CN107609121A - Newsletter archive sorting technique based on LDA and word2vec algorithms - Google Patents

Newsletter archive sorting technique based on LDA and word2vec algorithms Download PDF

Info

Publication number
CN107609121A
CN107609121A CN201710828232.XA CN201710828232A CN107609121A CN 107609121 A CN107609121 A CN 107609121A CN 201710828232 A CN201710828232 A CN 201710828232A CN 107609121 A CN107609121 A CN 107609121A
Authority
CN
China
Prior art keywords
mrow
text
msub
vector
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710828232.XA
Other languages
Chinese (zh)
Other versions
CN107609121B (en
Inventor
赵阔
王峰
谢珍真
孙小雅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Shenzhen City Mateng Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen City Mateng Technology Co Ltd filed Critical Shenzhen City Mateng Technology Co Ltd
Priority to CN201710828232.XA priority Critical patent/CN107609121B/en
Publication of CN107609121A publication Critical patent/CN107609121A/en
Application granted granted Critical
Publication of CN107609121B publication Critical patent/CN107609121B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of newsletter archive sorting technique based on LDA and word2vec algorithms, including:Corpus term vector is obtained by word2vec;Text participle, the removal stop words that training sample is concentrated;The Core Words of Class of training sample set is obtained by LDA models;Construct the class center vector of training sample set;Treat after classifying text pre-processed, extract text feature word, obtain the text vector of text to be sorted;The class center vector of the text vector and training sample set for the treatment of classifying text carries out Similarity Measure, treats classifying text and is classified;Classifying text, which is treated, with KNN algorithms carries out secondary classification.Beneficial effects of the present invention:The characteristic vector of text to be sorted and class center vector are subjected to Similarity Measure and carry out just subseries, greatly reduce amount of calculation, when subseries is not enough to clearly divide classification originally, secondary classification is carried out with KNN algorithms, classification sample is extracted in new samples collection moderate, eliminate sample distribution inequality influences to caused by classification accuracy.

Description

Newsletter archive sorting technique based on LDA and word2vec algorithms
Technical field
The present invention relates to document classification technical field, in particular to a kind of based on the new of LDA and word2vec algorithms Hear file classification method.
Background technology
Current most popular document representation method is all based on bag of words method, and bag of words method regards document as the set of word, The appearance of each word is separate, does not consider the information such as order, the syntax and semantics of word.The spy that it concentrates training text Sign item is organized into vector space model, and every document representation is into the vector with the model identical dimensional, each position in vectorial Value is the weight that the word representated by the position is concentrated in training sample.There is subject matter existing for this method:
(1) vector dimension is too high:
Dimension and the whole training sample of vector are concentrated as the characteristic item number of reservation, can reach up to ten thousand or even tens Ten thousand, " dimension disaster " phenomenon is caused, and these text vectors can take very big memory space;
(2) Sparse:
One document vector has weighted value only in there is the document on the position of characteristic item, is weighed on remaining most of position Weight values are 0, reduce the efficiency calculated in text categorization task, while also waste memory space;
(3) semantic information of document can not preferably be represented:
Bag of words method is assumed to be completely independent between word in document, the semantic relation between word is have ignored, for two semantemes Document close but in the absence of same characteristic features word, the text similarity that the text vector for using bag of words method to represent is calculated is 0.
KNN algorithm principles are simple, it is easy to accomplish, there is high stability and high accuracy, be to be applied to text classification at present One of classic algorithm, the deficiency of the algorithm mainly has at following 2 points:
(1) when training sample set is larger, KNN efficiency of algorithm is low:
Common KNN algorithms need the characteristic vector by whole samples in the characteristic vector of text to be sorted and training set to enter Row Similarity Measure, the training sample of K arest neighbors is selected, the quantity of training sample generic is counted, finally will Text to be sorted is divided into the maximum classification of quantity, wherein, by the characteristic vector of text to be sorted and whole training sample set It is the low key factor of KNN efficiency of algorithm that the characteristic vector of Chinese version, which calculate,;
(2) weight of each attribute is identical, influences the accuracy rate of classification results:
When training sample concentrates sample distribution of all categories unbalanced, as certain a kind of sample size is very big, and other classes During sample size very little, it is possible to cause when inputting text to be sorted, Large Copacity classification in K nearest samples of the text Sample occupy the majority, because KNN algorithms finally only consider " nearest " neighbours' sample, if certain a kind of sample size is larger, Text to be sorted may be not close to, in this kind of sample, can but be assigned under the category by mistake in itself, influence the accuracy rate of classification.
The content of the invention
To solve the above problems, it is an object of the invention to provide a kind of news text based on LDA and word2vec algorithms This sorting technique, the characteristic vector of text to be sorted and class center vector are subjected to Similarity Measure and carry out just subseries, significantly Amount of calculation is reduced, when subseries is not enough to clearly divide classification originally, secondary classification is carried out with KNN algorithms, after cutting New samples collection moderate extracts classification sample, and eliminate sample distribution inequality influences to caused by classification accuracy.
The invention provides a kind of newsletter archive sorting technique based on LDA and word2vec algorithms, including:
Step 1, the term vector of corpus is obtained by word2vec instruments:
Large-scale corpus is subjected to word segmentation processing, by the text input word2vec instruments after participle, training obtains The term vector of each word in corpus;
Step 2, Text Pretreatment is carried out to training sample set:
The text that training sample is concentrated is segmented, removes stop words;
Step 3, by LDA topic models, the Core Words of Class of training sample set is obtained:
LDA topic models are respectively trained in each classification of training sample set, training sample set is under LDA topic models Text-theme of each classification and the probability distribution of theme-word are obtained after training, will be each according to LDA topic model output results Maximum theme probability values are more than core word of the word of threshold alpha as the category in individual classification;
Step 4, the term vector a of Core Words of Class is passed throughi, construct the class center vector c of training sample seti
Step 5, treat after classifying text pre-processed, extract text feature word, obtain the text of text to be sorted to Measure dj
Step 6, the class center vector of the text vector and training sample set for the treatment of classifying text carries out Similarity Measure, And to Similarity value descending sort, classifying text is treated according to sequence and carries out just subseries, when the first two is seemingly spent in descending sort When difference between value is less than threshold epsilon, step 7 is carried out;
Step 7, classifying text is treated using KNN algorithms and carries out secondary classification.
As further improvement of the invention, step 4 specifically includes:
Step 401, the term vector a of the core word of each classification is selected from all term vectors of step 1i
Step 402, the probability values of theme-word LDA topic models obtainediThe word of weight as to(for) the category, Term vector after each weighting under same category is added and averaged as such class center vector ci, it is expressed as formula (1);
As further improvement of the invention, step 5 specifically includes:
Step 501, text to be sorted is pre-processed, including segments, removes stop words;
Step 502, text feature word is extracted using TF-IDF algorithms:
The text feature word that TF-IDF extracts is calculated according to formula (2), using word of the TF-IDF values more than threshold θ as treating point The Feature Words w of class text;
In formula, m is the number that Feature Words w occurs in text to be sorted, and M is the word sum of text to be sorted, and N is training Total textual data in sample set, n are that training sample concentrates the text sum comprising Feature Words w;
Step 503, Feature Words in text to be sorted are inputted into word2vec instruments, obtains Feature Words in text to be sorted Term vector, the term vector of all Feature Words is added and averages to obtain the text vector d of text to be sortedj
As further improvement of the invention, step 6 specifically includes:
Step 601, by the text vector d in text to be sortedjWith the class center vector c of each classificationiCarry out similarity Calculate, calculation formula is formula (3);
In formula, sim (ci,dj) it is Similarity value, T is the text vector of text to be sorted and the class center of each classification The dimension of vector, wikFor in class center vector per the numerical value on one-dimensional, wjkFor in the text vector of text to be sorted per one-dimensional On numerical value;
Step 602, the Similarity value calculated in step 601 is subjected to descending sort;
Step 603, the difference in the descending sort of calculation procedure 602 between first Similarity value and the second Similarity value:
If the difference is more than ε, by the classification corresponding to text classification to be sorted to the first Similarity value;
If the difference is less than ε, the secondary classification of step 7 is carried out.
As further improvement of the invention, step 7 specifically includes:
Step 701, the difference of preceding x adjacent values in Similarity value descending sort in step 6 is less than in classification corresponding to ε Text training text concentrate extract;
Step 702, z piece texts are randomly selected in each classification, form new training sample set;
Step 703, every text in new training sample set repeats step 5, obtains the text vector of every text;
Step 704, using KNN algorithms, by the text vector d of text to be sortedjWith all texts in new training sample set Text vector diSimilarity Measure is done, calculation formula is formula (4), selects most similar K pieces text;
In formula, sim (dj,di) it is Similarity value, T is the text vector and new training sample set Chinese version of text to be sorted The dimension of vector, w 'ikFor in the text vector of new training sample set per the numerical value on one-dimensional, wjkFor the text of text to be sorted Per the numerical value on one-dimensional in vector;
Step 705, to the K piece texts selected, the weight of every text generic is calculated successively, calculation formula is formula (5);
In formula, W (dj,Ci) it is that text to be sorted belongs to classification CiWeighted value, sim (dj,di) it is what is calculated in step 704 Similarity value, y (di,Ci) it is category attribute function, known each text generic in new training sample set, for the K selected Piece text, if it belong to classification Ci, then category attribute functional value is 1, and otherwise category attribute functional value is 0;
Step 706, weighted value text classification to be sorted calculated into step 705 is the classification corresponding to maximum In.
Beneficial effects of the present invention are:
1st, the term vector that the present invention trains to obtain using word2vec instruments represents text message, and word2vec models utilize Word is converted into a low-dimensional real number vector by the contextual information of word in text, and phrase semantic is obtained by the distance between vector On similarity, on the construction of text vector, using keyword term vector be added average by the way of replace vector splicing Method, vectorial high latitude is solve thed problems, such as, meanwhile, it also been removed the limitation chosen for keyword quantity;
2nd, traditional method that feature is extracted to training sample set uses TFIDF algorithms mostly, reconstructs vector space mould Type, the present invention propose the method construct category feature being combined using LDA models with word2vec algorithms, and by theme-word Weight of the probable value as Feature Words, the method add different terms under same category for the percentage contribution of classification and Same word has contained the semantic relation between word due to word2vec for different classes of percentage contribution, and the present invention is by word Addition of vectors is averaged to represent text, while similarity information between retaining text, controls the dimension of text vector not Can be excessive, therefore, when the characteristic vector of text to be sorted and class center vector are carried out into Similarity Measure, greatly reduce calculating Amount;
3rd, during classifying to text, conventional method only considers the similarity between text mostly, and the present invention carries Go out direct extraction category feature, contact is established between text and classification, when subseries is not enough to clearly divide classification originally, Reuse KNN algorithms and carry out secondary classification, and at this point for the classification distant with text to be sorted, then it is not required to consider, New samples collection moderate after cutting extracts classification sample, and eliminate sample distribution inequality influences to caused by classification accuracy.
Brief description of the drawings
Fig. 1 is a kind of newsletter archive sorting technique based on LDA and word2vec algorithms described in the embodiment of the present invention Schematic flow sheet.
Embodiment
The present invention is described in further detail below by specific embodiment and with reference to accompanying drawing.
A kind of as shown in figure 1, newsletter archive classification side based on LDA and word2vec algorithms described in the embodiment of the present invention Method, including:
Step 1, the term vector of corpus is obtained by word2vec instruments:
Large-scale corpus is subjected to word segmentation processing, by the text input word2vec instruments after participle, training obtains The term vector of each word in corpus.
Step 2, Text Pretreatment is carried out to training sample set:
The text that training sample is concentrated is segmented, removes stop words.
Step 3, by LDA topic models, the Core Words of Class of training sample set is obtained:
LDA topic models are respectively trained in each classification of training sample set, training sample set is under LDA topic models Text-theme of each classification and the probability distribution of theme-word are obtained after training, will be each according to LDA topic model output results Maximum theme probability values are more than core word of the word of threshold alpha as the category in individual classification.
Step 4, the term vector a of Core Words of Class is passed throughi, construct the class center vector c of training sample seti
Step 401, the term vector a of the core word of each classification is selected from all term vectors of step 1i
Step 402, the probability values of theme-word LDA topic models obtainediThe word of weight as to(for) the category, Term vector after each weighting under same category is added and averaged as such class center vector ci, it is expressed as formula (1);
Step 5, treat after classifying text pre-processed, extract text feature word, obtain the text of text to be sorted to Measure dj
Step 501, text to be sorted is pre-processed, including segments, removes stop words;
Step 502, text feature word is extracted using TF-IDF algorithms:
The text feature word that TF-IDF extracts is calculated according to formula (2), using word of the TF-IDF values more than threshold θ as treating point The Feature Words w of class text;
In formula, m is the number that Feature Words w occurs in text to be sorted, and M is the word sum of text to be sorted, and N is total Textual data, n are the text sum comprising Feature Words w;
Step 503, Feature Words in text to be sorted are inputted into word2vec instruments, obtains Feature Words in text to be sorted Term vector, the term vector of all Feature Words is added and averages to obtain the text vector d of text to be sortedj
Step 6, the class center vector of the text vector and training sample set for the treatment of classifying text carries out Similarity Measure, And to Similarity value descending sort, classified according to classifying text is treated in sequence:
Step 601, by the text vector d in text to be sortedjWith the class center vector c of each classificationiCarry out similarity Calculate, calculation formula is formula (3);
In formula, sim (ci,dj) it is Similarity value, T is the text vector of text to be sorted and the class center of each classification The dimension of vector, wikFor in class center vector per the numerical value on one-dimensional, wjkFor in the text vector of text to be sorted per one-dimensional On numerical value;
Step 602, the Similarity value calculated in step 601 is subjected to descending sort;
Step 603, the difference in the descending sort of calculation procedure 602 between first Similarity value and the second Similarity value:
If the difference is more than ε, by the classification corresponding to text classification to be sorted to the first Similarity value;
If the difference is less than ε, the secondary classification of step 7 is carried out.
Step 7, classifying text is treated using KNN algorithms and carries out secondary classification:
Step 701, the difference of preceding x adjacent values in Similarity value descending sort in step 6 is less than in classification corresponding to ε Text training text concentrate extract;
Step 702, z piece texts are randomly selected in each classification, form new training sample set;
Step 703, every text in new training sample set repeats step 5, obtains the text vector of every text;
Step 704, using KNN algorithms, by the text vector d of text to be sortedjWith all texts in new training sample set Text vector diSimilarity Measure is done, calculation formula is formula (4), selects most similar K pieces text;
In formula, sim (dj,di) it is Similarity value, T is the text vector and new training sample set Chinese version of text to be sorted The dimension of vector, w 'ikFor in the text vector of new training sample set per the numerical value on one-dimensional, wjkFor the text of text to be sorted Per the numerical value on one-dimensional in vector;
Step 705, to the K piece texts selected, the weight of every text generic is calculated successively, calculation formula is formula (5);
In formula, W (dj,Ci) it is that text to be sorted belongs to classification CiWeighted value, sim (dj,di) it is what is calculated in step 704 Similarity value, y (di,Ci) it is category attribute function, known each text generic in new training sample set, for the K selected Piece text, if it belong to classification Ci, then category attribute functional value is 1, and otherwise category attribute functional value is 0;
Step 706, weighted value text classification to be sorted calculated into step 705 is the classification corresponding to maximum In.
The present invention, which proposes, directly extracts category feature, establishes contact between text and classification, i.e. step 6, when only according to When can not clearly be divided according to categorization vector, reuse KNN algorithms and further classify, and at this point for text distance to be sorted compared with Remote classification, then it is not required to consider, i.e. step 7, realizes the cutting to sample set, reduce amount of calculation.It is meanwhile traditional to instruction The method for practicing sample set extraction feature uses tfidf algorithms mostly, reconstructs vector space model, and the present invention proposes use The method construct category feature that LDA models are combined with word2vec algorithms, and using the probable value of theme-word as Feature Words Weight, i.e. step 4, the method add different terms under same category for classification percentage contribution and same word for Different classes of percentage contribution, term vector is added and averaged to represent text, while similarity information between retaining text, The dimension for controlling text vector will not be excessive.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims (5)

  1. A kind of 1. newsletter archive sorting technique based on LDA and word2vec algorithms, it is characterised in that including:
    Step 1, the term vector of corpus is obtained by word2vec instruments:
    Large-scale corpus is subjected to word segmentation processing, by the text input word2vec instruments after participle, training obtains language material The term vector of each word in storehouse;
    Step 2, Text Pretreatment is carried out to training sample set:
    The text that training sample is concentrated is segmented, removes stop words;
    Step 3, by LDA topic models, the Core Words of Class of training sample set is obtained:
    LDA topic models are respectively trained in each classification of training sample set, training sample set is trained under LDA topic models Text-theme of each classification and the probability distribution of theme-word are obtained afterwards, according to LDA topic model output results, by each class Not middle maximum theme probability values are more than core word of the word of threshold alpha as the category;
    Step 4, the term vector a of Core Words of Class is passed throughi, construct the class center vector c of training sample seti
    Step 5, treat after classifying text pre-processed, extract text feature word, obtain the text vector d of text to be sortedj
    Step 6, the class center vector of the text vector and training sample set for the treatment of classifying text carries out Similarity Measure, and right Similarity value descending sort, according to sequence treat classifying text carry out just subseries, when in descending sort the first two like angle value it Between difference when being less than threshold epsilon, carry out step 7;
    Step 7, classifying text is treated using KNN algorithms and carries out secondary classification.
  2. 2. newsletter archive sorting technique according to claim 1, it is characterised in that step 4 specifically includes:
    Step 401, the term vector a of the core word of each classification is selected from all term vectors of step 1i
    Step 402, the probability values of theme-word LDA topic models obtainediThe word of weight as to(for) the category, will be same Term vector under one classification after each weighting, which is added, averages as such class center vector ci, it is expressed as formula (1);
    <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mi>n</mi> </mfrac> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msub> <mi>&amp;beta;</mi> <mi>i</mi> </msub> <mo>&amp;CenterDot;</mo> <msub> <mi>&amp;alpha;</mi> <mi>i</mi> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>.</mo> </mrow>
  3. 3. newsletter archive sorting technique according to claim 1, it is characterised in that step 5 specifically includes:
    Step 501, text to be sorted is pre-processed, including segments, removes stop words;
    Step 502, text feature word is extracted using TF-IDF algorithms:
    The text feature word of TF-IDF extractions is calculated according to formula (2), TF-IDF values are more than the word of threshold θ as text to be sorted This Feature Words w;
    <mrow> <mi>T</mi> <mi>F</mi> <mo>-</mo> <mi>I</mi> <mi>D</mi> <mi>F</mi> <mo>=</mo> <mfrac> <mi>m</mi> <mi>M</mi> </mfrac> <mo>&amp;CenterDot;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mi>N</mi> <mi>n</mi> </mfrac> <mo>+</mo> <mn>0.01</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
    In formula, m is the number that Feature Words w occurs in text to be sorted, and M is the word sum of text to be sorted, and N is training sample Total textual data is concentrated, n is that training sample concentrates the text sum comprising Feature Words w;
    Step 503, Feature Words in text to be sorted are inputted into word2vec instruments, obtain the words of Feature Words in text to be sorted to Amount, the term vector of all Feature Words is added and averages to obtain the text vector d of text to be sortedj
  4. 4. newsletter archive sorting technique according to claim 1, it is characterised in that step 6 specifically includes:
    Step 601, by the text vector d in text to be sortedjWith the class center vector c of each classificationiCarry out similarity meter Calculate, calculation formula is formula (3);
    <mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>k</mi> </mrow> </msub> <mo>&amp;times;</mo> <msub> <mi>w</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> </mrow> <mrow> <msqrt> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msubsup> <mi>w</mi> <mrow> <mi>i</mi> <mi>k</mi> </mrow> <mn>2</mn> </msubsup> </mrow> </msqrt> <msqrt> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msubsup> <mi>w</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> <mn>2</mn> </msubsup> </mrow> </msqrt> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>
    In formula, sim (ci,dj) it is Similarity value, T is the text vector of text to be sorted and the class center vector of each classification Dimension, wikFor in class center vector per the numerical value on one-dimensional, wjkOn often one-dimensional in the text vector of text to be sorted Numerical value;
    Step 602, the Similarity value calculated in step 601 is subjected to descending sort;
    Step 603, the difference in the descending sort of calculation procedure 602 between first Similarity value and the second Similarity value:
    If the difference is more than ε, by the classification corresponding to text classification to be sorted to the first Similarity value;
    If the difference is less than ε, the secondary classification of step 7 is carried out.
  5. 5. newsletter archive sorting technique according to claim 1, it is characterised in that step 7 specifically includes:
    Step 701, the difference of preceding x adjacent values in Similarity value descending sort in step 6 is less than the text in classification corresponding to ε This is concentrated in training text and extracted;
    Step 702, z piece texts are randomly selected in each classification, form new training sample set;
    Step 703, every text in new training sample set repeats step 5, obtains the text vector of every text;
    Step 704, using KNN algorithms, by the text vector d of text to be sortedjWith the text of all texts in new training sample set Vectorial diSimilarity Measure is done, calculation formula is formula (4), selects most similar K pieces text;
    <mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msubsup> <mi>w</mi> <mrow> <mi>i</mi> <mi>k</mi> </mrow> <mo>&amp;prime;</mo> </msubsup> <mo>&amp;times;</mo> <msub> <mi>w</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> </mrow> <mrow> <msqrt> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msubsup> <mi>w</mi> <mrow> <mi>i</mi> <mi>k</mi> </mrow> <mrow> <mo>&amp;prime;</mo> <mn>2</mn> </mrow> </msubsup> </mrow> </msqrt> <msqrt> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msubsup> <mi>w</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> <mn>2</mn> </msubsup> </mrow> </msqrt> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>
    In formula, sim (dj,di) it is Similarity value, T is the text vector and new training sample set Chinese version vector of text to be sorted Dimension, wikFor in the text vector of new training sample set per the numerical value on one-dimensional, wjkFor the text vector of text to be sorted In per the numerical value on one-dimensional;
    Step 705, to the K piece texts selected, the weight of every text generic is calculated successively, calculation formula is formula (5);
    <mrow> <mi>W</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>C</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <msub> <mi>d</mi> <mi>y</mi> </msub> <mo>&amp;Element;</mo> <mi>K</mi> <mi>N</mi> <mi>N</mi> </mrow> </munder> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mi>y</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>C</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>
    In formula, W (dj,Ci) it is that text to be sorted belongs to classification CiWeighted value, sim (dj,di) similar to be calculated in step 704 Angle value, y (di,Ci) it is category attribute function, known each text generic in new training sample set, for the K pieces text selected This, if it belong to classification Ci, then category attribute functional value is 1, and otherwise category attribute functional value is 0;
    Step 706, weighted value text classification to be sorted calculated into step 705 is in the classification corresponding to maximum.
CN201710828232.XA 2017-09-14 2017-09-14 News text classification method based on LDA and word2vec algorithm Expired - Fee Related CN107609121B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710828232.XA CN107609121B (en) 2017-09-14 2017-09-14 News text classification method based on LDA and word2vec algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710828232.XA CN107609121B (en) 2017-09-14 2017-09-14 News text classification method based on LDA and word2vec algorithm

Publications (2)

Publication Number Publication Date
CN107609121A true CN107609121A (en) 2018-01-19
CN107609121B CN107609121B (en) 2021-03-30

Family

ID=61062711

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710828232.XA Expired - Fee Related CN107609121B (en) 2017-09-14 2017-09-14 News text classification method based on LDA and word2vec algorithm

Country Status (1)

Country Link
CN (1) CN107609121B (en)

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108520030A (en) * 2018-03-27 2018-09-11 深圳中兴网信科技有限公司 File classification method, Text Classification System and computer installation
CN108597519A (en) * 2018-04-04 2018-09-28 百度在线网络技术(北京)有限公司 A kind of bill classification method, apparatus, server and storage medium
CN108804622A (en) * 2018-08-20 2018-11-13 天津探数科技有限公司 A kind of short text grader building method considering semantic background
CN108829661A (en) * 2018-05-09 2018-11-16 成都信息工程大学 A kind of subject of news title extracting method based on fuzzy matching
CN108846097A (en) * 2018-06-15 2018-11-20 北京搜狐新媒体信息技术有限公司 The interest tags representation method of user, article recommended method and device, equipment
CN108846120A (en) * 2018-06-27 2018-11-20 合肥工业大学 Method, system and storage medium for classifying to text set
CN108932228A (en) * 2018-06-06 2018-12-04 武汉斗鱼网络科技有限公司 INDUSTRY OVERVIEW and subregion matching process, device, server and storage medium is broadcast live
CN109145116A (en) * 2018-09-03 2019-01-04 武汉斗鱼网络科技有限公司 A kind of file classification method, device, electronic equipment and storage medium
CN109284379A (en) * 2018-09-21 2019-01-29 福州大学 Adaptive microblog topic method for tracing based on double vector models
CN109446324A (en) * 2018-10-16 2019-03-08 北京字节跳动网络技术有限公司 Processing method, device, storage medium and the electronic equipment of sample data
CN109522408A (en) * 2018-10-30 2019-03-26 广东原昇信息科技有限公司 The classification method of information streaming material intention text
CN109685109A (en) * 2018-11-26 2019-04-26 浙江工业大学 A kind of base station label track classification method based on twin neural network
CN109684444A (en) * 2018-11-02 2019-04-26 厦门快商通信息技术有限公司 A kind of intelligent customer service method and system
CN109766410A (en) * 2019-01-07 2019-05-17 东华大学 A kind of newsletter archive automatic classification system based on fastText algorithm
CN109815400A (en) * 2019-01-23 2019-05-28 四川易诚智讯科技有限公司 Personage's interest extracting method based on long text
CN109947939A (en) * 2019-01-30 2019-06-28 中兴飞流信息科技有限公司 File classification method, electronic equipment and computer readable storage medium
CN110046340A (en) * 2018-12-28 2019-07-23 阿里巴巴集团控股有限公司 The training method and device of textual classification model
CN110569351A (en) * 2019-09-02 2019-12-13 北京猎云万罗科技有限公司 Network media news classification method based on restrictive user preference
CN110674239A (en) * 2019-09-27 2020-01-10 中国航空无线电电子研究所 Automatic classification method and device for geographic elements
CN110704626A (en) * 2019-09-30 2020-01-17 北京邮电大学 Short text classification method and device
CN110781271A (en) * 2019-09-02 2020-02-11 国网天津市电力公司电力科学研究院 Semi-supervised network representation learning model based on hierarchical attention mechanism
CN110795564A (en) * 2019-11-01 2020-02-14 南京稷图数据科技有限公司 Text classification method lacking negative cases
CN110969172A (en) * 2018-09-28 2020-04-07 武汉斗鱼网络科技有限公司 Text classification method and related equipment
CN110969023A (en) * 2018-09-29 2020-04-07 北京国双科技有限公司 Text similarity determination method and device
CN111459959A (en) * 2020-03-31 2020-07-28 北京百度网讯科技有限公司 Method and apparatus for updating event set
CN111723199A (en) * 2019-03-19 2020-09-29 北京沃东天骏信息技术有限公司 Text classification method and device and computer readable storage medium
CN111753079A (en) * 2019-03-11 2020-10-09 阿里巴巴集团控股有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN111859979A (en) * 2020-06-16 2020-10-30 中国科学院自动化研究所 Ironic text collaborative recognition method, ironic text collaborative recognition device, ironic text collaborative recognition equipment and computer readable medium
CN112052333A (en) * 2020-08-20 2020-12-08 深圳市欢太科技有限公司 Text classification method and device, storage medium and electronic equipment
CN112069058A (en) * 2020-08-11 2020-12-11 国网河北省电力有限公司保定供电分公司 Defect disposal method based on expert database and self-learning technology
CN112287669A (en) * 2020-12-28 2021-01-29 深圳追一科技有限公司 Text processing method and device, computer equipment and storage medium
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium
CN112632971A (en) * 2020-12-18 2021-04-09 上海明略人工智能(集团)有限公司 Word vector training method and system for entity matching
CN112667806A (en) * 2020-10-20 2021-04-16 上海金桥信息股份有限公司 Text classification screening method using LDA
CN113255340A (en) * 2021-07-09 2021-08-13 北京邮电大学 Theme extraction method and device for scientific and technological requirements and storage medium
CN113268597A (en) * 2021-05-25 2021-08-17 平安科技(深圳)有限公司 Text classification method, device, equipment and storage medium
CN113486176A (en) * 2021-07-08 2021-10-08 桂林电子科技大学 News classification method based on secondary feature amplification
CN113535965A (en) * 2021-09-16 2021-10-22 杭州费尔斯通科技有限公司 Method and system for large-scale classification of texts
CN113920373A (en) * 2021-10-29 2022-01-11 平安银行股份有限公司 Object classification method and device, terminal equipment and storage medium
CN111177373B (en) * 2019-12-12 2023-07-14 北京明略软件系统有限公司 Method and device for acquiring training data, and model training method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160110343A1 (en) * 2014-10-21 2016-04-21 At&T Intellectual Property I, L.P. Unsupervised topic modeling for short texts
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107122349A (en) * 2017-04-24 2017-09-01 无锡中科富农物联科技有限公司 A kind of feature word of text extracting method based on word2vec LDA models

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160110343A1 (en) * 2014-10-21 2016-04-21 At&T Intellectual Property I, L.P. Unsupervised topic modeling for short texts
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107122349A (en) * 2017-04-24 2017-09-01 无锡中科富农物联科技有限公司 A kind of feature word of text extracting method based on word2vec LDA models

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郭茂: ""基于类中心向量的文本分类模型研究与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108520030A (en) * 2018-03-27 2018-09-11 深圳中兴网信科技有限公司 File classification method, Text Classification System and computer installation
CN108597519A (en) * 2018-04-04 2018-09-28 百度在线网络技术(北京)有限公司 A kind of bill classification method, apparatus, server and storage medium
CN108829661B (en) * 2018-05-09 2020-03-27 成都信息工程大学 News subject name extraction method based on fuzzy matching
CN108829661A (en) * 2018-05-09 2018-11-16 成都信息工程大学 A kind of subject of news title extracting method based on fuzzy matching
CN108932228A (en) * 2018-06-06 2018-12-04 武汉斗鱼网络科技有限公司 INDUSTRY OVERVIEW and subregion matching process, device, server and storage medium is broadcast live
CN108932228B (en) * 2018-06-06 2023-08-08 广东南方报业移动媒体有限公司 Live broadcast industry news and partition matching method and device, server and storage medium
CN108846097A (en) * 2018-06-15 2018-11-20 北京搜狐新媒体信息技术有限公司 The interest tags representation method of user, article recommended method and device, equipment
CN108846120A (en) * 2018-06-27 2018-11-20 合肥工业大学 Method, system and storage medium for classifying to text set
CN108804622A (en) * 2018-08-20 2018-11-13 天津探数科技有限公司 A kind of short text grader building method considering semantic background
CN109145116A (en) * 2018-09-03 2019-01-04 武汉斗鱼网络科技有限公司 A kind of file classification method, device, electronic equipment and storage medium
CN109284379A (en) * 2018-09-21 2019-01-29 福州大学 Adaptive microblog topic method for tracing based on double vector models
CN109284379B (en) * 2018-09-21 2022-01-04 福州大学 Adaptive microblog topic tracking method based on two-way quantity model
CN110969172A (en) * 2018-09-28 2020-04-07 武汉斗鱼网络科技有限公司 Text classification method and related equipment
CN110969023B (en) * 2018-09-29 2023-04-18 北京国双科技有限公司 Text similarity determination method and device
CN110969023A (en) * 2018-09-29 2020-04-07 北京国双科技有限公司 Text similarity determination method and device
CN109446324B (en) * 2018-10-16 2020-12-15 北京字节跳动网络技术有限公司 Sample data processing method and device, storage medium and electronic equipment
CN109446324A (en) * 2018-10-16 2019-03-08 北京字节跳动网络技术有限公司 Processing method, device, storage medium and the electronic equipment of sample data
CN109522408A (en) * 2018-10-30 2019-03-26 广东原昇信息科技有限公司 The classification method of information streaming material intention text
CN109684444A (en) * 2018-11-02 2019-04-26 厦门快商通信息技术有限公司 A kind of intelligent customer service method and system
CN109685109A (en) * 2018-11-26 2019-04-26 浙江工业大学 A kind of base station label track classification method based on twin neural network
CN110046340A (en) * 2018-12-28 2019-07-23 阿里巴巴集团控股有限公司 The training method and device of textual classification model
CN109766410A (en) * 2019-01-07 2019-05-17 东华大学 A kind of newsletter archive automatic classification system based on fastText algorithm
CN109815400A (en) * 2019-01-23 2019-05-28 四川易诚智讯科技有限公司 Personage's interest extracting method based on long text
CN109947939A (en) * 2019-01-30 2019-06-28 中兴飞流信息科技有限公司 File classification method, electronic equipment and computer readable storage medium
CN111753079A (en) * 2019-03-11 2020-10-09 阿里巴巴集团控股有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN111723199A (en) * 2019-03-19 2020-09-29 北京沃东天骏信息技术有限公司 Text classification method and device and computer readable storage medium
CN110569351A (en) * 2019-09-02 2019-12-13 北京猎云万罗科技有限公司 Network media news classification method based on restrictive user preference
CN110781271A (en) * 2019-09-02 2020-02-11 国网天津市电力公司电力科学研究院 Semi-supervised network representation learning model based on hierarchical attention mechanism
CN110674239B (en) * 2019-09-27 2022-11-04 中国航空无线电电子研究所 Automatic classification method and device for geographic elements
CN110674239A (en) * 2019-09-27 2020-01-10 中国航空无线电电子研究所 Automatic classification method and device for geographic elements
CN110704626B (en) * 2019-09-30 2022-07-22 北京邮电大学 Short text classification method and device
CN110704626A (en) * 2019-09-30 2020-01-17 北京邮电大学 Short text classification method and device
CN110795564A (en) * 2019-11-01 2020-02-14 南京稷图数据科技有限公司 Text classification method lacking negative cases
CN110795564B (en) * 2019-11-01 2022-02-22 南京稷图数据科技有限公司 Text classification method lacking negative cases
CN111177373B (en) * 2019-12-12 2023-07-14 北京明略软件系统有限公司 Method and device for acquiring training data, and model training method and device
CN111459959A (en) * 2020-03-31 2020-07-28 北京百度网讯科技有限公司 Method and apparatus for updating event set
CN111859979A (en) * 2020-06-16 2020-10-30 中国科学院自动化研究所 Ironic text collaborative recognition method, ironic text collaborative recognition device, ironic text collaborative recognition equipment and computer readable medium
CN112069058A (en) * 2020-08-11 2020-12-11 国网河北省电力有限公司保定供电分公司 Defect disposal method based on expert database and self-learning technology
CN112052333B (en) * 2020-08-20 2024-04-30 深圳市欢太科技有限公司 Text classification method and device, storage medium and electronic equipment
CN112052333A (en) * 2020-08-20 2020-12-08 深圳市欢太科技有限公司 Text classification method and device, storage medium and electronic equipment
CN112667806A (en) * 2020-10-20 2021-04-16 上海金桥信息股份有限公司 Text classification screening method using LDA
CN112417153B (en) * 2020-11-20 2023-07-04 虎博网络技术(上海)有限公司 Text classification method, apparatus, terminal device and readable storage medium
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium
CN112632971B (en) * 2020-12-18 2023-08-25 上海明略人工智能(集团)有限公司 Word vector training method and system for entity matching
CN112632971A (en) * 2020-12-18 2021-04-09 上海明略人工智能(集团)有限公司 Word vector training method and system for entity matching
CN112287669A (en) * 2020-12-28 2021-01-29 深圳追一科技有限公司 Text processing method and device, computer equipment and storage medium
CN113268597B (en) * 2021-05-25 2023-06-27 平安科技(深圳)有限公司 Text classification method, device, equipment and storage medium
CN113268597A (en) * 2021-05-25 2021-08-17 平安科技(深圳)有限公司 Text classification method, device, equipment and storage medium
CN113486176A (en) * 2021-07-08 2021-10-08 桂林电子科技大学 News classification method based on secondary feature amplification
CN113255340A (en) * 2021-07-09 2021-08-13 北京邮电大学 Theme extraction method and device for scientific and technological requirements and storage medium
CN113535965A (en) * 2021-09-16 2021-10-22 杭州费尔斯通科技有限公司 Method and system for large-scale classification of texts
CN113920373A (en) * 2021-10-29 2022-01-11 平安银行股份有限公司 Object classification method and device, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN107609121B (en) 2021-03-30

Similar Documents

Publication Publication Date Title
CN107609121A (en) Newsletter archive sorting technique based on LDA and word2vec algorithms
CN109446404B (en) Method and device for analyzing emotion polarity of network public sentiment
CN103336766B (en) Short text garbage identification and modeling method and device
CN102411563B (en) Method, device and system for identifying target words
CN107871144A (en) Invoice trade name sorting technique, system, equipment and computer-readable recording medium
CN108573047A (en) A kind of training method and device of Module of Automatic Chinese Documents Classification
CN103631859A (en) Intelligent review expert recommending method for science and technology projects
CN107180023A (en) A kind of file classification method and system
CN104298665A (en) Identification method and device of evaluation objects of Chinese texts
CN107122349A (en) A kind of feature word of text extracting method based on word2vec LDA models
CN103942340A (en) Microblog user interest recognizing method based on text mining
CN103034626A (en) Emotion analyzing system and method
CN103324628A (en) Industry classification method and system for text publishing
CN108199951A (en) A kind of rubbish mail filtering method based on more algorithm fusion models
CN108304509B (en) Junk comment filtering method based on text multi-directional expression mutual learning
CN106570170A (en) Text classification and naming entity recognition integrated method and system based on depth cyclic neural network
CN101604322A (en) A kind of decision level text automatic classified fusion method
CN109815400A (en) Personage&#39;s interest extracting method based on long text
CN109858034A (en) A kind of text sentiment classification method based on attention model and sentiment dictionary
CN104142960A (en) Internet data analysis system
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics
CN106528768A (en) Consultation hotspot analysis method and device
CN105224955A (en) Based on the method for microblogging large data acquisition network service state
CN107526805A (en) A kind of ML kNN multi-tag Chinese Text Categorizations based on weight

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210223

Address after: No. 601, Huangpu Avenue West, Shenzhen, Guangdong 510632

Applicant after: Jinan University

Address before: 518057 room 503, block C, building 5, Shenzhen Bay ecological science and Technology Park, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: SHENZHEN MATENG TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
CB02 Change of applicant information

Address after: 510632 No. 601, Whampoa Avenue, Guangzhou, Guangdong

Applicant after: Jinan University

Address before: No. 601, Huangpu Avenue West, Shenzhen, Guangdong 510632

Applicant before: Jinan University

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210330

Termination date: 20210914

CF01 Termination of patent right due to non-payment of annual fee