CN109344252A - Microblogging file classification method and system based on high-quality topic expansion - Google Patents

Microblogging file classification method and system based on high-quality topic expansion Download PDF

Info

Publication number
CN109344252A
CN109344252A CN201811064231.3A CN201811064231A CN109344252A CN 109344252 A CN109344252 A CN 109344252A CN 201811064231 A CN201811064231 A CN 201811064231A CN 109344252 A CN109344252 A CN 109344252A
Authority
CN
China
Prior art keywords
theme
text
microblogging
word
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811064231.3A
Other languages
Chinese (zh)
Other versions
CN109344252B (en
Inventor
张曦元
孙福权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201811064231.3A priority Critical patent/CN109344252B/en
Publication of CN109344252A publication Critical patent/CN109344252A/en
Application granted granted Critical
Publication of CN109344252B publication Critical patent/CN109344252B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of microblogging file classification method and system based on high-quality topic expansion.The feature extension of text, can be realized microblogging and effectively classifies suitable for the classification of the short texts such as microblogging.Using training set microblog data as the input of LDA model, obtains theme probability distribution and Word probability is distributed;High-quality theme is extracted according to the similarity of theme using the high characterization Sexual Themes that comentropy extracts;Theme deduction is carried out to test set microblogging;It chooses high-quality theme feature word and feature extension is carried out to microblogging text;Classification prediction is carried out using algorithm of support vector machine to the microblogging text after extension.This method is suitable for solving the problems, such as that it is inaccurate to mix caused text feature extension for descriptor when extending microblogging text feature using topic model.

Description

Microblogging file classification method and system based on high-quality topic expansion
Technical field
The present invention relates in Text Classification field, specifically, more particularly to a kind of based on high-quality topic expansion Microblogging file classification method and system.
Background technique
As one of emerging medium, microblogging has more than one hundred million user groups so far, and neck is occupied in Chinese social network-i i-platform First status.Microblogging is easy to operate, content update is rapid, researching value with higher.In past decades, text classification is There is more research, but the effect of short text this kind of for microblogging classification is undesirable always.It is short and small, special for microblogging text length Levy sparse, then handled through participle and stop words, filter out some words, again pass by after feature selecting be even more remain it is few Feature, although this reduces the complexity of calculating, make classification accuracy rate be decreased obviously so in order to preferably carry out Classification, needs to be extended feature to microblogging text.
LDA model is three layers of bayesian probability model being made of word, theme and document.Assuming that every document is by multiple Implicit theme is constituted, and excavates potential theme according to the cooccurrence relation between vocabulary, the probability distribution that text representation is the theme will be led Topic is expressed as a series of probability distribution that words are constituted.The feature extension that short text is realized using theme distribution is to promote short text The effective means of classification, but the theme that trains of topic model and not all theme can one subject content of complete expression, deposit Mix in theme and the indefinite phenomenon of theme, directly progress short text expansion may introduce other features that are not consistent.
Summary of the invention
The expansion of microblogging text is carried out using topic model according to set forth above, and a kind of base is provided In the microblogging file classification method and system of high-quality topic expansion.This method of the present invention effectively extracts good theme, is used for The poor disadvantage of the sparse caused classifying quality of feature is efficiently solved after the extension of microblogging feature.
The technological means that the present invention uses is as follows:
A kind of microblogging file classification method based on high-quality topic expansion, includes the following steps:
S1, data prediction is carried out to microblogging text and carries out feature selecting, constructed and trained by pretreated text Collection and test set;
S2, using pretreated training set data as the input of LDA model, obtain the probability of the theme of training set data The probability distribution of distribution and descriptor;
S3, the probability distribution that comentropy is applied to descriptor is calculated into theme entropy, while calculates relative entropy and theme Average similarity, to calculate theme merit figue, given threshold filters out high-quality theme;
S4, theme division is carried out to training set and test set respectively, each text is marked off by LDA model theme distribution Maximum probability value in high-quality theme is added separately to train for the descriptor of theme using the descriptor as expansion word In the text feature of collection and test set;
S5, text representation is carried out to the text after extension using vector space model, and calculates each feature with TF-IDF Training data and test data document are changed into vector, choose useful feature by the weight of word, are instructed by classifier SVM training Practice collection, classification prediction then is carried out to test set, generates classification results.
Further, described that microblogging text is carried out data prediction and carries out feature selecting to include the following steps:
S11, Chinese word segmentation pretreatment is carried out to text, complete sentence is divided into vocabulary, to obtain corpus of text Feature set;
Common conjunction, pronoun class stop words after S12, rejecting participle in text, are carried out pre- using Chinese stoplist Processing operation is deleted the specific word if Feature Words exist and deactivate vocabulary, then is rejected to punctuation mark;
S13, will pretreatment hereinafter this according to generic divide building dictionary, the information of different classes of word is counted, to spy There is total degree and carries out descending arrangement in sign word, and every class is selected to come Feature Words of the word as such of preceding n, conduct after being summarized The general characteristic of classification.
Further, in the step S2, the probability distribution of the theme of training set data is obtained as follows:
S21, setting topic model parameter alpha, theme number K extract microblogging from the Dirichlet distribution that parameter is α Theme distribution doc-topic matrix θm, θm~Dir (α), m ∈ [1, M], θmIndicate the theme probability distribution of document m
Wherein nm,kIndicate the number of the kth descriptor of m microbloggings.
Further, in the step S2, the probability distribution of the descriptor of training set data is obtained as follows:
S22, setting topic model parameter beta, theme number K extract theme from the Dirichlet distribution that parameter is β Word be distributed topic-word matrixIndicate the probability distribution of the word of theme k
Wherein nk,vIndicate the number that word v occurs under theme k.
Further, the step S3 is specifically included:
S31, subject information entropy TE is calculated, specifically:
TE (k)=- ∑ P (w | k) * lnP (w | k)
Wherein, P (w | k) indicates that word w appears in the probability under theme k;
S32, the relative entropy for calculating theme, specifically:
Wherein, P, Q are indicated wait measure distribution, when two random distributions are identical, relative entropy zero, when two random distributions Difference increase when, relative entropy also will increase;
S33, the average similarity for calculating theme, specifically:
The JS distance that theme is calculated using relative entropy, for measuring similarity between theme, specifically:
Average similarity is the independence for calculating other opposite distributions of a certain distribution, and the average similarity of theme calculates Method specifically:
Wherein j is not equal to K
Wherein K indicates theme sum;
S34, screening high-quality theme
Theme merit figue, calculation method are calculated according to theme entropy and average similarity specifically:
If theme merit figue meets G (k) > μ, μ is threshold value, then determines that the theme belongs to high-quality theme, standby as extending Choosing, is not otherwise high-quality theme, and then obtain high-quality theme collection S.
Further, in the step S4, theme division is carried out to training set specifically:
S41, the theme distribution that the topic model that training set trains is obtained, where selecting every microblogging in high-quality theme The affiliated theme of maximum probability value, by λ top ranked Feature Words w={ w of the corresponding descriptor select probability of affiliated theme1, w2,…wλBe added in the text feature of training set as expansion word, word w is incorporated into if expansion word is not present in original text shelves In document.
Further, in the step S4, test set carries out theme deduction and carries out feature extension specifically:
S42, theme deduction is carried out to test set using the topic model that training set trains, obtains the text of test text Shelves-theme distribution matrix;To every affiliated theme select probability row of test text selection maximum probability value in high-quality theme collection S The highest λ Feature Words w={ w of name1,w2,…wλBe added in the text feature of test set as expansion word.
Further, the step S5 specifically:
S51, the expanded text for obtaining step S41 using vector space model carry out text representation, by document d regard as to N-dimensional vector in quantity space calculates the weight of feature, vector v=(ε using TF-IDF12,…,εn),εiIt indicates i-th The weight of word, the calculating of weight specifically:
Wherein, tfijRefer to the frequency that some Feature Words occurs in certain text, dfiIt indicates to include Feature Words in corpus Textual data, M is corpus text sum;
S52, text classification is carried out using LIBSVM tool, the data format that document is converted is label 1:value 2: Value ..., wherein label is classification logotype, and 1,2, which are characterized the i.e. tf-idf of value, calculates weight;
S53, record training set class label Y={ y1, y2 ..., yn }, carry out test set to after training set training pattern Classification prediction.
The present invention also provides a kind of microblogging Text Classification Systems based on high-quality topic expansion, comprising:
Text collection unit constructs training set and test set for being acquired to the microblogging text data voluntarily acquired;
Text data pretreatment unit, for being pre-processed to urtext sample and carrying out feature selecting comprising:
Chinese word segmentation module for complete sentence to be divided into vocabulary, and rejects the stop words in text,
Chinese stoplist module for deleting the Feature Words in the deactivated vocabulary occurred in text, and rejects punctuate symbol Number,
Dictionary creation module for being ranked up to Feature Words in text, and summarizes Feature Words;
LDA model training unit, for obtaining document subject matter distribution and descriptor distribution situation by training set data, Include:
Data processing module is marked off excellent for calculating merit figue by descriptor distributed data by given threshold Matter theme;
The LDA model training unit is also used to using quality features word as the text of the text of training set extension and test set This extension;
Text classification unit, for carrying out text classification to the training set after text extension by LIBSVM tool, simultaneously Classify to the testing data of test set, generates classification results.
Compared with the prior art, the invention has the following advantages that
This method effectively extracts good theme by the microblogging file classification method based on high-quality topic expansion, uses The poor disadvantage of the sparse caused classifying quality of feature is efficiently solved after the extension of microblogging feature, it is compared with prior art, quasi- True rate is higher, is more applicable for the feature extension of text in the classification of the short texts such as microblogging, can be realized microblogging and effectively classify.Effectively Ground solves text feature extension caused by solution is mixed using descriptor when topic model extension microblogging text feature and inaccurately asks Topic.
The present invention can be widely popularized in Text Classification field based on the above reasons.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to do simply to introduce, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without any creative labor, can be with It obtains other drawings based on these drawings.
Fig. 1 is that the present invention is based on the flow charts of the microblogging file classification method of high-quality topic expansion.
Fig. 2 is that the present invention is based on the LDA probabilistic models of the microblogging file classification method of high-quality topic expansion.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
As shown in Figure 1, the present invention provides a kind of microblogging file classification method based on high-quality topic expansion, including it is as follows Step:
S1, data prediction is carried out to microblogging text and carries out feature selecting, constructed and trained by pretreated text Collection and test set;
S2, using pretreated training set data as the input of LDA model, obtain the probability of the theme of training set data The probability distribution of distribution and descriptor;
S3, theme entropy, the average phase of relative entropy and theme are calculated by the probability distribution that comentropy is applied to descriptor Like degree, to calculate theme merit figue, given threshold filters out high-quality theme;
S4, theme division is carried out to training set and test set respectively, each text is marked off by LDA model theme distribution Maximum probability value in high-quality theme is added separately to train for the descriptor of theme using the descriptor as expansion word In the text feature of collection and test set;
S5, text representation is carried out to the text after extension using vector space model, and calculates each feature with TF-IDF Training data and test data document are changed into vector, choose useful feature by the weight of word, are instructed by classifier SVM training Practice collection, classification prediction then is carried out to test set, generates classification results.
It is described that microblogging text is carried out data prediction and carries out feature selecting to include the following steps:
S11, Chinese word segmentation pretreatment is carried out to text, complete sentence is divided into vocabulary, to obtain corpus of text Feature set;
Common conjunction, pronoun class stop words after S12, rejecting participle in text, are carried out pre- using Chinese stoplist Processing operation is deleted the specific word if Feature Words exist and deactivate vocabulary, then is rejected to punctuation mark;
S13, will pretreatment hereinafter this according to generic divide building dictionary, the information of different classes of word is counted, to spy There is total degree and carries out descending arrangement in sign word, and every class is selected to come Feature Words of the word as such of preceding n, conduct after being summarized The general characteristic of classification.
As shown in Fig. 2, setting topic model parameter alpha, β, theme number K carry out parameter in the way of Gibbs sampling and estimate Meter.
S21, setting topic model parameter alpha, theme number K extract microblogging from the Dirichlet distribution that parameter is α Theme distribution doc-topic matrix θm, θm~Dir (α), m ∈ [1, M], θmIndicate the theme probability distribution of document m
Wherein nm,kIndicate the number of the kth descriptor of m microbloggings.
S22, setting topic model parameter beta, theme number K extract theme from the Dirichlet distribution that parameter is β Word be distributed topic-word matrixIndicate the probability distribution of the word of theme k
Wherein nk,vIndicate the number that word v occurs under theme k.
S31, theme entropy is calculated applied to theme distribution using comentropy and then marks off high-quality theme, calculate theme letter Entropy TE is ceased, specifically:
TE (k)=- ∑ P (w | k) * lnP (w | k)
Wherein, P (w | k) indicates that word w appears in the probability under theme k, and the value of TE is smaller, and distribution gap is more greatly different.From every From the point of view of a theme, a small amount of Feature Words occur in distribution with greater probability, other words occur with small probability, at this time theme it is representational compared with By force, theme noise is smaller.
S32, the relative entropy for calculating theme, relative entropy is the index for measuring otherness between probability distribution, specifically Are as follows:
Wherein, P, Q are indicated wait measure distribution, when two random distributions are identical, relative entropy zero, when two random distributions Difference increase when, relative entropy also will increase;
S33, the average similarity for calculating theme, specifically:
The JS distance that theme is calculated using relative entropy, for measuring similarity between theme, specifically:
Average similarity is the independence for calculating other opposite distributions of a certain distribution, and the average similarity of theme calculates Method specifically:
Wherein j is not equal to K
Wherein K indicates theme sum;
S34, screening high-quality theme
Theme merit figue, calculation method are calculated according to theme entropy and average similarity specifically:
If theme merit figue meets G (k) > μ, μ is threshold value, then determines that the theme belongs to high-quality theme, standby as extending Choosing, is not otherwise high-quality theme, and then obtain high-quality theme collection S.
S41, the theme distribution that the topic model that training set trains is obtained, where selecting every microblogging in high-quality theme The affiliated theme of maximum probability value, by λ top ranked Feature Words w={ w of the corresponding descriptor select probability of affiliated theme1, w2,…wλBe added in the text feature of training set as expansion word, word w is incorporated into if expansion word is not present in original text shelves In document.
S42, theme deduction is carried out to test set using the topic model that training set trains, obtains the text of test text Shelves-theme distribution matrix;To every affiliated theme select probability row of test text selection maximum probability value in high-quality theme collection S The highest λ Feature Words w={ w of name1,w2,…wλBe added in the text feature of test set as expansion word.
S51, the expanded text for obtaining step S41 using vector space model carry out text representation, by document d regard as to N-dimensional vector in quantity space calculates the weight of feature, vector v=(ε using TF-IDF12,…,εn),εiIt indicates i-th The weight of word, the calculating of weight specifically:
Wherein, tfijRefer to the frequency that some Feature Words occurs in certain text, dfiIt indicates to include Feature Words in corpus Textual data, M is corpus text sum;
S52, text classification is carried out using LIBSVM tool, the data format that document is converted is label 1:value 2: Value ..., wherein label is classification logotype, and 1,2, which are characterized the i.e. tf-idf of value, calculates weight;
S53, record training set class label Y={ y1, y2 ..., yn }, carry out test set to after training set training pattern Classification prediction.
Microblogging file classification method of the present invention based on high-quality topic expansion, in conjunction with independent svm method and LDA SVM method compares, and is significantly improved by experimental verification accuracy rate, as shown in table 1:
Table 1
Feature Recall rate Accuracy rate
SVM 0.754 0.760
LDA+SVM 0.831 0.822
High-quality theme+SVM 0.863 0.857
A kind of microblogging Text Classification System based on high-quality topic expansion, comprising:
Text collection unit constructs training set and test set for being acquired to the microblogging text data voluntarily acquired;
Text data pretreatment unit, for being pre-processed to urtext sample and carrying out feature selecting comprising:
Chinese word segmentation module for complete sentence to be divided into vocabulary, and rejects the stop words in text,
Chinese stoplist module for deleting the Feature Words in the deactivated vocabulary occurred in text, and rejects punctuate symbol Number,
Dictionary creation module for being ranked up to Feature Words in text, and summarizes Feature Words;
LDA model training unit, for obtaining document subject matter distribution and descriptor distribution situation by training set data, Include:
Data processing module is marked off excellent for calculating merit figue by descriptor distributed data by given threshold Matter theme;
The LDA model training unit is also used to using quality features word as the text of the text of training set extension and test set This extension;
Text classification unit, for carrying out text classification to the training set after text extension by LIBSVM tool, simultaneously Classify to the testing data of test set, generates classification results.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims (10)

1. a kind of microblogging file classification method based on high-quality topic expansion, which comprises the steps of:
S1, carry out data prediction to microblogging text and carry out feature selecting, by pretreated text construct training set with Test set;
S2, using pretreated training set data as the input of LDA model, obtain the probability distribution of the theme of training set data And the probability distribution of descriptor;
S3, the probability distribution that comentropy is applied to descriptor is calculated into theme entropy, while calculates being averaged for relative entropy and theme Similarity, to calculate theme merit figue, given threshold filters out high-quality theme;
S4, theme division is carried out to training set and test set respectively, each text is marked off excellent by LDA model theme distribution Maximum probability value in matter theme corresponds to the descriptor of theme, using the descriptor as expansion word be added separately to training set with In the text feature of test set;
S5, text representation is carried out to the text after extension using vector space model, and calculates each Feature Words with TF-IDF Weight, is changed into vector for training data and test data document, chooses useful feature, trains training set by classifier SVM, Then classification prediction is carried out to test set, generates classification results.
2. the microblogging file classification method according to claim 1 based on high-quality topic expansion, which is characterized in that described right Microblogging text carries out data prediction and carries out feature selecting and include the following steps:
S11, Chinese word segmentation pretreatment is carried out to text, complete sentence is divided into vocabulary, to obtain corpus of text feature Collection;
Common conjunction, pronoun class stop words after S12, rejecting participle in text, are pre-processed using Chinese stoplist Operation is deleted the specific word if Feature Words exist and deactivate vocabulary, then is rejected to punctuation mark;
S13, will pretreatment hereinafter this according to generic divide building dictionary, the information of different classes of word is counted, to Feature Words There is total degree and carry out descending arrangement, the word of n is used as classification as such Feature Words before selecting every class to come after being summarized General characteristic.
3. the microblogging file classification method according to claim 1 based on high-quality topic expansion, which is characterized in that the step In rapid S2, the probability distribution of the theme of training set data is obtained as follows:
S21, setting topic model parameter alpha, theme number K extract the master of microblogging from the Dirichlet distribution that parameter is α Topic distribution doc-topic matrix θm, θm~Dir (α), m ∈ [1, M], θmIndicate the theme probability distribution of document m
Wherein nm,kIndicate the number of the kth descriptor of m microbloggings;
The probability distribution of the descriptor of training set data is obtained as follows:
S22, setting topic model parameter beta, theme number K extract the word of theme from the Dirichlet distribution that parameter is β It is distributed topic-word matrix Indicate the probability distribution of the word of theme k
Wherein nk,vIndicate the number that word v occurs under theme k.
4. the microblogging file classification method according to claim 1 based on high-quality topic expansion, which is characterized in that the step Rapid S3 is specifically included:
S31, subject information entropy TE is calculated, specifically:
TE (k)=- ∑ P (w | k) * lnP (w | k)
Wherein, P (w | k) indicates that word w appears in the probability under theme k;
S32, the relative entropy for calculating theme, specifically:
Wherein, P, Q are indicated wait measure distribution, when two random distributions are identical, relative entropy zero, when the difference of two random distributions Not Zeng great when, relative entropy also will increase;
S33, the average similarity for calculating theme, specifically:
The JS distance that theme is calculated using relative entropy, for measuring similarity between theme, specifically:
Average similarity is the independence for calculating other opposite distributions of a certain distribution, the average similarity calculation method of theme Specifically:
Wherein j is not equal to K
Wherein K indicates theme sum;
S34, screening high-quality theme
Theme merit figue, calculation method are calculated according to theme entropy and average similarity specifically:
If theme merit figue meets G (k) > μ, μ is threshold value, then determines that the theme belongs to high-quality theme, alternative as extension, no It is not then high-quality theme, and then obtains high-quality theme collection S.
5. the microblogging file classification method according to claim 1 based on high-quality topic expansion, which is characterized in that the step In rapid S4, theme division is carried out to training set specifically:
S41, the theme distribution that the topic model that training set trains is obtained, probability in high-quality theme where selecting every microblogging The affiliated theme of maximum value, by λ top ranked Feature Words w={ w of the corresponding descriptor select probability of affiliated theme1,w2,… wλBe added in the text feature of training set as expansion word, word w is incorporated into document if expansion word is not present in original text shelves In.
6. the microblogging file classification method according to claim 5 based on high-quality topic expansion, which is characterized in that the step In rapid S4, test set carries out theme deduction and carries out feature extension specifically:
S42, theme deduction is carried out to test set using the topic model that training set trains, obtains document-master of test text Inscribe distribution matrix;To the selection of every test text, the affiliated theme select probability ranking of maximum probability value is most in high-quality theme collection S λ high Feature Words w={ w1,w2,…wλBe added in the text feature of test set as expansion word.
7. the microblogging file classification method according to claim 1 based on high-quality topic expansion, which is characterized in that the step Rapid S5 specifically:
S51, the expanded text for obtaining step S41 carry out text representation using vector space model, regard document d as vector sky Between in n-dimensional vector, the weight of feature, vector v=(ε are calculated using TF-IDF12,…,εn),εiIndicate i-th of word Weight, the calculating of weight specifically:
Wherein, tfijRefer to the frequency that some Feature Words occurs in certain text, dfiIndicate the text in corpus comprising Feature Words This number, M are corpus text sums;
S52, text classification is carried out using LIBSVM tool, the data format that document is converted is label 1:value 2: Value ..., wherein label is classification logotype, and 1,2, which are characterized the i.e. TF-IDF of value, calculates weight;
S53, record training set class label Y={ y1, y2 ..., yn }, to classifying after training set training pattern to test set Prediction.
8. a kind of microblogging Text Classification System based on high-quality topic expansion characterized by comprising
Text collection unit constructs training set and test set for being acquired to the microblogging text data voluntarily acquired;
Text data pretreatment unit, for being pre-processed to urtext sample and carrying out feature selecting comprising:
Chinese word segmentation module for complete sentence to be divided into vocabulary, and rejects the stop words in text,
Chinese stoplist module for deleting the Feature Words in the deactivated vocabulary occurred in text, and rejects punctuation mark,
Dictionary creation module for being ranked up to Feature Words in text, and summarizes Feature Words;
LDA model training unit, for obtaining document subject matter distribution and descriptor distribution situation, packet by training set data It includes:
Data processing module marks off high-quality master by given threshold for calculating merit figue by descriptor distributed data Topic;
The LDA model training unit is also used to expand quality features word as the text of the extension of the text of training set and test set Exhibition;
Text classification unit, for carrying out text classification to the training set after text extension by LIBSVM tool, while to survey The testing data of examination collection is classified, and classification results are generated.
9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein when described program is run, Execute method described in any one of claims 1 to 7 claim.
10. a kind of electronic device, including memory, processor and it is stored on the memory and can transports on the processor Capable computer program, which is characterized in that the processor by computer program operation execute the claim 1 to Method described in any one of 7 claims.
CN201811064231.3A 2018-09-12 2018-09-12 Microblog text classification method and system based on high-quality theme extension Active CN109344252B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811064231.3A CN109344252B (en) 2018-09-12 2018-09-12 Microblog text classification method and system based on high-quality theme extension

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811064231.3A CN109344252B (en) 2018-09-12 2018-09-12 Microblog text classification method and system based on high-quality theme extension

Publications (2)

Publication Number Publication Date
CN109344252A true CN109344252A (en) 2019-02-15
CN109344252B CN109344252B (en) 2021-12-07

Family

ID=65304880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811064231.3A Active CN109344252B (en) 2018-09-12 2018-09-12 Microblog text classification method and system based on high-quality theme extension

Country Status (1)

Country Link
CN (1) CN109344252B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569270A (en) * 2019-08-15 2019-12-13 中国人民解放军国防科技大学 bayesian-based LDA topic label calibration method, system and medium
CN113177409A (en) * 2021-05-06 2021-07-27 上海慧洲信息技术有限公司 Intelligent sensitive word recognition system
CN113177409B (en) * 2021-05-06 2024-05-31 上海慧洲信息技术有限公司 Intelligent sensitive word recognition system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103425710A (en) * 2012-05-25 2013-12-04 北京百度网讯科技有限公司 Subject-based searching method and device
CN104899273A (en) * 2015-05-27 2015-09-09 东南大学 Personalized webpage recommendation method based on topic and relative entropy
CN106991127A (en) * 2017-03-06 2017-07-28 西安交通大学 A kind of knowledget opic short text hierarchy classification method extended based on topological characteristic
CN108090231A (en) * 2018-01-12 2018-05-29 北京理工大学 A kind of topic model optimization method based on comentropy
CN108121736A (en) * 2016-11-30 2018-06-05 北京搜狗科技发展有限公司 A kind of descriptor determines the method for building up, device and electronic equipment of model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103425710A (en) * 2012-05-25 2013-12-04 北京百度网讯科技有限公司 Subject-based searching method and device
CN104899273A (en) * 2015-05-27 2015-09-09 东南大学 Personalized webpage recommendation method based on topic and relative entropy
CN108121736A (en) * 2016-11-30 2018-06-05 北京搜狗科技发展有限公司 A kind of descriptor determines the method for building up, device and electronic equipment of model
CN106991127A (en) * 2017-03-06 2017-07-28 西安交通大学 A kind of knowledget opic short text hierarchy classification method extended based on topological characteristic
CN108090231A (en) * 2018-01-12 2018-05-29 北京理工大学 A kind of topic model optimization method based on comentropy

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569270A (en) * 2019-08-15 2019-12-13 中国人民解放军国防科技大学 bayesian-based LDA topic label calibration method, system and medium
CN110569270B (en) * 2019-08-15 2022-07-05 中国人民解放军国防科技大学 Bayesian-based LDA topic label calibration method, system and medium
CN113177409A (en) * 2021-05-06 2021-07-27 上海慧洲信息技术有限公司 Intelligent sensitive word recognition system
CN113177409B (en) * 2021-05-06 2024-05-31 上海慧洲信息技术有限公司 Intelligent sensitive word recognition system

Also Published As

Publication number Publication date
CN109344252B (en) 2021-12-07

Similar Documents

Publication Publication Date Title
US11853704B2 (en) Classification model training method, classification method, device, and medium
CN108363804B (en) Local model weighted fusion Top-N movie recommendation method based on user clustering
CN106951422B (en) Webpage training method and device, and search intention identification method and device
Kadhim et al. Text document preprocessing and dimension reduction techniques for text document clustering
Froud et al. Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering
CN108665323B (en) Integration method for financial product recommendation system
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
EP3279804A1 (en) Data analysis system, data analysis method, data analysis program, and recording medium
CN107025284A (en) The recognition methods of network comment text emotion tendency and convolutional neural networks model
CN107301171A (en) A kind of text emotion analysis method and system learnt based on sentiment dictionary
Bayot et al. Multilingual author profiling using word embedding averages and svms
CN110442841A (en) Identify method and device, the computer equipment, storage medium of resume
CN109086375A (en) A kind of short text subject extraction method based on term vector enhancing
CN110110225B (en) Online education recommendation model based on user behavior data analysis and construction method
CN106506327B (en) Junk mail identification method and device
CN106599054A (en) Method and system for title classification and push
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN101714135B (en) Emotional orientation analytical method of cross-domain texts
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN107609113A (en) A kind of Automatic document classification method
CN110705247A (en) Based on x2-C text similarity calculation method
CN108875034A (en) A kind of Chinese Text Categorization based on stratification shot and long term memory network
CN109657064A (en) A kind of file classification method and device
CN107463715A (en) English social media account number classification method based on information gain
CN111221968A (en) Author disambiguation method and device based on subject tree clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant