CN109344252B - Microblog text classification method and system based on high-quality theme extension - Google Patents

Microblog text classification method and system based on high-quality theme extension Download PDF

Info

Publication number
CN109344252B
CN109344252B CN201811064231.3A CN201811064231A CN109344252B CN 109344252 B CN109344252 B CN 109344252B CN 201811064231 A CN201811064231 A CN 201811064231A CN 109344252 B CN109344252 B CN 109344252B
Authority
CN
China
Prior art keywords
theme
text
words
quality
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811064231.3A
Other languages
Chinese (zh)
Other versions
CN109344252A (en
Inventor
张曦元
孙福权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201811064231.3A priority Critical patent/CN109344252B/en
Publication of CN109344252A publication Critical patent/CN109344252A/en
Application granted granted Critical
Publication of CN109344252B publication Critical patent/CN109344252B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention provides a microblog text classification method and system based on high-quality theme extension. The method is suitable for feature expansion of texts in short text classification such as microblog and the like, and effective classification of the microblog can be realized. Using training set microblog data as input of an LDA model to obtain topic probability distribution and word probability distribution; extracting a high-quality theme by using the high-representation theme extracted by the information entropy according to the similarity of the theme; performing theme inference on the test set microblog; selecting high-quality theme feature words to perform feature expansion on the microblog texts; and carrying out classification prediction on the expanded microblog texts by using a support vector machine algorithm. The method is suitable for solving the problem of inaccurate text feature expansion caused by mixing of subject words when the subject model is used for expanding the microblog text features.

Description

Microblog text classification method and system based on high-quality theme extension
Technical Field
The invention relates to the technical field of text classification, in particular to a microblog text classification method and system based on high-quality theme extension.
Background
As one of emerging media, the microblog has hundreds of millions of user groups so far, and occupies a leading position in a Chinese social network platform. The microblog operation is simple, the content is updated rapidly, and the microblog content updating method has high research value. In the past decades, text classification has been studied more, but the effect of short text classification such as microblog is not ideal all the time. Aiming at short microblog texts and sparse features, words are filtered through word segmentation and stop word processing, and few features are reserved after feature selection, so that although the complexity of calculation is reduced, the accuracy of classification is obviously reduced, and therefore, the features of the microblog texts need to be expanded for better classification.
The LDA model is a three-layer Bayesian probability model formed by words, subjects and documents. Assuming that each document is composed of a plurality of implicit topics, potential topics are mined according to the co-occurrence relation among vocabularies, texts are represented as probability distribution of the topics, and the topics are represented as probability distribution composed of a series of words. The method for realizing feature expansion of the short text by using topic distribution is an effective way for improving short text classification, but not all topics trained by a topic model can completely express a topic content, the phenomena of topic mixing and topic ambiguity exist, and other inconsistent features can be introduced by directly carrying out short text expansion.
Disclosure of Invention
According to the technical problems existing in the microblog text expansion by using the topic model, the microblog text classification method and the microblog text classification system based on the high-quality topic expansion are provided. According to the method, the high-quality theme is effectively extracted, and the defect of poor classification effect caused by characteristic sparsity is effectively overcome after the method is used for microblog characteristic expansion.
The technical means adopted by the invention are as follows:
a microblog text classification method based on high-quality theme expansion comprises the following steps:
s1, performing data preprocessing on the microblog text, selecting characteristics, and constructing a training set and a test set through the preprocessed text;
s2, taking the preprocessed training set data as the input of an LDA model to obtain the probability distribution of the subjects of the training set data and the probability distribution of the subject words;
s3, applying the information entropy to probability distribution of the subject words to calculate subject entropy, and calculating relative entropy and average similarity of subjects, thereby calculating high-quality coefficients of the subjects, and setting a threshold value to screen out high-quality subjects;
s4, respectively carrying out theme division on the training set and the test set, dividing a subject word of the maximum probability value of each text in the high-quality theme to the theme through LDA model theme distribution, and adding the subject word serving as an expansion word to the text characteristics of the training set and the test set respectively;
s5, performing text representation on the expanded text by using a vector space model, calculating the weight of each feature word by using TF-IDF, converting training data and test data documents into vectors, selecting useful features, training a training set by using a classifier SVM, and performing classification prediction on the test set to generate a classification result.
Further, the step of performing data preprocessing and feature selection on the microblog text comprises the following steps:
s11, carrying out Chinese word segmentation pretreatment on the text, and dividing the complete sentence into words to obtain a text corpus characteristic set;
s12, removing common conjunctions and pronouns stop words in the text after word segmentation, performing preprocessing operation by using a Chinese stop word list, deleting the feature words if the feature words have the stop word list, and then removing punctuation marks;
and S13, dividing the preprocessed text according to the category to construct a dictionary, counting the information of words in different categories, performing descending order arrangement on the total occurrence times of the feature words, selecting the words in the first n categories as the feature words of the category, and summarizing the words to be used as the general feature of the category.
Further, in step S2, the probability distribution of the topics of the training set data is obtained through the following steps:
s21, setting a theme model parameter alpha and a theme number K, and extracting a theme distribution doc-topic matrix theta of the microblog from Dirichlet distribution with the parameter alpham,θm~Dir(α),m∈[1,M],θmRepresenting a topic probability distribution of a document m
Figure BDA0001797896030000031
Wherein n ism,kAnd the number of the k-th subject words of the mth microblog is represented.
Further, in step S2, the probability distribution of the subject word of the training set data is obtained by the following steps:
s22, setting a theme model parameter beta and a theme number K, and extracting a word distribution topic-word matrix of a theme from Dirichlet distribution with the parameter beta
Figure BDA0001797896030000032
Probability distribution of words representing topic k
Figure BDA0001797896030000033
Wherein n isk,vIndicating the number of occurrences of the word v under the topic k.
Further, the step S3 specifically includes:
s31, calculating a topic information entropy TE, specifically:
TE(k)=-∑P(w|k)*lnP(w|k)
wherein P (w | k) represents the probability of the word w appearing under the topic k;
s32, calculating the relative entropy of the theme, specifically:
Figure BDA0001797896030000034
p, Q represents the distribution of the metric, when the two random distributions are the same, the relative entropy is zero, and when the difference between the two random distributions is increased, the relative entropy is also increased;
s33, calculating the average similarity of the subjects, specifically:
calculating the JS distance of the theme by using the relative entropy for measuring the similarity between the themes, which specifically comprises the following steps:
Figure BDA0001797896030000035
the average similarity is used for calculating the independence of a certain distribution relative to other distributions, and the calculation method of the average similarity of the subjects specifically comprises the following steps:
Figure BDA0001797896030000041
where j is not equal to K
Wherein K represents the total number of topics;
s34, screening high-quality subjects
Calculating a theme high-quality coefficient according to the theme entropy and the average similarity, wherein the calculation method specifically comprises the following steps:
Figure BDA0001797896030000042
and if the theme high-quality coefficient meets G (k) > mu and mu is a threshold value, judging that the theme belongs to a high-quality theme as an expansion alternative, otherwise, judging that the theme is not the high-quality theme, and further obtaining a high-quality theme set S.
Further, in step S4, the theme partitioning performed on the training set specifically includes:
s41, selecting the theme to which the probability maximum value belongs in the high-quality theme of each microblog according to theme distribution obtained by the theme model trained by the training set, and selecting lambda feature words w with the highest probability rank from the theme words corresponding to the theme to which the probability maximum value belongs { w ═ w }1,w2,…wλAnd f, adding the word w as an expansion word into the text feature of the training set, and if the expansion word does not exist in the original document, merging the word w into the document.
Further, in step S4, the theme inference and feature expansion performed on the test set specifically include:
s42, carrying out theme inference on the test set by using the theme model trained by the training set to obtain a document-theme distribution matrix of the test text; selecting lambda characteristic words w with highest probability ranking in the topic selection probability of the maximum probability in the high-quality topic set S for each test text1,w2,…wλAnd the words are added as expansion words to the text features of the test set.
Further, the step S5 is specifically:
s51, the expanded text obtained in step S41 is expressed as a text using a vector space model, document d is regarded as an n-dimensional vector in a vector space, and the weight of the feature is calculated using TF-IDF, where (e) is the vector v12,…,εn),εiThe weight of the ith word is represented, and the calculation of the weight specifically comprises the following steps:
Figure BDA0001797896030000043
wherein, tfijRefers to the frequency of occurrence of a feature word in a text, dfiRepresenting the number of texts containing the characteristic words in the corpus, wherein M is the total number of texts in the corpus;
s52, text classification is carried out by using an LIBSVM tool, and the data format of document conversion is label 1: value 2: value …, wherein label is a category identifier, and 1 and 2 are feature values, namely tf-idf calculation weights;
and S53, recording a training set type label Y ═ Y1, Y2, … and yn, and classifying and predicting the test set after training the model in the training set.
The invention also provides a microblog text classification system based on high-quality theme extension, which comprises the following steps:
the system comprises a text acquisition unit, a test unit and a processing unit, wherein the text acquisition unit is used for acquiring self-acquired microblog text data and constructing a training set and a test set;
the text data preprocessing unit is used for preprocessing an original text sample and selecting features, and comprises the following steps:
a Chinese word segmentation module for dividing the complete sentence into words and eliminating stop words in the text,
a Chinese inactive word list module for deleting the feature words in the inactive word list appearing in the text and eliminating punctuation marks,
the dictionary building module is used for sequencing the characteristic words in the text and summarizing the characteristic words;
the LDA model training unit is used for obtaining document theme distribution and theme word distribution conditions through training set data, and comprises:
the data processing module is used for calculating a high-quality coefficient through the distribution data of the theme words and dividing a high-quality theme through a set threshold;
the LDA model training unit is also used for taking the high-quality characteristic words as text extension of a training set and text extension of a test set;
and the text classification unit is used for performing text classification on the training set after the text expansion through an LIBSVM tool, and classifying the data to be tested of the test set to generate a classification result.
Compared with the prior art, the invention has the following advantages:
according to the method, the high-quality theme is effectively extracted through the microblog text classification method based on the high-quality theme expansion, the defect of poor classification effect caused by sparse features is effectively overcome after microblog feature expansion, compared with the prior art, the method is higher in accuracy rate, more suitable for feature expansion of texts in short text classification such as microblog and the like, and effective classification of the microblog can be achieved. The problem that text feature expansion is not accurate due to mixing of subject words when the subject model is used for expanding microblog text features is effectively solved.
Based on the reasons, the method can be widely popularized in the technical field of text classification.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a microblog text classification method based on high-quality topic expansion.
FIG. 2 is an LDA probability model of the microblog text classification method based on the high-quality topic expansion.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As shown in fig. 1, the invention provides a microblog text classification method based on high-quality topic expansion, which comprises the following steps:
s1, performing data preprocessing on the microblog text, selecting characteristics, and constructing a training set and a test set through the preprocessed text;
s2, taking the preprocessed training set data as the input of an LDA model to obtain the probability distribution of the subjects of the training set data and the probability distribution of the subject words;
s3, calculating a topic entropy through the probability distribution of applying the information entropy to the topic words, calculating the relative entropy and the average similarity of the topics, thereby calculating a topic quality coefficient, and setting a threshold value to screen out a high quality topic;
s4, respectively carrying out theme division on the training set and the test set, dividing a subject word of the maximum probability value of each text in the high-quality theme to the theme through LDA model theme distribution, and adding the subject word serving as an expansion word to the text characteristics of the training set and the test set respectively;
s5, performing text representation on the expanded text by using a vector space model, calculating the weight of each feature word by using TF-IDF, converting training data and test data documents into vectors, selecting useful features, training a training set by using a classifier SVM, and performing classification prediction on the test set to generate a classification result.
The method for preprocessing the microblog text and selecting the characteristics comprises the following steps:
s11, carrying out Chinese word segmentation pretreatment on the text, and dividing the complete sentence into words to obtain a text corpus characteristic set;
s12, removing common conjunctions and pronouns stop words in the text after word segmentation, performing preprocessing operation by using a Chinese stop word list, deleting the feature words if the feature words have the stop word list, and then removing punctuation marks;
and S13, dividing the preprocessed text according to the category to construct a dictionary, counting the information of words in different categories, performing descending order arrangement on the total occurrence times of the feature words, selecting the words in the first n categories as the feature words of the category, and summarizing the words to be used as the general feature of the category.
As shown in fig. 2, theme model parameters α and β and a theme number K are set, and parameter estimation is performed by using Gibbs sampling.
S21, setting a theme model parameter alpha and a theme number K, and extracting a theme distribution doc-topic matrix theta of the microblog from Dirichlet distribution with the parameter alpham,θm~Dir(α),m∈[1,M],θmRepresenting a topic probability distribution of a document m
Figure BDA0001797896030000071
Wherein n ism,kAnd the number of the k-th subject words of the mth microblog is represented.
S22, setting a theme model parameter beta and a theme number K, and extracting a word distribution topic-word matrix of a theme from Dirichlet distribution with the parameter beta
Figure BDA0001797896030000072
Probability distribution of words representing topic k
Figure BDA0001797896030000081
Wherein n isk,vIndicating the number of occurrences of the word v under the topic k.
S31, applying the information entropy to the theme distribution to calculate the theme entropy and further divide the high-quality theme, and calculating the theme information entropy TE, specifically:
TE(k)=-∑P(w|k)*lnP(w|k)
wherein, P (w | k) represents the probability of the word w appearing under the theme k, and the smaller the value of TE, the more different the distribution difference. From each topic, a small number of characteristic words appear with a high probability in the distribution, and other words appear with a low probability, so that the topic representation is strong and the topic noise is low.
S32, calculating the relative entropy of the theme, wherein the relative entropy is an index for measuring the difference between probability distributions, and specifically comprises the following steps:
Figure BDA0001797896030000082
p, Q represents the distribution of the metric, when the two random distributions are the same, the relative entropy is zero, and when the difference between the two random distributions is increased, the relative entropy is also increased;
s33, calculating the average similarity of the subjects, specifically:
calculating the JS distance of the theme by using the relative entropy for measuring the similarity between the themes, which specifically comprises the following steps:
Figure BDA0001797896030000083
the average similarity is used for calculating the independence of a certain distribution relative to other distributions, and the calculation method of the average similarity of the subjects specifically comprises the following steps:
Figure BDA0001797896030000084
where j is not equal to K
Wherein K represents the total number of topics;
s34, screening high-quality subjects
Calculating a theme high-quality coefficient according to the theme entropy and the average similarity, wherein the calculation method specifically comprises the following steps:
Figure BDA0001797896030000085
and if the theme high-quality coefficient meets G (k) > mu and mu is a threshold value, judging that the theme belongs to a high-quality theme as an expansion alternative, otherwise, judging that the theme is not the high-quality theme, and further obtaining a high-quality theme set S.
S41, selecting the theme distribution obtained by the theme model trained by the training setIn the high-quality topics where each microblog is located, the topic word corresponding to the topic belongs to the topic, and the lambda feature words w with the highest probability rank are selected as { w ═ w1,w2,…wλAnd f, adding the word w as an expansion word into the text feature of the training set, and if the expansion word does not exist in the original document, merging the word w into the document.
S42, carrying out theme inference on the test set by using the theme model trained by the training set to obtain a document-theme distribution matrix of the test text; selecting lambda characteristic words w with highest probability ranking in the topic selection probability of the maximum probability in the high-quality topic set S for each test text1,w2,…wλAnd the words are added as expansion words to the text features of the test set.
S51, the expanded text obtained in step S41 is expressed as a text using a vector space model, document d is regarded as an n-dimensional vector in a vector space, and the weight of the feature is calculated using TF-IDF, where (e) is the vector v12,…,εn),εiThe weight of the ith word is represented, and the calculation of the weight specifically comprises the following steps:
Figure BDA0001797896030000091
wherein, tfijRefers to the frequency of occurrence of a feature word in a text, dfiRepresenting the number of texts containing the characteristic words in the corpus, wherein M is the total number of texts in the corpus;
s52, text classification is carried out by using an LIBSVM tool, and the data format of document conversion is label 1: value 2: value …, wherein label is a category identifier, and 1 and 2 are feature values, namely tf-idf calculation weights;
and S53, recording a training set type label Y ═ Y1, Y2, … and yn, and classifying and predicting the test set after training the model in the training set.
Compared with the single SVM method and the LDA combined SVM method, the microblog text classification method based on the high-quality theme expansion obviously improves the accuracy rate through experimental verification, and is shown in Table 1:
TABLE 1
Feature(s) Recall rate Rate of accuracy
SVM 0.754 0.760
LDA+SVM 0.831 0.822
High-quality theme + SVM 0.863 0.857
A microblog text classification system based on high-quality theme extension comprises:
the system comprises a text acquisition unit, a test unit and a processing unit, wherein the text acquisition unit is used for acquiring self-acquired microblog text data and constructing a training set and a test set;
the text data preprocessing unit is used for preprocessing an original text sample and selecting features, and comprises the following steps:
a Chinese word segmentation module for dividing the complete sentence into words and eliminating stop words in the text,
a Chinese inactive word list module for deleting the feature words in the inactive word list appearing in the text and eliminating punctuation marks,
the dictionary building module is used for sequencing the characteristic words in the text and summarizing the characteristic words;
the LDA model training unit is used for obtaining document theme distribution and theme word distribution conditions through training set data, and comprises:
the data processing module is used for calculating a high-quality coefficient through the distribution data of the theme words and dividing a high-quality theme through a set threshold;
the LDA model training unit is also used for taking the high-quality characteristic words as text extension of a training set and text extension of a test set;
and the text classification unit is used for performing text classification on the training set after the text expansion through an LIBSVM tool, and classifying the data to be tested of the test set to generate a classification result.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A microblog text classification method based on high-quality theme expansion is characterized by comprising the following steps of:
s1, performing data preprocessing on the microblog text, selecting characteristics, and constructing a training set and a test set through the preprocessed text;
s2, taking the preprocessed training set data as the input of an LDA model to obtain the probability distribution of the subjects of the training set data and the probability distribution of the subject words;
s3, applying the information entropy to probability distribution of the subject words to calculate subject entropy, and calculating relative entropy and average similarity of subjects, thereby calculating high-quality coefficients of the subjects, and setting a threshold value to screen out high-quality subjects;
s4, respectively carrying out theme division on the training set and the test set, dividing a subject word of each text in a theme corresponding to the probability maximum value in the high-quality theme through LDA model theme distribution, and adding the subject word serving as an expansion word into text features of the training set and the test set respectively;
s5, performing text representation on the expanded text by using a vector space model, calculating the weight of each feature word by using TF-IDF, converting training data and test data documents into vectors, selecting useful features, training a training set by using a classifier SVM, and performing classification prediction on the test set to generate a classification result;
in step S2, the probability distribution of the topics of the training set data is obtained by:
s21, setting a theme model parameter alpha and a theme number K, and extracting a theme distribution doc-topic matrix theta of the microblog from Dirichlet distribution with the parameter alpham,θm~Dir(α),m∈[1,M],θmRepresenting a topic probability distribution of a document m
Figure FDA0003150368870000011
Wherein n ism,kThe number of the k-th subject words of the mth microblog is represented;
obtaining the probability distribution of the subject term of the training set data by the following steps:
s22, setting a theme model parameter beta and a theme number K, and extracting a word distribution topic-word matrix of a theme from Dirichlet distribution with the parameter beta
Figure FDA0003150368870000012
k∈[1,K],
Figure FDA0003150368870000013
Probability distribution of words representing topic k
Figure FDA0003150368870000021
Wherein n isk,vRepresenting the number of times the word v appears under the topic k;
the step S3 specifically includes:
s31, calculating a topic information entropy TE, specifically:
TE(k)=-∑P(w|k)*lnP(w|k)
wherein P (w | k) represents the probability of the word w appearing under the topic k;
s32, calculating the relative entropy of the theme, specifically:
Figure FDA0003150368870000022
p, Q represents the distribution of the metric, when the two random distributions are the same, the relative entropy is zero, and when the difference between the two random distributions is increased, the relative entropy is also increased;
s33, calculating the average similarity of the subjects, specifically:
calculating the JS distance of the theme by using the relative entropy for measuring the similarity between the themes, which specifically comprises the following steps:
Figure FDA0003150368870000023
the average similarity is used for calculating the independence of a certain distribution relative to other distributions, and the calculation method of the average similarity of the subjects specifically comprises the following steps:
Figure FDA0003150368870000024
where j is not equal to K
Wherein K represents the total number of topics;
s34, screening high-quality subjects
Calculating a theme high-quality coefficient according to the theme entropy and the average similarity, wherein the calculation method specifically comprises the following steps:
Figure FDA0003150368870000025
and if the theme high-quality coefficient meets G (k) > mu and mu is a threshold value, judging that the theme belongs to a high-quality theme as an expansion alternative, otherwise, judging that the theme is not the high-quality theme, and further obtaining a high-quality theme set S.
2. The microblog text classification method based on the high-quality theme extension according to claim 1, wherein the step of performing data preprocessing and feature selection on microblog texts comprises the following steps:
s11, carrying out Chinese word segmentation pretreatment on the text, and dividing the complete sentence into words to obtain a text corpus characteristic set;
s12, removing common conjunctions and pronouns stop words in the text after word segmentation, performing preprocessing operation by using a Chinese stop word list, deleting the feature words if the feature words have the stop word list, and then removing punctuation marks;
and S13, dividing the preprocessed text according to the category to construct a dictionary, counting the information of words in different categories, performing descending order arrangement on the total occurrence times of the feature words, selecting the words in the first n categories as the feature words of the category, and summarizing the words to be used as the general feature of the category.
3. The microblog text classification method based on the high-quality theme extension according to claim 1, wherein in the step S4, theme division is specifically performed on the training set as follows:
s41, selecting the topic distribution obtained by the topic model trained by the training set, and selecting the highest probability in the high-quality topics of each microblogThe subject to which the large value belongs, lambda feature words w with the highest probability of being selected from the subject words corresponding to the subject to which the large value belongs are arranged as { w ═ w }1,w2,…wλAnd f, adding the word w as an expansion word into the text feature of the training set, and if the expansion word does not exist in the original document, merging the word w into the document.
4. The microblog text classification method based on high-quality topic expansion according to claim 3, wherein in the step S4, the topic inference and feature expansion of the test set specifically comprise:
s42, carrying out theme inference on the test set by using the theme model trained by the training set to obtain a document-theme distribution matrix of the test text; selecting lambda characteristic words w with highest probability ranking in the topic selection probability of the maximum probability in the high-quality topic set S for each test text1,w2,…wλAnd the words are added as expansion words to the text features of the test set.
5. The microblog text classification method based on the high-quality subject expansion according to claim 1, wherein the step S5 specifically comprises:
s51, the expanded text obtained in step S41 is expressed as a text using a vector space model, document d is regarded as an n-dimensional vector in a vector space, and the weight of the feature is calculated using TF-IDF, where (e) is the vector v12,…,εn),εiThe weight of the ith word is represented, and the calculation of the weight specifically comprises the following steps:
Figure FDA0003150368870000041
wherein, tfijRefers to the frequency of occurrence of a feature word in a text, dfiRepresenting the number of texts containing the characteristic words in the corpus, wherein M is the total number of texts in the corpus;
s52, text classification is carried out by using an LIBSVM tool, and the data format of document conversion is label 1: value 2: value …, wherein label is a category identifier, 1 and 2 are feature values, namely TF-IDF calculation weight values;
and S53, recording a training set type label Y ═ Y1, Y2, … and yn, and classifying and predicting the test set after training the model in the training set.
6. The microblog text classification system based on the microblog text classification method based on the high-quality subject expansion according to any one of claims 1 to 5 is characterized by comprising the following steps of:
the system comprises a text acquisition unit, a test unit and a processing unit, wherein the text acquisition unit is used for acquiring self-acquired microblog text data and constructing a training set and a test set;
the text data preprocessing unit is used for preprocessing an original text sample and selecting features, and comprises the following steps:
a Chinese word segmentation module for dividing the complete sentence into words and eliminating stop words in the text,
a Chinese inactive word list module for deleting the feature words in the inactive word list appearing in the text and eliminating punctuation marks,
the dictionary building module is used for sequencing the characteristic words in the text and summarizing the characteristic words;
the LDA model training unit is used for obtaining document theme distribution and theme word distribution conditions through training set data, and comprises:
the data processing module is used for calculating a high-quality coefficient through the distribution data of the theme words and dividing a high-quality theme through a set threshold;
the LDA model training unit is also used for taking the high-quality characteristic words as text extension of a training set and text extension of a test set;
and the text classification unit is used for performing text classification on the training set after the text expansion through an LIBSVM tool, and classifying the data to be tested of the test set to generate a classification result.
7. A storage medium comprising a stored program, wherein the program when executed performs the method of any one of claims 1 to 5.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to perform the method of any one of claims 1 to 5.
CN201811064231.3A 2018-09-12 2018-09-12 Microblog text classification method and system based on high-quality theme extension Active CN109344252B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811064231.3A CN109344252B (en) 2018-09-12 2018-09-12 Microblog text classification method and system based on high-quality theme extension

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811064231.3A CN109344252B (en) 2018-09-12 2018-09-12 Microblog text classification method and system based on high-quality theme extension

Publications (2)

Publication Number Publication Date
CN109344252A CN109344252A (en) 2019-02-15
CN109344252B true CN109344252B (en) 2021-12-07

Family

ID=65304880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811064231.3A Active CN109344252B (en) 2018-09-12 2018-09-12 Microblog text classification method and system based on high-quality theme extension

Country Status (1)

Country Link
CN (1) CN109344252B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569270B (en) * 2019-08-15 2022-07-05 中国人民解放军国防科技大学 Bayesian-based LDA topic label calibration method, system and medium
CN113177409A (en) * 2021-05-06 2021-07-27 上海慧洲信息技术有限公司 Intelligent sensitive word recognition system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103425710A (en) * 2012-05-25 2013-12-04 北京百度网讯科技有限公司 Subject-based searching method and device
CN104899273B (en) * 2015-05-27 2017-08-25 东南大学 A kind of Web Personalization method based on topic and relative entropy
CN108121736B (en) * 2016-11-30 2021-06-08 北京搜狗科技发展有限公司 Method and device for establishing subject term determination model and electronic equipment
CN106991127B (en) * 2017-03-06 2020-01-10 西安交通大学 Knowledge subject short text hierarchical classification method based on topological feature expansion
CN108090231A (en) * 2018-01-12 2018-05-29 北京理工大学 A kind of topic model optimization method based on comentropy

Also Published As

Publication number Publication date
CN109344252A (en) 2019-02-15

Similar Documents

Publication Publication Date Title
US11093854B2 (en) Emoji recommendation method and device thereof
CN105183833B (en) Microblog text recommendation method and device based on user model
EP2486470B1 (en) System and method for inputting text into electronic devices
US20210056571A1 (en) Determining of summary of user-generated content and recommendation of user-generated content
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN105022754B (en) Object classification method and device based on social network
CN105183717B (en) A kind of OSN user feeling analysis methods based on random forest and customer relationship
CN108519971B (en) Cross-language news topic similarity comparison method based on parallel corpus
KR20200127020A (en) Computer-readable storage medium storing method, apparatus and instructions for matching semantic text data with tags
CN111061957A (en) Article similarity recommendation method and device
CN109271639B (en) Hot event discovery method and device
KR20200007713A (en) Method and Apparatus for determining a topic based on sentiment analysis
CN109086375A (en) A kind of short text subject extraction method based on term vector enhancing
CN107885717B (en) Keyword extraction method and device
CN113836938A (en) Text similarity calculation method and device, storage medium and electronic device
CN115186665B (en) Semantic-based unsupervised academic keyword extraction method and equipment
CN109344252B (en) Microblog text classification method and system based on high-quality theme extension
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN111914554A (en) Training method of field new word recognition model, field new word recognition method and field new word recognition equipment
CN115062621A (en) Label extraction method and device, electronic equipment and storage medium
CN107665222B (en) Keyword expansion method and device
CN110457707B (en) Method and device for extracting real word keywords, electronic equipment and readable storage medium
CN108763258B (en) Document theme parameter extraction method, product recommendation method, device and storage medium
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
CN111339778A (en) Text processing method, device, storage medium and processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant