CN106991171A

CN106991171A - Topic based on Intelligent campus information service platform finds method

Info

Publication number: CN106991171A
Application number: CN201710216639.7A
Authority: CN
Inventors: 王凤领
Original assignee: Hezhou University
Current assignee: Hezhou University
Priority date: 2017-03-25
Filing date: 2017-03-25
Publication date: 2017-07-28

Abstract

The present invention provides a kind of topic based on Intelligent campus information service platform and finds method.The topic based on Intelligent campus information service platform finds the new technology that method is related in being found for campus much-talked-about topic, analyze conventional Text Clustering Algorithm and text representation model, the extraction of participle and keyword will have been carried out to Message-text by using Words partition system, propose the Knowledge Representation Model of Message-text, by to building the research based on vector space model, determining initial cluster center and refined Hook Jeeves algorighm using the word frequency of statistical message.Compared with correlation technique, the topic based on Intelligent campus information service platform that the present invention is provided finds that method can obtain accurate public sentiment mode and important element, so as to be properly formed public opinion.Wisdom Subject Clustering process can be handled more rapid and better, and when the message number of report is larger, the precision of cluster can keep higher.

Description

Topic based on Intelligent campus information service platform finds method

Technical field

Field is found the present invention relates to a kind of hot issue, more particularly to it is a kind of based on Intelligent campus information service platform Topic finds method.

Background technology

Continuing to develop for computer network, increasingly enriches campus life.The network information turns into campus life Important component, internet has become student and obtains information, the important place of communication.

How effectively to grasp the mass data of network, extract much-talked-about topic therein, or obtain oneself desired information, Problem as the long-standing problem network user.Much-talked-about topic is found, each can be found in certain time from various information resources Cause the topic of people's extensive concern in field, facilitate student to obtain current important information, quickly grasp current information.

Find method to realize above technology therefore, it is necessary to provide a kind of topic in Intelligent campus information service platform Scheme.

The content of the invention

Method is found it is an object of the invention to provide a kind of topic based on Intelligent campus information service platform, to meet User has found the demand of the sudden much-talked-about topic in network forum in real time.

The present invention provides a kind of topic based on Intelligent campus information service platform and finds method, including：

Step 1, intelligent campus information service platform is set up, the message collection of campus theme on the internet forms message count According to storehouse；

Step 2, Text Pretreatment is carried out to the Message-text in the database, Text Pretreatment is word segmentation processing, bag Include semantic ambiguity analysis, unregistered word extraction, keyword extraction and stop words processing；

Step 3, feature extraction is carried out to pretreated text, its Text character extraction is only

Vertical evaluation method, the independent assessment method includes information gain, X²Statistic and document frequency algorithm, the letter Breath gain is classified by needing to calculate Feature item weighting size to text class cluster, and it is calculated by following formula (1) The Feature Words more than classification information are obtained,

Wherein, text collection represents classification c_iProbability be P (c_i), and text collection represents that Feature Words t probability is P (t), P (c_i| t) belong to predefined classification c comprising Feature Words t texts_i,It is the text if Feature Words t is not in text Belong to classification c_iProbability, n is the quantity of text categories；

The X²Statistic is the significant levels for assessing characteristic item, and the amount of the text message carried by characteristic item passes through Quantify to quantify, its by following formula (2) come statistic,

Wherein, N is the number of the text extracted, C_jIt is cluster, A is C_jIn text number, and during C is not feature Text, B is characteristic item t_iThe number C of outside text_jCluster, and D is not in characteristic item t_iIn C_jText outside cluster Quantity；

The document frequency algorithm is to assess feature by calculating the quantity of document including multiple documents；

Step 4, the Feature Words of extraction are appointed as Knowledge Representation Model；

Step 5, model is represented by computer by clustering algorithm calculates object for text knowledge, with same subject Text forms a theme storehouse together, and the main body storehouse is hot issue storehouse.

Compared with correlation technique, the topic based on Intelligent campus information service platform that the present invention is provided finds that method can be with Accurate public sentiment mode and important element are obtained, so as to be properly formed public opinion.Can be with faster and better to wisdom Subject Clustering process Ground is handled, and when the message number of report is larger, the precision of cluster can keep higher.

Brief description of the drawings

Fig. 1 is the structural representation of the campus hot issue discovery module of the present invention；

Fig. 2 has found flow chart for the topic of the present invention；

Fig. 3 is the flow chart that Fig. 2 Chinese versions are pre-processed；

Fig. 4 is the flow chart that Fig. 2 Chinese versions represent model；

Fig. 5 is the (C of the clustering algorithm of the present invention_Det)_NormIt is worth test chart.

Embodiment

Please refer to Fig. 1 and Fig. 2, wherein, Fig. 1 is the structural representation of the campus hot issue discovery module of the present invention Figure, Fig. 2 has found flow chart for the topic of the present invention.A kind of topic based on Intelligent campus information service platform that the present invention is provided It was found that method, including：

Step 1, intelligent campus information service platform is set up, the message collection of campus theme on the internet forms message count According to storehouse.

Step 2, Text Pretreatment is carried out to the Message-text in the database, text pretreatment specifically includes semantic discrimination Justice analysis, unregistered word are extracted, keyword extraction and stop words are handled.It is the stream that Fig. 2 Chinese versions are pre-processed to please refer to Fig. 3 Cheng Tu.The focus motif discovery module in campus uses ICTCLAS Words partition systems, and coarse word is filtered by given stop words, Delete modal particle, auxiliary word and conjunction, final output Chinese dictionary.

Chinese word segmentation in step 2 is used in statistical morphology, N- shortest-path methods and string matching participle method Any one is combined.

The statistical morphology be by each phrase compound word closer to each other in Chinese text, and by close to text The number of the word of each word in this is counted writes the probability of definite word to obtain.Before statistics, threshold value is set, if The frequency of combinatorics on words is more than or higher than threshold value, then two adjacent words can be combined into a word.

At present, the existing participle model based on statistics mainly has：Hidden Markov, most probable number method, channel-noise etc. Model.Participle method based on statistics must enumerate all may neighborhood word composition word, therefore the word segmentation processing time can be caused It is relatively long, to combine and use with other participle methods, and single participle method is not used as, statistics word participle can be accurate Reflect text semantic word segmentation result.

The thought that the N- shortest-path methods are split based on path.What is occurred in Chinese text in word storehouse is each Word is considered as the side for constituting path profile.Each edge is endowed the weight of edge length.N- shortest paths are split divided by edge Length value, and path profile results set by be path profile most short set.When cutting runs into identical length, by side one Play insertion path set.After the segmentation of path, the word segmentation result of Chinese text will be obtained.

The segmenting method of the string matching is the string matching of word participle, also referred to as mechanical dissection, is a kind of phase To simple segmenting method.Although method is easier to realize, neologisms are distinguished bad.In character string control, character string is found It is consistent with the word in vocabulary, it may be determined that to be a word.It can also be extended by the word in word, field noun and special name Morphology is into a participle.

Step 3, feature extraction is carried out to pretreated text, its Text character extraction is independent assessment method, described Independent assessment method includes information gain, X²Statistic and document frequency algorithm.

Described information gain is that text class cluster is classified by needing to calculate Feature item weighting size, Feature Words institute How much text message containing classification is to be judged according to the size of the text message yield value of obtained Feature Words, so as to choose classification Feature Words more than information, its be by following formula (1) calculate obtain classification information more than Feature Words,

The X²Statistic can assess the significant levels of characteristic item.The amount throughput of the text message carried by characteristic item Change to quantify.When statistic is big, its indicative character represents that content of text theme is comprehensive, and it passes through following formula (2) Carry out statistic,

The document frequency algorithm is one of most basic feature evaluation method.The idea of this method is by including many Individual document calculates the quantity of document to assess feature, if to exclude characteristic item to see whether this characteristic item is counted greatly Document is included or only included by a small number of texts, then its value is too high or value it is too low be all will by for except object.

Step 4, the Feature Words of extraction are appointed as Knowledge Representation Model.Please refer to Fig. 4 and represent mould for Fig. 2 Chinese versions The flow chart of type.Focus motif discovery module represents Message-text using Knowledge Representation Model.Step is as follows：After pre-processing Word participle as feature selecting sample；The text knowledge is reduced by correlated characteristic selection rule and represents model Dimension；Weighted feature vector is calculated by calculating the weight of selected text feature；To weighted feature vector is deposited Storing up is used for follow-up clustering in database.

The model of campus focus motif discovery model considers the importance of campus message subject.However, common vector Spatial model only models the characteristic item of message report text, and this is critically important for display campus message subject.Campus message is known PK=(C, id, F can be used by knowing expression model₁, wf₁, F₂, wf₂..., F_i, wf_i) represent campus message subject, wherein C message Belong to row, id is unique difference between message, field i value and its F_iIt is corresponding, but wf_iIts corresponding weight, represents message The value of text.

Because text data can not directly be handled by computer, text is represented as designated model first, it is allowed to count Calculation machine calculates object by clustering algorithm.The Knowledge Representation Model includes probabilistic model, Boolean Model, vector space model And language model.

Probabilistic model is based on bayesian theory.It has the advantages that to be ranked up document by probability correlation, and Result and user's request can be adjusted to realize higher accuracy rate.The model will cause huge to text cluster work Workload.Meanwhile, the model does not consider the implication of text word, therefore can reduce the accuracy of text representation.

The Boolean Model is simple text representation mode.It is a kind of based on Boolean algebra and sets theory proposition 's.It is that text is marked as 1 or 0 with the presence of identification feature, the two text spies presented simultaneously by the ratio of calculating Levy come the calculating for the similitude for calculating two message to determine.But, there is also deficiency for Boolean Model, that is to say, that Bu Ermo Type represents that the ability of document is relatively poor, eliminates the most of characteristic of document in itself, therefore often regard Boolean Model as it He compares submodel at similitude.

The similitude of the vector model can be calculated by the cos θ values between vector：

For as the vectorial document in n-dimensional space, for giving document D (t₁, w₁；t₂, w₂；...；t_n, w_n), wherein t_iIt is the text of feature, w_iCharacteristic item be content of text importance execution text, using by i characteristic items as i coordinate Axle, then w_iIt is the multidimensional coordinate axle that the ratio value of respective coordinates axle, i.e. text are conceptualized as vector, sets up vector space model Committed step is to determine i characteristic item of text, and confirms the significance level of characteristic item by calculating the weight of characteristic item.

The language model is a kind of model based on probability and statistics.Language model is generally divided into two kinds of classifications：One class For for the rule syntax in linguistics, one kind is to be based on statistical language model.Statistical method is also the main flow side of language model Method, is, by being processed to a corpus, and to count the probability distribution knowledge in terms of linguistics therein, that is, obtains Linguistry included in corpus.

Topical subject is the discovery that the elite of Text Clustering Algorithm, and text cluster obtains one by topic cluster from topic cluster New theme.Basic thought is that the topic cluster by each message report with having existed is compared.Given if similitude is higher than Determine threshold value, then message is inserted into theme cluster.Similitude is lower, then news report will rebuild theme cluster.

The clustering algorithm be partition clustering algorithm, hierarchical clustering algorithm or incremental clustering algorithm in any one or Combination.

The partition clustering algorithm is to be based on subarea clustering algorithm, it is assumed that each text can accurately be defined as a collection Close, and calculate the text each gathered and similitude with by text classification into corresponding set.Intelligent campus hot issue It is the discovery that based on K-Means algorithms to realize.K-Means clustering algorithms are k cluster centres being pre-selected, and are performed Recursive operation is to realize cluster.

K-Means algorithms are to randomly choose k initial cluster center by using traditional clustering algorithm, to the knot of cluster Fruit influences larger, to solve this problem, before clustering algorithm, collects descriptor frequency method, and then selection can split master K text of topic as algorithm initial cluster center.Comprise the following steps that：

1) title of each message article is selected to form title set { T from sample set₁, T₂..., T_n}；

2) the n subject information extracted is divided into word, is counted for the frequency of occurrences to the word in theme.

3) after being ranked up to descriptor frequency, keyword of the selection with k word frequency of highest is special to form theme Collect { w_t1, w_t2..., w_tk}；

4) initial message sample is made up of the k group documents according to set of keywords, i.e. D_i={ w_i1, w_i2..., w_in, w_ij It is the tagged word w included_tiJ-th of text, n is the tagged word w included_tiTextual data；

5) w is compared_tiAnd D_tiSimilarity between middle remaining text, we obtain the value of n similarity, and we will obtain Their sum.Then we use the message with similitude and maximum as corresponding descriptor frequency w_tiThe expression of text, K can represent text altogether；

6) threshold value is set to calculate the k similitudes between representative text and text.If it exceeds the threshold, then in two Heart point merges.If the similitude between all texts is less than threshold value, step 9)；

7) kth obtained in step 2₁Individual tagged word, then proceedes to step 4；

8) k representative texts are finally given；

9) k representative text is initial cluster center and clustered with K-Means algorithms.

The text so selected is k initial center point of K-Means clustering algorithms, to improve the accuracy of cluster.

K-Means algorithms must confirm the quantity of cluster result cluster in advance, but actually be difficult to confirm cluster result Quantity, and the algorithm can not complete the text object that newly inserts, it is therefore desirable to according to actual use demand and other two kinds of sides A kind of combination in method or other methods is calculated.

The hierarchical clustering algorithm is a kind of clustering algorithm that text categories are divided into appropriate level, and appropriate level will be with Text type is changed and changed.It can be divided to two major classes according to the direction of cluster：It is fractionized and from lower floor to upper from upper strata to lower floor Layer combination.

The incremental clustering algorithm be Single-pass algorithms, be using first text as initial cluster center, and with Other text similarities are compared, and similarity is higher than the preset value for the text being inserted into cluster, when similarity is low, and it can be automatic Create a new cluster centre.The algorithm has substantial amounts of news report sequence, and the influence of different input sequences has to cluster Certain influence.

Incremental clustering algorithm adapts to new samples of text, and asking for new text object can not be solved by solving K-Means algorithms Topic, according to the news briefing time, inputs algorithm by Message-text collection successively, is to be formed dynamically clustering cluster among cluster, in advance Elder generation simultaneously is not required to confirm initial classes number of clusters, applied to processing online information, is generally used for online topic detection.

The topic based on Intelligent campus information service platform that the present invention is provided finds method by by focus motif discovery Algorithm is compared to verify the correctness of cluster result with Text Clustering Algorithm, and following experimental data is analyzed.It is first right Data test index is accordingly introduced.

Assuming that the intelligence sample total amount tested is n, then to a topic i, there are a and topic i phases in n sample The straight news reporting of pass, the message for belonging to topic i measured by topic clustering algorithm just has m, then finds the standard in m Really belonging to topic i message has b, then the probability of the correct message of algorithm omission is shown in shown in formula (4)：

Then, the algorithm is just by among 100 message mistake report clusters to topic i, and error detection defines the general of algorithm Shown in rate such as formula (5).

For tracking system and topic detection expense, calculate shown in standard such as equation (6).

The C in formula (5)_MissThe effect produced by accurate category topic i straight news reporting, C are then omitted for algorithm_FaIt is then The straight news reporting that will not belong to topic i is grouped into effect produced in i, if accurately message many to the greatest extent should be referred to words by experiment Inscribe among i, among up to this purpose, the message for being much not belonging to topic i can be also referred to by system in the lump, therefore, experiment is false Make C_MissInfluence it is of a relatively high, and C_FaInfluence it is relatively low, if C_Miss=1.0, C_Fa=0.1.P_TargetAnd P_Non-targetIt is The coefficient obtained based on past substantial amounts of cluster experiment, P_Target=0.02, P_Non-target=0.98, (C_Det)_NormValue get over It is small, show that the accuracy of algorithm is better.

Refer to (Cs of the Fig. 5 for the clustering algorithm of the present invention_Det)_NormIt is worth test chart.Single-pass algorithms in figure (C_Det)_NormValue, the contrast histogram of three kinds of algorithms of K-Means algorithms and Intelligent campus Subject Clustering algorithm, pass through experiment process Sample size increase, the K-Means algorithms of Intelligent campus motif discovery algorithm and the C of Single-pass algorithms_DetValue also exists Increase.This shows that the clustering precision of algorithm is reduced with the quantity of input sample.When test post 100, K-Means is calculated Method and the cluster correctness of Single-pass algorithms difference be not obvious, when institute's test post increases to 800, Single-pass Algorithm is obviously more accurate than K-Means algorithm, mainly due to the influence of K-Means algorithms and initial cluster center, works as survey When the text time of examination is longer, it is difficult to randomly choose suitable k center, the campus focus algorithm of wisdom solves this problem, institute With (C_Det)_NormThe impacted sample size of value is little.

Embodiments of the invention are the foregoing is only, are not intended to limit the scope of the invention, it is every to utilize this hair Equivalent structure or equivalent flow conversion that bright specification and accompanying drawing content are made, or directly or indirectly it is used in other related skills Art field, is included within the scope of the present invention.

Claims

1. a kind of topic based on Intelligent campus information service platform finds method, it is characterised in that including：

Step 1, intelligent campus information service platform is set up, the message collection of campus theme on the internet forms message data Storehouse；

Step 2, Text Pretreatment is carried out to the Message-text in the database, Text Pretreatment is word segmentation processing, including language Adopted ambiguity analysis, unregistered word are extracted, keyword extraction and stop words are handled；

Step 3, feature extraction is carried out to pretreated text, its Text character extraction is independent assessment method, the independence Evaluation method includes information gain, X²Statistic and document frequency algorithm, described information gain calculate characteristic item power by needing It is great it is small text class cluster is classified, it is that the Feature Words obtained more than classification information are calculated by following formula (1),

I G (t) = - Σ_{i = 1}^{n} P (c_{i}) \log P (c_{i}) + P (t) Σ_{i = 1}^{n} P (c_{i} | t) \log P (c_{i} | t) + P (\overset{&OverBar;}{t}) Σ_{i = 1}^{n} P (c_{i} | \overset{&OverBar;}{t}) \log P (c_{i} | \overset{&OverBar;}{t}) - - - (1)

Wherein, text collection represents classification c_iProbability be P (c_i), and text collection represents that Feature Words t probability is P (t), P (c_i| t) belong to predefined classification c comprising Feature Words t texts_i,It is that text belongs to if Feature Words t is not in text Classification c_iProbability, n is the quantity of text categories；

The X²Statistic is the significant levels for assessing characteristic item, and the amount of the text message carried by characteristic item is by quantifying To quantify, its by following formula (2) come statistic,

X^{2} (t_{i}, C_{j}) = \frac{N \times {(A \times D - C \times B)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)} - - - (2)

Wherein, N is the number of the text extracted, C_jIt is cluster, A is C_jIn text number, and C is not the text in feature This, B is characteristic item t_iThe number C of outside text_jCluster, and D is not in characteristic item t_iIn C_jThe number of text outside cluster Amount；

Step 5, model is represented by computer by clustering algorithm calculates object for text knowledge, with the text of same subject A theme storehouse is formed together, and the main body storehouse is hot issue storehouse.

2. the topic according to claim 1 based on Intelligent campus information service platform finds method, it is characterised in that step Rapid 2 Chinese word segmentation use statistical morphology, N- shortest-path methods and string matching participle method in any one or Combination.

3. the topic according to claim 1 based on Intelligent campus information service platform finds method, it is characterised in that step Knowledge Representation Model in rapid 4 includes probabilistic model, Boolean Model, vector space model and language model.

4. the topic according to claim 3 based on Intelligent campus information service platform finds method, it is characterised in that institute Stating the similitude of vector model can be calculated by the cos θ values between vector：

S i m (D_{1}, D_{2}) = \cos θ = \frac{Σ_{k = 1}^{n} w_{1 k} \times w_{2 k}}{\sqrt{(Σ_{k = 1}^{n} w_{1 k}^{2}) (Σ_{k = 1}^{n} w_{2 k}^{2})}} - - - (3)

For as the vectorial document in n-dimensional space, for giving document D (t₁, w₁；t₂, w₂；...；t_n, w_n), wherein t_iIt is The text of feature, w_iCharacteristic item be content of text importance execution text, using by i characteristic items as i reference axis, Then w_iIt is the multidimensional coordinate axle that the ratio value of respective coordinates axle, i.e. text are conceptualized as vector, sets up the pass of vector space model Key step is to determine i characteristic item of text, and confirms the significance level of characteristic item by calculating the weight of characteristic item.

5. the topic according to claim 1 based on Intelligent campus information service platform finds method, it is characterised in that step Rapid 5 clustering algorithm is any one in partition clustering algorithm, hierarchical clustering algorithm or incremental clustering algorithm or combination.

6. the topic according to claim 5 based on Intelligent campus information service platform finds method, it is characterised in that institute Partition clustering algorithm is stated for K-Means algorithms, is k cluster centre being pre-selected, and it is poly- to realize to perform recursive operation Class.

7. the topic according to claim 5 based on Intelligent campus information service platform finds method, it is characterised in that institute Incremental clustering algorithm is stated for Single-pass algorithms, be using first text as initial cluster center, and with other text phases Compared like property, similarity is higher than the preset value for the text being inserted into cluster, when similarity is low, it can automatically create one newly Cluster centre.