CN102779190A

CN102779190A - Rapid detection method for hot issues of timing sequence massive network news

Info

Publication number: CN102779190A
Application number: CN2012102293775A
Authority: CN
Inventors: 王厚峰; 彭楠赟
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2012-07-03
Filing date: 2012-07-03
Publication date: 2012-11-14
Anticipated expiration: 2032-07-03
Also published as: CN102779190B

Abstract

The invention provides a rapid detection method for hot issues of timing sequence massive network news, comprising the following steps of: dividing a network news text sequence into block sequences according to time intervals; clustering a news text of the first block according to a Dirichlet process to form a clustered set; attenuating and filtering a result of clustering the front block to be used as prior distribution of subsequent blocks; clustering the subsequent blocks according to the Dirichlet process; carrying out hot degree sequencing of issues of each cluster according to reporting amount; and taking T clusters with the highest sequencing value as the hot issues; and selecting M characteristics with the highest tf-idf value in each cluster as keywords of hot spots and displaying the hot spots. According to the rapid detection method for the hot issues of the timing sequence massive network news disclosed by the invention, the efficiency of clustering the network news can be greatly improved; and meanwhile, the occupation of an internal memory is not linearly increased along the increasing of data quantity, and the rapid detection method is suitable for large-scale text data analysis.

Description

A kind of focus incident method for quick of sequential mass network news

Technical field

The invention provides a kind of focus incident discover method of online news, be specifically related to find focus incident fast in the magnanimity newsletter archive of the report of sequential from network, and by temperature to ordering of events, belong to natural language processing and data mining field.

Background technology

Along with the flourish of network technology and thing followed information explosion, people can get access to up-to-date, the most full news major issue on the one hand at any time, and on the other hand, the time cost that the reader obtains key message also increases thereupon.How from the online Internet news of magnanimity, to obtain useful information automatically and become a urgent task.The focus incident detection of the online news of network can be satisfied people and from the Internet news of sequential magnanimity, obtain important information, raising reading efficiency, also can help departments of government to carry out network public-opinion monitoring and accident monitoring simultaneously.

At present, a lot of methods have been used topic model (Topic Model) and affine propagation (Affinity Propagation) algorithm when carrying out detection of Internet news focus and news recommendation.But the problem that these two class methods exist is to need focus number k in the prior given news, and can only handle static data.Actual conditions are, the news major issue quantity that take place every day is also uncertain, and simultaneously, news report is dynamic, real-time.

Except that the problems referred to above, incident itself also can experience the process of generation, development and decay, in focus incident is found, also should consider the rule that these are natural.

Summary of the invention

The Internet news focus incident of indication refers among the present invention: in the Internet news text flow of one group of sequential, exist, in a certain certain period of time by continuously and the wide coverage and the incident of being shown great attention to.Under the situation of not doing to specify, chronomere of the present invention all suppose by " my god " for unit, time span is the interval with " 1 day " also.But method of the present invention is applicable to random time unit.

The purpose of this invention is to provide a kind of new method,, detect focus incident wherein, and press temperature ordering of events through the Internet news text data of fast processing magnanimity.In the face of the newsletter archive of sequential magnanimity, both required the algorithm time efficiency high, can not linearity increase space complexity along with the increase of news data again, simultaneously, can also be to generation, development, the attenuation process modeling of focus incident.

Principle of the present invention is: use the Di Li Cray process of a band time factor to be used for the Internet news cluster; It can represent the dynamic evolutionary process of hot news on the one hand well; On the other hand general Di Li Cray process has been become incremental model; Not taking of internal memory can increase and linear increasing with data volume, is applicable to the processing of large scale network text data.In addition,, the present invention proposes a kind of quick deduction algorithm and replace gibbs sampler, accelerated algorithm speed greatly based on the greed search in order further to improve time efficiency.Afterwards, the focus incident of excavating (being the newsletter archive clustering cluster) is sorted, extract the most popular incident.

Following elder generation makes an explanation to several terms:

-clustering cluster: each class that forms through clustering method is called a clustering cluster.Among the present invention, each clustering cluster is represented a possible incident.

-clustering cluster size: the element number in the clustering cluster; For text cluster, the size of clustering cluster is meant text number wherein.

-Di Li Cray process (Dirichlet Process): be also referred to as Chinese-style restaurant's process (Chinese Restaurant Procss), illustrated in detail (http://en.wikipedia.org/wiki/Dirichlet_process) is arranged on [WIKI].

-tf-idf value: the notion commonly used in the information retrieval is a kind of method that a speech of tolerance (or phrase) characterizes content of text.Suppose that the frequency that certain speech (or phrase) term occurs is tf in a text Text; Occur in the df of this speech (or phrase) in the text collection text; If the text in the text collection adds up to Num, the tf-idf value of this term in text Text calculated (logarithm log gets 10 and is the truth of a matter) by following formula:

tf - idf = tf * \log \frac{Num}{df}

The operation chart of the present invention's correspondence is as shown in Figure 1, [f in the drawings ₁f ₂...] and the representation feature set, characteristic is actually the set of speech in the text set by construed.Each clustering cluster is represented an incident, and the clustering cluster set constitutes the focus incident storehouse.

Technical scheme provided by the invention is following:

A kind of focus incident method for quick (flow process is referring to Fig. 2) of sequential mass network news comprises:

A. the Di Li Cray process of using the band time factor comprises following three steps to the online cluster of Internet news text:

A1. the Internet news text sequence is divided into the block sequence by the time interval, each block comprise a plurality of newsletter archives in the time interval (as, be the interval with " 1 day ", each block comprises 1 day newsletter archive).

A2. to first block (as, first day) newsletter archive carry out cluster by Di Li Cray process, form the clustering cluster set.

A3. for each follow-up block, utilize result after the last block cluster also by carrying out cluster, but before cluster by Di Li Cray process, need do filtration treatment again to the cluster result of last block affected attenuation processing earlier.The basic thought of attenuation processing is after last block processes is intact, is that decay factor is implemented decay with a to each clustering cluster that forms, the size of supposing certain clustering cluster be r (promptly; Comprise r text), then, after revising decay; Its size becomes r '=a*r, wherein a ∈ (0,1).The inner characteristic distribution of clustering cluster remains unchanged.The basic thought of filtration treatment is: the deletion size is less than the clustering cluster of certain threshold value t (as t=30 is set, also can be made as other value), and simultaneously, deletion continues the clustering cluster that the report time surpasses certain hour length (as 150 days, also can be set to other value).

B. focus incident is sorted and shows, specifically be divided into following two steps:

B1. to each clustering cluster, calculate this clustering cluster during reporting in averaging time section the report amount, carry out the temperature ordering of incident then according to the report amount;

B2. ranking value is a highest T clustering cluster is as T focus incident (T is the User Defined value); (M can set up on their own to choose the M that the tf-idf value is the highest in each clustering cluster; Like M=20) individual characteristic (promptly representing speech) as the keyword of focus, shows focus.

Utilize technical scheme provided by the invention, can improve the efficient of Internet news cluster greatly; Simultaneously internal memory take linear increasing with the increase of data volume, be applicable to the large scale text data analysis; In addition, the Di Li Cray process mixture model of improved through the joining day factor, can be simulated hot news generation, development, attenuation process, tallies with the actual situation; Filtration to hot news has improved system effectiveness on the one hand, has removed noise on the other hand, has improved the accuracy of system.

Description of drawings

Fig. 1 the method for the invention operation chart.

Fig. 2 the method for the invention process flow diagram.

Embodiment

Through instance the present invention is done further explanation below.

Supposing has continuous three days Internet news report, wherein had in first day 100 pieces about earthquake, 60 pieces about college entrance examination, 30 pieces about national defence, 10 pieces about diplomacy, 5 pieces about economy; Had in second day 70 pieces indeterminate about national defence, 30 pieces of themes about health care, 50 pieces about earthquake, 20 pieces; Had in the 3rd day 80 pieces indeterminate about health care, 50 pieces about tourism, 10 pieces of themes.We also do not know how many total press focus incident numbers is, do not know which type incident every piece of article specifically belongs to yet.

At first, introduce several symbol descriptions:

(1) m is the size of block, that is, textual data is shown by the time sequence table: x _1:m=(x ₁, x ₂..., x _m), wherein, x _iRepresent i text, i=1 ... M.

(2) m in the block clustering cluster that text is corresponding is shown with sequence table:

Assign _1:m=(assign ₁, assign ₂..., assign _m), assign wherein _j∈ C, C represent clustering cluster collection, i.e. C={c ₁, c ₂..., c _k, the number of clustering cluster is K=|C|;

(3) N _jExpression belongs to clustering cluster c _jThe text number;

(4) L representes the different speech numbers (each speech all has sequence number) that comprise altogether in the text collection;

(5)

Expression belongs to clustering cluster c _jText collection in sequence number be the number of times that the speech of l occurs altogether;

(6)

is the ultra parameter corresponding to ; Ultra parameter be given as an initial constant value (as; Each

all is set to 1), and

α also is ultra parameter (its value also can be made as 1).

(7) Γ (a) is called gamma function on mathematics.Form is as the one of which:

when variable a be positive integer

The time, its value is for factorial, that is: Γ (a+1)=a Γ (a)=a! (detailed description see Higher Education Publishing House " the 1st edition p587-589 of mathematics handbook)

Being achieved as follows of A part:

A1. the Internet news text sequence is divided into the block sequence by the time interval; Comprise in each block a plurality of newsletter archives (following with " my god " be chronomere; With " 1 day " is the time interval of block; Therefore each block comprises 1 day newsletter archive, and chronomere also can be set to other value, like " 3 days ", " 1 week ", " January " etc.).

A2. to each text of first block (that is, first day), carry out cluster through following algorithm in chronological order:

Input: an orderly m text is expressed as x _1:m

Output: the clustering cluster that each text is corresponding, that is: sequence assign _{1: m}

The 1st step: the set of initialization clustering cluster is sky, that is, C={}, the clustering cluster number is 0, K=0

The 2nd step: set an initial value p _Max=0;

The 3rd step: (suppose that current is i text x for each text in the block _i), repeat ~ the 3.3 step of the 3.1st step

The 3.1st step: newly-increased clustering cluster c _New, that is: C'=C ∪ { c _New;

The 3.2nd step: for each clustering cluster c _j∈ C' repeats 3.2.1 step ~ the 3.2.1 step:

The 3.2.1 step: when text belongs to c _jThe time, the Probability p of calculating current block integral is following:

1. calculate current text x _iEach text x before _r(1≤r≤i) belongs to the probable value of corresponding clustering cluster:

p (x_{r}) = (\frac{N_{{assign}_{r}}}{Σ_{k = 1}^{K} N_{k} + α} \times \frac{Π_{l = 1}^{L} Γ (n_{{assign}_{r}}^{l} + β_{{assign}_{r}}^{l})}{Γ (Σ_{l = 1}^{L} n_{{assign}_{r}}^{l} + β_{0})})

2. suppose text x _iEach text x afterwards _r(i＜r≤m) belong to independent new clustering cluster, its probable value is:

p (x_{r}) = \frac{α}{Σ_{k = 1}^{K + 1} N_{k} + α} \times \frac{Π_{l = 1}^{L} Γ (n_{r}^{l} + β_{r}^{l})}{Γ (Σ_{l = 1}^{L} n_{r}^{l} + β_{0})}

3. each text x that calculates above _iThe set representations of the probable value of affiliated clustering cluster is:

p = Π_{r = 1}^{m} p (x_{r})

The 3.2.2 step: if Probability p is greater than most probable value p _Max, that is: p>p _MaxThe time:

The 3.2.2.1 step: i text x _iClustering cluster be appointed as c _j: assign _i=c _j

The 3.2.2.2 step: upgrade most probable value, make p _Max=p;

The 3.3rd step: if i text x _iAffiliated clustering cluster does not belong to set C, that is: assign _i=c _K+1:

The 3.3.1 step: with new clustering cluster c _K+1Join clustering cluster set C:C=C ∪ { c _K+1;

The 3.3.2 step: the cluster number of clusters increases 1, that is: K=K+1;

The 4th step: return the corresponding clustering cluster of each text, i.e. assign _1:m

Gone out 5 clustering cluster through the said process cluster, represented earthquake, college entrance examination, national defence, diplomacy and economic respectively, comprised textual data and be respectively 100 pieces, 60 pieces, 30 pieces, 10 pieces and 5 pieces.

A3. when the cluster that gets into second day, earlier first day cluster result is decayed.Suppose that decay factor a is 0.5, after the decay, the size of each clustering cluster becomes 50,30,15,5 and 2.5 respectively.Then filter.Suppose to filter threshold value t=30, have only preceding two clustering cluster, i.e. earthquake (clustering cluster c after then filtering ₁) and college entrance examination (clustering cluster c ₂) exist, as the focus prior distribution of second day (that is second block) cluster.Second block is carried out above-mentioned similar cluster, with above-mentioned unique different be that the initialization in the 1st step changes into:

The 1st ' step: initialization clustering cluster set C={ c ₁, c ₂, the clustering cluster number is 2, K=2;

The B part: temperature ordering and the realization of showing about incident are distinguished as follows:

B1. the temperature of incident ordering:

The 1st step: calculate each clustering cluster c _jIn the text number, count N _j

The 2nd step: calculate each clustering cluster c _jThe time span D of Chinese version _j(by the unit interval, as " my god ", report time and earliest time the latest at interval, like the fate of follow-up story)

The 3rd step: calculate clustering cluster c _jAverage report amount in chronomere: Score _j=N _j/ D _j

The 4th step: press Score _jBe worth descendingly, get T the hottest incident of the individual conduct of preceding T (like T=10) clustering cluster collection ordering.

B2. focus incident is sorted and shows:

The 1st step: regard each clustering cluster as one " big text ", so all clustering cluster have formed several " big text " set;

The 2nd step: with " big text " collection is background, calculates the tf-idf value of each characteristic f (characteristic of a construed in the text) in the hottest T incident (clustering cluster);

The 3rd step: get the highest individual characteristic of M (like M=20) of tf-idf value and carry out the focus displaying as the keyword of focus.

Claims

1. the focus incident method for quick of a sequential mass network news comprises:

A1. the Internet news text sequence is divided into the block sequence by the time interval, each block comprises a plurality of newsletter archives in the time interval;

A2. the newsletter archive to first block carries out cluster by Di Li Cray process, forms the clustering cluster set;

A3. decay the result after the last block cluster, filter, as the prior distribution of follow-up block, then to follow-up block by carrying out cluster by Di Li Cray process;

B. focus incident is sorted and shows, comprising:

B2. ranking value is a highest T clustering cluster is as focus incident, chooses the tf-idf value is the highest in each clustering cluster M the characteristic keyword as focus, focus is showed,

Wherein, T, M are the User Defined value;

tf is that certain speech or phrase term are at one

The frequency that occurs among the text Text, df occurs in this speech or what texts of phrase in text collection, and Num is a literary composition

Text sum in this set, logarithm log are got 10 and are the truth of a matter.

2. focus incident method for quick as claimed in claim 1 is characterized in that, in the steps A 1, the said time interval was a unit with 1 day, and each block comprises 1 day newsletter archive.

3. focus incident method for quick as claimed in claim 1 is characterized in that, in the steps A 3; The disposal route of said decay is following: after last block processes is intact, be that decay factor is implemented decay with a to each clustering cluster that forms, the size of supposing certain clustering cluster is r; Then, after revising decay, its size becomes r ＇=a*r; A ∈ (0,1) wherein, the inner characteristic distribution of clustering cluster remains unchanged.

4. focus incident method for quick as claimed in claim 1 is characterized in that, in the steps A 3, the disposal route of said filtration is following: the deletion size is less than the clustering cluster of certain threshold value t, and simultaneously, deletion continues the clustering cluster that the report time surpasses certain hour length.

5. focus incident method for quick as claimed in claim 1 is characterized in that, the implementation method of steps A 2 is following:

The 1st step: initialization clustering cluster set C is empty, and the clustering cluster number K is 0;

The 2nd step: set a peaked initial value p of probability _Max=0;

The 3rd step: for each text x in the block _i, repeat ~ the 3.3 step of the 3.1st step:

The 3.1st step: newly-increased clustering cluster c _New, note C'=C ∪ { c _New;

1. calculate current text x _iEach text x before _r, 1≤r≤i belongs to the probable value of corresponding clustering cluster:

p (x_{r}) = (\frac{N_{{assign}_{r}}}{Σ_{k = 1}^{K} N_{k} + α} \times \frac{Π_{l = 1}^{L} Γ (n_{{assign}_{r}}^{l} + β_{{assign}_{r}}^{l})}{Γ (Σ_{l = 1}^{L} n_{{assign}_{r}}^{l} + β_{0})})

2. suppose text x _iEach text x afterwards _r, i＜r≤m belongs to independent new clustering cluster, and its probable value then is:

p (x_{r}) = \frac{α}{Σ_{k = 1}^{K + 1} N_{k} + α} \times \frac{Π_{l = 1}^{L} Γ (n_{r}^{l} + β_{r}^{l})}{Γ (Σ_{l = 1}^{L} n_{r}^{l} + β_{0})}

3. the whole probability of current block is top each text x _iThe probable value of affiliated clustering cluster is long-pending:

p = Π_{r = 1}^{m} p (x_{r})

The 3.2.2.2 step: upgrade most probable value, make p _Max=p;

The 3.3rd step: if i text x _iAffiliated clustering cluster does not belong to set C, that is: assigni=c _K+1:

The 3.3.2 step: the cluster number of clusters increases 1, that is: K=K+1;

Wherein, the m in the block the clustering cluster that text is corresponding is shown with sequence table: assign _{1: m}=(assign ₁, assign ₂..., assign _m), assign wherein _j∈ C, C represent clustering cluster collection, i.e. C={c ₁, c ₂..., c _k, the number of clustering cluster is K=|C|; N _jExpression belongs to clustering cluster c _jThe text number; L representes the different speech numbers that comprise altogether in the text collection;

Expression belongs to clustering cluster c _jText collection in sequence number be the number of times that the speech of l occurs altogether; Be corresponding to

Ultra parameter, and

α also is ultra parameter, and ultra parameter is given as an initial constant value.