WO2017035922A1

WO2017035922A1 - Online internet topic mining method based on improved lda model

Info

Publication number: WO2017035922A1
Application number: PCT/CN2015/092047
Authority: WO
Inventors: 杨鹏; 卢云骋; 董永强
Original assignee: 杨鹏
Priority date: 2015-09-02
Filing date: 2015-10-16
Publication date: 2017-03-09
Also published as: CN105138665B; CN105138665A

Abstract

Disclosed is an online Internet topic mining method based on an improved LDA model. The method corresponds to a continuous and streaming type topic mining process conducted in a segmented mode, n web pages are processed each time, and these web pages are usually acquired by web crawlers from the Internet in an online and real-time mode, and mining results of the contents of these web pages generate k topics. After current n web pages are processed, newly acquired n web pages are continuously processed through the process. The process mainly comprises initialization of On-LDA model hyper-parameters, dynamic updating of the On-LDA model hyper-parameters, Internet topic mining based on the On-LDA model and the like. By means of the present invention, the assignment method and effect of use in respect to the hyper-parameters and of a traditional LDA model in the topic mining process are radically changed. Classified information to which the web page contents belong is fully utilized to assign initial values to the model hyper-parameters, so that the initial values of the hyper-parameters completely depend on the web page contents to be mined, and the computing process is simplified and rationality is achieved.

Description

An Online Topic Mining Method Based on Improved LDA Model

Technical field

The invention belongs to the field of internet technology, and particularly relates to an online topic mining method based on an improved LDA model, which can overcome the inaccuracy of the traditional LDA model for dynamically mining Internet topics, and can be included in a large number of web resources in real time. Topics are detected and mined online.

Background technique

The rapid development and widespread popularity of the Internet has made it an important medium for people to quickly acquire, publish and deliver information. Especially in recent years, the mobile Internet has been greatly developed. It fully combines the advantages of both mobile communication and the Internet, making it easier for people to obtain information. A large number of information resources with many sources and different positions in the Internet continue to emerge. Some of the hotspots and sensitive topics they reflect often spread and spread at a very fast speed by means of the Internet, which has a major impact on society. Therefore, how to detect and mine the topics contained in a large number of webpage information resources, quickly discover and capture the hot topics of the network, and/or cluster the Internet information resources according to the topic, and monitor and monitor the network for real-time tracking and rationalization. Internet content big data, as well as guiding readers to quickly find information of their own interest, are very important.

Research on complex networks has shown that the Internet has evolved into a scale-free network that obeys power laws. One of the main manifestations of its scale-free feature is that a small number of websites have thousands of connection connections far above the average website, and they form hubs in the World Wide Web (Web), which become Internet content access traffic. The main source. Make full use of this feature, through the use of web crawler-based information collection technology for mainstream and popular websites, it can dynamically and efficiently collect a large number of webpage information resources on a high degree of coverage, providing a premise basis for real-time detection and mining of Internet topics. . However, the amount of dynamically aggregated web page information resources is large and complex, and the content of these web pages is generally time-sensitive, and the topics and their heats are often dynamically changed with time. Investigate some existing algorithm models for topic mining and detection, among which are more influential such as PLSA (Probabilistic Latent Semantic Analysis) model and LDA (Latent Dirichlet Allocation) model. The analysis shows that the PLSA model is not perfect for the multi-distribution probability model of the topic (it only focuses on the likelihood function but ignores the prior distribution of the parameters), and the model complexity and iterative calculation when the number of documents and the amount of words increase A significant increase. The LDA model relies on two Dirichlet distribution hyperparameters.

with

They usually take the value according to experience at the beginning, or experiment with a specific corpus first, and then take the value according to the optimal result of the experiment, and the value of the hyperparameter is set and remains unchanged throughout the topic mining process. In addition, the LDA model uses the same hyperparameter when generating the probability distribution of all topics for each word.

This is not reasonable. Therefore, topic mining and detection models such as PLSA and LDA are generally applicable to the relatively static offline topic mining environment of corpus, while the real-time and streaming online mining requirements for Internet topics are reasonable, timeliness, computational efficiency and accuracy. The area is greatly discounted.

Summary of the invention

OBJECT OF THE INVENTION: In view of the problems existing in the prior art, the present invention provides an online topic mining method based on an improved LDA model. The basis of this method is an improved LDA model (online-on-LDA), which initially uses the classification information of the content of the web page to be mined to hyperparameters.

The initial value is assigned, and then the hyper-parameter of the On-LDA model is dynamically updated with relevant statistical information after each topic mining is completed. The online topic mining method based on improved LDA model (On-LDA model) can effectively overcome the limitations of traditional models such as PLSA and LDA subject to static and offline topic mining environment. It can reflect the current topic in the Internet more accurately and timely. The reality of dynamic evolution with new web pages that continue to emerge, enabling online detection and mining of topics contained in a large number of web content resources.

The "topic" in the present invention refers to a collection of subject words or phrases that are extracted from the content of a given web page collection and that are normalized and reflect deep semantic features such as the subject matter and meaning of the web page content. The invention adopts the On-LDA model as a basis for online mining of topics included in a large number of web resources in the Internet. The On-LDA model is an improved LDA model that supports dynamic, online topic mining.

Technical Solution: An online topic mining method based on improved LDA model, corresponding to a continuous, streamlined, piecemeal topic mining process, each processing n (≥ 1) web pages, these web pages are usually web crawlers Collected from the Internet in an online and real-time manner, and the results of mining the contents of these web pages generate k (≥ 1) topics. After the current n web pages are processed, the process is continued for the newly acquired n web pages. Suppose that a web page resource set consisting of n web pages is initialized at initial time t ₀

Perform topic mining, and collect all the different words contained in C ⁰ to form a collection

Mining to generate a set of topics consisting of k topics

And at time t _i (i>0) for the collection of web resources

Conduct topic mining, consider the collection at this time

a collection of different words contained in all web pages

Mining generated topic collection

In the above W ⁰ and W ⁱ , v ⁰ =|W ⁰ |, v ⁱ =|W ⁱ |.

An online topic mining method based on improved LDA model (On-LDA model) mainly involves three calculation processes, including initialization of On-LDA model hyperparameters, On-LDA model hyperparameter dynamic update, On-LDA model based on On-LDA model. Internet topic mining and so on.

Initialization of the On-LDA model hyperparameters. The On-LDA model mainly uses the classification information of web content to compare hyperparameters.

Assign initial value. For web resources in a given domain (such as the news field) in the Internet, the content of each web page corresponds to a category information of the domain (such as current affairs, military, technology, etc.), which is the content metadata of the web page. Suppose that all classification information for all web resource content in a given domain is represented by the set G={cat ₁ ,cat ₂ ,...,cat _g }, where g=|G|, and cat _s (1≤s≤g) represents one Specific classification information (such as current affairs). First, the value of the parameter k is set by the size of the set G, that is, k=g=|G|, which determines the number of topics generated by each mining of the On-LDA model. On this basis, the hyperparameters in the On-LDA model

Initialize to obtain the hyperparameter value at the initial time t ₀

with

(Superscript T means matrix transpose):

in

with

In the case of 1 ≤ s ≤ k:

Where count_doc(cat _s ) represents the total number of web pages whose content in the web resource set C ⁰ belongs to the classification information cat _s (1≤s≤k);

among them

The values are as follows:

Where count_doc(cat _s ) indicates words

The total number of times in all web pages with classification information cat _s appearing in C ⁰ .

Dynamic update of the On-LDA model hyperparameters. On-LDA model in the continuous, streaming topic mining process, when each topic mining is completed, the statistical information will be used to update the hyperparameters in time.

And use the updated hyperparameters for the next topic mining, which is significantly different from the classic LDA model. On-LDA model hyperparameter update process: hyperparameter in On-LDA model at initial time t ₀

Take the initialization value separately

with

Assume that the hyperparameter at time t _i ( _i ≥ 1)

Value separately

with

According to this, the collection of web resources

Perform topic mining to generate topic collections

Next, on the hyperparameter

Update it as follows. First, update the hyperparameters with the following formula

for

among them

with

The values are as follows:

matrix

The jth (0 ≤ j ≤ i) is listed as

It indicates that in all the webpages of the webpage resource set ^Cj , the frequency of the corresponding words of each topic in the topic set Z ⁱ is included, that is,

Indicates that all pages in C ^j contain tags that are marked as topics

The number of words.

Considering that the longer the web content from the current time (t _i ) has less influence on the current topic mining, when updating the hyperparameter of the On-LDA model, an exponential decay function can be used to represent the web content of the past moments to the current topic. Time weight matrix

Where λ is the attenuation factor and n ₀ is the normalization constant.

Next, update the hyperparameters with the following formula

for

Among them, for 1≤s≤k there are:

matrix

The jth (0 ≤ j ≤ i) is listed as

It expresses a topic

Referring to make each word, the number of all the words in the words W ⁱ is set at time t _i appears. If topic

Contains words

then

Equal to the word at time t _i

The total number of occurrences in all pages of C ^j ; if topic

Does not contain words

then

Equal to 0.

Is the same time weight matrix as before.

Internet topic mining based on On-LDA model. Suppose that at time t _i ( _i ≥ 0), a collection of web resources is required.

Conduct topic mining. At this point, first determine the hyperparameter of the On-LDA model.

The value. If the topic mining is performed on the first collected web resource set C ⁰ at time t ₀ , the hyperparameter is first calculated according to the initialization process of the On-LDA model hyperparameter.

Initial value

with

If the topic mining is performed on the collected web resource set C ⁱ at time t _i ( _i ≥ 1), the hyperparameter

The value obtained by the On-LDA model hyperparameter dynamic update at the end of the topic excavation at the last moment (t _i-1 )

with

Then, according to the On-LDA probability map model, and using the Gibbs Sampling method as shown in FIG. 2, the topic mining of the webpage resource set C ^{i is} performed to generate a topic set.

And get every page in C ⁱ

(1≤u≤n) corresponds to the semantic feature vector of the topic set Z ⁱ

among them

(1 ≤ s ≤ k) for web pages

Belonging to the topic

The probability.

When it is necessary to explain, in the Internet topic mining process based on On-LDA model, not only hyperparameters

The values are dynamically updated as the information is mined, and when the probability distribution of k topics for all words is generated at time t _i , different topics use different hyperparameters (ie

k different components

), which always uses fixed, preset hyperparameters in traditional LDA models.

It is much more reasonable.

Beneficial effects: The online topic mining method based on the improved LDA model (On-LDA model) fundamentally changes the traditional LDA model in the topic mining process.

The way of assignment and the effect of using it. It makes full use of the classification information of the web content to model the parameters.

Assigning the initial value makes the initial value of the hyperparameter completely dependent on the content of the web page to be mined (rather than the pre-selected corpus), which simplifies the calculation process and makes it more reasonable.

At the same time, model hyperparameters

The value of the web page dynamically changes with the content of the web page that has been processed (rather than remaining unchanged during the topic mining process), so that the evolution of the topic in the Internet can be more accurately and timely reflected. The above features make the application field of the present invention no longer limited to the static and offline topic mining environment, especially in the online topic detection and mining of the Internet, which has better timeliness, computational efficiency and accuracy than the traditional topic mining method.

DRAWINGS

Figure 1 is a probabilistic graph model of the improved LDA model (On-LDA model), which describes how the On-LDA model generates corresponding sets of documents for all documents. among them

Is a hyperparameter of the Dirichlet distribution, which has corresponding specific values at different times, and

Is the current hyperparameter

The sth dimension vector for the s (1 ≤ s ≤ k) topics. Suppose that topic mining is performed on n webpage content at a certain time t, and k topics are generated.

For the topic distribution of the i-th web page c _i (1 ≤ _i ≤ n),

The word distribution indicating the s (1 ≤ s ≤ k) topics, tn _{i, r} represents the topic number to which the r word of the web page c _i is assigned, and w _{i, r} represents the r word of the web page c _i .

Figure 2 is the On-LDA model hyperparameter

The dynamic update process.

Figure 3 shows the Gibbs sampling process for topic mining based on the On-LDA model. Where Z ⁽⁰⁾ is the initial value of the topic set Z ⁱ ,

Expressive words

Appear in the topic

Number of times,

Expressing topic

Appear on the web

The number of times. Probability

Indicates that the page is excluded

Under the premise of the currently assigned topic number of the r th word, the web page is calculated using the information of the web page set C ⁱ and the word set W ⁱ

The probability distribution of the rth word for each of the remaining topics. Θ indicates by webpage

Semantic feature vector

A matrix composed of row vectors. Φ represents a row vector as a matrix consisting of k subject to all the words W ⁱ probability distribution.

detailed description

The invention will be further clarified below with reference to specific embodiments, which should be understood only to illustrate the invention. The scope of the present invention is not limited by the scope of the invention, and the modifications of the various equivalents of the invention are intended to be within the scope of the appended claims.

(1) Using the On-LDA model as a basis for online mining of topics contained in a large number of web resources on the Internet. The On-LDA model is an improved LDA model that supports dynamic and online topic mining. The probability graph model is shown in Figure 1. The meaning is: the process of generating k topics for n web pages (documents), which can be regarded as essentially The process of generating a collection of words in a web page (document), that is, using the current hyperparameter first

Generate a topic distribution for each web page c _i (1 ≤ _i ≤ n)

Further basis

Sampling to generate the topic number tn _i,r of each word of the web page c _i ; at the same time, using the current hyperparameter

Each dimension column vector

Sampling to generate word distributions for corresponding topics (ie, the sth topic)

Finally, each word w _{i,r of the} web page c _i is generated by sampling, that is, the word set of the web page c _i is obtained. The online topic mining method based on On-LDA model corresponds to a continuous, streamlined, piecemeal topic mining process, which processes n (≥1) web pages each time. These web pages are usually online and real-time by web crawlers. The way to collect from the Internet, the result of mining the content of these pages generates k (≥ 1) topics. After the current n web pages are processed, the process is continued for the newly acquired n web pages. Suppose that a web page resource set consisting of n web pages is initialized at initial time t ₀

Mining to generate a set of topics consisting of k topics

And at time t _i (i>0) for the collection of web resources

Conduct topic mining, consider the collection at this time

a collection of different words contained in all web pages

Mining generated topic collection

In the above W ⁰ and W ⁱ , v ⁰ =|W ⁰ |, v ⁱ =|W ⁱ |.

(2) Initialization of the On-LDA model hyperparameter. The On-LDA model mainly uses the classification information of web content to compare hyperparameters.

Initialize to obtain the hyperparameter value at the initial time t ₀

with

(Superscript T means matrix transpose):

in

with

In the case of 1 ≤ s ≤ k:

among them

The values are as follows:

Where count_doc(cat _s ) indicates words

(3) Dynamic update of the On-LDA model hyperparameters. On-LDA model in the continuous, streaming topic mining process, when each topic mining is completed, the statistical information will be used to update the hyperparameters in time.

And use the updated hyperparameters for the next topic mining, which is significantly different from the classic LDA model. The update process of the On-LDA model hyperparameter is shown in Figure 2. At the initial time t ₀ , the hyperparameter in the On-LDA model

Take the initialization value separately

with

Assume that the hyperparameter at time t _i ( _i ≥ 1)

Value separately

with

According to this, the collection of web resources

Perform topic mining to generate topic collections

Next, on the hyperparameter

for

among them

with

The values are as follows:

matrix

The jth (0 ≤ j ≤ i) is listed as

Indicates that all pages in C ^j contain tags that are marked as topics

The number of words.

Where λ is the attenuation factor and n ₀ is the normalization constant.

Next, update the hyperparameters with the following formula

for

Among them, for 1≤s≤k there are:

matrix

The jth (0 ≤ j ≤ i) is listed as

It expresses a topic

Contains words

then

Equal to the word at time t _i

The total number of occurrences in all pages of C ^j ; if topic

Does not contain words

then

Equal to 0.

Is the same time weight matrix as before.

(4) Internet topic mining based on On-LDA model. Suppose that at time t _i ( _i ≥ 0), a collection of web resources is required.

Initial value

with

Then, according to the On-LDA probability map model shown in FIG. 1 and using the Gibbs Sampling method as shown in FIG. 2, topic mining is performed on the webpage resource set C ⁱ to generate a topic set.

And get every page in C ⁱ

a semantic feature vector corresponding to the topic set Z ⁱ

among them

For web pages

Belonging to the topic

The probability.

k different components

), which always uses fixed, preset hyperparameters in traditional LDA models.

It is much more reasonable.

The online topic mining method based on the improved LDA model (On-LDA model) proposed by the present invention is verified by an example, including:

(1) First, use the web resource corpus of a given domain (such as the news field) in the Internet to collect all the classified information of all webpage resource contents in the field, and obtain the set G={cat ₁ ,cat ₂ ,...,cat _g }, And use the size of the set to set the value of the parameter k. For example, applying the present invention to online mining and real-time detection of news web pages in the Internet, first classifying the web content of mainstream and popular news websites, and obtaining 20 categories (including current affairs, international, rule of law, military, technology, etc.) ), so set the parameter k=20.

(2) Next, through the tools such as web crawlers, real-time, batch-by-batch collection of popular news web resources in the network, and mining of n web pages for topic mining. In this example, take n=1000. The time when the 1000 news pages are first collected is recorded as t ₀ , and these web pages form a web resource set C ⁰ , and each web page records its classification information when it is collected. Calculate the hyperparameter according to the initialization process of the On-LDA model hyperparameter in the technical solution

Initialization value

with

among them

and

Corresponding to a high-dimensional sparse matrix, this is omitted.

Then, based on the On-LDA model, the topic mining of the web resource collection C ^{0 is} performed, and 20 topics are calculated by Gibbs sampling, and each topic is composed of 5 words. The first four topics are:

Then, according to the dynamic update process of the On-LDA model hyperparameter in the technical solution, the hyperparameter will be

Updated separately to

with

among them

High dimensional sparse matrix

be omitted.

(3) Next, in this example, whenever 1000 hot news web resources are collected in real time, firstly, according to the On-LDA model-based Internet topic mining process in the technical solution, 20 topics are generated for these webpage mining, and At the end of the excavation, according to the dynamic update process of the On-LDA model hyperparameter in the technical solution, the hyperparameter

Update.

(4) after a certain time duration topic of mining, the new collection to the 1000 Top ₁₀ news pages in the time t topic mining and 20 of the topic at the current time, of which the first four topics are:

At the time t ₁₀ topics excavation end, ultra-dynamic update parameters On-LDA model

for

with

among them

High dimensional sparse matrix

be omitted.

The above example shows that the online topic mining method based on On-LDA model has a certain relationship between the generated topics in the two mining with certain time interval, and reflects the dynamic evolution of the topic, which can be reflected in time. News concerns change over time. The application is based on the topic of web content on the Internet. The online mining results can not only detect and analyze the hot topics emerging in the current network, but also use the semantic feature vector of the webpage to determine the similarity between webpage content, and perform content aggregation analysis and personalized recommendation.

Claims

An online topic mining method based on improved LDA model, which includes initialization of On-LDA model hyperparameters, dynamic update of On-LDA model hyperparameters and Internet topic mining based on On-LDA model;

On-LDA model hyperparameter initialization; On-LDA model uses the classification information of web content to hyperparameters
Assign initial value;

On-LDA model dynamic update of hyperparameters; On-LDA model in the continuous, streaming topic mining process, when each topic mining is completed, the statistical information will be used to update the hyperparameters in time.
And use the updated hyperparameters for the next topic mining;

Internet topic mining based on On-LDA model; assume that web resource collection is required at time t i ( i ≥ 0)
Conduct topic mining; at this time, first determine the hyperparameter of the On-LDA model
If the value is used, the topic mining is performed on the first collected web resource set C 0 at time t 0. At this time, the hyperparameter is first calculated according to the initialization process of the On-LDA model hyperparameter.
Initial value
with
If the topic mining is performed on the collected web resource set C i at time t i ( i ≥ 1), the hyperparameter
The value obtained by the On-LDA model hyperparameter dynamic update at the end of the topic excavation at the last moment (t i-1 )
with
Then, according to the On-LDA probability map model, and using the Gibbs sampling method, the topic mining of the web resource collection C i is performed to generate a topic set.
And get every page in C i
a semantic feature vector corresponding to the topic set Z i
among them
For web pages
Belonging to the topic
The probability.
The online topic mining method based on the improved LDA model according to claim 1, wherein all the classification information of all webpage resource contents in a given domain is assumed to be a set G={cat 1 ,cat 2 ,...,cat g } indicates that g=|G|, and cat s (1≤s≤g) represents a specific classification information; first, the value of the parameter k is set by the size of the set G, that is, k=g=|G| , which determines the number of topics generated by each mining of the On-LDA model; the hyperparameters in the On-LDA model
Initialize to obtain the hyperparameter value at the initial time t 0
with

in
with
In the case of 1 ≤ s ≤ k:

Where count_doc(cat s ) represents the total number of web pages whose content in the web resource set C 0 belongs to the classification information cat s (1≤s≤k);

among them
The values are as follows:

Where count_doc(cat s ) indicates words
The total number of times in all web pages with classification information cat s appearing in C 0 .
The online topic mining method based on the improved LDA model according to claim 1, characterized in that the update process of the On-LDA model hyperparameter: the hyperparameter in the On‐LDA model at the initial time t 0
Take the initialization value separately
with
Assume that the hyperparameter at time t i ( i ≥ 1)
Value separately
with
According to this, the collection of web resources
Perform topic mining to generate topic collections
Next, on the hyperparameter
Update it as follows. First, update the hyperparameters with the following formula
for

among them
with
The values are as follows:

matrix
The jth (0 ≤ j ≤ i) is listed as
It indicates that in all the webpages of the webpage resource set Cj , the frequency of the corresponding words of each topic in the topic set Z i is included, that is,
Indicates that all pages in C j contain tags that are marked as topics
The number of words.
The online topic mining method based on the improved LDA model according to claim 3, wherein the on-LDA is updated in consideration of the influence of the webpage content that is longer than the current time (t i ) on the current topic mining. When the hyperparameter of the model is used, an exponential decay function can be used to represent the weight of the webpage content of the past moments on the current topic mining, forming a time weight matrix.
Where λ is the attenuation factor and n 0 is the normalization constant;

Next, update the hyperparameters with the following formula
for

Among them, for 1≤s≤k there are:

matrix
The jth (0 ≤ j ≤ i) is listed as
It expresses a topic
Referring to make each word, the number of all the words in the words W i is set at time t i appears. If topic
Contains words
then
Equal to the word at time t i
The total number of occurrences in all pages of C j ; if topic
Does not contain words
then
Equal to 0;
Is the same time weight matrix as before.