WO2017035922A1 - Online internet topic mining method based on improved lda model - Google Patents

Online internet topic mining method based on improved lda model Download PDF

Info

Publication number
WO2017035922A1
WO2017035922A1 PCT/CN2015/092047 CN2015092047W WO2017035922A1 WO 2017035922 A1 WO2017035922 A1 WO 2017035922A1 CN 2015092047 W CN2015092047 W CN 2015092047W WO 2017035922 A1 WO2017035922 A1 WO 2017035922A1
Authority
WO
WIPO (PCT)
Prior art keywords
topic
lda model
mining
time
hyperparameter
Prior art date
Application number
PCT/CN2015/092047
Other languages
French (fr)
Chinese (zh)
Inventor
杨鹏
卢云骋
董永强
Original Assignee
杨鹏
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杨鹏 filed Critical 杨鹏
Publication of WO2017035922A1 publication Critical patent/WO2017035922A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the invention belongs to the field of internet technology, and particularly relates to an online topic mining method based on an improved LDA model, which can overcome the inaccuracy of the traditional LDA model for dynamically mining Internet topics, and can be included in a large number of web resources in real time. Topics are detected and mined online.
  • PLSA Probabilistic Latent Semantic Analysis
  • LDA Topic Dirichlet Allocation
  • the analysis shows that the PLSA model is not perfect for the multi-distribution probability model of the topic (it only focuses on the likelihood function but ignores the prior distribution of the parameters), and the model complexity and iterative calculation when the number of documents and the amount of words increase A significant increase.
  • the LDA model relies on two Dirichlet distribution hyperparameters. with They usually take the value according to experience at the beginning, or experiment with a specific corpus first, and then take the value according to the optimal result of the experiment, and the value of the hyperparameter is set and remains unchanged throughout the topic mining process.
  • topic mining and detection models such as PLSA and LDA are generally applicable to the relatively static offline topic mining environment of corpus, while the real-time and streaming online mining requirements for Internet topics are reasonable, timeliness, computational efficiency and accuracy. The area is greatly discounted.
  • the present invention provides an online topic mining method based on an improved LDA model.
  • the basis of this method is an improved LDA model (online-on-LDA), which initially uses the classification information of the content of the web page to be mined to hyperparameters. The initial value is assigned, and then the hyper-parameter of the On-LDA model is dynamically updated with relevant statistical information after each topic mining is completed.
  • the online topic mining method based on improved LDA model can effectively overcome the limitations of traditional models such as PLSA and LDA subject to static and offline topic mining environment. It can reflect the current topic in the Internet more accurately and timely. The reality of dynamic evolution with new web pages that continue to emerge, enabling online detection and mining of topics contained in a large number of web content resources.
  • the "topic” in the present invention refers to a collection of subject words or phrases that are extracted from the content of a given web page collection and that are normalized and reflect deep semantic features such as the subject matter and meaning of the web page content.
  • the invention adopts the On-LDA model as a basis for online mining of topics included in a large number of web resources in the Internet.
  • the On-LDA model is an improved LDA model that supports dynamic, online topic mining.
  • An online topic mining method based on improved LDA model corresponding to a continuous, streamlined, piecemeal topic mining process, each processing n ( ⁇ 1) web pages, these web pages are usually web crawlers Collected from the Internet in an online and real-time manner, and the results of mining the contents of these web pages generate k ( ⁇ 1) topics. After the current n web pages are processed, the process is continued for the newly acquired n web pages.
  • a web page resource set consisting of n web pages is initialized at initial time t 0
  • , v i
  • An online topic mining method based on improved LDA model mainly involves three calculation processes, including initialization of On-LDA model hyperparameters, On-LDA model hyperparameter dynamic update, On-LDA model based on On-LDA model. Internet topic mining and so on.
  • the On-LDA model mainly uses the classification information of web content to compare hyperparameters. Assign initial value. For web resources in a given domain (such as the news field) in the Internet, the content of each web page corresponds to a category information of the domain (such as current affairs, military, technology, etc.), which is the content metadata of the web page.
  • a category information of the domain such as current affairs, military, technology, etc.
  • the hyperparameters in the On-LDA model Initialize to obtain the hyperparameter value at the initial time t 0 with (Superscript T means matrix transpose):
  • count_doc(cat s ) represents the total number of web pages whose content in the web resource set C 0 belongs to the classification information cat s (1 ⁇ s ⁇ k);
  • count_doc(cat s ) indicates words The total number of times in all web pages with classification information cat s appearing in C 0 .
  • On-LDA model hyperparameter update process hyperparameter in On-LDA model at initial time t 0 Take the initialization value separately with Assume that the hyperparameter at time t i ( i ⁇ 1) Value separately with According to this, the collection of web resources Perform topic mining to generate topic collections
  • the hyperparameter Update it as follows. First, update the hyperparameters with the following formula for
  • the jth (0 ⁇ j ⁇ i) is listed as It indicates that in all the webpages of the webpage resource set Cj , the frequency of the corresponding words of each topic in the topic set Z i is included, that is, Indicates that all pages in C j contain tags that are marked as topics The number of words.
  • the jth (0 ⁇ j ⁇ i) is listed as It expresses a topic Referring to make each word, the number of all the words in the words W i is set at time t i appears. If topic Contains words then Equal to the word at time t i The total number of occurrences in all pages of C j ; if topic Does not contain words then Equal to 0. Is the same time weight matrix as before.
  • the online topic mining method based on the improved LDA model fundamentally changes the traditional LDA model in the topic mining process.
  • the way of assignment and the effect of using it It makes full use of the classification information of the web content to model the parameters. Assigning the initial value makes the initial value of the hyperparameter completely dependent on the content of the web page to be mined (rather than the pre-selected corpus), which simplifies the calculation process and makes it more reasonable.
  • model hyperparameters The value of the web page dynamically changes with the content of the web page that has been processed (rather than remaining unchanged during the topic mining process), so that the evolution of the topic in the Internet can be more accurately and timely reflected.
  • Figure 1 is a probabilistic graph model of the improved LDA model (On-LDA model), which describes how the On-LDA model generates corresponding sets of documents for all documents. among them Is a hyperparameter of the Dirichlet distribution, which has corresponding specific values at different times, and Is the current hyperparameter The sth dimension vector for the s (1 ⁇ s ⁇ k) topics. Suppose that topic mining is performed on n webpage content at a certain time t, and k topics are generated.
  • the word distribution indicating the s (1 ⁇ s ⁇ k) topics, tn i, r represents the topic number to which the r word of the web page c i is assigned, and w i, r represents the r word of the web page c i .
  • Figure 2 is the On-LDA model hyperparameter The dynamic update process.
  • Figure 3 shows the Gibbs sampling process for topic mining based on the On-LDA model.
  • Z (0) is the initial value of the topic set Z i
  • Expressive words Appear in the topic Number of times Expressing topic Appear on the web The number of times.
  • Probability Indicates that the page is excluded Under the premise of the currently assigned topic number of the r th word, the web page is calculated using the information of the web page set C i and the word set W i The probability distribution of the rth word for each of the remaining topics.
  • indicates by webpage Semantic feature vector A matrix composed of row vectors.
  • represents a row vector as a matrix consisting of k subject to all the words W i probability distribution.
  • the On-LDA model is an improved LDA model that supports dynamic and online topic mining.
  • the probability graph model is shown in Figure 1. The meaning is: the process of generating k topics for n web pages (documents), which can be regarded as essentially The process of generating a collection of words in a web page (document), that is, using the current hyperparameter first Generate a topic distribution for each web page c i (1 ⁇ i ⁇ n) Further basis Sampling to generate the topic number tn i,r of each word of the web page c i ; at the same time, using the current hyperparameter Each dimension column vector Sampling to generate word distributions for corresponding topics (ie, the sth topic) Finally, each word w i,r of the web page c i is generated by sampling, that is, the word set of the web page c i is obtained.
  • the online topic mining method based on On-LDA model corresponds to a continuous, streamlined, piecemeal topic mining process, which processes n ( ⁇ 1) web pages each time. These web pages are usually online and real-time by web crawlers. The way to collect from the Internet, the result of mining the content of these pages generates k ( ⁇ 1) topics. After the current n web pages are processed, the process is continued for the newly acquired n web pages.
  • a web page resource set consisting of n web pages is initialized at initial time t 0
  • , v i
  • An online topic mining method based on improved LDA model mainly involves three calculation processes, including initialization of On-LDA model hyperparameters, On-LDA model hyperparameter dynamic update, On-LDA model based on On-LDA model. Internet topic mining and so on.
  • the On-LDA model mainly uses the classification information of web content to compare hyperparameters. Assign initial value. For web resources in a given domain (such as the news field) in the Internet, the content of each web page corresponds to a category information of the domain (such as current affairs, military, technology, etc.), which is the content metadata of the web page.
  • a category information of the domain such as current affairs, military, technology, etc.
  • the hyperparameters in the On-LDA model Initialize to obtain the hyperparameter value at the initial time t 0 with (Superscript T means matrix transpose):
  • count_doc(cat s ) represents the total number of web pages whose content in the web resource set C 0 belongs to the classification information cat s (1 ⁇ s ⁇ k);
  • count_doc(cat s ) indicates words The total number of times in all web pages with classification information cat s appearing in C 0 .
  • the jth (0 ⁇ j ⁇ i) is listed as It indicates that in all the webpages of the webpage resource set Cj , the frequency of the corresponding words of each topic in the topic set Z i is included, that is, Indicates that all pages in C j contain tags that are marked as topics The number of words.
  • the jth (0 ⁇ j ⁇ i) is listed as It expresses a topic Referring to make each word, the number of all the words in the words W i is set at time t i appears. If topic Contains words then Equal to the word at time t i The total number of occurrences in all pages of C j ; if topic Does not contain words then Equal to 0. Is the same time weight matrix as before.
  • the topic mining of the web resource collection C 0 is performed, and 20 topics are calculated by Gibbs sampling, and each topic is composed of 5 words.
  • the first four topics are: Then, according to the dynamic update process of the On-LDA model hyperparameter in the technical solution, the hyperparameter will be Updated separately to with among them High dimensional sparse matrix be omitted.
  • the above example shows that the online topic mining method based on On-LDA model has a certain relationship between the generated topics in the two mining with certain time interval, and reflects the dynamic evolution of the topic, which can be reflected in time. News concerns change over time.
  • the application is based on the topic of web content on the Internet.
  • the online mining results can not only detect and analyze the hot topics emerging in the current network, but also use the semantic feature vector of the webpage to determine the similarity between webpage content, and perform content aggregation analysis and personalized recommendation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Disclosed is an online Internet topic mining method based on an improved LDA model. The method corresponds to a continuous and streaming type topic mining process conducted in a segmented mode, n web pages are processed each time, and these web pages are usually acquired by web crawlers from the Internet in an online and real-time mode, and mining results of the contents of these web pages generate k topics. After current n web pages are processed, newly acquired n web pages are continuously processed through the process. The process mainly comprises initialization of On-LDA model hyper-parameters, dynamic updating of the On-LDA model hyper-parameters, Internet topic mining based on the On-LDA model and the like. By means of the present invention, the assignment method and effect of use in respect to the hyper-parameters and of a traditional LDA model in the topic mining process are radically changed. Classified information to which the web page contents belong is fully utilized to assign initial values to the model hyper-parameters, so that the initial values of the hyper-parameters completely depend on the web page contents to be mined, and the computing process is simplified and rationality is achieved.

Description

一种基于改进LDA模型的互联网话题在线挖掘方法An Online Topic Mining Method Based on Improved LDA Model 技术领域Technical field
本发明属于互联网技术领域,具体涉及一种基于改进LDA模型的互联网话题在线挖掘方法,该方法能够克服传统LDA模型对于动态挖掘互联网话题的不适应性,可以实时地对大量网页资源中所包含的话题进行在线检测和挖掘。The invention belongs to the field of internet technology, and particularly relates to an online topic mining method based on an improved LDA model, which can overcome the inaccuracy of the traditional LDA model for dynamically mining Internet topics, and can be included in a large number of web resources in real time. Topics are detected and mined online.
背景技术Background technique
互联网的高速发展和广泛普及,使它逐渐成为人们快速获取、发布和传递信息的重要媒介。尤其近年来移动互联网得到长足发展,它充分结合了移动通信和互联网二者的优势,使人们获取信息的途径更加便捷。互联网中来源众多、立场各异的大量信息资源不断涌现,它们所反映的一些热点和敏感话题往往借助于网络而以极快的速度进行传播和扩散,对社会产生重大影响。因此,如何对大量网页信息资源中所包含的话题进行实时检测和挖掘,快速发现和捕捉网络热点话题,和(或)按照话题归集聚类互联网信息资源,对于实时跟踪监测网络舆情、理顺互联网内容大数据、以及导引读者快速找到自己感兴趣的信息等,都具有十分重要的作用。The rapid development and widespread popularity of the Internet has made it an important medium for people to quickly acquire, publish and deliver information. Especially in recent years, the mobile Internet has been greatly developed. It fully combines the advantages of both mobile communication and the Internet, making it easier for people to obtain information. A large number of information resources with many sources and different positions in the Internet continue to emerge. Some of the hotspots and sensitive topics they reflect often spread and spread at a very fast speed by means of the Internet, which has a major impact on society. Therefore, how to detect and mine the topics contained in a large number of webpage information resources, quickly discover and capture the hot topics of the network, and/or cluster the Internet information resources according to the topic, and monitor and monitor the network for real-time tracking and rationalization. Internet content big data, as well as guiding readers to quickly find information of their own interest, are very important.
关于复杂网络的研究表明,互联网已经演化为服从幂律的无标度(scale-free)网络。其无标度特征的一个主要表现是,少数网站拥有远高于普通网站成千上万倍的连接访问数,它们形成万维网(Web)中的集散结点(hubs),成为互联网内容访问流量的主要源头。充分利用这种特征,通过对主流及热门网站采取基于网络爬虫的信息采集技术,可以在较高的覆盖度上动态、高效地汇集大量网页信息资源,为互联网话题的实时检测和挖掘提供前提基础。然而,动态汇集的网页信息资源量大义繁,并且这些网页内容一般具有很强的时效性,所反映的话题及其热度常常随时间动态变化。考察目前已有的一些针对话题挖掘和检测的算法模型,其中较有影响的如PLSA(Probabilistic Latent Semantic Analysis)模型和LDA(Latent Dirichlet Allocation)模型等。分析表明,PLSA模型关于话题的多项分布概率模型不够完善(它只关注于似然函数却忽略了参数的先验分布),并且当文档数及词语量增大时模型复杂度及迭代计算量显著增加。而LDA模型依赖于两个Dirichlet分布超参数
Figure PCTCN2015092047-appb-000001
Figure PCTCN2015092047-appb-000002
它们在初始时通常按照经验进行取值,或者先对某个特定语料库进行实验,然后按照实验结果最优来进行取值,并且超参数的值设定后在整个话题挖掘过程中保持不变。另外,LDA模型在生成所有话题对各个词语的概率 分布时采用同一个超参数
Figure PCTCN2015092047-appb-000003
这种做法也不尽合理。所以,PLSA和LDA等话题挖掘和检测模型,一般适用于语料库相对静态的离线话题挖掘环境,而对于互联网话题的实时、流式在线挖掘需求,在合理性、时效性、计算效率及准确度等方面大打折扣。
Research on complex networks has shown that the Internet has evolved into a scale-free network that obeys power laws. One of the main manifestations of its scale-free feature is that a small number of websites have thousands of connection connections far above the average website, and they form hubs in the World Wide Web (Web), which become Internet content access traffic. The main source. Make full use of this feature, through the use of web crawler-based information collection technology for mainstream and popular websites, it can dynamically and efficiently collect a large number of webpage information resources on a high degree of coverage, providing a premise basis for real-time detection and mining of Internet topics. . However, the amount of dynamically aggregated web page information resources is large and complex, and the content of these web pages is generally time-sensitive, and the topics and their heats are often dynamically changed with time. Investigate some existing algorithm models for topic mining and detection, among which are more influential such as PLSA (Probabilistic Latent Semantic Analysis) model and LDA (Latent Dirichlet Allocation) model. The analysis shows that the PLSA model is not perfect for the multi-distribution probability model of the topic (it only focuses on the likelihood function but ignores the prior distribution of the parameters), and the model complexity and iterative calculation when the number of documents and the amount of words increase A significant increase. The LDA model relies on two Dirichlet distribution hyperparameters.
Figure PCTCN2015092047-appb-000001
with
Figure PCTCN2015092047-appb-000002
They usually take the value according to experience at the beginning, or experiment with a specific corpus first, and then take the value according to the optimal result of the experiment, and the value of the hyperparameter is set and remains unchanged throughout the topic mining process. In addition, the LDA model uses the same hyperparameter when generating the probability distribution of all topics for each word.
Figure PCTCN2015092047-appb-000003
This is not reasonable. Therefore, topic mining and detection models such as PLSA and LDA are generally applicable to the relatively static offline topic mining environment of corpus, while the real-time and streaming online mining requirements for Internet topics are reasonable, timeliness, computational efficiency and accuracy. The area is greatly discounted.
发明内容Summary of the invention
发明目的:针对现有技术中存在的问题,本发明提供一种基于改进LDA模型的互联网话题在线挖掘方法。该方法的基础是一个改进的LDA模型(简记On-LDA),它初始时利用待挖掘网页内容的分类信息来对超参数
Figure PCTCN2015092047-appb-000004
赋初值,然后在每次话题挖掘完成后利用相关统计信息动态更新On-LDA模型的超参数。基于改进LDA模型(On-LDA模型)的互联网话题在线挖掘方法可有效克服PLSA、LDA等传统模型受制于静态、离线话题挖掘环境的局限性,它能够更加准确和及时地反映当前互联网中“话题”随不断涌现的新网页而动态演化的实际情况,从而支持对大量网页内容资源中所包含的话题进行在线检测和挖掘。
OBJECT OF THE INVENTION: In view of the problems existing in the prior art, the present invention provides an online topic mining method based on an improved LDA model. The basis of this method is an improved LDA model (online-on-LDA), which initially uses the classification information of the content of the web page to be mined to hyperparameters.
Figure PCTCN2015092047-appb-000004
The initial value is assigned, and then the hyper-parameter of the On-LDA model is dynamically updated with relevant statistical information after each topic mining is completed. The online topic mining method based on improved LDA model (On-LDA model) can effectively overcome the limitations of traditional models such as PLSA and LDA subject to static and offline topic mining environment. It can reflect the current topic in the Internet more accurately and timely. The reality of dynamic evolution with new web pages that continue to emerge, enabling online detection and mining of topics contained in a large number of web content resources.
本发明中的“话题”是指从给定网页集合的内容中提取出的、经过规范化处理的、可反映网页内容的主旨和要义等深层语义特征的主题词或短语的集合。本发明采用On-LDA模型作为基础,对互联网中大量网页资源所包含的话题进行在线挖掘。On-LDA模型是一个支持动态、在线话题挖掘的改进LDA模型。The "topic" in the present invention refers to a collection of subject words or phrases that are extracted from the content of a given web page collection and that are normalized and reflect deep semantic features such as the subject matter and meaning of the web page content. The invention adopts the On-LDA model as a basis for online mining of topics included in a large number of web resources in the Internet. The On-LDA model is an improved LDA model that supports dynamic, online topic mining.
技术方案:一种基于改进LDA模型的互联网话题在线挖掘方法,对应一个持续的、流式的、逐段进行的话题挖掘过程,每次处理n(≥1)个网页,这些网页通常由网络爬虫以在线、实时的方式从互联网采集得到,对这些网页的内容进行挖掘的结果生成k(≥1)个话题。在处理完当前n个网页后,对新采集到的n个网页继续进行该过程。假定在初始时刻t0对由n个网页构成的网页资源集合
Figure PCTCN2015092047-appb-000005
进行话题挖掘,集合C0中所有网页包含的不同词语构成集合
Figure PCTCN2015092047-appb-000006
挖掘生成由k个话题构成的话题集合
Figure PCTCN2015092047-appb-000007
而在时刻ti(i>0)对网页资源集合
Figure PCTCN2015092047-appb-000008
进行话题挖掘,此时考虑集合
Figure PCTCN2015092047-appb-000009
中所有网页所包含的不同词语所构成的集合
Figure PCTCN2015092047-appb-000010
Figure PCTCN2015092047-appb-000011
挖掘生成话题集合
Figure PCTCN2015092047-appb-000012
在上述W0和Wi中,v0=|W0|,vi=|Wi|。
Technical Solution: An online topic mining method based on improved LDA model, corresponding to a continuous, streamlined, piecemeal topic mining process, each processing n (≥ 1) web pages, these web pages are usually web crawlers Collected from the Internet in an online and real-time manner, and the results of mining the contents of these web pages generate k (≥ 1) topics. After the current n web pages are processed, the process is continued for the newly acquired n web pages. Suppose that a web page resource set consisting of n web pages is initialized at initial time t 0
Figure PCTCN2015092047-appb-000005
Perform topic mining, and collect all the different words contained in C 0 to form a collection
Figure PCTCN2015092047-appb-000006
Mining to generate a set of topics consisting of k topics
Figure PCTCN2015092047-appb-000007
And at time t i (i>0) for the collection of web resources
Figure PCTCN2015092047-appb-000008
Conduct topic mining, consider the collection at this time
Figure PCTCN2015092047-appb-000009
a collection of different words contained in all web pages
Figure PCTCN2015092047-appb-000010
Figure PCTCN2015092047-appb-000011
Mining generated topic collection
Figure PCTCN2015092047-appb-000012
In the above W 0 and W i , v 0 =|W 0 |, v i =|W i |.
基于改进LDA模型(On-LDA模型)的互联网话题在线挖掘方法,主要涉及3个计算过程,包括On-LDA模型超参数的初始化、On-LDA模型超参数的动态更新、基于On-LDA模型的互联网话题挖掘等。An online topic mining method based on improved LDA model (On-LDA model) mainly involves three calculation processes, including initialization of On-LDA model hyperparameters, On-LDA model hyperparameter dynamic update, On-LDA model based on On-LDA model. Internet topic mining and so on.
On-LDA模型超参数的初始化。On-LDA模型主要利用网页内容的分类信息,来对超参数
Figure PCTCN2015092047-appb-000013
赋初值。对于互联网中给定领域(如新闻领域)的网页资源,每个网页的内容对应该领域的一个分类信息(如时政、军事、科技等),它是网页的内容元数据。假设给定领域中所有网页资源内容的全部分类信息用集合G={cat1,cat2,…,catg}表示,其中g=|G|,而cats(1≤s≤g)代表一个具体的分类信息(如时政)。首先用集合G的大小来设定参数k的取值,即k=g=|G|,它决定了On-LDA模型每次挖掘产生的话题数。在此基础上,对On-LDA模型中的超参数
Figure PCTCN2015092047-appb-000014
进行初始化,得到初始时刻t0的超参数值
Figure PCTCN2015092047-appb-000015
Figure PCTCN2015092047-appb-000016
(上标T表示矩阵转置):
Initialization of the On-LDA model hyperparameters. The On-LDA model mainly uses the classification information of web content to compare hyperparameters.
Figure PCTCN2015092047-appb-000013
Assign initial value. For web resources in a given domain (such as the news field) in the Internet, the content of each web page corresponds to a category information of the domain (such as current affairs, military, technology, etc.), which is the content metadata of the web page. Suppose that all classification information for all web resource content in a given domain is represented by the set G={cat 1 ,cat 2 ,...,cat g }, where g=|G|, and cat s (1≤s≤g) represents one Specific classification information (such as current affairs). First, the value of the parameter k is set by the size of the set G, that is, k=g=|G|, which determines the number of topics generated by each mining of the On-LDA model. On this basis, the hyperparameters in the On-LDA model
Figure PCTCN2015092047-appb-000014
Initialize to obtain the hyperparameter value at the initial time t 0
Figure PCTCN2015092047-appb-000015
with
Figure PCTCN2015092047-appb-000016
(Superscript T means matrix transpose):
Figure PCTCN2015092047-appb-000017
Figure PCTCN2015092047-appb-000017
Figure PCTCN2015092047-appb-000018
Figure PCTCN2015092047-appb-000018
Figure PCTCN2015092047-appb-000019
Figure PCTCN2015092047-appb-000020
中,对于1≤s≤k有:
in
Figure PCTCN2015092047-appb-000019
with
Figure PCTCN2015092047-appb-000020
In the case of 1 ≤ s ≤ k:
Figure PCTCN2015092047-appb-000021
其中count_doc(cats)表示网页资源集合C0中内容属于分类信息cats(1≤s≤k)的网页总数;
Figure PCTCN2015092047-appb-000021
Where count_doc(cat s ) represents the total number of web pages whose content in the web resource set C 0 belongs to the classification information cat s (1≤s≤k);
Figure PCTCN2015092047-appb-000022
其中
Figure PCTCN2015092047-appb-000023
取值如下:
Figure PCTCN2015092047-appb-000022
among them
Figure PCTCN2015092047-appb-000023
The values are as follows:
Figure PCTCN2015092047-appb-000024
其中count_doc(cats)表示词语
Figure PCTCN2015092047-appb-000025
其中出现在C0中具有分类信息cats的所有网页中的总次数。
Figure PCTCN2015092047-appb-000024
Where count_doc(cat s ) indicates words
Figure PCTCN2015092047-appb-000025
The total number of times in all web pages with classification information cat s appearing in C 0 .
On-LDA模型超参数的动态更新。On-LDA模型在持续、流式的话题挖掘过程中,当每次话题挖掘完成后会及时利用统计信息动态更新超参数
Figure PCTCN2015092047-appb-000026
并采用更新后的超参数进行下一次话题挖掘,这与经典LDA模型有显著的差别。On-LDA模型超参数的更新过程:在初始时刻t0,On-LDA模型中的超参数
Figure PCTCN2015092047-appb-000027
分别取初始化值
Figure PCTCN2015092047-appb-000028
Figure PCTCN2015092047-appb-000029
假设在 时刻ti(i≥1)超参数
Figure PCTCN2015092047-appb-000030
分别取值
Figure PCTCN2015092047-appb-000031
Figure PCTCN2015092047-appb-000032
据此对网页资源集合
Figure PCTCN2015092047-appb-000033
进行话题挖掘,生成话题集合
Figure PCTCN2015092047-appb-000034
紧接着,对超参数
Figure PCTCN2015092047-appb-000035
进行更新,具体方法如下。首先,采用如下公式更新超参数
Figure PCTCN2015092047-appb-000036
Figure PCTCN2015092047-appb-000037
Dynamic update of the On-LDA model hyperparameters. On-LDA model in the continuous, streaming topic mining process, when each topic mining is completed, the statistical information will be used to update the hyperparameters in time.
Figure PCTCN2015092047-appb-000026
And use the updated hyperparameters for the next topic mining, which is significantly different from the classic LDA model. On-LDA model hyperparameter update process: hyperparameter in On-LDA model at initial time t 0
Figure PCTCN2015092047-appb-000027
Take the initialization value separately
Figure PCTCN2015092047-appb-000028
with
Figure PCTCN2015092047-appb-000029
Assume that the hyperparameter at time t i ( i ≥ 1)
Figure PCTCN2015092047-appb-000030
Value separately
Figure PCTCN2015092047-appb-000031
with
Figure PCTCN2015092047-appb-000032
According to this, the collection of web resources
Figure PCTCN2015092047-appb-000033
Perform topic mining to generate topic collections
Figure PCTCN2015092047-appb-000034
Next, on the hyperparameter
Figure PCTCN2015092047-appb-000035
Update it as follows. First, update the hyperparameters with the following formula
Figure PCTCN2015092047-appb-000036
for
Figure PCTCN2015092047-appb-000037
Figure PCTCN2015092047-appb-000038
Figure PCTCN2015092047-appb-000038
其中
Figure PCTCN2015092047-appb-000039
Figure PCTCN2015092047-appb-000040
的取值如下:
among them
Figure PCTCN2015092047-appb-000039
with
Figure PCTCN2015092047-appb-000040
The values are as follows:
Figure PCTCN2015092047-appb-000041
Figure PCTCN2015092047-appb-000041
Figure PCTCN2015092047-appb-000042
Figure PCTCN2015092047-appb-000042
矩阵
Figure PCTCN2015092047-appb-000043
的第j(0≤j≤i)列为
Figure PCTCN2015092047-appb-000044
它表示在网页资源集合Cj的所有网页中,包含有话题集合Zi中各个话题相应词语的频度,即
Figure PCTCN2015092047-appb-000045
表示Cj中所有网页包含有被标记为话题
Figure PCTCN2015092047-appb-000046
的词语的数量。
matrix
Figure PCTCN2015092047-appb-000043
The jth (0 ≤ j ≤ i) is listed as
Figure PCTCN2015092047-appb-000044
It indicates that in all the webpages of the webpage resource set Cj , the frequency of the corresponding words of each topic in the topic set Z i is included, that is,
Figure PCTCN2015092047-appb-000045
Indicates that all pages in C j contain tags that are marked as topics
Figure PCTCN2015092047-appb-000046
The number of words.
考虑到距离当前时刻(ti)越久的网页内容对当前话题挖掘的影响越小,所以在更新On-LDA模型的超参数时,可使用指数衰减函数来表示既往各时刻的网页内容对当前话题挖掘的影响权重,形成时间权重矩阵
Figure PCTCN2015092047-appb-000047
其中,λ为衰减因子,n0为归一化常数。
Considering that the longer the web content from the current time (t i ) has less influence on the current topic mining, when updating the hyperparameter of the On-LDA model, an exponential decay function can be used to represent the web content of the past moments to the current topic. Time weight matrix
Figure PCTCN2015092047-appb-000047
Where λ is the attenuation factor and n 0 is the normalization constant.
接着,采用如下公式更新超参数
Figure PCTCN2015092047-appb-000048
Figure PCTCN2015092047-appb-000049
Next, update the hyperparameters with the following formula
Figure PCTCN2015092047-appb-000048
for
Figure PCTCN2015092047-appb-000049
Figure PCTCN2015092047-appb-000050
Figure PCTCN2015092047-appb-000050
其中,对1≤s≤k有: Among them, for 1≤s≤k there are:
Figure PCTCN2015092047-appb-000051
Figure PCTCN2015092047-appb-000051
矩阵
Figure PCTCN2015092047-appb-000052
的第j(0≤j≤i)列为
Figure PCTCN2015092047-appb-000053
它表示以话题
Figure PCTCN2015092047-appb-000054
的各个词语做参照,词语集合Wi中的所有词语在时刻ti时出现的次数。若话题
Figure PCTCN2015092047-appb-000055
包含词语
Figure PCTCN2015092047-appb-000056
Figure PCTCN2015092047-appb-000057
等于在时刻ti时词语
Figure PCTCN2015092047-appb-000058
出现在Cj的所有网页中的总次数;若话题
Figure PCTCN2015092047-appb-000059
不包含词语
Figure PCTCN2015092047-appb-000060
Figure PCTCN2015092047-appb-000061
等于0。
Figure PCTCN2015092047-appb-000062
是与前面一样的时间权重矩阵。
matrix
Figure PCTCN2015092047-appb-000052
The jth (0 ≤ j ≤ i) is listed as
Figure PCTCN2015092047-appb-000053
It expresses a topic
Figure PCTCN2015092047-appb-000054
Referring to make each word, the number of all the words in the words W i is set at time t i appears. If topic
Figure PCTCN2015092047-appb-000055
Contains words
Figure PCTCN2015092047-appb-000056
then
Figure PCTCN2015092047-appb-000057
Equal to the word at time t i
Figure PCTCN2015092047-appb-000058
The total number of occurrences in all pages of C j ; if topic
Figure PCTCN2015092047-appb-000059
Does not contain words
Figure PCTCN2015092047-appb-000060
then
Figure PCTCN2015092047-appb-000061
Equal to 0.
Figure PCTCN2015092047-appb-000062
Is the same time weight matrix as before.
基于On-LDA模型的互联网话题挖掘。假设在时刻ti(i≥0)需对网页资源集合
Figure PCTCN2015092047-appb-000063
进行话题挖掘。此时,首先确定On-LDA模型的超参数
Figure PCTCN2015092047-appb-000064
的取值。如果是在时刻t0对最先采集到的网页资源集合C0进行话题挖掘,此时先按照On-LDA模型超参数的初始化过程,计算超参数
Figure PCTCN2015092047-appb-000065
的初始值
Figure PCTCN2015092047-appb-000066
Figure PCTCN2015092047-appb-000067
如果是在时刻ti(i≥1)对采集到的网页资源集合Ci进行话题挖掘,则超参数
Figure PCTCN2015092047-appb-000068
的取值为在上一时刻(ti-1)话题挖掘结束时,经On-LDA模型超参数动态更新后得到的
Figure PCTCN2015092047-appb-000069
Figure PCTCN2015092047-appb-000070
接着,按照On-LDA概率图模型,并采用如图2所示的吉布斯采样(Gibbs Sampling)方法,对网页资源集合Ci进行话题挖掘,生成话题集合
Figure PCTCN2015092047-appb-000071
并且得到Ci中每个网页
Figure PCTCN2015092047-appb-000072
(1≤u≤n)对应于话题集合Zi的语义特征向量
Figure PCTCN2015092047-appb-000073
其中
Figure PCTCN2015092047-appb-000074
(1≤s≤k)为网页
Figure PCTCN2015092047-appb-000075
属于话题
Figure PCTCN2015092047-appb-000076
的概率。
Internet topic mining based on On-LDA model. Suppose that at time t i ( i ≥ 0), a collection of web resources is required.
Figure PCTCN2015092047-appb-000063
Conduct topic mining. At this point, first determine the hyperparameter of the On-LDA model.
Figure PCTCN2015092047-appb-000064
The value. If the topic mining is performed on the first collected web resource set C 0 at time t 0 , the hyperparameter is first calculated according to the initialization process of the On-LDA model hyperparameter.
Figure PCTCN2015092047-appb-000065
Initial value
Figure PCTCN2015092047-appb-000066
with
Figure PCTCN2015092047-appb-000067
If the topic mining is performed on the collected web resource set C i at time t i ( i ≥ 1), the hyperparameter
Figure PCTCN2015092047-appb-000068
The value obtained by the On-LDA model hyperparameter dynamic update at the end of the topic excavation at the last moment (t i-1 )
Figure PCTCN2015092047-appb-000069
with
Figure PCTCN2015092047-appb-000070
Then, according to the On-LDA probability map model, and using the Gibbs Sampling method as shown in FIG. 2, the topic mining of the webpage resource set C i is performed to generate a topic set.
Figure PCTCN2015092047-appb-000071
And get every page in C i
Figure PCTCN2015092047-appb-000072
(1≤u≤n) corresponds to the semantic feature vector of the topic set Z i
Figure PCTCN2015092047-appb-000073
among them
Figure PCTCN2015092047-appb-000074
(1 ≤ s ≤ k) for web pages
Figure PCTCN2015092047-appb-000075
Belonging to the topic
Figure PCTCN2015092047-appb-000076
The probability.
需要说明的时,在基于On-LDA模型的互联网话题挖掘过程中,不但超参数
Figure PCTCN2015092047-appb-000077
的取值随既往挖掘信息而动态更新,而且在时刻ti生成k个话题对所有词语的概率分布时,不同话题采用不同的超参数(即
Figure PCTCN2015092047-appb-000078
的k个不同分量
Figure PCTCN2015092047-appb-000079
),这比传统LDA模型中始终采用固定的、预设的超参数
Figure PCTCN2015092047-appb-000080
要合理得多。
When it is necessary to explain, in the Internet topic mining process based on On-LDA model, not only hyperparameters
Figure PCTCN2015092047-appb-000077
The values are dynamically updated as the information is mined, and when the probability distribution of k topics for all words is generated at time t i , different topics use different hyperparameters (ie
Figure PCTCN2015092047-appb-000078
k different components
Figure PCTCN2015092047-appb-000079
), which always uses fixed, preset hyperparameters in traditional LDA models.
Figure PCTCN2015092047-appb-000080
It is much more reasonable.
有益效果:基于改进LDA模型(On-LDA模型)的互联网话题在线挖掘方法,从根 本上改变了传统LDA模型在话题挖掘过程中关于超参数
Figure PCTCN2015092047-appb-000081
的赋值方式和使用效果。它充分利用网页内容所属的分类信息来对模型超参数
Figure PCTCN2015092047-appb-000082
赋初值,使超参数的初值完全依赖于待挖掘网页内容本身(而不是预先选定的语料库),既简化了计算过程又更具合理性。
Beneficial effects: The online topic mining method based on the improved LDA model (On-LDA model) fundamentally changes the traditional LDA model in the topic mining process.
Figure PCTCN2015092047-appb-000081
The way of assignment and the effect of using it. It makes full use of the classification information of the web content to model the parameters.
Figure PCTCN2015092047-appb-000082
Assigning the initial value makes the initial value of the hyperparameter completely dependent on the content of the web page to be mined (rather than the pre-selected corpus), which simplifies the calculation process and makes it more reasonable.
同时,模型超参数
Figure PCTCN2015092047-appb-000083
的值随已经处理过的网页内容而动态改变(而不是在话题挖掘过程中保持不变),因此能够更加准确和及时地反映互联网中话题的演化过程。上述特征,使本发明的应用领域不再局限于静态、离线的话题挖掘环境,尤其在互联网话题在线检测和挖掘方面比传统话题挖掘方法表现出更优越的时效性、计算效率和准确度。
At the same time, model hyperparameters
Figure PCTCN2015092047-appb-000083
The value of the web page dynamically changes with the content of the web page that has been processed (rather than remaining unchanged during the topic mining process), so that the evolution of the topic in the Internet can be more accurately and timely reflected. The above features make the application field of the present invention no longer limited to the static and offline topic mining environment, especially in the online topic detection and mining of the Internet, which has better timeliness, computational efficiency and accuracy than the traditional topic mining method.
附图说明DRAWINGS
图1是改进LDA模型(On-LDA模型)的概率图模型,描述了On-LDA模型如何生成所有文档的对应词集。其中
Figure PCTCN2015092047-appb-000084
是Dirichlet分布的超参数,在不同时刻有相应的具体取值,而
Figure PCTCN2015092047-appb-000085
是当前超参数
Figure PCTCN2015092047-appb-000086
针对第s(1≤s≤k)个话题的第s维列向量。假定在某一时刻t对n个网页内容进行话题挖掘,生成k个话题,则
Figure PCTCN2015092047-appb-000087
为第i个网页ci(1≤i≤n)的话题分布,
Figure PCTCN2015092047-appb-000088
表示第s(1≤s≤k)个话题的词语分布,tni,r表示网页ci的第r个词所分配到的话题编号,wi,r表示网页ci的第r个词。
Figure 1 is a probabilistic graph model of the improved LDA model (On-LDA model), which describes how the On-LDA model generates corresponding sets of documents for all documents. among them
Figure PCTCN2015092047-appb-000084
Is a hyperparameter of the Dirichlet distribution, which has corresponding specific values at different times, and
Figure PCTCN2015092047-appb-000085
Is the current hyperparameter
Figure PCTCN2015092047-appb-000086
The sth dimension vector for the s (1 ≤ s ≤ k) topics. Suppose that topic mining is performed on n webpage content at a certain time t, and k topics are generated.
Figure PCTCN2015092047-appb-000087
For the topic distribution of the i-th web page c i (1 ≤ i ≤ n),
Figure PCTCN2015092047-appb-000088
The word distribution indicating the s (1 ≤ s ≤ k) topics, tn i, r represents the topic number to which the r word of the web page c i is assigned, and w i, r represents the r word of the web page c i .
图2是On-LDA模型超参数
Figure PCTCN2015092047-appb-000089
的动态更新过程。
Figure 2 is the On-LDA model hyperparameter
Figure PCTCN2015092047-appb-000089
The dynamic update process.
图3是基于On-LDA模型进行话题挖掘的吉布斯采样过程。其中Z(0)是话题集合Zi的初始值,
Figure PCTCN2015092047-appb-000090
表示词语
Figure PCTCN2015092047-appb-000091
出现在话题
Figure PCTCN2015092047-appb-000092
中的次数,
Figure PCTCN2015092047-appb-000093
表示话题
Figure PCTCN2015092047-appb-000094
出现在网页
Figure PCTCN2015092047-appb-000095
中的次数。概率
Figure PCTCN2015092047-appb-000096
表示在排除网页
Figure PCTCN2015092047-appb-000097
的第r个词的当前所分配话题编号的前提下,利用网页集合Ci和词语集合Wi的信息,计算网页
Figure PCTCN2015092047-appb-000098
的第r个词对其余各个话题的概率分布。Θ表示由网页
Figure PCTCN2015092047-appb-000099
Figure PCTCN2015092047-appb-000100
的语义特征向量
Figure PCTCN2015092047-appb-000101
作为行向量所组成的矩阵。Φ表示由k个话题对Wi中所有词语的概率分布作为行向量所组成的矩阵。
Figure 3 shows the Gibbs sampling process for topic mining based on the On-LDA model. Where Z (0) is the initial value of the topic set Z i ,
Figure PCTCN2015092047-appb-000090
Expressive words
Figure PCTCN2015092047-appb-000091
Appear in the topic
Figure PCTCN2015092047-appb-000092
Number of times,
Figure PCTCN2015092047-appb-000093
Expressing topic
Figure PCTCN2015092047-appb-000094
Appear on the web
Figure PCTCN2015092047-appb-000095
The number of times. Probability
Figure PCTCN2015092047-appb-000096
Indicates that the page is excluded
Figure PCTCN2015092047-appb-000097
Under the premise of the currently assigned topic number of the r th word, the web page is calculated using the information of the web page set C i and the word set W i
Figure PCTCN2015092047-appb-000098
The probability distribution of the rth word for each of the remaining topics. Θ indicates by webpage
Figure PCTCN2015092047-appb-000099
Figure PCTCN2015092047-appb-000100
Semantic feature vector
Figure PCTCN2015092047-appb-000101
A matrix composed of row vectors. Φ represents a row vector as a matrix consisting of k subject to all the words W i probability distribution.
具体实施方式detailed description
下面结合具体实施例,进一步阐明本发明,应理解这些实施例仅用于说明本发明而 不用于限制本发明的范围,在阅读了本发明之后,本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。The invention will be further clarified below with reference to specific embodiments, which should be understood only to illustrate the invention. The scope of the present invention is not limited by the scope of the invention, and the modifications of the various equivalents of the invention are intended to be within the scope of the appended claims.
(1)采用On-LDA模型作为基础,对互联网中大量网页资源所包含的话题进行在线挖掘。On-LDA模型是一个支持动态、在线话题挖掘的改进LDA模型,其概率图模型如图1所示,涵义为:对n个网页(文档)挖掘生成k个话题的过程,本质上可看作一个网页(文档)词语集合的生成过程,即先用当前超参数
Figure PCTCN2015092047-appb-000102
对每一个网页ci(1≤i≤n)采样生成其话题分布
Figure PCTCN2015092047-appb-000103
再依据
Figure PCTCN2015092047-appb-000104
采样生成网页ci的每一个词的话题编号tni,r;同时,利用当前超参数
Figure PCTCN2015092047-appb-000105
的每一维列向量
Figure PCTCN2015092047-appb-000106
采样生成对应话题(即第s个话题)的词语分布
Figure PCTCN2015092047-appb-000107
最后,通过采样生成网页ci的每一个词wi,r,即得到网页ci的词语集合。基于On-LDA模型的互联网话题在线挖掘方法对应一个持续的、流式的、逐段进行的话题挖掘过程,它每次处理n(≥1)个网页,这些网页通常由网络爬虫以在线、实时的方式从互联网采集得到,对这些网页的内容进行挖掘的结果生成k(≥1)个话题。在处理完当前n个网页后,对新采集到的n个网页继续进行该过程。假定在初始时刻t0对由n个网页构成的网页资源集合
Figure PCTCN2015092047-appb-000108
进行话题挖掘,集合C0中所有网页包含的不同词语构成集合
Figure PCTCN2015092047-appb-000109
挖掘生成由k个话题构成的话题集合
Figure PCTCN2015092047-appb-000110
Figure PCTCN2015092047-appb-000111
而在时刻ti(i>0)对网页资源集合
Figure PCTCN2015092047-appb-000112
进行话题挖掘,此时考虑集合
Figure PCTCN2015092047-appb-000113
中所有网页所包含的不同词语所构成的集合
Figure PCTCN2015092047-appb-000114
挖掘生成话题集合
Figure PCTCN2015092047-appb-000115
在上述W0和Wi中,v0=|W0|,vi=|Wi|。
(1) Using the On-LDA model as a basis for online mining of topics contained in a large number of web resources on the Internet. The On-LDA model is an improved LDA model that supports dynamic and online topic mining. The probability graph model is shown in Figure 1. The meaning is: the process of generating k topics for n web pages (documents), which can be regarded as essentially The process of generating a collection of words in a web page (document), that is, using the current hyperparameter first
Figure PCTCN2015092047-appb-000102
Generate a topic distribution for each web page c i (1 ≤ i ≤ n)
Figure PCTCN2015092047-appb-000103
Further basis
Figure PCTCN2015092047-appb-000104
Sampling to generate the topic number tn i,r of each word of the web page c i ; at the same time, using the current hyperparameter
Figure PCTCN2015092047-appb-000105
Each dimension column vector
Figure PCTCN2015092047-appb-000106
Sampling to generate word distributions for corresponding topics (ie, the sth topic)
Figure PCTCN2015092047-appb-000107
Finally, each word w i,r of the web page c i is generated by sampling, that is, the word set of the web page c i is obtained. The online topic mining method based on On-LDA model corresponds to a continuous, streamlined, piecemeal topic mining process, which processes n (≥1) web pages each time. These web pages are usually online and real-time by web crawlers. The way to collect from the Internet, the result of mining the content of these pages generates k (≥ 1) topics. After the current n web pages are processed, the process is continued for the newly acquired n web pages. Suppose that a web page resource set consisting of n web pages is initialized at initial time t 0
Figure PCTCN2015092047-appb-000108
Perform topic mining, and collect all the different words contained in C 0 to form a collection
Figure PCTCN2015092047-appb-000109
Mining to generate a set of topics consisting of k topics
Figure PCTCN2015092047-appb-000110
Figure PCTCN2015092047-appb-000111
And at time t i (i>0) for the collection of web resources
Figure PCTCN2015092047-appb-000112
Conduct topic mining, consider the collection at this time
Figure PCTCN2015092047-appb-000113
a collection of different words contained in all web pages
Figure PCTCN2015092047-appb-000114
Mining generated topic collection
Figure PCTCN2015092047-appb-000115
In the above W 0 and W i , v 0 =|W 0 |, v i =|W i |.
基于改进LDA模型(On-LDA模型)的互联网话题在线挖掘方法,主要涉及3个计算过程,包括On-LDA模型超参数的初始化、On-LDA模型超参数的动态更新、基于On-LDA模型的互联网话题挖掘等。An online topic mining method based on improved LDA model (On-LDA model) mainly involves three calculation processes, including initialization of On-LDA model hyperparameters, On-LDA model hyperparameter dynamic update, On-LDA model based on On-LDA model. Internet topic mining and so on.
(2)On-LDA模型超参数的初始化。On-LDA模型主要利用网页内容的分类信息,来对超参数
Figure PCTCN2015092047-appb-000116
赋初值。对于互联网中给定领域(如新闻领域)的网页资源,每个网页的内容对应该领域的一个分类信息(如时政、军事、科技等),它是网页的内容元数据。假设给定领域中所有网页资源内容的全部分类信息用集合G={cat1,cat2,…,catg}表示,其中g=|G|,而cats(1≤s≤g)代表一个具体的分类信息(如时政)。首先用集合G的 大小来设定参数k的取值,即k=g=|G|,它决定了On-LDA模型每次挖掘产生的话题数。在此基础上,对On-LDA模型中的超参数
Figure PCTCN2015092047-appb-000117
进行初始化,得到初始时刻t0的超参数值
Figure PCTCN2015092047-appb-000118
Figure PCTCN2015092047-appb-000119
(上标T表示矩阵转置):
(2) Initialization of the On-LDA model hyperparameter. The On-LDA model mainly uses the classification information of web content to compare hyperparameters.
Figure PCTCN2015092047-appb-000116
Assign initial value. For web resources in a given domain (such as the news field) in the Internet, the content of each web page corresponds to a category information of the domain (such as current affairs, military, technology, etc.), which is the content metadata of the web page. Suppose that all classification information for all web resource content in a given domain is represented by the set G={cat 1 ,cat 2 ,...,cat g }, where g=|G|, and cat s (1≤s≤g) represents one Specific classification information (such as current affairs). First, the value of the parameter k is set by the size of the set G, that is, k=g=|G|, which determines the number of topics generated by each mining of the On-LDA model. On this basis, the hyperparameters in the On-LDA model
Figure PCTCN2015092047-appb-000117
Initialize to obtain the hyperparameter value at the initial time t 0
Figure PCTCN2015092047-appb-000118
with
Figure PCTCN2015092047-appb-000119
(Superscript T means matrix transpose):
Figure PCTCN2015092047-appb-000120
Figure PCTCN2015092047-appb-000120
Figure PCTCN2015092047-appb-000121
Figure PCTCN2015092047-appb-000121
Figure PCTCN2015092047-appb-000122
Figure PCTCN2015092047-appb-000123
中,对于1≤s≤k有:
in
Figure PCTCN2015092047-appb-000122
with
Figure PCTCN2015092047-appb-000123
In the case of 1 ≤ s ≤ k:
Figure PCTCN2015092047-appb-000124
其中count_doc(cats)表示网页资源集合C0中内容属于分类信息cats(1≤s≤k)的网页总数;
Figure PCTCN2015092047-appb-000124
Where count_doc(cat s ) represents the total number of web pages whose content in the web resource set C 0 belongs to the classification information cat s (1≤s≤k);
Figure PCTCN2015092047-appb-000125
其中
Figure PCTCN2015092047-appb-000126
取值如下:
Figure PCTCN2015092047-appb-000125
among them
Figure PCTCN2015092047-appb-000126
The values are as follows:
Figure PCTCN2015092047-appb-000127
其中count_doc(cats)表示词语
Figure PCTCN2015092047-appb-000128
其中出现在C0中具有分类信息cats的所有网页中的总次数。
Figure PCTCN2015092047-appb-000127
Where count_doc(cat s ) indicates words
Figure PCTCN2015092047-appb-000128
The total number of times in all web pages with classification information cat s appearing in C 0 .
(3)On-LDA模型超参数的动态更新。On-LDA模型在持续、流式的话题挖掘过程中,当每次话题挖掘完成后会及时利用统计信息动态更新超参数
Figure PCTCN2015092047-appb-000129
并采用更新后的超参数进行下一次话题挖掘,这与经典LDA模型有显著的差别。On-LDA模型超参数的更新过程如图2所示。在初始时刻t0,On-LDA模型中的超参数
Figure PCTCN2015092047-appb-000130
分别取初始化值
Figure PCTCN2015092047-appb-000131
Figure PCTCN2015092047-appb-000132
假设在时刻ti(i≥1)超参数
Figure PCTCN2015092047-appb-000133
分别取值
Figure PCTCN2015092047-appb-000134
Figure PCTCN2015092047-appb-000135
据此对网页资源集合
Figure PCTCN2015092047-appb-000136
进行话题挖掘,生成话题集合
Figure PCTCN2015092047-appb-000137
紧接着,对超参数
Figure PCTCN2015092047-appb-000138
进行更新,具体方法如下。首先,采用如下公式更新超参数
Figure PCTCN2015092047-appb-000139
Figure PCTCN2015092047-appb-000140
(3) Dynamic update of the On-LDA model hyperparameters. On-LDA model in the continuous, streaming topic mining process, when each topic mining is completed, the statistical information will be used to update the hyperparameters in time.
Figure PCTCN2015092047-appb-000129
And use the updated hyperparameters for the next topic mining, which is significantly different from the classic LDA model. The update process of the On-LDA model hyperparameter is shown in Figure 2. At the initial time t 0 , the hyperparameter in the On-LDA model
Figure PCTCN2015092047-appb-000130
Take the initialization value separately
Figure PCTCN2015092047-appb-000131
with
Figure PCTCN2015092047-appb-000132
Assume that the hyperparameter at time t i ( i ≥ 1)
Figure PCTCN2015092047-appb-000133
Value separately
Figure PCTCN2015092047-appb-000134
with
Figure PCTCN2015092047-appb-000135
According to this, the collection of web resources
Figure PCTCN2015092047-appb-000136
Perform topic mining to generate topic collections
Figure PCTCN2015092047-appb-000137
Next, on the hyperparameter
Figure PCTCN2015092047-appb-000138
Update it as follows. First, update the hyperparameters with the following formula
Figure PCTCN2015092047-appb-000139
for
Figure PCTCN2015092047-appb-000140
Figure PCTCN2015092047-appb-000141
Figure PCTCN2015092047-appb-000141
其中
Figure PCTCN2015092047-appb-000142
Figure PCTCN2015092047-appb-000143
的取值如下:
among them
Figure PCTCN2015092047-appb-000142
with
Figure PCTCN2015092047-appb-000143
The values are as follows:
Figure PCTCN2015092047-appb-000144
Figure PCTCN2015092047-appb-000144
Figure PCTCN2015092047-appb-000145
Figure PCTCN2015092047-appb-000145
矩阵
Figure PCTCN2015092047-appb-000146
的第j(0≤j≤i)列为
Figure PCTCN2015092047-appb-000147
它表示在网页资源集合Cj的所有网页中,包含有话题集合Zi中各个话题相应词语的频度,即
Figure PCTCN2015092047-appb-000148
表示Cj中所有网页包含有被标记为话题
Figure PCTCN2015092047-appb-000149
的词语的数量。
matrix
Figure PCTCN2015092047-appb-000146
The jth (0 ≤ j ≤ i) is listed as
Figure PCTCN2015092047-appb-000147
It indicates that in all the webpages of the webpage resource set Cj , the frequency of the corresponding words of each topic in the topic set Z i is included, that is,
Figure PCTCN2015092047-appb-000148
Indicates that all pages in C j contain tags that are marked as topics
Figure PCTCN2015092047-appb-000149
The number of words.
考虑到距离当前时刻(ti)越久的网页内容对当前话题挖掘的影响越小,所以在更新On-LDA模型的超参数时,可使用指数衰减函数来表示既往各时刻的网页内容对当前话题挖掘的影响权重,形成时间权重矩阵
Figure PCTCN2015092047-appb-000150
其中,λ为衰减因子,n0为归一化常数。
Considering that the longer the web content from the current time (t i ) has less influence on the current topic mining, when updating the hyperparameter of the On-LDA model, an exponential decay function can be used to represent the web content of the past moments to the current topic. Time weight matrix
Figure PCTCN2015092047-appb-000150
Where λ is the attenuation factor and n 0 is the normalization constant.
接着,采用如下公式更新超参数
Figure PCTCN2015092047-appb-000151
Figure PCTCN2015092047-appb-000152
Next, update the hyperparameters with the following formula
Figure PCTCN2015092047-appb-000151
for
Figure PCTCN2015092047-appb-000152
Figure PCTCN2015092047-appb-000153
Figure PCTCN2015092047-appb-000153
其中,对1≤s≤k有:Among them, for 1≤s≤k there are:
Figure PCTCN2015092047-appb-000154
Figure PCTCN2015092047-appb-000154
矩阵
Figure PCTCN2015092047-appb-000155
的第j(0≤j≤i)列为
Figure PCTCN2015092047-appb-000156
它表示以话题
Figure PCTCN2015092047-appb-000157
的各个词语做参照,词语集合Wi中的所有词语在时刻ti时出现的次数。若话题
Figure PCTCN2015092047-appb-000158
包含词语
Figure PCTCN2015092047-appb-000159
Figure PCTCN2015092047-appb-000160
等于在时刻ti时词语
Figure PCTCN2015092047-appb-000161
出现在Cj的所有网页中的总次数;若话题
Figure PCTCN2015092047-appb-000162
不包含词语
Figure PCTCN2015092047-appb-000163
Figure PCTCN2015092047-appb-000164
等于0。
Figure PCTCN2015092047-appb-000165
是与前面一样的时间权重矩阵。
matrix
Figure PCTCN2015092047-appb-000155
The jth (0 ≤ j ≤ i) is listed as
Figure PCTCN2015092047-appb-000156
It expresses a topic
Figure PCTCN2015092047-appb-000157
Referring to make each word, the number of all the words in the words W i is set at time t i appears. If topic
Figure PCTCN2015092047-appb-000158
Contains words
Figure PCTCN2015092047-appb-000159
then
Figure PCTCN2015092047-appb-000160
Equal to the word at time t i
Figure PCTCN2015092047-appb-000161
The total number of occurrences in all pages of C j ; if topic
Figure PCTCN2015092047-appb-000162
Does not contain words
Figure PCTCN2015092047-appb-000163
then
Figure PCTCN2015092047-appb-000164
Equal to 0.
Figure PCTCN2015092047-appb-000165
Is the same time weight matrix as before.
(4)基于On-LDA模型的互联网话题挖掘。假设在时刻ti(i≥0)需对网页资源集合
Figure PCTCN2015092047-appb-000166
进行话题挖掘。此时,首先确定On-LDA模型的超参数
Figure PCTCN2015092047-appb-000167
的取值。如果是在时刻t0对最先采集到的网页资源集合C0进行话题挖掘,此时先按照On-LDA模型超参数的初始化过程,计算超参数
Figure PCTCN2015092047-appb-000168
的初始值
Figure PCTCN2015092047-appb-000169
Figure PCTCN2015092047-appb-000170
如果是在时刻ti(i≥1)对采集到的网页资源集合Ci进行话题挖掘,则超参数
Figure PCTCN2015092047-appb-000171
的取值为在上一时刻(ti-1)话题挖掘结束时,经On-LDA模型超参数动态更新后得到的
Figure PCTCN2015092047-appb-000172
Figure PCTCN2015092047-appb-000173
接着,按照如图1所示的On-LDA概率图模型,并采用如图2所示的吉布斯采样(Gibbs Sampling)方法,对网页资源集合Ci进行话题挖掘,生成话题集合
Figure PCTCN2015092047-appb-000174
并且得到Ci中每个网页
Figure PCTCN2015092047-appb-000175
对应于话题集合Zi的语义特征向量
Figure PCTCN2015092047-appb-000176
其中
Figure PCTCN2015092047-appb-000177
为网页
Figure PCTCN2015092047-appb-000178
属于话题
Figure PCTCN2015092047-appb-000179
的概率。
(4) Internet topic mining based on On-LDA model. Suppose that at time t i ( i ≥ 0), a collection of web resources is required.
Figure PCTCN2015092047-appb-000166
Conduct topic mining. At this point, first determine the hyperparameter of the On-LDA model.
Figure PCTCN2015092047-appb-000167
The value. If the topic mining is performed on the first collected web resource set C 0 at time t 0 , the hyperparameter is first calculated according to the initialization process of the On-LDA model hyperparameter.
Figure PCTCN2015092047-appb-000168
Initial value
Figure PCTCN2015092047-appb-000169
with
Figure PCTCN2015092047-appb-000170
If the topic mining is performed on the collected web resource set C i at time t i ( i ≥ 1), the hyperparameter
Figure PCTCN2015092047-appb-000171
The value obtained by the On-LDA model hyperparameter dynamic update at the end of the topic excavation at the last moment (t i-1 )
Figure PCTCN2015092047-appb-000172
with
Figure PCTCN2015092047-appb-000173
Then, according to the On-LDA probability map model shown in FIG. 1 and using the Gibbs Sampling method as shown in FIG. 2, topic mining is performed on the webpage resource set C i to generate a topic set.
Figure PCTCN2015092047-appb-000174
And get every page in C i
Figure PCTCN2015092047-appb-000175
a semantic feature vector corresponding to the topic set Z i
Figure PCTCN2015092047-appb-000176
among them
Figure PCTCN2015092047-appb-000177
For web pages
Figure PCTCN2015092047-appb-000178
Belonging to the topic
Figure PCTCN2015092047-appb-000179
The probability.
需要说明的时,在基于On-LDA模型的互联网话题挖掘过程中,不但超参数
Figure PCTCN2015092047-appb-000180
的取值随既往挖掘信息而动态更新,而且在时刻ti生成k个话题对所有词语的概率分布时,不同话题采用不同的超参数(即
Figure PCTCN2015092047-appb-000181
的k个不同分量
Figure PCTCN2015092047-appb-000182
),这比传统LDA模型中始终采用固定的、预设的超参数
Figure PCTCN2015092047-appb-000183
要合理得多。
When it is necessary to explain, in the Internet topic mining process based on On-LDA model, not only hyperparameters
Figure PCTCN2015092047-appb-000180
The values are dynamically updated as the information is mined, and when the probability distribution of k topics for all words is generated at time t i , different topics use different hyperparameters (ie
Figure PCTCN2015092047-appb-000181
k different components
Figure PCTCN2015092047-appb-000182
), which always uses fixed, preset hyperparameters in traditional LDA models.
Figure PCTCN2015092047-appb-000183
It is much more reasonable.
针对本发明所提出的基于改进LDA模型(On-LDA模型)的互联网话题在线挖掘方法,下面通过实例进行验证,包括:The online topic mining method based on the improved LDA model (On-LDA model) proposed by the present invention is verified by an example, including:
(1)首先,利用互联网中给定领域(如新闻领域)的网页资源语料库,统计该领域所有网页资源内容的全部分类信息,得到集合G={cat1,cat2,…,catg},并用该集合的大小设定参数k的取值。例如,应用本发明对互联网中的新闻网页进行话题在线挖掘和实时检测,首先对主流及热门新闻网站的网页内容进行归类分析,得到20个类别(包括时政、国际、法治、军事、科技等),因此设定参数k=20。(1) First, use the web resource corpus of a given domain (such as the news field) in the Internet to collect all the classified information of all webpage resource contents in the field, and obtain the set G={cat 1 ,cat 2 ,...,cat g }, And use the size of the set to set the value of the parameter k. For example, applying the present invention to online mining and real-time detection of news web pages in the Internet, first classifying the web content of mainstream and popular news websites, and obtaining 20 categories (including current affairs, international, rule of law, military, technology, etc.) ), so set the parameter k=20.
(2)接着,通过网络爬虫等工具实时、逐批地采集网络中热门的新闻网页资源,每采集n个网页进行一次话题挖掘。在本示例中取n=1000。令首次采集完成1000个新闻网页的时刻记为t0,这些网页形成网页资源集合C0,并且每个网页在被采集时记录其分类信息。按照技术方案中On-LDA模型超参数的初始化过程,计算超参数
Figure PCTCN2015092047-appb-000184
的初始化 值
Figure PCTCN2015092047-appb-000185
Figure PCTCN2015092047-appb-000186
其中
Figure PCTCN2015092047-appb-000187
Figure PCTCN2015092047-appb-000188
Figure PCTCN2015092047-appb-000189
对应一个高维稀疏矩阵,此从略。
(2) Next, through the tools such as web crawlers, real-time, batch-by-batch collection of popular news web resources in the network, and mining of n web pages for topic mining. In this example, take n=1000. The time when the 1000 news pages are first collected is recorded as t 0 , and these web pages form a web resource set C 0 , and each web page records its classification information when it is collected. Calculate the hyperparameter according to the initialization process of the On-LDA model hyperparameter in the technical solution
Figure PCTCN2015092047-appb-000184
Initialization value
Figure PCTCN2015092047-appb-000185
with
Figure PCTCN2015092047-appb-000186
among them
Figure PCTCN2015092047-appb-000187
Figure PCTCN2015092047-appb-000188
and
Figure PCTCN2015092047-appb-000189
Corresponding to a high-dimensional sparse matrix, this is omitted.
进而基于On-LDA模型对网页资源集合C0进行话题挖掘,通过吉布斯采样计算得到20个话题,每个话题由5个词语组成。其中前四个话题为:
Figure PCTCN2015092047-appb-000190
Figure PCTCN2015092047-appb-000191
Figure PCTCN2015092047-appb-000192
紧接着,按照技术方案中On-LDA模型超参数的动态更新过程,将超参数
Figure PCTCN2015092047-appb-000193
的分别更新为
Figure PCTCN2015092047-appb-000194
Figure PCTCN2015092047-appb-000195
其中
Figure PCTCN2015092047-appb-000196
Figure PCTCN2015092047-appb-000197
高维稀疏矩阵
Figure PCTCN2015092047-appb-000198
从略。
Then, based on the On-LDA model, the topic mining of the web resource collection C 0 is performed, and 20 topics are calculated by Gibbs sampling, and each topic is composed of 5 words. The first four topics are:
Figure PCTCN2015092047-appb-000190
Figure PCTCN2015092047-appb-000191
Figure PCTCN2015092047-appb-000192
Then, according to the dynamic update process of the On-LDA model hyperparameter in the technical solution, the hyperparameter will be
Figure PCTCN2015092047-appb-000193
Updated separately to
Figure PCTCN2015092047-appb-000194
with
Figure PCTCN2015092047-appb-000195
among them
Figure PCTCN2015092047-appb-000196
Figure PCTCN2015092047-appb-000197
High dimensional sparse matrix
Figure PCTCN2015092047-appb-000198
be omitted.
(3)接下来,在本示例中,每当实时采集完成1000个热门新闻网页资源时,先按照技术方案中基于On-LDA模型的互联网话题挖掘过程,对这些网页挖掘生成20个话题,并在挖掘结束时,按照技术方案中On-LDA模型超参数的动态更新过程,对超参数
Figure PCTCN2015092047-appb-000199
进行更新。
(3) Next, in this example, whenever 1000 hot news web resources are collected in real time, firstly, according to the On-LDA model-based Internet topic mining process in the technical solution, 20 topics are generated for these webpage mining, and At the end of the excavation, according to the dynamic update process of the On-LDA model hyperparameter in the technical solution, the hyperparameter
Figure PCTCN2015092047-appb-000199
Update.
(4)经过一定时间的持续话题挖掘之后,在t10时刻对新采集到的1000个热门新闻网页进行话题挖掘,得到当前时刻的20个话题,其中前四个话题为:
Figure PCTCN2015092047-appb-000200
Figure PCTCN2015092047-appb-000201
Figure PCTCN2015092047-appb-000202
Figure PCTCN2015092047-appb-000203
在t10时刻的话题挖掘结束时,动态更新On-LDA模型的超参数
Figure PCTCN2015092047-appb-000204
Figure PCTCN2015092047-appb-000205
Figure PCTCN2015092047-appb-000206
其中
Figure PCTCN2015092047-appb-000207
Figure PCTCN2015092047-appb-000208
高维稀疏矩阵
Figure PCTCN2015092047-appb-000209
从略。
(4) after a certain time duration topic of mining, the new collection to the 1000 Top 10 news pages in the time t topic mining and 20 of the topic at the current time, of which the first four topics are:
Figure PCTCN2015092047-appb-000200
Figure PCTCN2015092047-appb-000201
Figure PCTCN2015092047-appb-000202
Figure PCTCN2015092047-appb-000203
At the time t 10 topics excavation end, ultra-dynamic update parameters On-LDA model
Figure PCTCN2015092047-appb-000204
for
Figure PCTCN2015092047-appb-000205
with
Figure PCTCN2015092047-appb-000206
among them
Figure PCTCN2015092047-appb-000207
Figure PCTCN2015092047-appb-000208
High dimensional sparse matrix
Figure PCTCN2015092047-appb-000209
be omitted.
上述示例表明,采用基于On-LDA模型的互联网话题在线挖掘方法,在具有一定时间间隔的两次挖掘中,所生成的话题彼此间有一定联系,又体现出话题的动态演化性质,能及时反映新闻关注点随时间所发生的改变。应用程序根据对互联网中网页内容话题的 在线挖掘结果,既可以对当前网络中涌现的热点话题进行实时检测和舆情分析,又可以利用网页语义特征向量判别网页内容之间的相似度,并进行内容聚合分析、个性化推荐等。 The above example shows that the online topic mining method based on On-LDA model has a certain relationship between the generated topics in the two mining with certain time interval, and reflects the dynamic evolution of the topic, which can be reflected in time. News concerns change over time. The application is based on the topic of web content on the Internet. The online mining results can not only detect and analyze the hot topics emerging in the current network, but also use the semantic feature vector of the webpage to determine the similarity between webpage content, and perform content aggregation analysis and personalized recommendation.

Claims (4)

  1. 一种基于改进LDA模型的互联网话题在线挖掘方法,其特征在于,包括On-LDA模型超参数的初始化、On-LDA模型超参数的动态更新和基于On-LDA模型的互联网话题挖掘;An online topic mining method based on improved LDA model, which includes initialization of On-LDA model hyperparameters, dynamic update of On-LDA model hyperparameters and Internet topic mining based on On-LDA model;
    On-LDA模型超参数的初始化;On-LDA模型利用网页内容的分类信息,来对超参数
    Figure PCTCN2015092047-appb-100001
    赋初值;
    On-LDA model hyperparameter initialization; On-LDA model uses the classification information of web content to hyperparameters
    Figure PCTCN2015092047-appb-100001
    Assign initial value;
    On-LDA模型超参数的动态更新;On-LDA模型在持续、流式的话题挖掘过程中,当每次话题挖掘完成后会及时利用统计信息动态更新超参数
    Figure PCTCN2015092047-appb-100002
    并采用更新后的超参数进行下一次话题挖掘;
    On-LDA model dynamic update of hyperparameters; On-LDA model in the continuous, streaming topic mining process, when each topic mining is completed, the statistical information will be used to update the hyperparameters in time.
    Figure PCTCN2015092047-appb-100002
    And use the updated hyperparameters for the next topic mining;
    基于On-LDA模型的互联网话题挖掘;假设在时刻ti(i≥0)需对网页资源集合
    Figure PCTCN2015092047-appb-100003
    进行话题挖掘;此时,首先确定On-LDA模型的超参数
    Figure PCTCN2015092047-appb-100004
    的取值;如果是在时刻t0对最先采集到的网页资源集合C0进行话题挖掘,此时先按照On-LDA模型超参数的初始化过程,计算超参数
    Figure PCTCN2015092047-appb-100005
    的初始值
    Figure PCTCN2015092047-appb-100006
    Figure PCTCN2015092047-appb-100007
    如果是在时刻ti(i≥1)对采集到的网页资源集合Ci进行话题挖掘,则超参数
    Figure PCTCN2015092047-appb-100008
    的取值为在上一时刻(ti-1)话题挖掘结束时,经On-LDA模型超参数动态更新后得到的
    Figure PCTCN2015092047-appb-100009
    Figure PCTCN2015092047-appb-100010
    接着,按照On-LDA概率图模型,并采用吉布斯采样方法,对网页资源集合Ci进行话题挖掘,生成话题集合
    Figure PCTCN2015092047-appb-100011
    并且得到Ci中每个网页
    Figure PCTCN2015092047-appb-100012
    对应于话题集合Zi的语义特征向量
    Figure PCTCN2015092047-appb-100013
    其中
    Figure PCTCN2015092047-appb-100014
    为网页
    Figure PCTCN2015092047-appb-100015
    属于话题
    Figure PCTCN2015092047-appb-100016
    的概率。
    Internet topic mining based on On-LDA model; assume that web resource collection is required at time t i ( i ≥ 0)
    Figure PCTCN2015092047-appb-100003
    Conduct topic mining; at this time, first determine the hyperparameter of the On-LDA model
    Figure PCTCN2015092047-appb-100004
    If the value is used, the topic mining is performed on the first collected web resource set C 0 at time t 0. At this time, the hyperparameter is first calculated according to the initialization process of the On-LDA model hyperparameter.
    Figure PCTCN2015092047-appb-100005
    Initial value
    Figure PCTCN2015092047-appb-100006
    with
    Figure PCTCN2015092047-appb-100007
    If the topic mining is performed on the collected web resource set C i at time t i ( i ≥ 1), the hyperparameter
    Figure PCTCN2015092047-appb-100008
    The value obtained by the On-LDA model hyperparameter dynamic update at the end of the topic excavation at the last moment (t i-1 )
    Figure PCTCN2015092047-appb-100009
    with
    Figure PCTCN2015092047-appb-100010
    Then, according to the On-LDA probability map model, and using the Gibbs sampling method, the topic mining of the web resource collection C i is performed to generate a topic set.
    Figure PCTCN2015092047-appb-100011
    And get every page in C i
    Figure PCTCN2015092047-appb-100012
    a semantic feature vector corresponding to the topic set Z i
    Figure PCTCN2015092047-appb-100013
    among them
    Figure PCTCN2015092047-appb-100014
    For web pages
    Figure PCTCN2015092047-appb-100015
    Belonging to the topic
    Figure PCTCN2015092047-appb-100016
    The probability.
  2. 如权利要求1所述的基于改进LDA模型的互联网话题在线挖掘方法,其特征在于,假设给定领域中所有网页资源内容的全部分类信息用集合G={cat1,cat2,…,catg}表示,其中g=|G|,而cats(1≤s≤g)代表一个具体的分类信息;首先用集合G的大小来设定参数k的取值,即k=g=|G|,它决定了On-LDA模型每次挖掘产生的话题数;对On-LDA模型中的超参数
    Figure PCTCN2015092047-appb-100017
    进行初始化,得到初始时刻t0的超参数值
    Figure PCTCN2015092047-appb-100018
    Figure PCTCN2015092047-appb-100019
    The online topic mining method based on the improved LDA model according to claim 1, wherein all the classification information of all webpage resource contents in a given domain is assumed to be a set G={cat 1 ,cat 2 ,...,cat g } indicates that g=|G|, and cat s (1≤s≤g) represents a specific classification information; first, the value of the parameter k is set by the size of the set G, that is, k=g=|G| , which determines the number of topics generated by each mining of the On-LDA model; the hyperparameters in the On-LDA model
    Figure PCTCN2015092047-appb-100017
    Initialize to obtain the hyperparameter value at the initial time t 0
    Figure PCTCN2015092047-appb-100018
    with
    Figure PCTCN2015092047-appb-100019
    Figure PCTCN2015092047-appb-100020
    Figure PCTCN2015092047-appb-100020
    Figure PCTCN2015092047-appb-100021
    Figure PCTCN2015092047-appb-100021
    Figure PCTCN2015092047-appb-100022
    Figure PCTCN2015092047-appb-100023
    中,对于1≤s≤k有:
    in
    Figure PCTCN2015092047-appb-100022
    with
    Figure PCTCN2015092047-appb-100023
    In the case of 1 ≤ s ≤ k:
    Figure PCTCN2015092047-appb-100024
    其中count_doc(cats)表示网页资源集合C0中内容属于分类信息cats(1≤s≤k)的网页总数;
    Figure PCTCN2015092047-appb-100024
    Where count_doc(cat s ) represents the total number of web pages whose content in the web resource set C 0 belongs to the classification information cat s (1≤s≤k);
    Figure PCTCN2015092047-appb-100025
    其中
    Figure PCTCN2015092047-appb-100026
    取值如下:
    Figure PCTCN2015092047-appb-100025
    among them
    Figure PCTCN2015092047-appb-100026
    The values are as follows:
    Figure PCTCN2015092047-appb-100027
    其中count_doc(cats)表示词语
    Figure PCTCN2015092047-appb-100028
    其中出现在C0中具有分类信息cats的所有网页中的总次数。
    Figure PCTCN2015092047-appb-100027
    Where count_doc(cat s ) indicates words
    Figure PCTCN2015092047-appb-100028
    The total number of times in all web pages with classification information cat s appearing in C 0 .
  3. 如权利要求1所述的基于改进LDA模型的互联网话题在线挖掘方法,其特征在于,On‐LDA模型超参数的更新过程:在初始时刻t0,On‐LDA模型中的超参数
    Figure PCTCN2015092047-appb-100029
    分别取初始化值
    Figure PCTCN2015092047-appb-100030
    Figure PCTCN2015092047-appb-100031
    假设在时刻ti(i≥1)超参数
    Figure PCTCN2015092047-appb-100032
    分别取值
    Figure PCTCN2015092047-appb-100033
    Figure PCTCN2015092047-appb-100034
    据此对网页资源集合
    Figure PCTCN2015092047-appb-100035
    进行话题挖掘,生成话题集合
    Figure PCTCN2015092047-appb-100036
    紧接着,对超参数
    Figure PCTCN2015092047-appb-100037
    进行更新,具体方法如下。首先,采用如下公式更新超参数
    Figure PCTCN2015092047-appb-100038
    Figure PCTCN2015092047-appb-100039
    The online topic mining method based on the improved LDA model according to claim 1, characterized in that the update process of the On-LDA model hyperparameter: the hyperparameter in the On‐LDA model at the initial time t 0
    Figure PCTCN2015092047-appb-100029
    Take the initialization value separately
    Figure PCTCN2015092047-appb-100030
    with
    Figure PCTCN2015092047-appb-100031
    Assume that the hyperparameter at time t i ( i ≥ 1)
    Figure PCTCN2015092047-appb-100032
    Value separately
    Figure PCTCN2015092047-appb-100033
    with
    Figure PCTCN2015092047-appb-100034
    According to this, the collection of web resources
    Figure PCTCN2015092047-appb-100035
    Perform topic mining to generate topic collections
    Figure PCTCN2015092047-appb-100036
    Next, on the hyperparameter
    Figure PCTCN2015092047-appb-100037
    Update it as follows. First, update the hyperparameters with the following formula
    Figure PCTCN2015092047-appb-100038
    for
    Figure PCTCN2015092047-appb-100039
    Figure PCTCN2015092047-appb-100040
    Figure PCTCN2015092047-appb-100040
    其中
    Figure PCTCN2015092047-appb-100041
    Figure PCTCN2015092047-appb-100042
    的取值如下:
    among them
    Figure PCTCN2015092047-appb-100041
    with
    Figure PCTCN2015092047-appb-100042
    The values are as follows:
    Figure PCTCN2015092047-appb-100043
    Figure PCTCN2015092047-appb-100043
    Figure PCTCN2015092047-appb-100044
    Figure PCTCN2015092047-appb-100044
    矩阵
    Figure PCTCN2015092047-appb-100045
    的第j(0≤j≤i)列为
    Figure PCTCN2015092047-appb-100046
    它表示在网页资源集合Cj的所有网页中,包含有话题集合Zi中各个话题相应词语的频度,即
    Figure PCTCN2015092047-appb-100047
    表示Cj中所有网页包含有被标记为话题
    Figure PCTCN2015092047-appb-100048
    的词语的数量。
    matrix
    Figure PCTCN2015092047-appb-100045
    The jth (0 ≤ j ≤ i) is listed as
    Figure PCTCN2015092047-appb-100046
    It indicates that in all the webpages of the webpage resource set Cj , the frequency of the corresponding words of each topic in the topic set Z i is included, that is,
    Figure PCTCN2015092047-appb-100047
    Indicates that all pages in C j contain tags that are marked as topics
    Figure PCTCN2015092047-appb-100048
    The number of words.
  4. 如权利要求3所述的基于改进LDA模型的互联网话题在线挖掘方法,其特征在于,考虑到距离当前时刻(ti)越久的网页内容对当前话题挖掘的影响越小,所以在更新On-LDA模型的超参数时,可使用指数衰减函数来表示既往各时刻的网页内容对当前话题 挖掘的影响权重,形成时间权重矩阵
    Figure PCTCN2015092047-appb-100049
    其中,λ为衰减因子,n0为归一化常数;
    The online topic mining method based on the improved LDA model according to claim 3, wherein the on-LDA is updated in consideration of the influence of the webpage content that is longer than the current time (t i ) on the current topic mining. When the hyperparameter of the model is used, an exponential decay function can be used to represent the weight of the webpage content of the past moments on the current topic mining, forming a time weight matrix.
    Figure PCTCN2015092047-appb-100049
    Where λ is the attenuation factor and n 0 is the normalization constant;
    接着,采用如下公式更新超参数
    Figure PCTCN2015092047-appb-100050
    Figure PCTCN2015092047-appb-100051
    Next, update the hyperparameters with the following formula
    Figure PCTCN2015092047-appb-100050
    for
    Figure PCTCN2015092047-appb-100051
    Figure PCTCN2015092047-appb-100052
    Figure PCTCN2015092047-appb-100052
    其中,对1≤s≤k有:Among them, for 1≤s≤k there are:
    Figure PCTCN2015092047-appb-100053
    Figure PCTCN2015092047-appb-100053
    矩阵
    Figure PCTCN2015092047-appb-100054
    的第j(0≤j≤i)列为
    Figure PCTCN2015092047-appb-100055
    它表示以话题
    Figure PCTCN2015092047-appb-100056
    的各个词语做参照,词语集合Wi中的所有词语在时刻ti时出现的次数。若话题
    Figure PCTCN2015092047-appb-100057
    包含词语
    Figure PCTCN2015092047-appb-100058
    Figure PCTCN2015092047-appb-100059
    等于在时刻ti时词语
    Figure PCTCN2015092047-appb-100060
    出现在Cj的所有网页中的总次数;若话题
    Figure PCTCN2015092047-appb-100061
    不包含词语
    Figure PCTCN2015092047-appb-100062
    Figure PCTCN2015092047-appb-100063
    等于0;
    Figure PCTCN2015092047-appb-100064
    是与前面一样的时间权重矩阵。
    matrix
    Figure PCTCN2015092047-appb-100054
    The jth (0 ≤ j ≤ i) is listed as
    Figure PCTCN2015092047-appb-100055
    It expresses a topic
    Figure PCTCN2015092047-appb-100056
    Referring to make each word, the number of all the words in the words W i is set at time t i appears. If topic
    Figure PCTCN2015092047-appb-100057
    Contains words
    Figure PCTCN2015092047-appb-100058
    then
    Figure PCTCN2015092047-appb-100059
    Equal to the word at time t i
    Figure PCTCN2015092047-appb-100060
    The total number of occurrences in all pages of C j ; if topic
    Figure PCTCN2015092047-appb-100061
    Does not contain words
    Figure PCTCN2015092047-appb-100062
    then
    Figure PCTCN2015092047-appb-100063
    Equal to 0;
    Figure PCTCN2015092047-appb-100064
    Is the same time weight matrix as before.
PCT/CN2015/092047 2015-09-02 2015-10-16 Online internet topic mining method based on improved lda model WO2017035922A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510557916.1A CN105138665B (en) 2015-09-02 2015-09-02 A kind of internet topic online mining method based on improvement LDA models
CN201510557916.1 2015-09-02

Publications (1)

Publication Number Publication Date
WO2017035922A1 true WO2017035922A1 (en) 2017-03-09

Family

ID=54724012

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/092047 WO2017035922A1 (en) 2015-09-02 2015-10-16 Online internet topic mining method based on improved lda model

Country Status (2)

Country Link
CN (1) CN105138665B (en)
WO (1) WO2017035922A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325165A (en) * 2018-08-29 2019-02-12 中国平安保险(集团)股份有限公司 Internet public opinion analysis method, apparatus and storage medium
CN109829112A (en) * 2019-01-31 2019-05-31 平安科技(深圳)有限公司 Fission Topic Tracking method, apparatus and computer equipment based on big data
CN111241846A (en) * 2020-01-15 2020-06-05 沈阳工业大学 Theme dimension self-adaptive determination method in theme mining model
CN111475638A (en) * 2020-06-02 2020-07-31 北京邮电大学 Interest mining method and device
CN112115327A (en) * 2020-03-04 2020-12-22 云南大学 Public opinion news event tracking method based on topic model
CN113705247A (en) * 2021-10-27 2021-11-26 腾讯科技(深圳)有限公司 Theme model effect evaluation method, device, equipment, storage medium and product
CN117422063A (en) * 2023-12-18 2024-01-19 四川省大数据技术服务中心 Big data processing method applying intelligent auxiliary decision and intelligent auxiliary decision system

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138665B (en) * 2015-09-02 2017-06-20 东南大学 A kind of internet topic online mining method based on improvement LDA models
CN105447179B (en) * 2015-12-14 2019-02-05 清华大学 Topic auto recommending method and its system based on microblogging social networks
CN106777243A (en) * 2016-12-27 2017-05-31 浪潮软件集团有限公司 Dynamic modeling of streaming data analysis
CN108133140A (en) * 2017-12-08 2018-06-08 成都数聚城堡科技有限公司 A kind of mode of the anti-reptile of dynamic
CN108509517B (en) * 2018-03-09 2021-05-11 东南大学 Streaming topic evolution tracking method for real-time news content
CN108596717B (en) * 2018-04-16 2021-02-05 世纪美映影院技术服务(北京)有限公司 Method for intelligently recommending cinema showing schedule
CN111368092B (en) * 2020-02-21 2020-12-04 中国科学院电子学研究所苏州研究院 Knowledge graph construction method based on trusted webpage resources

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101587493A (en) * 2009-06-29 2009-11-25 中国科学技术大学 Text classification method
CN101710333A (en) * 2009-11-26 2010-05-19 西北工业大学 Network text segmenting method based on genetic algorithm
US7853596B2 (en) * 2007-06-21 2010-12-14 Microsoft Corporation Mining geographic knowledge using a location aware topic model
CN102439597A (en) * 2011-07-13 2012-05-02 华为技术有限公司 Parameter deducing method, computing device and system based on potential dirichlet model
CN105138665A (en) * 2015-09-02 2015-12-09 东南大学 Online internet topic mining method based on improved LDA model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7853596B2 (en) * 2007-06-21 2010-12-14 Microsoft Corporation Mining geographic knowledge using a location aware topic model
CN101587493A (en) * 2009-06-29 2009-11-25 中国科学技术大学 Text classification method
CN101710333A (en) * 2009-11-26 2010-05-19 西北工业大学 Network text segmenting method based on genetic algorithm
CN102439597A (en) * 2011-07-13 2012-05-02 华为技术有限公司 Parameter deducing method, computing device and system based on potential dirichlet model
CN105138665A (en) * 2015-09-02 2015-12-09 东南大学 Online internet topic mining method based on improved LDA model

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325165B (en) * 2018-08-29 2023-08-22 中国平安保险(集团)股份有限公司 Network public opinion analysis method, device and storage medium
CN109325165A (en) * 2018-08-29 2019-02-12 中国平安保险(集团)股份有限公司 Internet public opinion analysis method, apparatus and storage medium
CN109829112A (en) * 2019-01-31 2019-05-31 平安科技(深圳)有限公司 Fission Topic Tracking method, apparatus and computer equipment based on big data
CN109829112B (en) * 2019-01-31 2023-11-14 平安科技(深圳)有限公司 Fissile topic tracking method and device based on big data and computer equipment
CN111241846A (en) * 2020-01-15 2020-06-05 沈阳工业大学 Theme dimension self-adaptive determination method in theme mining model
CN111241846B (en) * 2020-01-15 2023-05-26 沈阳工业大学 Self-adaptive determination method for theme dimension in theme mining model
CN112115327A (en) * 2020-03-04 2020-12-22 云南大学 Public opinion news event tracking method based on topic model
CN112115327B (en) * 2020-03-04 2023-10-20 云南大学 Topic model-based public opinion news event tracking method
CN111475638A (en) * 2020-06-02 2020-07-31 北京邮电大学 Interest mining method and device
CN113705247B (en) * 2021-10-27 2022-02-11 腾讯科技(深圳)有限公司 Theme model effect evaluation method, device, equipment, storage medium and product
CN113705247A (en) * 2021-10-27 2021-11-26 腾讯科技(深圳)有限公司 Theme model effect evaluation method, device, equipment, storage medium and product
CN117422063A (en) * 2023-12-18 2024-01-19 四川省大数据技术服务中心 Big data processing method applying intelligent auxiliary decision and intelligent auxiliary decision system
CN117422063B (en) * 2023-12-18 2024-02-23 四川省大数据技术服务中心 Big data processing method applying intelligent auxiliary decision and intelligent auxiliary decision system

Also Published As

Publication number Publication date
CN105138665B (en) 2017-06-20
CN105138665A (en) 2015-12-09

Similar Documents

Publication Publication Date Title
WO2017035922A1 (en) Online internet topic mining method based on improved lda model
Wang et al. A review of microsoft academic services for science of science studies
US11288573B2 (en) Method and system for training and neural network models for large number of discrete features for information rertieval
WO2017097231A1 (en) Topic processing method and device
Lu et al. Don't forget the quantifiable relationship between words: Using recurrent neural network for short text topic discovery
CN108763348B (en) Classification improvement method for feature vectors of extended short text words
JP2017073137A (en) Generation of descriptive topic label
Lin et al. Rumor detection with hierarchical recurrent convolutional neural network
Yang et al. A deep top-k relevance matching model for ad-hoc retrieval
Huang et al. Intangible cultural heritage management using machine learning model: a case study of Northwest Folk Song Huaer
Chen et al. Popular topic detection in Chinese micro-blog based on the modified LDA model
Wang et al. Topic discovery method based on topic model combined with hierarchical clustering
Zhou et al. Leverage knowledge graph and GCN for fine-grained-level clickbait detection
Zhang et al. Massive picture retrieval system based on big data image mining
Qin et al. Malware detection based on TF-(IDF&ICF) method
Zhou [Retracted] Application of K‐Means Clustering Algorithm in Energy Data Analysis
CN112580355B (en) News information topic detection and real-time aggregation method
Li et al. Research on hot news discovery model based on user interest and topic discovery
Zhang et al. Automatic web news extraction based on DS theory considering content topics
CN107315642B (en) Minimum energy consumption calculation method in green cloud service provision
Bao et al. Hot news prediction method based on natural language processing technology and its application
Peng et al. TH-SLP: Web service link prediction based on topic-aware heterogeneous graph neural network
Hong [Retracted] Application of Data Mining in Network Information Dynamic Push Software
Wang et al. A Novel Explainable Rumor Detection Model with Fusing Objective Information
Wang et al. Research on quality assessment method of blockchain multi-source heterogeneous business data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15902729

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15902729

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 15902729

Country of ref document: EP

Kind code of ref document: A1