CN103023714A

CN103023714A - Activeness and cluster structure analyzing system and method based on network topics

Info

Publication number: CN103023714A
Application number: CN2012104773175A
Authority: CN
Inventors: 陈秀真; 李生红; 李建华; 李琳; 楼昊; 蔡贵贤; 陶彤彤
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2012-11-21
Filing date: 2012-11-21
Publication date: 2013-04-03
Anticipated expiration: 2032-11-21
Also published as: CN103023714B

Abstract

The invention provides an activeness and cluster structure analyzing system based on network topics. The activeness and cluster structure analyzing system based on the network topics comprises a data acquiring and normalizing module, a data storage module, an application analysis module and an user interaction and display module; the user interaction and display module provides an interface for realizing user interaction and displaying data analysis results; the data acquiring and normalizing module is used for receiving an URL (Uniform Resource Locator) specified by a user and acquiring and normalizing the network data by virtue of a network data crawling subunit and a webpage data normalizing subunit; the data storage module is used for storing the normalized webpage data and providing analysis data to the application analysis module; and the application analysis module is used for deeply mining the activeness of the topics and a community structure based on the webpage clustering and hot spot mining, and showing the results to the user through the user interaction and display unit. By adopting the activeness and cluster analyzing system and the activeness and cluster analyzing method based on the network topics, the limitation due to that single detection approach is adopted in the internet public opinion system field, and the webpage content mining cannot be carried out, is solved; and the problems due to cluster structure mining in webpage information analysis and detailed information mining in state evaluation can be solved well.

Description

The liveness of topic Network Based and cluster topology analytical system and method

Technical field

What the present invention relates to is the system of a kind of network public-opinion monitoring and analysis field, specifically a kind of liveness of topic Network Based and cluster topology analytical system and method.

Background technology

The Internet highly developed, especially the appearance of the emerging application such as blog, microblogging, forum, so that network becomes the main media that bulk information is propagated in the modern life, become that the netizen is used for obtaining information, makes comments, the main platform of the network promotion, network marketing, be called as " fourth media " after newspaper, broadcasting, TV.Real Time Monitoring network public-opinion, correct guidance network public opinion concern national safety, harmonious society, managerial decision and Business survival, for Government Office, net a surname do, more and more important the government departments such as foreign affairs office.Present network public-opinion system is by web crawlers (webpage spider, network robot, in the middle of FOAF community, often be called the webpage follower) technology, be a kind of program or script according to certain regular automatic capturing World Wide Web (WWW) information, obtain a large amount of network datas, further these data are filtered and denoising, by the hot issue of propagating in the method discovering networks such as participle, cluster, statistical analysis.

But because the Ajax technology has all been adopted in present most website, generally all dynamically join in the webpage by JavaScript for the network comment data, so that web crawlers is difficult to obtain these comment contents.Therefore at present only rest on identification to comment viewpoint theme for the analysis and research of webpage comment, by the participle instrument comment content is carried out participle after, obtain the viewpoint descriptor with statistical method.These systems do not excavate and assess the cluster topology state of network topics data, can not obtain the profound level relation of network data.

Through further retrieval, the comment data of webpage is not carried out profound level analysis in the existing public sentiment system, yet the cluster topology of network topics data is not excavated and assesses.

That is to say that existing web crawlers is difficult to realize precisely obtaining required data or data from the angle of technology, exist and realize the large technical problem of difficulty.

Summary of the invention

The present invention is directed to the problem on the prior art, provide a kind of liveness of topic data Network Based and cluster topology to excavate and evaluating system.

The present invention is achieved by the following technical solutions, the present invention includes: data acquisition normalizing module, data memory module, applied analysis module, user interactions and display module, wherein: user interactions and display module provide the interface of showing with user interactions and data results.Data acquisition normalizing module crawls subelement and web data normalizing subelement by receiving the URL(uniform resource locator) (Universal Resource Locator, URL) of user's appointment by network data, obtains and the normalization network data.Data memory module is deposited the normalization web data, for the analytic application module provides the analysis data.The applied analysis module is on the basis of webpage cluster and focus excavation, and the result who utilizes focus to excavate further analyzes normalization data, and the degree of depth is excavated topic liveness and community structure, and shows analysis result by user interactions and display unit to the user.

1. described data acquisition normalizing module comprises that network data crawls subelement and web data normalizing subelement.Wherein, network data crawls the web data that subelement obtains appointed website, and web data and the local storage address of webpage are preserved.Web data normalizing subelement deposits the normalization web database by processing the parent page data in analyzing the page key message that extracts.

Described network data crawls subelement by reading the SeedURL table, obtains seed URL, utilizes the link of web page interlinkage reptile module acquisition appointed website and deposits the web page interlinkage formation in; Webpage pulls module and obtain page URL from the web page interlinkage formation, and on the basis of the whole webpage of crawl, the information with page data, URL and the local memory address of this webpage deposits the original web page database in by the info web memory module.

Described SeedURL table indication reptile crawls the initial URL of webpage, and each record is with reptile thread of unlatching of correspondence, and this thread only crawls the webpage of appointed website.The field that this table comprises comprises: (1) url field, the URL link of expression appointment; (2) Parser field: specify analytic method corresponding to this URL; (3) inUse field: whether this seed URL is using; (4) finish field: it is complete whether this URL crawls; (5) Depth field: the number of plies that crawls.

Described web page interlinkage reptile module starts a plurality of link reptile module threads according to the record number in the SeedURL table, and each thread crawls the all-links of specifying the number of plies in appointed website, it is added in the web page interlinkage formation.Each link reptile module thread has been safeguarded a url filtering database, is used for depositing the page info that had crawled.After system obtains a URL, can check the url filtering database, if the relevant record of existing this URL in this storehouse represents that then this URL was crawled, and directly abandoned this URL; If there is not record, then this URL is added the URL formation, wait network data request module is processed the page of this URL.This url filtering database is realized with the Hashset among the Java.The network data request module uses the object in the htmlParser bag to obtain data from network, and the Webpage data are passed to the URL extraction module, and the URL that will process again deposits the url filtering database in.The URL extraction module uses the LinkBean object extraction URL in the htmlParser bag, puts into web page interlinkage formation and URL formation through the url filtering database.After this web page interlinkage reptile module crawls the number of plies that reaches certain, empty the record in the url filtering database, take seed URL as initial, crawl webpage again.Opening up of link reptile is that record according in the SeedURL in the database table carries out, and each bar inUse field is that 0 record all will be opened up a thread in web page interlinkage reptile module.The main thread of web page interlinkage reptile module sees if there is the record that went to check among the SeedURL in per ten minutes new record and adds.Field Parser among the SeedURL will be added in the record LocalRecord object in the linked queue that is crawled by this link reptile thread, to specify the analysis mode of this webpage.

Described webpage pulls module and extracts record in the page link formation, goes for accordingly to ask web data and web data is stored in the local disk.Deposit in together in the page stores formation pulling the URL of the page and the address of local storage, and the URL of the parent page data of obtaining and the page data that pulls and the address of local storage are sent into the page info memory module together.

Described info web memory module has been used the data access interface (DAO) of hibernate form.Corresponding table is the LocalRecord table in the database, and the field that this table comprises is: (1) url field: the URL information of depositing webpage; (2) LocalDir field: the local storage address of web data; (3) Parsed field: the method information of resolving this webpage; (4) Processed field: whether this webpage was resolved.Class in Java is LocalRecord, and corresponding DAO is LocalRecordDAO.

Described web data normalizing subelement extracts title, author, time, webpage main contents and comment by analyzing the original web page data.This subelement is obtained by the original web page data extraction module and is kept at local original web page, puts into webpage formation to be analysed.The web page analysis module is obtained the original web page data from webpage formation to be analysed, the analytic method analyzing web page data of utilizing webpage resolution data storehouse to provide, and the URL of the web data that is disposed put into the parsing record queue.By the original web page data extraction module, the webpage that had been resolved that is stored in the original web page database is made marks simultaneously.The web page analysis module will be analyzed data later and deposit the formation of normalizing webpage in, wait for the processing of normalizing data memory module.The normalizing data memory module obtains the web data after the analysis from the formation of normalizing webpage, and deposits the normalization web data in by data access interface.Original web page data extraction module in this module, web page analysis module and normalizing data memory module are all realized as a separate threads, open up a plurality of threads when webpage is resolved.

What described original web page data extraction module was regular obtains page data from the original web page database, and these data are deposited in the webpage formation to be analysed; According to the data of resolving in the record queue respective record Processed field in the original web page database is made as 1, represents that this webpage was resolved.

Described network queue to be analysed is the chained list of a LocalRecord form.Be used for as the web page analysis module provides the analysis data, with the speed between balance net page analysis module and the original web page data extraction module.

Described web page analysis module is according to the webpage analytic method of appointment in the webpage record, from webpage parsing storehouse, extract corresponding method and remove analyzing web page, parent page Data Analysis result is sent into buffer memory in the formation of normalizing webpage, and the identification information that will resolve the page deposits the parsing record queue in.

Described webpage is resolved the storehouse and is one and resolves the set of class for the different web sites page, and all kinds of in this storehouse are all inherited a common abstract base class:

Different analytic methods rewrites the method in OO mode, and different web sites is realized the analytic method of oneself separately.Defined following field among the type FormatForParseResult that returns in the method:

private String title;

private String url;

private String content;

private String date;

private String author;

Private String site; Webpage in which website of // expression is set this value by main thread, need not in the concrete grammar to set

private List<Comment>comments;

For the characteristics of webpage programming language, adopt the htmlParser among the Java that the page of html format is resolved in this parsing storehouse.Each html page as a tree, is obtained the information of webpage by the label of correspondence.The webpage main contents obtain part, the html page for input travels through each label node in the html page in order.For text node, analyze text node Chinese version content-length whether greater than 30 characters and Chinese character greater than the English character number, if satisfy this condition, just the main contents as the page extract.For the html tag node that contains child node, extract child node and continue and analyze each child node with aforesaid way, until all nodes are all analyzed complete in the page.For obtaining of webpage comment, to construct corresponding review pages place URL for the news pages of Netease and Sina, and then obtain review information.The news pages comment URL make of Netease and Sina is as follows.Netease's news pages comment URL structure is: " http://comment. "+channelID+ " .163.com/data/ "+tieChannel+ "/df/ "+articleID+ " _ "+pageNum+ " .html ".Wherein the channelID field can travel through all script nodes of news pages, searches to have keyword " Site ID " the script node.Content in this node in the quotation marks of " ntes_nacc=" back is the content of channelID.Search the script node that contains " tieAnywhere.HotTieArea " keyword, with the field contents of second and the 3rd parameter in this function as articleID and tieChannel.And in this script node, the content of replyCount field is exactly the content of pageNum as numeral divided by every page comment number.The structure of Sina News page comment URL is: " http://comment5.news.sina.com.cn/page/info format=js﹠amp; Jsvar=pagedata﹠amp; Channel="+Channel+ " ﹠amp; Newsid="+NewsId+ " ﹠amp; Group=0﹠amp; Page=1 ", wherein Channel and NewsId field are parameter.Search the script node that contains sinaCMNT.embed.init in the Sina News page, channel in the parameter character string of this function: the part of back is exactly the content of parameters C hannel, newsid: the part of back is exactly the content of Parameter N ewsId.

The FormatForParseResult object that described normalizing webpage formation buffer memory parses waits for that the normalizing data memory module regularly goes these records are stored in the database.

Described normalizing data memory module regularly with the data in the formation of normalizing webpage, deposits the normalization web database in by data access interface.

2. described data memory module is the normalization web database.The information such as the web page title that webpage normalizing database generates for storage network page data normalizing subelement, author, time, main contents, comment.This database comprises three forms: the webpage Basic Information Table comprises field: (1) url field: the URL information of record webpage; (2) Title field: the heading message of record webpage; (3) Publisher field: the distributor information of webpage; (4) Data field: the issuing time of the page; (5) Site field: webpage affiliated web site information; (6) id field is as the major key of this table.Webpage main contents table comprises field: (1) url field: the URL information of webpage; (2) MainContent field: main contents of webpages field; (3) DataId field: the major key of this table.The comment table mainly comprises field: (1) url field: the URL information of preserving the page; (2) Commented field: the id information of issue comment; (3) BeCommentedID field: if comment institute for ID(should comment less than specifically for certain individuality, think that then it is the publisher of webpage); (4) CommentContent field: the content information of preserving comment; (5) CommentDate field: record comment temporal information; (6) DataId field: the major key of this table.The Hibernate technology is adopted in access to database, and each table is become a class, and the record in the table becomes the object of class, and corresponding three classes are respectively Data, Content and Comment.Three tables are associated together by the id value in the webpage Basic Information Table, during record in each this information table of storage network page base, obtain the id value of this record, and this value are given to webpage main contents table to be stored and comment on DataId field in the table object.

3. described applied analysis module comprises that webpage cluster subelement, focus excavate subelement, the analysis of topic liveness and visualization and community structure excavates and visualization.Wherein, focus excavates subelement to carry out on the basis of network clustering at webpage cluster subelement, by the web page contents of analysis parent page data, thereby obtains much-talked-about topic, and this unit also can carry out the inquiry of focus for the keyword that the user provides.The analysis of topic liveness and visualization and community structure excavate and visualization depth analysis network configuration and web page contents on the basis of focus excavation subelement, and then analysis result fed back to user interactions and display module, the visual analysis result that presents.

Described webpage cluster subelement is added up the weight of each word by the webpage main contents are cut word and participle, and web page contents is mapped as vector in the vector space.Utilize the K Mean Method with the webpage cluster in the fixed time section, thereby according to sub-topic network data is classified.

Described focus excavates cluster result and the web data that subelement utilizes webpage cluster subelement, by to all web page title participles of collecting in the section sometime, extract popular vocabulary by character such as parts of speech, finally obtain popular vocabulary, excavate much-talked-about topic by this vocabulary.This unit also can excavate for the keyword of the user's appointment focus of being correlated with.What the method for digging of much-talked-about topic mainly adopted is the method for greatly commenting on: get the web page title that the comment number is maximum in a certain period and be the focus in this section period, from the comment table, extract maximum front 100 webpages of number of reviews in certain period, calculate similarity between any two, the webpage that similarity is very high is merged into a class, and extracts public word as the much-talked-about topic of this class webpage from all kinds of.

Described community structure excavates with display unit and passes through to specify in each website of retrieval the relevant webpage of keyword to comment on, according to the message relation structure network diagram of each comment.Corporations' method for digging phase-split network community structure by potential energy Network Based, then each node among the figure is randomly dispersed in the specified window, spring is regarded on limit between each node as, utilize the relative position of system's each node of Calculation and Analysis of Force, thereby represent whole network, and to the node of heterogeneous networks corporations with different color shows.

Corporations' method for digging of described potential energy Network Based is incorporated into the concept of potential energy in the physics in the Complex Networks Analysis, has proposed the definition of network potential energy, and the potential-energy function by peak optimizating network, reaches the purpose of excavating community structure in the complex network.Being implemented as follows of the method: remove the nodes degree and be on the basis of 1 node, obtain pretreated adjacency matrix A', utilize the searching algorithm of breadth-first in A', to calculate the limit number that shortest path comprises between any two nodes, obtain Distance matrix D.Pass through formula

Potential energy in (wherein R is a normal amount, and D is distance matrix) computing network between any two nodes Utilize Obtain the network potential energy of whole network G.For every in network limit e _k, sub-network G behind the calculating deletion this edge _k=G-{e _kNetwork potential energy and the network potential energy difference of former network G

Remove

The limit that value is maximum, if do not produce sub-network,

Then continue to remove in the network existing

The limit that value is maximum; If produce new non-connection sub-network, then divide the sub-network that generates with strong and weak community structure definition check and whether meet predefined strong and weak community structure.If result of calculation meets strong and weak community structure, then in existing network configuration, calculate new Distance matrix D, and repeat above-mentioned potential energy calculation procedure, further splitting network; If do not meet strong and weak community structure, the node number of degrees of then removing for pretreatment stage are 1 node, with its incorporate into into the community structure at its node place that directly links to each other, thereby obtain the community structure division of network node.

The first step, for complex network G=(V, E), V represents the set of node, E represents the set on limit.The adjacency matrix A of fan-in network G does the data preliminary treatment, obtains pretreated network adjacent matrix A '.Described data preliminary treatment refers to remove the network moderate and is 1 node, namely removes the row and column of this point correspondence in the adjacency matrix of network input.

Second step is for pretreated network adjacent matrix A', the topological structure of phase-split network.Retrieve other nodes in the whole network for each node with the searching method of breadth-first, obtain the distance between any two nodes, thereby set up Distance matrix D.Distance between described any two nodes refers to the limit number that shortest path comprises between 2 in the network.

In the 3rd step, the distance matrix of node Network Based calculates whole network of network potential energy.Regard each node in the network source of a gravitational field as, can obtain the potential energy between any two nodes in the network

Computing formula is as follows:

Wherein, R is a constant, can be set as a positive number, and D is resulting distance matrix in the second step.

The network potential energy of whole network G is the potential energy sum between all nodes of comprising of network, and computing formula is as follows:

The 4th step is for every in network limit e _k, sub-network G behind the calculating deletion this edge _k=G-{e _kNetwork potential energy and the network potential energy difference of former network G

Computing formula is as follows:

In the 5th step, remove

The limit that value is maximum, and check whether generated independently sub-network.If no, then got back to for the 4th step; Sub-network produces if having independently, and then whether the sub-network of check division generation meets predefined strong and weak community structure.If result of calculation meets strong and weak community structure, then turn the first step, method continues; If result of calculation does not meet strong and weak community structure, then method finishes, and turns for the 6th step.

In the 6th step, the result for the 5th step obtained re-constructs primitive network figure.Node with preliminary treatment in the first step is fallen adds among the primitive network figure again, and belongs to the community structure at the node place that directly links to each other with it.

The analysis of described topic liveness and visualization propose a kind of cluster liveness quantitative appraisement model and computational methods of stratification by the network cluster based on topic in the forum is analyzed and researched.This model is divided into " cluster layer ", " individual layer ", " behavior layer " three levels from top to bottom, consider the key element such as scale, individual behavior difference of cluster, make up the active state assessment models of whole cluster, quantize active state, the quantitative cluster active degree appraisal procedure that provides.Concrete grammar is as follows:

Be called individuality with participating in interactive ID among the BBS, each individuality has three class behaviors: (1) proposes certain topic, is called the topic post behavior of sending out; (2) for the answer of a certain topic, be called the response behavior of sending out; (3) do not participate in the discussion of any topic, but can browse topic in the forum with visitor's identity, be called the behavior of browsing.And will send out topic post and send out the response behavior, be referred to as the behavior of posting.Individual collections so that some topics link together namely for same topic, has all individualities of posting and browsing behavior to be called cluster.According to the difference of behavior, the individuality in the cluster can be divided into two classes: a class is the ID that really participates in discussion, and the ID of the behavior of posting (send out topic post and send out response) is namely arranged, and is called real individuality; Another kind of is the individuality of not posting, but browses this topic, is called empty individual.From top to bottom network cluster is divided into cluster layer, individual layer and three levels of behavior layer, adopts assessment strategy from bottom to top to estimate.

Be defined as follows concept and symbol:

Define 1. clusters: C (T, I, B), wherein

T represent cluster based on some topics;

I={i ₁..., i _m, expression participates in the individual collections of topic T, i _k∈ real, virual}, and m is finite value;

B={post _Fst, post _Rpl, browse}, the behavior set that expression is individual, wherein post _FstTopic post behavior, post are sent out in expression _RplThe response behavior is sent out in expression, and browse represents to browse behavior;

Then

Satisfy

i_{k} = \{\begin{matrix} real & b_{ik} = {post}_{fst} or & {post}_{rpl} \\ virtual & b_{ik} = browse \end{matrix}

Wherein, b _IkRepresent individual i _kBehavior.

The definition 2. reduction factors (ω): after individual behavior occurs, As time goes on, its active continuous decline.The ω factor namely reflects the degree of individual behavior liveness reduction, and span is (0,1).

Define 3. half-life (Half-life): the liveness of individual behavior drops to the required time interval of initial value 1/2.There are relation in the reduction factor and half-life:

Real individuality in the cluster can produce the behavior of repeatedly posting.For the behavior of posting within the half-life, its liveness is higher, can react this individual liveness, is vaild act; For the behavior beyond the half-life, decayed more than halfly because of its liveness, be considered as ineffective act.

Be defined as follows symbolic variable:

The explanation of table 1 symbolic variable

The active state assessment models of cluster is divided into " cluster layer ", " individual layer ", " behavior layer " three levels from top to bottom.

Described behavior layer is divided into two classes with the network cluster individuality take topic as the basis: real individual and empty individual.Corresponding two kinds of dissimilar behaviors of two kinds of Different Individual: post and browse.The liveness quantization method of these two kinds of behaviors is defined as follows

(1) post: establish the behavior x of posting and occur in T constantly, the definition liveness is unit value 1, then computing formula B _{Ijx post}(t) be:

(2) browse: the behavior of browsing to the contribution of cluster liveness far below the behavior of posting, between (0,1).Its computing formula is:

B_{il browse} (t) = \frac{{Num}_{i post} (t)}{{Num}_{i browse} (t)} - - - (2)

Can get through a large amount of statistical calculations to historical data, when t is enough large, B be arranged _{Il browse}(t) ≈ 10 ^-3

Described individual layer defines respectively its liveness computing formula for real individuality and empty individuality, wherein

(1) real individual j is in t liveness computing formula constantly:

N_{ijpost} (t) = \underset{x}{Σ} B_{ijxpost} (t) - - - (3)

(2) empty individual l is in t liveness computing formula constantly:

N _ilbrowse(t)=B _ilbrowse(t) (4)

The liveness of described cluster layer is determined by all real liveness individual and empty individuality in the t moment cluster.Its computing formula is:

Wherein:

1)

Real individual liveness vector among the moment t cluster i, element N _{Ij post}(t) (j=1 ..., p) the individual liveness of reality for being calculated by formula (3), p is real number of individuals in the cluster.

2)

\overset{&RightArrow;}{W (N_{ij post} (t))} = (W (N_{ilpost} (t)), \cdot \cdot \cdot, W (N_{ip post} (t)))

Real individual weights of importance in cluster i is vectorial constantly for t.

3)

\overset{&RightArrow;}{N_{il browse} (t)} = (N_{ilbrowse} (t), \cdot \cdot \cdot, N_{iq browse} (t))

Empty individual liveness vector among the moment t cluster i, element N _{Il browse}(t) (l=1 ..., q) the empty individual liveness index for being calculated by formula (4), q is empty number of individuals in the cluster.

4)

\overset{&RightArrow;}{W (N_{il browse} (t))} = (W (N_{ilbrowse} (t)), \cdot \cdot \cdot, W (N_{iq browse} (t)))

Be the empty individual weights of importance vector in cluster i of moment t.

5) the individual j weights of importance of reality among the t moment cluster i is W (N _Ijpost(t)), its computing formula is:

W (N_{ijpost} (t)) = \{\begin{matrix} 1 & b_{j} = {post}_{fst} \\ \frac{Σ_{t = 0}^{t - Half - life} p {Num}_{ij} (t)}{\max_{j} {Σ_{t = 0}^{t - Half - life} p {Num}_{ij} (t)}} & b_{j} = {post}_{rpl} \end{matrix} - - - (6)

Wherein: b _jThe behavior of the real individual j of expression; PNum in the molecule _Ij(t) history of the real individual j of the expression behavior summation of posting,

The i.e. outer behavior of posting summation of all half-life; Denominator

The maximum of the behavior of the posting summation outside the half-life in all individualities is got in expression.

6) the empty individual l weights of importance W (N among the t moment cluster i _{Il browse}(t))=1

7) determining of ω value: obtain by the mass data that collects is carried out statistical analysis, in the forum of ends of the earth community " ends of the earth tittle-tattle ", the half-life of the behavior of posting is about 3 hours, thereby its reduction factor ω ≈ 0.71.

8) W (C _i(t)) can be accounted for by the individual amount in the cluster ratio definition of the online number of whole forum:

W (C_{i} (t)) = \frac{{Num}_{ipost} (t) + {Num}_{ibrowse} (t)}{{Num}_{online} (t)} - - - (7)

This formula is used for carrying out liveness relatively between different clusters.

Described user interactions and display module obtain user's the orders such as input inquiry, parameter setting and system's control, and present the result of the modules such as analysis of central issue, corporations' excavation in visual mode.

Compared with prior art, it is single that the present invention has overcome existing network public sentiment system detection means, analysis result is simple, can not carry out the limitation such as Web Content Mining, well solved info web analyze in cluster topology excavate profound information excavating problem with state estimation, be that public sentiment behavior prediction etc. provides support from network topics liveness and cluster topology two aspects.The present invention first take topic as core, analyzes cluster liveness and cluster topology at the public feelings information system aspects.Use the present invention to analyze and to obtain network hot topic to info web, excavate on this basis design feature and the cluster liveness of network cluster behavior inherence.Therefore, the present invention has good application prospect, and the guiding, the control that can be Emergent Public Events provide useful information.From the angle that technology realizes, the present invention has extremely strong technology realization property, and precision is very high, is a rare analytical method.

Description of drawings

Fig. 1 is structural representation of the present invention.

Fig. 2 is that web data crawls the cellular construction schematic diagram.

Fig. 3 is web page interlinkage reptile module diagram.

Fig. 4 is web data normalizing module diagram.

Fig. 5 is hierarchical network cluster state quantitative appraisement model figure.

Embodiment

The below elaborates to embodiments of the invention, and present embodiment is implemented under take technical solution of the present invention as prerequisite, provided detailed execution mode and concrete operating process, but protection scope of the present invention is not limited to following embodiment.

As shown in Figure 1, present embodiment comprises: data acquisition normalizing module, data memory module, applied analysis module, user interactions and display module, wherein: user interactions and display module provide the interface of showing with user interactions and data results.Data acquisition normalizing module is by receiving the URL of user's appointment, crawls subelement and web data normalizing subelement obtains and the normalization network data by network data.Data memory module is deposited the normalization web data, for the applied analysis module provides the analysis data.The applied analysis module is on the basis of webpage cluster and focus excavation, and the result who utilizes focus to excavate further analyzes normalization data, and the degree of depth is excavated topic liveness and community structure, and shows analysis result by user interactions and display unit to the user.

Described data acquisition normalizing module comprises that network data crawls subelement and web data normalizing subelement.Wherein, network data crawls the web data that subelement obtains appointed website, and web data and the local storage address of webpage are preserved.Web data normalizing subelement deposits the normalization web database by processing the parent page data in analyzing the page key message that extracts.

As shown in Figure 2, network data crawls subelement by reading the SeedURL table, obtains seed URL, utilizes the link of web page interlinkage reptile module acquisition appointed website and deposits the web page interlinkage formation in; Webpage pulls module and obtain page URL from the web page interlinkage formation, and on the basis of the whole webpage of crawl, the information with page data, URL and the local memory address of this webpage deposits the original web page database in by the info web memory module.

As shown in Figure 3, web page interlinkage reptile module starts a plurality of link reptile module threads according to the record number in the SeedURL table, and each thread crawls the all-links of specifying the number of plies in appointed website, it is added in the web page interlinkage formation.Each link reptile module thread has been safeguarded a url filtering database, is used for depositing the page info that had crawled.After system obtains a URL, can check the url filtering database, if the relevant record of existing this URL in this storehouse represents that then this URL was crawled, and directly abandoned this URL; If there is not record, then this URL is added the URL formation, wait network data request module is processed the page of this URL.This url filtering database is realized with the Hashset among the Java.The network data request module uses the object in the htmlParser bag to obtain data from network, and the Webpage data are passed to the URL extraction module, and the URL that will process again deposits the url filtering database in.The URL extraction module uses the LinkBean object extraction URL in the htmlParser bag, puts into web page interlinkage formation and URL formation through the url filtering database.After this web page interlinkage reptile module crawls the number of plies that reaches certain, empty the record in the url filtering database, take seed URL as initial, crawl webpage again.Opening up of link reptile is that record according in the SeedURL in the database table carries out, and each bar inUse field is that 0 record all will be opened up a thread in web page interlinkage reptile module.The main thread of web page interlinkage reptile module sees if there is the record that went to check among the SeedURL in per ten minutes new record and adds.Field Parser among the SeedURL will be added in the record LocalRecord object in the linked queue that is crawled by this link reptile thread, to specify the analysis mode of this webpage.

As shown in Figure 4, web data normalizing subelement extracts title, author, time, webpage main contents and comment by analyzing the original web page data.This subelement is obtained by the original web page data extraction module and is kept at local original web page, puts into webpage formation to be analysed.The web page analysis module is obtained the original web page data from webpage formation to be analysed, the analytic method analyzing web page data of utilizing webpage resolution data storehouse to provide, and the URL of the web data that is disposed put into the parsing record queue.By the original web page data extraction module, the webpage that had been resolved that is stored in the original web page database is made marks simultaneously.The web page analysis module will be analyzed data later and deposit the formation of normalizing webpage in, wait for the processing of normalizing data memory module.The normalizing data memory module obtains the web data after the analysis from the formation of normalizing webpage, and deposits the normalization web data in by data access interface.Original web page data extraction module in this module, web page analysis module and normalizing data memory module are all realized as a separate threads, open up a plurality of threads when webpage is resolved.

Data memory module is the normalization web database.The information such as the web page title that webpage normalizing database generates for storage network page data normalizing subelement, author, time, main contents, comment.This database comprises three forms: webpage Basic Information Table, webpage main contents table and comment table.The Hibernate technology is adopted in access to database, and each table is become a class, and the record in the table becomes the object of class, and corresponding three classes are respectively Data, Content and Comment.Three tables are associated together by the id value in the webpage Basic Information Table, during record in each this information table of storage network page base, obtain the id value of this record, and this value are given to webpage main contents table to be stored and comment on DataId field in the table object.

The applied analysis module is obtained data from the normalization web database, utilizes webpage cluster subelement, focus to excavate subelement, the analysis of topic liveness and visualization and community structure excavation and visualization degree of depth mined information.Wherein, focus excavates subelement to carry out on the basis of network clustering at webpage cluster subelement, by the web page contents of analysis parent page data, thereby obtains much-talked-about topic, and this unit also can carry out the inquiry of focus for the keyword that the user provides.The analysis of topic liveness and visualization and community structure excavate and visualization depth analysis network configuration and web page contents on the basis of focus excavation subelement, and then analysis result fed back to user interactions and display module, the visual analysis result that presents.Webpage cluster subelement is added up the weight of each word by the webpage main contents are cut word and participle, and web page contents is mapped as vector in the vector space.Utilize the K Mean Method with the webpage cluster in the fixed time section, thereby according to sub-topic network data is classified.

Focus excavates cluster result and the web data that subelement utilizes webpage cluster subelement, by to all web page title participles of collecting in the section sometime, extract popular vocabulary by character such as parts of speech, finally obtain popular vocabulary, excavate much-talked-about topic by this vocabulary.This unit also can excavate for the keyword of the user's appointment focus of being correlated with.What the method for digging of much-talked-about topic mainly adopted is the method for greatly commenting on: get the web page title that the comment number is maximum in a certain period and be the focus in this section period, from the comment table, extract maximum front 100 webpages of number of reviews in certain period, calculate similarity between any two, the webpage that similarity is very high is merged into a class, and extracts public word as the much-talked-about topic of this class webpage from all kinds of.

Community structure excavates and shows that subelement passes through the webpage comment of specifying keyword relevant in each website of retrieval, according to the message relation structure network diagram of each comment.Corporations' method for digging phase-split network community structure by potential energy Network Based, then each node among the figure is randomly dispersed in the specified window, spring is regarded on limit between each node as, utilize the relative position of system's each node of Calculation and Analysis of Force, thereby represent whole network, and to the node of heterogeneous networks corporations with different color shows.The analysis of described topic liveness and visual subelement propose a kind of cluster liveness quantitative appraisement model and computational methods of stratification by the network cluster based on topic in the forum is analyzed and researched.This model is divided into " cluster layer ", " individual layer ", " behavior layer " three levels from top to bottom, consider the key element such as scale, individual behavior difference of cluster, make up the active state assessment models of whole cluster, quantize active state, the quantitative cluster active degree appraisal procedure that provides.

As targeted sites, gather 2011-06-01 to the data between the 2011-07-01 by three websites of forum that Netease, Sina, friendship are never forgotten where one's happiness comes from greatly, the workflow of present embodiment may further comprise the steps:

(1) after the user inputs URL, system crawls the original web page data that subelement obtains the website take this URL as seed by network data, and leaves the memory address of each original web page and this webpage in this locality.By web data normalizing subelement, extract title, author, time, webpage main contents and deposit the normalization web database in.Analyze the URL form of appointed website, obtain the purpose URL of comment under the structure Ajax, and then obtain related commentary and deposit database in.

(2) web data in the normalization web database is classified webpage by clustering method.Focus excavates subelement and utilizes the cluster result of webpage cluster that network data is carried out the focus excavation.

(3) analysis result that the data in the normalization web database and focus is excavated subelement offers topic liveness assessment unit and community structure and excavates and visualization as the input data.The analysis result of the analysis result of these two unit and focus excavation subelement shows the user together.

Present embodiment realized that the info web structure is excavated and the topic liveness is assessed and with user-friendly interactive interface, be applicable to the numerous areas info web problem analyses such as Internet public opinion analysis, it is single to have solved the detection means that exists in the prior art, analysis result is simple, can not carry out the problems such as Web Content Mining.Have complete function, modular construction, can expand and the characteristics such as mutual friendly, have good promotion prospect.

Claims

1. the liveness of a topic Network Based and cluster topology analytical system is characterized in that, comprise data acquisition normalizing module, data memory module, applied analysis module, user interactions and display module, wherein:

User interactions and display module: be used for providing the interface of showing with user interactions and data results;

Data acquisition normalizing module: be used for by receiving the URL(uniform resource locator) of user's appointment, crawl subelement and web data normalizing subelement obtains and the normalization network data by network data.It comprises that network data crawls subelement and web data normalizing subelement, network data crawls the web data that subelement is used for obtaining appointed website, and the local storage address of web data and webpage preserved, web data normalizing subelement deposits in the corresponding data memory module analyzing the page key message that extracts by processing the parent page data;

Data memory module: be used for depositing initial data and the normalization web data of Webpage, for the analytic application module provides the analysis data;

The applied analysis module: on the basis of webpage cluster and focus excavation, the result who utilizes focus to excavate excavates topic liveness and community structure to the normalization data degree of depth, and shows the result by user interactions and display unit to the user.

2. the liveness of topic Network Based according to claim 1 and cluster topology analytical system, it is characterized in that, network data crawls subelement and comprises that web page interlinkage reptile module, webpage pull module, info web memory module, and these modules are by the web page interlinkage formation and the web storage formation is obtained and the storage network page data.

3. the liveness of topic Network Based according to claim 1 and cluster topology analytical system is characterized in that, described web data normalizing subelement further comprises original web page data extraction module, web page analysis module, normalizing data memory module.This subelement is obtained by the original web page data extraction module and is kept at local original web page, puts into webpage formation to be analysed and webpage formation to be analysed, resolves record queue and the formation of normalizing webpage.The web page analysis module is obtained the original web page data from webpage formation to be analysed, the analytic method analyzing web page data of utilizing webpage resolution data storehouse to provide, and the URL of the web data that is disposed put into the parsing record queue.By the original web page data extraction module, the webpage that had been resolved that is stored in the original web page database is made marks simultaneously.The web page analysis module will be analyzed data later and deposit the formation of normalizing webpage in, wait for the processing of normalizing data memory module.The normalizing data memory module obtains the web data after the analysis from the formation of normalizing webpage, and deposits the normalization web data in by data access interface.

4. the liveness of topic Network Based according to claim 1 and cluster topology analytical system, it is characterized in that, described applied analysis module comprises webpage cluster subelement, focus excavates subelement, the analysis of topic liveness and visualization and community structure excavate and visualization, wherein, focus excavates subelement to carry out on the basis of network clustering at webpage cluster subelement, by analyzing the web page contents of parent page data, thereby obtain much-talked-about topic, this subelement also can carry out hotspot query for the keyword that the user provides, the analysis of topic liveness and visualization and community structure excavate and excavate on the basis of subelement at focus with visual subelement, depth analysis network configuration and web page contents, and then analysis result fed back to user interactions and display module presents the result.

5. the liveness of a topic Network Based and cluster topology analytical method is characterized in that, comprising:

(1) after the user inputs URL, system is take this URL as seed, crawl the original web page data that subelement obtains the website by network data, and leave the memory address of each original web page and this webpage in this locality, by web data normalizing subelement, extract title, author, time, webpage main contents and deposit the normalization web database in, analyze the URL form of appointed website, obtain the purpose URL of comment under the structure Ajax, and then obtain related commentary and deposit database in;

(2) web data in the parent page database is classified webpage by clustering method, focus excavates subelement and utilizes the cluster result of webpage cluster that network data is carried out the focus excavation;

(3) data in the normalization web database and focus are excavated the analysis result of subelement as the input data, offering topic liveness assessment unit and community structure excavates and visualization, the analysis result of the analysis result of these two unit and focus excavation subelement shows the user together.

6. method as claimed in claim 5 is characterized in that, also comprises:

By reading the SeedURL table, obtain seed URL, utilize the link of web page interlinkage reptile module acquisition appointed website and deposit the web page interlinkage formation in; Webpage pulls module and obtain page URL from the web page interlinkage formation, on the basis of the whole webpage of crawl, information with page data, URL and the local memory address of this webpage, deposit the original web page database in by the info web memory module, and, described SeedURL table indication reptile crawls the initial URL of webpage, and each record is with reptile thread of unlatching of correspondence, and this thread only crawls the webpage of appointed website.The field that this table comprises comprises: (1) url field, the URL link of expression appointment; (2) Parser field: specify analytic method corresponding to this URL; (3) inUse field: whether this seed URL is using; (4) finish field: it is complete whether this URL crawls; (5) Depth field: the number of plies that crawls;

Web page interlinkage reptile module is according to the record number in the SeedURL table, start a plurality of link reptile module threads, each thread crawls the all-links of specifying the number of plies in appointed website, it is added in the web page interlinkage formation, each link reptile module thread has been safeguarded a url filtering database, be used for depositing the page info that had crawled, after system obtains a URL, can check the url filtering database, if the relevant record of existing this URL in this storehouse, represent that then this URL was crawled, and directly abandoned this URL; If there is not record, then this URL is added the URL formation, wait network data request module is processed the page of this URL, the network data request module uses the object in the htmlParser bag to obtain data from network, the Webpage data are passed to the URL extraction module, and the URL that will process again deposits the url filtering database in, and the URL extraction module uses the LinkBean object extraction URL in the htmlParser bag, put into web page interlinkage formation and URL formation through the url filtering database

After this web page interlinkage reptile module crawls the number of plies that reaches certain, empty the record in the url filtering database, again take seed URL as initial, crawl webpage, opening up of link reptile is that record according in the SeedURL in the database table carries out, and each bar inUse field is that 0 record all will be opened up a thread in web page interlinkage reptile module, and the main thread of web page interlinkage reptile module will remove to check the record among the SeedURL at set intervals, seeing if there is new record adds

Field Parser among the SeedURL will be added in the record LocalRecord object in the linked queue that is crawled by this link reptile thread, specifying the analysis mode of this webpage,

Described webpage pulls the record in the formation of module extraction page link, go for accordingly and ask web data and web data is stored in the local disk, deposit in together in the page stores formation pulling the URL of the page and the address of local storage, and the URL of the parent page data obtained and the page data that pulls and the address of local storage sent into the page info memory module together

And described info web memory module has been used the data access interface (DAO) of hibernate form.Corresponding table is the LocalRecord table in the database, and the field that this table comprises is: (1) url field: the URL information of depositing webpage; (2) LocalDir field: the local storage address of web data; (3) Parsed field: the method information of resolving this webpage; (4) Processed field: whether this webpage was resolved.Class in Java is LocalRecord, and corresponding DAO is LocalRecordDAO.

7. method as claimed in claim 5 also comprises:

Described web data normalizing subelement is by analyzing the original web page data, extract title, the author, time, webpage main contents and comment, this subelement is obtained by the original web page data extraction module and is kept at local original web page, put into webpage formation to be analysed, the web page analysis module is obtained the original web page data from webpage formation to be analysed, the analytic method analyzing web page data of utilizing webpage resolution data storehouse to provide, and the URL of the web data that is disposed put into the parsing record queue, simultaneously by the original web page data extraction module, the webpage that had been resolved that is stored in the original web page database is made marks, the web page analysis module will be analyzed data later and deposit the formation of normalizing webpage in, wait for the processing of normalizing data memory module, the normalizing data memory module obtains the web data after the analysis from the formation of normalizing webpage, and deposit the normalization web data in by data access interface

And the original web page data extraction module in this module, web page analysis module and normalizing data memory module are all realized as a separate threads, open up a plurality of threads when webpage is resolved.

8. method as claimed in claim 7 also comprises:

Network queue to be analysed is the chained list of a LocalRecord form, is used for as the web page analysis module provides the analysis data, and with the speed between balance net page analysis module and the original web page data extraction module,

private String title;

private String url;

private String content;

private String date;

private String author;

private List<Comment>comments;

Each html page as a tree, is obtained the information of webpage by the label of correspondence.The webpage main contents obtain part, the html page for input travels through each label node in the html page in order.For text node, analyze text node Chinese version content-length whether greater than 30 characters and Chinese character greater than the English character number, if satisfy this condition, just the main contents as the page extract, for the html tag node that contains child node, extract child node and continue and analyze each child node with aforesaid way, until all nodes are all analyzed complete in the page.

9. method as claimed in claim 8 is characterized in that, also comprises:

The first step, for complex network G=(V, E), V represents the set of node, and E represents the set on limit, the adjacency matrix A of fan-in network G does the data preliminary treatment, obtain pretreated network adjacent matrix A', described data preliminary treatment refers to remove the network moderate and is 1 node, namely in the adjacency matrix of network input, remove row and column corresponding to this point

Second step, for pretreated network adjacent matrix A', the topological structure of phase-split network, retrieve other nodes in the whole network for each node with the searching method of breadth-first, obtain the distance between any two nodes, thereby set up Distance matrix D, the distance between described any two nodes, refer to the limit number that shortest path comprises between 2 in the network

In the 3rd step, the distance matrix of node Network Based calculates whole network of network potential energy, regards each node in the network source of a gravitational field as, can obtain the potential energy between any two nodes in the network

Computing formula is as follows:

Wherein, R is a constant, can be set as a positive number, and D is resulting distance matrix in the second step,

Computing formula is as follows:

In the 5th step, remove

The limit that value is maximum, and check whether generated independently sub-network, if do not have, then got back to for the 4th step; Sub-network produces if having independently, and then whether the sub-network of check division generation meets predefined strong and weak community structure, if result of calculation meets strong and weak community structure, then turns the first step, and method continues; If result of calculation does not meet strong and weak community structure, then method finishes, and turns for the 6th step.