CN103023714B - The liveness of topic Network Based and cluster topology analytical system and method - Google Patents
The liveness of topic Network Based and cluster topology analytical system and method Download PDFInfo
- Publication number
- CN103023714B CN103023714B CN201210477317.5A CN201210477317A CN103023714B CN 103023714 B CN103023714 B CN 103023714B CN 201210477317 A CN201210477317 A CN 201210477317A CN 103023714 B CN103023714 B CN 103023714B
- Authority
- CN
- China
- Prior art keywords
- data
- web
- webpage
- module
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of liveness of topic Network Based and cluster topology analytical system, comprise: data acquisition normalizing module, data memory module, applied analysis module, user interactions and display module, user interactions and display module provide the interface shown with user interactions and data results.The URL that data acquisition normalizing module is specified by receiving user, crawls subelement by network data and web data normalizing subelement obtains and normalization network data.Data memory module deposits normalization web data, for analytic application module provides analysis data.Analytic application module is on the basis that website construction and focus excavate, and the degree of depth excavates topic liveness and community structure, and shows result by user interactions and display unit to user.Instant invention overcomes network public-opinion system regions detection means single, can not the limitation such as Web Content Mining be carried out, well solve cluster topology in info web analysis and excavate the problem with the profound information excavating of state estimation.
Description
Technical field
What the present invention relates to is the monitoring of a kind of network public-opinion and the system of analysis field, specifically a kind of liveness of topic Network Based and cluster topology analytical system and method.
Background technology
The Internet highly developed, especially the appearance of the emerging application such as blog, microblogging, forum, network is made to become the main media that in the modern life, bulk information is propagated, become netizen for obtaining information, make comments, the network promotion, network marketing main platform, be called as " fourth media " after newspaper, broadcast, TV.Real-time monitor network public sentiment, correct guidance network public opinion, concern national safety, harmonious society, managerial decision and Business survival, does, more and more important the government department such as foreign affairs office for Government Office, net a surname.Current network public-opinion system is by web crawlers (webpage spider, network robot, webpage follower is often called in the middle of FOAF community) technology, i.e. a kind of program according to certain regular automatic capturing web message or script, obtain a large amount of network data, further these data are filtered and denoising, find the hot issue propagated in network by methods such as participle, cluster, statistical analyses.
But because current most website all have employed Ajax technology, network comment data are generally all dynamically joined in webpage by JavaScript, make web crawlers be difficult to obtain these comment contents.Therefore the current analysis and research for webpage comment only rest on the identification to comment viewpoint theme, after carrying out participle, obtain viewpoint descriptor by statistical method by participle instrument to comment content.These systems are not carried out excavating to the cluster topology state of network topics data and are assessed, and can not obtain the profound relation of network data.
Through retrieval further, in existing public sentiment system, profound level analysis is not carried out to the comment data of webpage, also do not carry out excavating to the cluster topology of network topics data and assess.
That is, the data needed for existing web crawlers is difficult to realize precisely obtaining from the angle of technology or data, exist and realize the large technical problem of difficulty.
Summary of the invention
The present invention is directed to the problem in prior art, provide a kind of liveness of topic data Network Based and cluster topology to excavate and evaluating system.
The present invention is achieved by the following technical solutions, the present invention includes: data acquisition normalizing module, data memory module, applied analysis module, user interactions and display module, wherein: user interactions and display module provide the interface shown with user interactions and data results.The URL(uniform resource locator) (UniversalResourceLocator, URL) that data acquisition normalizing module is specified by receiving user, crawls subelement and web data normalizing subelement by network data, obtains and normalization network data.Data memory module deposits normalization web data, for analytic application module provides analysis data.Applied analysis module is on the basis that website construction and focus excavate, and the result utilizing focus to excavate is analyzed further to normalization data, and the degree of depth excavates topic liveness and community structure, and shows analysis result by user interactions and display unit to user.
1. the data acquisition normalizing module described in comprises network data and crawls subelement and web data normalizing subelement.Wherein, network data crawls the web data that subelement obtains appointed website, and web data and the local storage address of webpage is saved.Web data normalizing subelement, by process parent page data, will analyze the page key message of extraction stored in normalization web database.
Described network data crawls subelement by reading SeedURL table, obtains seed URL, utilize web page interlinkage reptile module obtain appointed website link and stored in web page interlinkage queue; Webpage pulls module and obtain page URL from web page interlinkage queue, on the basis capturing whole webpage, by the information of the page data of this webpage, URL and local memory address, by info web memory module stored in raw page data storehouse.
Described SeedURL table instruction reptile crawls the initial URL of webpage, and each record is by unlatching reptile thread of correspondence, and this thread only crawls the webpage of appointed website.The field that this table comprises comprises: (1) url field, represents the URL link of specifying; (2) Parser field: specify the analytic method that this URL is corresponding; (3) inUse field: whether this seed URL is in use; (4) finish field: it is complete whether this URL crawls; (5) Depth field: the number of plies crawled.
Described web page interlinkage reptile module, according to the record number in SeedURL table, starts multiple link reptile module thread, and each thread crawls the all-links of specifying the number of plies in appointed website, is added in web page interlinkage queue.Each link reptile module thread maintains a url filtering database, is used for depositing the page info crawled.After system obtains a URL, url filtering database can be checked, if the record that in this storehouse, this URL existing is relevant, then represent that this URL was crawled, directly abandon this URL; If there is no record, then this URL is added URL queue, wait for the page of this URL of network data request module process.Hashset in this url filtering database Java realizes.Network data request module uses the object in htmlParser bag to obtain data from network, and Webpage data are passed to URL extraction module, then by processed URL stored in url filtering database.URL extraction module uses the LinkBean object extraction URL in htmlParser bag, puts into web page interlinkage queue and URL queue through url filtering database.After this web page interlinkage reptile module crawls the number of plies reaching certain, empty the record in url filtering database, then be initial with seed URL, crawl webpage.Link reptile open up be according in database SeedURL table in record carry out, each inUse field be 0 record all will open up a thread in web page interlinkage reptile module.Every ten minutes go to check the record in SeedURL by the main thread of web page interlinkage reptile module, see if there is new record and add.Field Parser in SeedURL by the record LocalRecord object that is added in the linked queue that crawled by this link reptile thread, to specify the analysis mode of this webpage.
Described webpage pulls the record in the queue of module extraction page link, goes for accordingly and asks web data and be stored in local disk by web data.By pull the page URL together with the local address stored stored in page storage queue, and send into page info memory module together with the address of the parent page data of acquisition and the URL of the page data pulled and local storage.
Described info web memory module employs the data access interface (DAO) of hibernate form.Table corresponding in database is LocalRecord table, and the field that this table comprises is: (1) url field: the URL information depositing webpage; (2) LocalDir field: the local storage address of web data; (3) Parsed field: the method information of resolving this webpage; (4) Processed field: whether this webpage was resolved.Class in Java is LocalRecord, and corresponding DAO is LocalRecordDAO.
Described web data normalizing subelement, by analyzing raw page data, extracts title, author, time, webpage main contents and comment.This subelement is obtained by raw page data extraction module and is kept at local original web page, puts into webpage queue to be analysed.Web page analysis module obtains raw page data from webpage queue to be analysed, the analytic method analyzing web page data utilizing web analysis database to provide, and the URL of the web data be disposed is put into parsing record queue.Simultaneously by raw page data extraction module, the webpage be resolved be stored in raw page data storehouse is made marks.Data after analysis stored in the queue of normalizing webpage, are waited for the process of normalizing data memory module by web page analysis module.Normalizing data memory module obtains the web data after analyzing from the queue of normalizing webpage, and by data access interface stored in normalization web data.Raw page data extraction module in this module, web page analysis module and normalizing data memory module all realize as a separate threads, open up multiple thread when web analysis.
What described raw page data extraction module was regular obtains page data from raw page data storehouse, and by these data stored in webpage queue to be analysed; Respective record Processed field in raw page data storehouse is set to 1 by the data according to resolving in record queue, represents that this webpage was resolved.
Described network queue to be analysed is the chained list of a LocalRecord form.Be used for as web page analysis module provides analysis data, with the speed between balance net page analysis module and raw page data extraction module.
Described web page analysis module is according to the web analysis method of specifying in webpage record, from web analysis storehouse, extract corresponding method remove analyzing web page, parent page Data Analysis result is sent into buffer memory in the queue of normalizing webpage, and will the identification information of the page be resolved stored in parsing record queue.
Described web analysis storehouse is a set for different web sites page parsing class, and all kinds of in this storehouse all inherit a common abstract base class:
Different analytic methods rewrites the method in OO mode, and different web sites realizes oneself analytic method separately.Following field is defined in the type FormatForParseResult returned in the method:
privateStringtitle;
privateStringurl;
privateStringcontent;
privateStringdate;
privateStringauthor;
PrivateStringsite; // represent and webpage in which website set this value by main thread, without the need to setting in concrete grammar
privateList<Comment>comments;
For the feature of Web wrapper language in this parsing storehouse, the page of htmlParser to html format in Java is adopted to resolve.By each html page as a tree, obtained the information of webpage by the label of correspondence.The fetching portion of webpage main contents, for the html page of input, travels through each label node in html page in order.For text node, analysis text node Chinese version content-length whether is greater than 30 characters and Chinese character is greater than English character number, if meet this condition, the main contents just as the page extract.For the html tag node containing child node, extract child node continuation aforesaid way analyzes each child node, until all nodes are all analyzed complete in the page.For the acquisition of webpage comment, the news pages for Netease and Sina will construct corresponding review pages place URL, and then obtains review information.The news pages comment URL make of Netease and Sina is as follows.Netease's news pages comment URL structure is: " http://comment. "+channelID+ " .163.com/data/ "+tieChannel+ "/df/ "+articleID+ " _ "+pageNum+ " .html ".Wherein channelID field can travel through all script nodes of news pages, searches and has keyword " Site ID " script node.Content in this node in " ntes_nacc=" quotation marks is below the content of channelID.Search the script node containing " tieAnywhere.HotTieArea " keyword, using second in this function and the 3rd parameter field contents as articleID and tieChannel.And in this script node, the content of replyCount field is exactly the content of pageNum divided by the comment number of every page as numeral.The structure of Sina News page comment URL is: " http://comment5.news.sina.com.cn/page/info format=js & jsvar=pagedata & channel="+Channel+ " & newsid="+NewsId+ " & group=0 & page=1 ", wherein Channel and NewsId field is parameter.The script node containing sinaCMNT.embed.init is searched, channel in the parameter character string of this function: part is below exactly the content of parameter Channel, newsid: part is below exactly the content of Parameter N ewsId in the Sina News page.
The FormatForParseResult object that described normalizing webpage queue buffer memory parses, waits for that normalizing data memory module regularly goes these records to be stored in database.
Described normalizing data memory module regularly by the data in the queue of normalizing webpage, by data access interface stored in normalization web database.
2. the data memory module described in and normalization web database.Webpage normalizing database is used for the information such as web page title, author, time, main contents, comment that storage network page data normalizing subelement generates.This database comprises three forms: webpage Basic Information Table, comprises field: (1) url field: the URL information of record webpage; (2) Title field: the heading message of record webpage; (3) Publisher field: the distributor information of webpage; (4) Data field: the issuing time of the page; (5) Site field: webpage affiliated web site information; (6) id field, as the major key of this table.Webpage main contents table, comprises field: (1) url field: the URL information of webpage; (2) MainContent field: main contents of webpages field; (3) DataId field: the major key of this table.Comment table, mainly comprises field: (1) url field: the URL information of preserving the page; (2) Commented field: the id information issuing comment; (3) BeCommentedID field: if comment institute for this comment of ID(not specifically for certain individuality, then think that it is the publisher of webpage); (4) CommentContent field: the content information preserving comment; (5) CommentDate field: record comment temporal information; (6) DataId field: the major key of this table.Adopt Hibernate technology to the access of database, each table mapping is become a class, and the record in table becomes the object of class, and corresponding three classes are respectively Data, Content and Comment.Three tables are associated together by the id value in webpage Basic Information Table, during record in each storage network page Basic Information Table, obtain the id value of this record, and this value are given to the DataId field in webpage main contents table to be stored and comment table object.
3. the applied analysis module described in comprises website construction subelement, focus excavates subelement, topic liveness is analyzed and excavated and visualization with visualization and community structure.Wherein, focus excavates subelement to carry out at website construction subelement on the basis of network clustering, and by analyzing the web page contents of parent page data, thus obtain much-talked-about topic, the keyword that this unit also can provide for user carries out the inquiry of focus.Topic liveness is analyzed and to be excavated with visualization and community structure and depth analysis network configuration and web page contents on visualization excavates subelement basis at focus, and then analysis result is fed back to user interactions and display module, visually present analysis result.
Described website construction subelement, by cutting word and participle to webpage main contents, adds up the weight of each word, web page contents is mapped as the vector in vector space.Utilize K Mean Method by the website construction in fixed time section, thus according to sub-topic, network data is classified.
Described focus excavates cluster result and the web data that subelement utilizes website construction subelement, by to all web page title participles collected in section sometime, extract popular vocabulary by character such as parts of speech, finally obtain popular vocabulary, excavate much-talked-about topic by this vocabulary.The keyword that this unit also can be specified for user carries out relevant focus excavation.What the method for digging of much-talked-about topic mainly adopted is greatly comment on method: getting web page title that in certain a period of time, comment number is maximum is focus in this period, front 100 webpages that number of reviews in certain period is maximum are extracted from comment table, calculate similarity between any two, the webpage that similarity is very high is merged into a class, and from all kinds of, extract the much-talked-about topic of public word as this class webpage.
Described community structure excavates to be commented on by retrieving the webpage of specifying keyword relevant in each website to display unit, and the message relation according to each comment builds network diagram.Network community structure is analyzed by corporations' method for digging of potential energy Network Based, then each node in figure is randomly dispersed in specified window, spring is regarded on limit between each node as, utilize the relative position of each node of system Calculation and Analysis of Force, thus represent whole network, and to the node of heterogeneous networks corporations with different color shows.
Corporations' method for digging of described potential energy Network Based is incorporated into the concept of potential energy in physics in Complex Networks Analysis, proposes the definition of network potential energy, and by the potential-energy function of peak optimizating network, reaches the object excavating community structure in complex network.Being implemented as follows of the method: removing nodes degree is on the basis of the node of 1, obtain pretreated adjacency matrix A', utilize the searching algorithm of breadth-first in A', calculate the limit number that between any two nodes, shortest path comprises, obtain Distance matrix D.Pass through formula
potential energy in (wherein R is a normal amount, and D is distance matrix) computing network between any two nodes
utilize
obtain the network potential energy of whole network G.For the every bar limit e in network
k, sub-network G after calculating deletion this edge
k=G-{e
knetwork potential energy and the network potential energy difference of former network G
remove
be worth maximum limit, if do not produce sub-network,
Then continue to remove in network existing
be worth maximum limit; If produce new non-interconnected sub-network, then divide with strong and weak community structure definition inspection the sub-network generated and whether meet the strong and weak community structure preset.If result of calculation meets strong and weak community structure, then in existing network configuration, calculate new Distance matrix D, and repeat above-mentioned potential energy calculation procedure, further splitting network; If do not meet strong and weak community structure, then the node number of degrees removed for pretreatment stage are the node of 1, are incorporated into into the community structure with its node place be directly connected, thus the community structure obtaining network node divides.
The first step, represents the set of node for complex network G=(V, E), V, E represents the set on limit.The adjacency matrix A of input network G does data prediction, obtains pretreated network adjacent matrix A '.Described data prediction, refers to and removes the node that network moderate is 1, in the adjacency matrix of network input, namely remove the row and column of this some correspondence.
Second step, for pretreated network adjacent matrix A', analyzes topology of networks.Each node is retrieved to other nodes in whole network with the searching method of breadth-first, obtain the distance between any two nodes, thus set up Distance matrix D.Distance between described any two nodes, refers to the limit number that in network, between 2, shortest path comprises.
3rd step, the distance matrix of node Network Based, calculates whole network of network potential energy.Node each in network is regarded as the source of a gravitational field, the potential energy between any two nodes in network can be obtained
computing formula is as follows:
Wherein, R is a constant, can be set as a positive number, and D is the distance matrix obtained in second step.
The network potential energy of whole network G is the potential energy sum between all nodes of comprising of network, and computing formula is as follows:
4th step, for the every bar limit e in network
k, sub-network G after calculating deletion this edge
k=G-{e
knetwork potential energy and the network potential energy difference of former network G
computing formula is as follows:
5th step, removes
be worth maximum limit, and check whether generate independently sub-network.If no, then get back to the 4th step; If there is independently sub-network to produce, then whether the sub-network that inspection division generates meets the strong and weak community structure preset.If result of calculation meets strong and weak community structure, then turn the first step, method continues; If result of calculation does not meet strong and weak community structure, then method terminates, and turns the 6th step.
6th step, for the result that the 5th step obtains, re-constructs primitive network figure.Node preliminary treatment in the first step fallen, rejoins in primitive network figure, and belongs to the community structure at the node place be directly connected with it.
Described topic liveness analysis and visualization, by analyzing and researching to the network cluster based on topic in forum, propose a kind of cluster liveness quantitative appraisement model and computational methods of stratification.This model is divided into " cluster layer ", " individual layer ", " behavior layer " three levels from top to bottom, consider the key element such as the scale of cluster, individual behavior difference, build the active state assessment models of whole cluster, quantize active state, quantitative provides cluster active degree appraisal procedure.Concrete grammar is as follows:
Be called individuality by participating in interactive ID in BBS, each individuality has three class behaviors: (1) proposes certain topic, is called and sends out topic post behavior; (2) for the reply of a certain topic, be called and send out response behavior; (3) do not participate in the discussion of any topic, but the topic in forum can be browsed with the identity of visitor, be called navigation patterns.And topic post will be sent out and send out response behavior, be referred to as the behavior of posting.With the individual collections that some topics link together, namely for same topic, all individualities of posting with navigation patterns are had to be called cluster.According to the difference of behavior, the individuality in cluster can be divided into two classes: a class is the ID really participated in discussion, and namely has the ID of the behavior of posting (send out topic post and send out response), is called real individuality; Another kind of is the individuality of not posting, but this topic browsed, be called empty individual.From top to bottom network cluster is divided into cluster layer, individual layer and behavior layer three levels, adopts assessment strategy from bottom to top to evaluate.
Be defined as follows concept and symbol:
Define 1. clusters: C (T, I, B), wherein
T represent cluster based on some topics;
I={i
1..., i
m, represent the individual collections participating in topic T, i
k∈ real, virual}, and m is finite value;
B={post
fst, post
rpl, browse}, represents individual behavior set, wherein post
fstrepresent and send out topic post behavior, post
rplrepresent and send out response behavior, browse represents navigation patterns;
Then
meet
Wherein, b
ikrepresent individual i
kbehavior.
Definition 2. weakens the factor (ω): after individual behavior occurs, As time goes on, and its activity constantly declines.Namely the ω factor reflects that the degree that individual behavior liveness weakens, span are (0,1).
Define 3. half-life (Half-life): the liveness of individual behavior drops to the time interval needed for initial value 1/2.There is relation in the reduction factor and half-life:
Real individuality in cluster, can produce behavior of repeatedly posting.For the behavior of posting within the half-life, its liveness is higher, can react the liveness of this individuality, is vaild act; For the behavior beyond the half-life, because its liveness has been decayed more than half, be considered as ineffective act.
Be defined as follows symbolic variable:
The explanation of table 1 symbolic variable
The active state assessment models of cluster is divided into " cluster layer ", " individual layer ", " behavior layer " three levels from top to bottom.
Network cluster individuality based on topic is divided into two classes by described behavior layer: real individual and void individuality.The corresponding two kinds of dissimilar behaviors of two kinds of Different Individual: post and browse.The liveness quantization method of these two kinds of behaviors is defined as follows
(1) post: establish the behavior x of posting to occur in the T moment, definition liveness is unit value 1, then computing formula B
ijxpost(t) be:
(2) browse: navigation patterns to the contribution of cluster liveness far below the behavior of posting, between (0,1).Its computing formula is:
Through obtaining a large amount of statistical calculations of historical data, when t is enough large, there is B
ilbrowse(t) ≈ 10
-3.
Described individual layer defines its liveness computing formula respectively for real individuality and void individuality, wherein
(1) real individual j is in the liveness computing formula of t:
(2) empty individual l is in the liveness computing formula of t:
N
ilbrowse(t)=B
ilbrowse(t)(4)
The liveness of described cluster layer is determined by real liveness that is individual and void individuality all in t cluster.Its computing formula is:
Wherein:
1)
real individual liveness vector in moment t cluster i, element N
ijpost(t) (j=1 ..., the individual liveness of reality p) for being calculated by formula (3), p is real number of individuals in cluster.
2)
For the real individual weights of importance vector in cluster i of t.
3)
Empty individual liveness vector in moment t cluster i, element N
ilbrowse(t) (l=1 ..., the individual liveness index of void q) for being calculated by formula (4), q is empty number of individuals in cluster.
4)
For the empty individual weights of importance vector in cluster i of moment t.
5) the individual j weights of importance of the reality in t cluster i is W (N
ijpost(t)), its computing formula is:
Wherein: b
jrepresent the behavior of real individual j; PNum in molecule
ijt () represents that the history of real individual j is posted behavior summation,
namely behavior of the posting summation outside all half-life; Denominator
represent the maximum of behavior of the posting summation of getting in all individualities outside the half-life.
6) the individual l weights of importance W of the void in t cluster i (N
ilbrowse(t))=1
7) determination of ω value: obtain by carrying out statistical analysis to the mass data collected, in ends of the earth community " ends of the earth tittle-tattle " forum, the half-life of behavior of posting is about 3 hours, and thus it weakens factor ω ≈ 0.71.
8) W (C
i(t)) ratio that can be accounted for the online number of whole forum by the individual amount in cluster defines:
This formula is used between different cluster, carry out liveness and compares.
Described user interactions and display module obtain input inquiry, the order such as optimum configurations and Systematical control of user, and present the result of the module such as analysis of central issue, corporations' excavation in visual mode.
Compared with prior art, instant invention overcomes existing network public sentiment systems axiol-ogy means single, analysis result is simple, the limitation such as Web Content Mining can not be carried out, well solving cluster topology in info web analysis and excavate the profound information excavating problem with state estimation, is that public sentiment behavior prediction etc. provides support from network topics liveness and cluster topology two aspect.The present invention is core first with topic at public feelings information system aspects, analyzes cluster liveness and cluster topology.Use the present invention to analyze info web and to obtain network hot topic, excavate design feature and the cluster liveness of network cluster behavior inherence on this basis.Therefore, the present invention has good application prospect, can be the guiding of Emergent Public Events, controls to provide useful information.From the angle that technology realizes, it is realisation that the present invention has extremely strong technology, and precision is very high, is a rare analytical method.
Accompanying drawing explanation
Fig. 1 is structural representation of the present invention.
Fig. 2 is that web data crawls cellular construction schematic diagram.
Fig. 3 is web page interlinkage reptile module diagram.
Fig. 4 is web data normalizing module diagram.
Fig. 5 is hierarchical network cluster state quantitative appraisement model figure.
Embodiment
Elaborate to embodiments of the invention below, the present embodiment is implemented under premised on technical solution of the present invention, give detailed execution mode and concrete operating process, but protection scope of the present invention is not limited to following embodiment.
As shown in Figure 1, the present embodiment comprises: data acquisition normalizing module, data memory module, applied analysis module, user interactions and display module, wherein: user interactions and display module provide the interface shown with user interactions and data results.The URL that data acquisition normalizing module is specified by receiving user, crawls subelement by network data and web data normalizing subelement obtains and normalization network data.Data memory module deposits normalization web data, for applied analysis module provides analysis data.Applied analysis module is on the basis that website construction and focus excavate, and the result utilizing focus to excavate is analyzed further to normalization data, and the degree of depth excavates topic liveness and community structure, and shows analysis result by user interactions and display unit to user.
Described data acquisition normalizing module comprises network data and crawls subelement and web data normalizing subelement.Wherein, network data crawls the web data that subelement obtains appointed website, and web data and the local storage address of webpage is saved.Web data normalizing subelement, by process parent page data, will analyze the page key message of extraction stored in normalization web database.
As shown in Figure 2, network data crawls subelement by reading SeedURL table, obtains seed URL, utilize web page interlinkage reptile module obtain appointed website link and stored in web page interlinkage queue; Webpage pulls module and obtain page URL from web page interlinkage queue, on the basis capturing whole webpage, by the information of the page data of this webpage, URL and local memory address, by info web memory module stored in raw page data storehouse.
As shown in Figure 3, web page interlinkage reptile module, according to the record number in SeedURL table, starts multiple link reptile module thread, and each thread crawls the all-links of specifying the number of plies in appointed website, is added in web page interlinkage queue.Each link reptile module thread maintains a url filtering database, is used for depositing the page info crawled.After system obtains a URL, url filtering database can be checked, if the record that in this storehouse, this URL existing is relevant, then represent that this URL was crawled, directly abandon this URL; If there is no record, then this URL is added URL queue, wait for the page of this URL of network data request module process.Hashset in this url filtering database Java realizes.Network data request module uses the object in htmlParser bag to obtain data from network, and Webpage data are passed to URL extraction module, then by processed URL stored in url filtering database.URL extraction module uses the LinkBean object extraction URL in htmlParser bag, puts into web page interlinkage queue and URL queue through url filtering database.After this web page interlinkage reptile module crawls the number of plies reaching certain, empty the record in url filtering database, then be initial with seed URL, crawl webpage.Link reptile open up be according in database SeedURL table in record carry out, each inUse field be 0 record all will open up a thread in web page interlinkage reptile module.Every ten minutes go to check the record in SeedURL by the main thread of web page interlinkage reptile module, see if there is new record and add.Field Parser in SeedURL by the record LocalRecord object that is added in the linked queue that crawled by this link reptile thread, to specify the analysis mode of this webpage.
As shown in Figure 4, web data normalizing subelement, by analyzing raw page data, extracts title, author, time, webpage main contents and comment.This subelement is obtained by raw page data extraction module and is kept at local original web page, puts into webpage queue to be analysed.Web page analysis module obtains raw page data from webpage queue to be analysed, the analytic method analyzing web page data utilizing web analysis database to provide, and the URL of the web data be disposed is put into parsing record queue.Simultaneously by raw page data extraction module, the webpage be resolved be stored in raw page data storehouse is made marks.Data after analysis stored in the queue of normalizing webpage, are waited for the process of normalizing data memory module by web page analysis module.Normalizing data memory module obtains the web data after analyzing from the queue of normalizing webpage, and by data access interface stored in normalization web data.Raw page data extraction module in this module, web page analysis module and normalizing data memory module all realize as a separate threads, open up multiple thread when web analysis.
Data memory module and normalization web database.Webpage normalizing database is used for the information such as web page title, author, time, main contents, comment that storage network page data normalizing subelement generates.This database comprises three forms: webpage Basic Information Table, webpage main contents table and comment table.Adopt Hibernate technology to the access of database, each table mapping is become a class, and the record in table becomes the object of class, and corresponding three classes are respectively Data, Content and Comment.Three tables are associated together by the id value in webpage Basic Information Table, during record in each storage network page Basic Information Table, obtain the id value of this record, and this value are given to the DataId field in webpage main contents table to be stored and comment table object.
Applied analysis module obtains data from normalization web database, utilizes website construction subelement, focus to excavate subelement, the analysis of topic liveness is excavated and visualization degree of depth mined information with visualization and community structure.Wherein, focus excavates subelement to carry out at website construction subelement on the basis of network clustering, and by analyzing the web page contents of parent page data, thus obtain much-talked-about topic, the keyword that this unit also can provide for user carries out the inquiry of focus.Topic liveness is analyzed and to be excavated with visualization and community structure and depth analysis network configuration and web page contents on visualization excavates subelement basis at focus, and then analysis result is fed back to user interactions and display module, visually present analysis result.Website construction subelement, by cutting word and participle to webpage main contents, adds up the weight of each word, web page contents is mapped as the vector in vector space.Utilize K Mean Method by the website construction in fixed time section, thus according to sub-topic, network data is classified.
Focus excavates cluster result and the web data that subelement utilizes website construction subelement, by to all web page title participles collected in section sometime, extract popular vocabulary by character such as parts of speech, finally obtain popular vocabulary, excavate much-talked-about topic by this vocabulary.The keyword that this unit also can be specified for user carries out relevant focus excavation.What the method for digging of much-talked-about topic mainly adopted is greatly comment on method: getting web page title that in certain a period of time, comment number is maximum is focus in this period, front 100 webpages that number of reviews in certain period is maximum are extracted from comment table, calculate similarity between any two, the webpage that similarity is very high is merged into a class, and from all kinds of, extract the much-talked-about topic of public word as this class webpage.
Community structure excavates to be commented on by retrieving the webpage of specifying keyword relevant in each website to displaying subelement, and the message relation according to each comment builds network diagram.Network community structure is analyzed by corporations' method for digging of potential energy Network Based, then each node in figure is randomly dispersed in specified window, spring is regarded on limit between each node as, utilize the relative position of each node of system Calculation and Analysis of Force, thus represent whole network, and to the node of heterogeneous networks corporations with different color shows.Described topic liveness analysis and visual subelement, by analyzing and researching to the network cluster based on topic in forum, propose a kind of cluster liveness quantitative appraisement model and computational methods of stratification.This model is divided into " cluster layer ", " individual layer ", " behavior layer " three levels from top to bottom, consider the key element such as the scale of cluster, individual behavior difference, build the active state assessment models of whole cluster, quantize active state, quantitative provides cluster active degree appraisal procedure.
By forum three websites of Netease, Sina, friendship being never forgotten where one's happiness comes from greatly as targeted sites, gather the data between 2011-06-01 to 2011-07-01, the workflow of the present embodiment comprises the following steps:
(1) after user inputs URL, system for seed with this URL, crawls by network data the raw page data that subelement obtains website, and leaves the memory address of each original web page and this webpage in this locality.By web data normalizing subelement, extract title, author, time, webpage main contents stored in normalization web database.Analyze the URL form of appointed website, under structure Ajax, obtain the object URL of comment, and then obtain related commentary and stored in database.
(2) webpage is classified by clustering method by the web data in normalization web database.Focus excavates subelement and utilizes the cluster result of website construction to carry out focus excavation to network data.
(3) analysis result data in normalization web database and focus being excavated subelement, as input data, is supplied to topic liveness assessment unit and community structure excavates and visualization.Together with the analysis result of the analysis result of these two unit and focus excavation subelement, show user.
The present embodiment achieve info web structure excavate and topic liveness assess and with user-friendly interactive interface, be applicable to the numerous areas info web problem analyses such as Internet public opinion analysis, solve the detection means existed in prior art single, analysis result is simple, can not carry out the problems such as Web Content Mining.There is the features such as complete function, modular construction, easily extensible and mutual close friend, there is good promotion prospect.
Claims (9)
1. the liveness of topic Network Based and a cluster topology analytical system, is characterized in that, comprise data acquisition normalizing module, data memory module, applied analysis module, user interactions and display module, wherein:
User interactions and display module: for providing the interface shown with user interactions and data results;
Data acquisition normalizing module: for by receiving the URL(uniform resource locator) that user specifies, crawl subelement and web data normalizing subelement acquisition also normalization web data by web data; It comprises web data and crawls subelement and web data normalizing subelement, web data crawls subelement for obtaining the raw page data of appointed website, and the local storage address of original web page and this original web page is saved, the analytic method that web data normalizing subelement utilizes web analysis database to provide analyzes described raw page data, obtain normalization web data, and saved;
Data memory module: for depositing raw page data and the normalization web data of Webpage, for applied analysis module provides analysis data;
Applied analysis module: on the basis that website construction and focus excavate, the result utilizing focus to excavate excavates topic liveness and community structure further to normalization web data, and shows result by described user interactions and display module to user.
2. the liveness of topic Network Based according to claim 1 and cluster topology analytical system, it is characterized in that, web data crawls that subelement comprises web page interlinkage reptile module, webpage pulls module, info web memory module, and these modules obtain and storage network page data by web page interlinkage queue and web storage queue.
3. the liveness of topic Network Based according to claim 1 and cluster topology analytical system, it is characterized in that, described web data normalizing subelement comprises raw page data extraction module, web page analysis module, normalizing data memory module further; This subelement is obtained by raw page data extraction module and is kept at local original web page, puts into webpage queue to be analysed and resolves record queue and the queue of normalizing webpage; Web page analysis module obtains raw page data from webpage queue to be analysed, the analytic method analyzing web page data utilizing web analysis database to provide, and the URL of the web data be disposed is put into parsing record queue; Simultaneously by raw page data extraction module, the webpage be resolved be stored in raw page data storehouse is made marks; Data after analysis stored in the queue of normalizing webpage, are waited for the process of normalizing data memory module by web page analysis module; Normalizing data memory module obtains the web data after analyzing from the queue of normalizing webpage, and by data access interface stored in normalization web data.
4. the liveness of topic Network Based according to claim 1 and cluster topology analytical system, it is characterized in that, described applied analysis module comprises website construction subelement, focus excavates subelement, topic liveness is analyzed and is excavated and visualization with visualization and community structure, wherein, focus excavates subelement to carry out at website construction subelement on the basis of website construction, by analyzing the web page contents of raw page data, thus obtain much-talked-about topic, the keyword that this subelement also can provide for user carries out hotspot query, the analysis of topic liveness is excavated with visualization and community structure and is excavated at focus on the basis of subelement with visual subelement, commented on by the webpage relevant to described keyword in retrieval raw page data, network diagram is built according to described webpage comment, analysis result is showed user.
5. the liveness of topic Network Based and a cluster topology analytical method, is characterized in that, comprising:
(1) after user inputs URL, system with this URL for seed, the raw page data that subelement obtains website is crawled by web data, and the local storage address of each original web page and this webpage is saved, by web data normalizing subelement, extract title, author, time, webpage main contents stored in normalization web database, analyze the URL form of appointed website, obtain the object URL of comment under structure Ajax, and then obtain related commentary and stored in database;
(2) webpage is classified by clustering method by the web data in raw page data storehouse, focus excavates the cluster result that subelement utilizes website construction, by carrying out the extraction of participle, keyword to all web page title data collected in section sometime, obtain antistop list, by described antistop list, very big comment method is utilized to excavate much-talked-about topic;
(3) web data in normalization web database and focus are excavated the analysis result of subelement as input data, be supplied to the analysis of topic liveness to excavate and visualization with visualization and community structure, described topic liveness analyze with visualization analyzing web page data in based on the network cluster of much-talked-about topic, the liveness of quantitative analysis network cluster, described community structure excavates and comments on to the webpage relevant with focus vocabulary in visualization searching web pages data, network diagram is built according to the relation between described webpage comment, apply the community structure based on corporations' method for digging analyzing web page data of potential energy, together with the analysis result that the analysis result of two unit and focus are excavated subelement, show user.
6. method as claimed in claim 5, is characterized in that, also comprise:
Described web data crawls subelement by reading SeedURL table, obtains seed URL, utilize web page interlinkage reptile module obtain appointed website link and stored in web page interlinkage queue; Webpage pulls module and obtain page URL from web page interlinkage queue, on the basis capturing whole webpage, by the information of the page data of this webpage, URL and local storage address, by info web memory module stored in raw page data storehouse, and, described SeedURL table instruction reptile crawls the initial URL of webpage, and each record is by unlatching reptile thread of correspondence, and this thread only crawls the webpage of appointed website; The field that this table comprises comprises: (1) url field, represents the URL link of specifying; (2) Parser field: specify the analytic method that this URL is corresponding; (3) inUse field: whether this seed URL is in use; (4) finish field: it is complete whether this URL crawls; (5) Depth field: the number of plies crawled;
Web page interlinkage reptile module is according to the record number in SeedURL table, start multiple link reptile module thread, each thread crawls the all-links of specifying the number of plies in appointed website, added in web page interlinkage queue, each link reptile module thread maintains a url filtering database, be used for depositing the page info crawled, after system obtains a URL, url filtering database can be checked, if the record that in this storehouse, this URL existing is relevant, then represent that this URL was crawled, directly abandon this URL; If there is no record, then this URL is added URL queue, wait for the page of this URL of webpage data demand module process, webpage data demand module uses the object in htmlParser bag to obtain data from webpage, web data is passed to URL extraction module, then by processed URL stored in url filtering database, URL extraction module uses the LinkBean object extraction URL in htmlParser bag, web page interlinkage queue and URL queue is put into through url filtering database
After this web page interlinkage reptile module crawls the number of plies reaching certain, empty the record in url filtering database, be initial network address with seed URL again, crawl webpage, reptile thread open up be according in database SeedURL table in record carry out, each inUse field be 0 record all will open up a thread in web page interlinkage reptile module, the main thread of web page interlinkage reptile module checks the record in SeedURL by going at set intervals, see if there is new record to add, field Parser in SeedURL is by the record LocalRecord object that is added in the linked queue that crawled by this link reptile thread, to specify the analysis mode of this webpage,
Described webpage pulls the record in the queue of module extraction page link, go for accordingly and ask web data and web data is stored in local disk, by pull the page URL together with the local address stored stored in page storage queue, and send into page info memory module together with the raw page data of acquisition and the URL of the page data pulled and the address of local storage;
Further, described info web memory module employs the data access interface DAO of hibernate form; Table corresponding in raw page data storehouse is LocalRecord table, and the field that this table comprises is: (1) url field: the URL information depositing webpage; (2) LocalDir field: the local storage address of web data; (3) Parsed field: the method information of resolving this webpage; (4) Processed field: whether this webpage was resolved.
7. method as claimed in claim 5, also comprises:
Described web data normalizing subelement is by analyzing raw page data, extract title, author, time, webpage main contents and comment, this subelement is obtained by raw page data extraction module and is kept at local raw page data storehouse, put into webpage queue to be analysed, web page analysis module obtains raw page data from webpage queue to be analysed, the analytic method analyzing web page data utilizing web analysis database to provide, and the URL of the web data be disposed is put into parsing record queue, simultaneously by raw page data extraction module, the webpage be resolved be stored in raw page data storehouse is made marks, web page analysis module by analyze after data stored in the queue of normalizing webpage, wait for the process of normalizing data memory module, normalizing data memory module obtains the normalization web data after analyzing from the queue of normalizing webpage, and by data access interface stored in normalization web data,
Further, the raw page data extraction module in this module, web page analysis module and normalizing data memory module all realize as a separate threads, open up multiple thread when web analysis.
8. method as claimed in claim 7, also comprises:
Webpage queue to be analysed is the chained list of a LocalRecord form, is used for as web page analysis module provides analysis data, with the speed between balance net page analysis module and raw page data extraction module;
Described web page analysis module is according to the web analysis method of specifying in webpage record, from web analysis database, extract corresponding method remove analyzing web page, raw page data analysis result is sent into buffer memory in the queue of normalizing webpage, and will the identification information of the page be resolved stored in parsing record queue;
Described web analysis database is a set for different web sites page parsing class, and all kinds of in this storehouse all inherit a common abstract base class:
Different analytic methods rewrites the method in OO mode, and different web sites realizes oneself analytic method separately, defines following field in the type FormatForParseResult returned in the method:
privateStringtitle;
privateStringurl;
privateStringcontent;
privateStringdate;
privateStringauthor;
PrivateStringsite; // represent and webpage in which website set this value by main thread, without the need to setting in concrete grammar
privateList<Comment>comments;
By each html page as a tree, the information of webpage is obtained by the label of correspondence, the fetching portion of webpage main contents, for the html page of input, travel through each label node in html page in order, for text node, analysis text node Chinese version content-length whether is greater than 30 characters and Chinese character is greater than English character number, if meet this condition, just the main contents of the content of text in text node as the page are extracted, for the html tag node containing child node, extract child node continuation aforesaid way analyzes each child node, until all nodes are all analyzed complete in the page.
9. method as claimed in claim 8, is characterized in that, described community structure excavates and is specially with the analytical method of visualization:
The first step, for complex network G=(V, E), V represents the set of node, and E represents the set on limit, and the adjacency matrix A of input network G does data prediction, obtain pretreated network adjacent matrix A', described data prediction, refers to and removes the node that network moderate is 1, in the adjacency matrix of network input, namely remove the row and column of this some correspondence;
Second step, for pretreated network adjacent matrix A', analyze topology of networks, each node is retrieved to other nodes in whole network with the searching method of breadth-first, obtain the distance between any two nodes, thus set up Distance matrix D, the distance between described any two nodes, refers to the limit number that in network, between 2, shortest path comprises;
3rd step, the distance matrix of node Network Based, calculates whole network of network potential energy, node each in network is regarded as the source of a gravitational field, can obtain the potential energy between any two nodes in network
computing formula is as follows:
Wherein, R is a constant, can be set as a positive number, and D is the distance matrix obtained in second step,
The network potential energy of whole network G is the potential energy sum between all nodes of comprising of network, and computing formula is as follows:
4th step, for the every bar limit e in network
k, sub-network G after calculating deletion this edge
k=G-{e
knetwork potential energy and the network potential energy difference of former network G
computing formula is as follows:
5th step, removes
be worth maximum limit, and check whether generate independently sub-network, if do not had, then get back to the 4th step; If there is independently sub-network to produce, then whether the sub-network that inspection division generates meets the strong and weak community structure preset, if result of calculation meets strong and weak community structure, then turn the first step, method continues; If result of calculation does not meet strong and weak community structure, then method terminates, and turns the 6th step;
6th step, for the result that the 5th step obtains, re-constructs primitive network figure, node preliminary treatment in the first step fallen, rejoins in primitive network figure, and belongs to the community structure at the node place be directly connected with it.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210477317.5A CN103023714B (en) | 2012-11-21 | 2012-11-21 | The liveness of topic Network Based and cluster topology analytical system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210477317.5A CN103023714B (en) | 2012-11-21 | 2012-11-21 | The liveness of topic Network Based and cluster topology analytical system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103023714A CN103023714A (en) | 2013-04-03 |
CN103023714B true CN103023714B (en) | 2015-12-23 |
Family
ID=47971866
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210477317.5A Active CN103023714B (en) | 2012-11-21 | 2012-11-21 | The liveness of topic Network Based and cluster topology analytical system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103023714B (en) |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279483B (en) * | 2013-04-23 | 2016-04-13 | 中国科学院计算技术研究所 | A kind of topic Epidemic Scope appraisal procedure towards micro-blog and system |
CN103516805A (en) * | 2013-10-10 | 2014-01-15 | 贝壳网际(北京)安全技术有限公司 | Platform, method and system for application distribution |
CN104915879B (en) * | 2014-03-10 | 2019-08-13 | 华为技术有限公司 | The method and device that social relationships based on finance data are excavated |
CN103823890B (en) * | 2014-03-10 | 2016-11-02 | 中国科学院信息工程研究所 | A kind of microblog hot topic detection method for special group and device |
CN104035997B (en) * | 2014-06-13 | 2017-05-10 | 淮阴工学院 | Scientific and technical information acquisition and pushing method based on text classification and image deep mining |
CN105654387B (en) * | 2015-03-17 | 2020-03-17 | 重庆邮电大学 | Time-varying network community evolution visualization method introducing quantization index |
CN104951539B (en) * | 2015-06-19 | 2017-12-22 | 成都艾尔普科技有限责任公司 | Internet data center's harmful information monitoring system |
CN107102997A (en) * | 2016-02-22 | 2017-08-29 | 北京国双科技有限公司 | data crawling method and device |
CN105930528B (en) * | 2016-06-03 | 2020-09-08 | 腾讯科技(深圳)有限公司 | Webpage caching method and server |
CN106095919B (en) * | 2016-06-12 | 2019-08-02 | 上海交通大学 | Data variation trend spring visualization system and method towards analysis of central issue |
CN106649576A (en) * | 2016-11-15 | 2017-05-10 | 北京集奥聚合科技有限公司 | Storing method and system for e-commerce commodities crawled by crawlers |
CN107707662A (en) * | 2017-10-16 | 2018-02-16 | 大唐网络有限公司 | A kind of distributed caching method based on node, device and storage medium |
CN107729564A (en) * | 2017-11-13 | 2018-02-23 | 北京众荟信息技术股份有限公司 | A kind of distributed focused web crawler web page crawl method and system |
CN108363748B (en) * | 2018-01-26 | 2021-07-09 | 南京邮电大学 | Topic portrait system and topic portrait method based on knowledge |
CN109543086B (en) * | 2018-11-23 | 2022-11-22 | 北京信息科技大学 | Network data acquisition and display method oriented to multiple data sources |
CN109766478B (en) * | 2019-01-08 | 2021-06-29 | 浙江财经大学 | Semantic-enhanced large-scale multivariate graph simplified visualization method |
CN113408089B (en) * | 2021-05-31 | 2023-09-26 | 上海师范大学 | Inter-cluster influence modeling method based on gravitational field idea and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101441662A (en) * | 2008-11-28 | 2009-05-27 | 北京交通大学 | Topic information acquisition method based on network topology |
CN101661513A (en) * | 2009-10-21 | 2010-03-03 | 上海交通大学 | Detection method of network focus and public sentiment |
CN101751438A (en) * | 2008-12-17 | 2010-06-23 | 中国科学院自动化研究所 | Theme webpage filter system for driving self-adaption semantics |
-
2012
- 2012-11-21 CN CN201210477317.5A patent/CN103023714B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101441662A (en) * | 2008-11-28 | 2009-05-27 | 北京交通大学 | Topic information acquisition method based on network topology |
CN101751438A (en) * | 2008-12-17 | 2010-06-23 | 中国科学院自动化研究所 | Theme webpage filter system for driving self-adaption semantics |
CN101661513A (en) * | 2009-10-21 | 2010-03-03 | 上海交通大学 | Detection method of network focus and public sentiment |
Non-Patent Citations (1)
Title |
---|
基于概念网络的文本信息监控技术;熊静娴等;《信息安全与通信保密》;20051031;第57-59页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103023714A (en) | 2013-04-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103023714B (en) | The liveness of topic Network Based and cluster topology analytical system and method | |
Chen et al. | Websrc: A dataset for web-based structural reading comprehension | |
CN103390051B (en) | A kind of topic detection and tracking method based on microblog data | |
US8463786B2 (en) | Extracting topically related keywords from related documents | |
Yang et al. | Incorporating site-level knowledge to extract structured data from web forums | |
CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
US20090070366A1 (en) | Method and system for web document clustering | |
DE102019001267A1 (en) | Dialog-like system for answering inquiries | |
Hou et al. | Newsminer: Multifaceted news analysis for event search | |
CN104268148A (en) | Forum page information auto-extraction method and system based on time strings | |
CN102890702A (en) | Internet forum-oriented opinion leader mining method | |
CN103294781A (en) | Method and equipment used for processing page data | |
KR101801257B1 (en) | Text-Mining Application Technique for Productive Construction Document Management | |
Hassan et al. | Task tours: helping users tackle complex search tasks | |
CN102915358B (en) | Navigation website implementation method and device | |
CN102654873A (en) | Tourism information extraction and aggregation method based on Chinese word segmentation | |
CN102004805B (en) | Webpage denoising system and method based on maximum similarity matching | |
CN103064966A (en) | Method for extracting regular noise from single record web pages | |
Haris et al. | Mining graphs from travel blogs: a review in the context of tour planning | |
US20170235835A1 (en) | Information identification and extraction | |
Kanakaraj et al. | NLP based intelligent news search engine using information extraction from e-newspapers | |
Alim et al. | Data retrieval from online social network profiles for social engineering applications | |
Zhou et al. | The Social Network Mining of BBS. | |
Lim et al. | Generalized and lightweight algorithms for automated web forum content extraction | |
Jánosi-Rancz et al. | Semantic data extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |