CN101441662A

CN101441662A - Topic information acquisition method based on network topology

Info

Publication number: CN101441662A
Application number: CNA2008102275821A
Authority: CN
Inventors: 刘云; 熊菲; 李勇; 沈波; 张振江; 贾凡; 程辉; 张立; 张彦超; 司夏萌
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2008-11-28
Filing date: 2008-11-28
Publication date: 2009-05-27
Anticipated expiration: 2028-11-28
Also published as: CN101441662B

Abstract

The invention relates to a topic information acquisition method based on network topology. An initial web page set is obtained from a search engine and is expressed as a vector set through purification, word division and removal of stop words, and a vector space model is used to calculate the text similarity. A network structure is utilized to perform linkage analysis to extracted URLs first, the linkage is filtered through directory hierarchies of the URLs, and then the weights of the URLs are modified according to the scaleless property of a network to perform the prior absorption selection. At the same time, unrelated topic areas are feedback, and the lengths of buffer areas of unrelated URLs are set through the distance between the URLs and a seed set. The heat of acquired topics is calculated to select one topic to obtain a new reply.

Description

The topic information acquisition method of topology Network Based

Technical field

The present invention relates to the topic information acquisition method of topology Network Based, belong to network safety filed.

Background technology

Universal day by day along with information networking, the information on the internet grows with each passing day, and huge potential value lies in the Web information resources of these magnanimity isomeries.The internet is the intercommunion platform of information published method and audient's interaction conveniently, makes network surmount traditional media, becomes the main mode that real-time information is obtained.Media event occurs on the internet usually the earliest, and generates a discussion in network.

How to extract effectively and utilize the network information to become a great challenge.Search engine provides information acquiring way efficiently and effectively by the mode of inquiry for the user.The network information gathering system is that search engine is downloaded webpage from WWW, be the important component part (J.Cho of search engine, Crawling the web:Discovery and Maintenance of Large-Saled Web Data, Computer Science, 2001.).General breadth-first (BFS) acquisition system is just carried out the search of next level after the search of finishing current level, broad covered area often comprises the unconcerned information of user.System has well solved this problem based on the specific topics topic information acquisition.The topic information acquisition system is according to set extracting target, adopt the web page analysis algorithm, selectively visit peer link, obtain needed information, its purpose is to prepare data resource (Zhou Lizhu, Lin Ling for the user inquiring of subject-oriented, focused crawler technical research summary, computer utility, 2005,25 (9): 1965-1969.).

(Jyh-Jong Tsay such as Jyh-Jong Tsay, Chen-Yang Shih, Bo-Liang Wu.Auto Crawler:an Integrated System For Automatic Topical Crawler.Computer and InformationScience, 2005.Fourth Annual ACIS International Conference on 2005:462-467.) on the basis of breadth-first search, used relevant feedback and length of tunnel rationally has been set, make acquisition system break away from uncorrelated zone as early as possible, the related content after the uncorrelated link is hidden in excavation.Jamali M. (Jamali M., Sayyadi H., Hariri B.B., et al.A Method for Focused Crawling Using Combination ofLink Structure and Content Similarity.Web Intelligence, 2006.IEEE/WIC/ACMInternational Conference on, 2006:753-756.) handle URL in conjunction with link structure and text similarity, the URL weights are defined as the product that webpage similarity and URL chain are gone into chain out-degree sum, are the URL processing modes of aftereffect.Wang Tao, Fan Xiaozhong (Wang Tao, Fan Xiaozhong. link analysis is to the improvement of Theme Crawler of Content. computer utility, 2004,24 (B12): 174-176.) on the basis of using vector space model, by link being carried out physical arrangement and logical organization analysis and filter URL, to reduce recall ratio slightly is cost, pursues higher precision ratio.

General topic information acquisition system treats the URL that extracts in the same webpage without distinction, has kept more incoherent link, and can't reduce the influence that the similarity mistake in judgment brings.

Summary of the invention

The object of the invention is to avoid above-mentioned weak point of the prior art and topic information acquisition method that a kind of topology Network Based is provided, the present invention handles URL according to the internet topology, URL is carried out link analysis,, and carried out the tunnel adjustment according to no scale network characterization correction weights.Theme has been gathered in visit according to the theme temperature simultaneously, obtains its return information.

Purpose of the present invention can reach by the following technical programs:

The topic information acquisition method of topology Network Based comprises the steps:

A, obtain the seed collections of web pages from search engine;

B, every piece of webpage in the seed collections of web pages is carried out participle according to descriptor, be expressed as the vector set, extract URL, the URL formation is not visited in initialization;

The URL formation is not visited in c, selection, gathers corresponding web page, calculates the similarity of gathering webpage and seed collections of web pages;

D, the similarity and the preset threshold of gathering webpage and seed collections of web pages are compared.

Described steps d specifically comprises:

If similarity is greater than preset threshold,

1) parse URL from webpage, go to heavy back to insert and do not visit the URL formation, relatively the path relation of father URL and sub-URL distributes different weights for sub-URL;

2) link of calculating sub-URL is weighed, and sub-pages i to the link weighting coefficient of father's webpage j is: link _Ji=path _Ji+ freq _i, wherein, path _JiBe different URL routine weight values, freq _iBe normalized anchor text key word frequency;

3) the weighted value correction of antithetical phrase URL, revised weights are as follows:

score (i) = Σ_{t = 1}^{n} {link}_{ti} \cdot η (k_{t}) \cdot sim (V_{t}, D)

Wherein, n is the in-degree of webpage i, sim (V _t, D) be the correlativity of father's webpage and seed set, link _TiBe the link weighting coefficient of webpage i to father's webpage,

η (k_{t}) = k_{t} / \underset{j}{Σ} k_{j}

The be the theme deflection probability of webpage, k _tEffective link number of quoting for father's webpage;

If similarity is not more than preset threshold, be provided with in the length of tunnel according to the distance of URL and seed set.Length of tunnel is

step (i) = floor (\frac{σ}{n (i)}),

Floor rounds downwards, and σ is an initial depth parameter constant, and n (i) is bonded to the link degree of depth of webpage i for subset.If the length of tunnel of URL is greater than 0, sub-URL disposal route is identical greater than the situation of threshold value with similarity, otherwise, reduce all sub-URL weights.

The described different link weight of sub-URL distribution of giving specifically is included as:

1) sub-URL comprises father URL, and then sub-pages is in the subprime directory of father's webpage, and the theme of sub-pages is the expansion and the extension of father's Web page subject, and the weights that sub-URL distributes are t;

2) sub-URL has similar path to father URL, and sub-pages is identical with file length with father's webpage directories deep, and new theme is early stage or follow-up, and the weights that sub-URL distributes are t;

3) sub-URL is redundant link such as background illustration, advertisement, and the weights that sub-URL distributes are

0.4＜t＜0.6 wherein.

The topic information acquisition method that the present invention is based on network topology can also be realized for following steps:

A, obtain the seed collections of web pages from search engine;

B, every piece of webpage in the seed collections of web pages is carried out participle according to descriptor, be expressed as the vector set, extract URL;

C, the general URL of access queue carry out template matches, when webpage comprises return information, according to the theme temperature, preferentially choose the high theme URL of temperature and obtain new answer;

The theme temperature is:

heft (t) = {(n + 1)}^{α} e^{- (1 + t - \overset{&OverBar;}{t}) / β}, \overset{&OverBar;}{t} = Σ_{i = 1}^{n} t_{i} / n;

Wherein the average answer that is the theme of t promptly enlivens time point constantly; N is the total answer number of initial time to this theme of current time, and α, β are constant; 0＜α＜1, α is the weighted index of deflection probability; β has determined the level and smooth degree of theme temperature function.

Introduce concrete grammar of the present invention and step below in detail:

At first obtain kind of a sub-pages.According to focusing on keyword, visit each search engine, m bar record before obtaining is as the initial link that focuses on.Grasp the source file of initial link, obtain seed collections of web pages D=＜D ₁, D ₂, D ₃... D _m.Every piece of webpage D in the pair set _i, extract subject information and carry out participle, remove to transfer insignificant auxiliary word, adverbial word and stop words, be expressed as the document vector form.If document D _iThe entry that comprises is＜t ₁, t ₂, t ₃... t _n, then corresponding n dimension document vector is＜w _I1, w _I2, w _I3... w _In, w wherein _IjThe weight of entry j.w _IjAdopt classical TF * IDF definition.IDF is according to the total number of documents incremental update.The seed collections of web pages has been mapped to document vector set W=＜W ₁, W ₂, W ₃... W _m.Concentrate the URL that parses from kind of sub-pages, give initial weight 1, add the search queue of acquisition system, first search is than the link of high weight during collection.

The webpage that newly grabs behind pre-service and participle, changes into the entry vector, calculates the similarity of new web page and seed collections of web pages.Similarity between document uses the cosine of document vector angle to measure two webpage D _i, D _j, the similarity between them is

sim &lang; D_{i}, D_{j} &rang; = = \frac{D_{i} \cdot D_{j}}{| D_{i} | \times | D_{j} |} = \frac{Σ_{k = 1}^{n} w_{ik} \times w_{jk}}{\sqrt{Σ_{k = 1}^{n} w_{ik}_{2}} \times \sqrt{Σ_{k = 1}^{n} w_{jk}_{2}}} .

Seed collections of web pages D, new web page V, new web page is the mean value of this webpage and all webpage similaritys of collections of web pages with the similarity of planting the sub-pages collection

sim &lang; V, D &rang; = \frac{1}{m} Σ_{k = 1}^{m} sim &lang; V, D_{k} &rang; .

The webpage that similarity is higher, the angle in vector space is more little, tends to describe same topic, otherwise, the webpage that similarity is low more, the probability that belongs to different topics is big more.If the similarity of webpage and seed collections of web pages is higher than threshold value, then this webpage is added the seed set.

The URL that parses in the webpage is carried out link analysis filter URL.The link webpage of quoting in one piece of webpage pointed is called the sub-pages of this father's webpage.The URL that parses in father's webpage, its structure has reflected the relation of sub-pages and father's webpage, looks following several situation and distributes different weights (the power parameter is t, 0.4＜t＜0.6):

(1) sub-URL comprises father URL, as father URL be " http://mil .news .sina .com .cn/ ", sub-URL be " http://mil .news .sina .com .cn/ w/ d{4}-d{2}-d{2}/d+ .html ", or " http://mil .news .sina .com .cn/ d{4}-d{2}-d{2}/d+ .html ", then sub-pages is in the subprime directory of father's webpage.The theme of sub-pages is the expansion and the extension of father's Web page subject, and sub-URL distributes weights t.

(2) sub-URL has similar path to father URL.Sub-pages is identical with file length with father's webpage directories deep, and new theme is early stage or follow-up, distributes weights t.

(3) redundant link such as background illustration, advertisement, right of distribution

Simultaneously, the more URL relevant with theme links near one section text and comprises the focusing key word.Therefore webpage i to the link weighting coefficient of father's webpage j is: link _Ji=path _Ji+ freq _i, wherein, path _JiBe above-mentioned 3 kinds of different URL routine weight values, freq _iBe normalized anchor text key word frequency.

On the basis of link analysis,, select link to add according to the URL weights and do not visit the URL formation URL weighting, ordering.WWW has the feature of no scale network, webpage or website are as the node of network, link is as the limit of network, network has degree distribution (the SEN QIN of power rate, GUAN-ZHONG DAI, YAN-LING LI, Design and Implementation of Web Hot-topic Talk Mining Basedon Scale-free Network, Proceedings of the Fifth International Conferenceon Machine Learning and Cybernetics, 2006, pp.13-16.).The growth of no scale network and preferential adsorption principle make those comprise the more webpage of link number and may obtain new url more.The new topic links that adds network is directly proportional to the number that links that deflection probability that has theme and theme comprise.The similarity error in judgement can be included into the focusing topic with incoherent webpage, only depends on link analysis can not filter out the link that extracts in these webpages effectively.But these wrong webpages that focus on, the effective link number that comprises usually is less.Therefore the deflection probability that uses no scale network has reduced the weights of the link that extracts in these judgement error webpages to the URL weighting, has improved precision ratio.On pagerank algorithm basis, take into account correlation calculations, link analysis and do not have the influence of scale network characteristic, revised weighted value is as follows:

score (i) = Σ_{t = 1}^{n} {link}_{ti} \cdot η (k_{t}) \cdot sim (V_{t}, D)

η (k_{t}) = k_{t} / \underset{j}{Σ} k_{j}

The be the theme deflection probability of webpage, k _tEffective link number of quoting for father's webpage.Sim (V _t, D) being the pith of URL weights, the threshold value of similarity is determining the orientation of sub-URL.

The content of one piece of webpage is relevant with the focusing topic, and the possibility that its sub-URL that parses is relevant with topic is bigger, otherwise sub-URL tends to describe new topic.The webpage that correlativity is higher may have been quoted a large amount of recommended links, make the sub-URL that extracts belong to the focusing topic, but weights is higher because it has inherited the correlativity of father's webpage.Grasp this a little URL, will obtain too much incoherent content, influence system performance.Therefore, when grasping the sub-URL that parses by same father's webpage,, then reduce all sub-URL weights if all obtain the irrelevant information of topic continuously for several times.

In the subject network, may comprise incoherent zone the path from a related web page to another related web page, make relevant subject web page be hidden in the irrelevant link, this is called tunnel(l)ing.With the lower webpage of seed set similarity,, may be linked to the high webpage of similarity through after the multistage link.Ignore similarity and be lower than the URL that extracts in the webpage of threshold value, will lose the peer link that is hidden in behind these webpages, the theme of acquisition reduces.Uncorrelated webpage is set suitable buffering, improve the recall ratio of topic information acquisition system.Therefore adopt atomic model to come the link jumping figure of distributing buffer.The seed set is the foundation that topic focuses on, and as atomic nucleus, atomic nucleus is brought in constant renewal in.The URL that extracts from the seed set is as extranulear electron.Same URL has a plurality of chains of going into, with the minor increment of the seed set degree of depth that links as it.The URL that is in the same degree of depth is considered as identical energy level.The degree of depth of URL is more little, and the closer to atomic nucleus, it is big more to be subjected to nuclear binding force, and is bigger by its probability that obtains the related subject webpage, and bigger buffering progression is set.On the contrary, the URL degree of depth is big more, and the probability of shaking off atomic nucleus constraint escape is also bigger, is difficult for obtaining to focus on theme.Being set at of buffering jumping figure (being length of tunnel)

step (i) = floor (\frac{σ}{n (i)}),

Wherein step (i) is the buffering webpage jumping figure of webpage i, and floor rounds downwards, and σ is an initial depth parameter constant, and n (i) is bonded to the link degree of depth of webpage i for subset.If the buffering jumping figure of the URL in the uncorrelated webpage equals 0, and reduces all sub-URL weights in this webpage.

The theme URL that has visited does not discharge, and adds to have visited the URL formation.URL is linked template matches, if webpage by concern news analysis, broadcasting bulletin system, blog etc. take back multiple information, then choose the new answer that URL obtains the big theme of popular degree according to the theme temperature.Consider that those have the theme model of more answer, attracted popular interest, the probability that they obtain answer is also big more; Simultaneously As time goes on, the answer that out-of-date theme model obtains reduces, and absorption affinity is tending towards 0 gradually.Definition theme temperature is as follows:

heft (t) = {(n + 1)}^{α} e^{- (1 + t - \overset{&OverBar;}{t}) / β}, \overset{&OverBar;}{t} = Σ_{i = 1}^{n} t_{i} / n .

Wherein the average answer that is the theme of t promptly enlivens time point constantly, and n is the total answer number of initial time to this theme of current time, and α, β are constant.0＜α＜1, α is the weighted index of deflection probability.β has determined the level and smooth degree of theme temperature function, and β is more little, can react some details more, and β is big more, and function is mild more.Because mild function more can estimate the theme answer trend in future, β is generally greater than 2.

Usually between 0 to 30, topical subject can be thought in the theme greater than 15 to the temperature of theme, when selecting access queue, preferentially selects the high news analysis of popular degree, broadcasting bulletin system or blog title URL.

The present invention has following advantage compared to existing technology:

(1) URL is carried out link analysis, filter redundant and irrelevant link, save system resource.

(2) the deflection probability correction URL weights of the no scale network of use with the feature that abundant internet information is assembled, reduce the similarity mistake in judgment but next influence.

(3) URL is carried out relevant feedback, make system break away from uncorrelated zone as early as possible, improve accuracy rate.

(4) use the evaluation of theme temperature to take back multiple theme, system time mainly is distributed on the focus theme.

Description of drawings

Fig. 1 is the workflow diagram of entire method;

The be the theme precision ratio comparison diagram of information acquisition system and BFS acquisition system of Fig. 2;

The be the theme related subject acquisition rate comparison diagram of information acquisition system and BFS acquisition system of Fig. 3;

Fig. 4 is a link analysis and the precision ratio comparison diagram of link analysis not;

Fig. 5 is the precision ratio comparison diagram under the different weights;

Fig. 6 is the precision ratio comparison diagram before and after the similarity feedback.

Embodiment

The performance of topic information acquisition system is weighed by precision ratio.Precision ratio is the tolerance that obtains related subject webpage precision, if M is for catching the theme sum, T is the related subject number in the webpage of obtaining, and then precision ratio is precision=T/M.By key word " Beijing Olympic " is focused on, grasped many parts of subject web pages of up to a hundred websites, relatively the related subject collecting efficiency of topic information acquisition system and general acquisition system and different system parameter are to topic information acquisition Effect on Performance (power parameter t gets 0.5).As shown in Figure 1, topic information acquisition method workflow diagram for topology Network Based, extremely shown in Figure 6 as Fig. 2, the result be repeatedly the mean value of emulated data, because network upgrades fast, the accuracy rate of information acquisition is relevant with website structure, and in order to guarantee comparability, the seed set of theme acquisition system is consistent with the initial formation of BFS acquisition system.

Fig. 2 is the theme the precision ratio of information acquisition system and BFS acquisition system with the variation of grasping the webpage sum.From the initial URL formation that search engine obtains the kind sub-pages collection and the BFS of theme acquisition system, grasp more than 10000 part of webpage continuously.Fig. 3 is corresponding related web page acquisition rate.As can be seen, the theme acquisition system precision ratio that this paper carried is apparently higher than the BFS acquisition system, and along with the increasing of collecting net number of pages, the collection of BFS related subject is slow, and precision ratio is reduced to below 40%, and the theme acquisition system still maintains more than 70%.

Fig. 4 has reflected the influence of link analysis to theme acquisition system precision ratio.Initial seed webpage collection is 10, grasps 3500 parts of webpages.As seen, do not carry out link analysis, gathered the invalid and redundant link in a large amount of subject web pages and influenced the performance of system, precision ratio is reduced to below 0.6 after grasping 2500 webpages.

Fig. 5 is precision ratio variation with collection webpage number under the different weights mode.Than pagerank weights height, the range of decrease is milder based on the weights allocation scheme precision ratio of internet topology.And waiting the gain weights to distribute, the initial stage precision ratio is higher gathering, but along with the operation of system, the highest similarity that covets very easily is absorbed in local optimum, and precision ratio reduces rapidly.

Fig. 6 is the comparison of precision ratio before and after the similarity feedback.The similarity feedback makes system after being absorbed in uncorrelated zone, can withdraw from quickly and grasp mistaken ideas, prevents that performance from further worsening.

Following table 1 is to focus on different topics, chooses 10 initial links, when grasping 5000 webpages, and the precision ratio of theme acquisition system and BFS acquisition system.

The precision ratio of the different topics of table 1

	The theme acquisition system	The BFS acquisition system
	The theme acquisition system	The BFS acquisition system	The big aircraft project of China	0.38	0.071
Shenzhou VI spacecraft	0.65	0.12	The big aircraft project of China	0.38	0.071
Shenzhou VI spacecraft	0.65	0.12	Wenchuan earthquake	0.76	0.28
Notebook	0.73	0.13	Wenchuan earthquake	0.76	0.28

Claims

1. the topic information acquisition method of topology Network Based is characterized in that comprising the steps:

A, obtain the seed collections of web pages from search engine;

2, the topic information acquisition method of topology Network Based according to claim 1 is characterized in that described steps d specifically comprises:

If similarity is greater than preset threshold,

score (i) = Σ_{t = 1}^{n} {link}_{ti} \cdot η (k_{t}) \cdot sim (V_{t}, D)

η (k_{t}) = k_{t} / \underset{j}{Σ} k_{j}

step (i) = floor (\frac{σ}{n (i)}),

Floor rounds downwards, and σ is an initial depth parameter constant, and n (i) is bonded to the link degree of depth of webpage i for subset; If the length of tunnel of URL is greater than 0, sub-URL disposal route is identical greater than the situation of threshold value with similarity, otherwise, reduce all sub-URL weights.

3, the topic information acquisition method of topology Network Based according to claim 2 is characterized in that the described different link weight of sub-URL distribution of giving specifically is included as:

0.4＜t＜0.6 wherein.

4, the topic information acquisition method of topology Network Based according to claim 1 is characterized in that comprising the steps:

A, obtain the seed collections of web pages from search engine;

The theme temperature is:

heft (t) = {(n + 1)}^{α} e^{- (1 + t - \overset{&OverBar;}{t}) / β}, \overset{&OverBar;}{t} = Σ_{i = 1}^{n} t_{i} / n;