CN101441662A - Topic information acquisition method based on network topology - Google Patents

Topic information acquisition method based on network topology Download PDF

Info

Publication number
CN101441662A
CN101441662A CNA2008102275821A CN200810227582A CN101441662A CN 101441662 A CN101441662 A CN 101441662A CN A2008102275821 A CNA2008102275821 A CN A2008102275821A CN 200810227582 A CN200810227582 A CN 200810227582A CN 101441662 A CN101441662 A CN 101441662A
Authority
CN
China
Prior art keywords
url
webpage
sub
link
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008102275821A
Other languages
Chinese (zh)
Other versions
CN101441662B (en
Inventor
刘云
熊菲
李勇
沈波
张振江
贾凡
程辉
张立
张彦超
司夏萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN2008102275821A priority Critical patent/CN101441662B/en
Publication of CN101441662A publication Critical patent/CN101441662A/en
Application granted granted Critical
Publication of CN101441662B publication Critical patent/CN101441662B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a topic information acquisition method based on network topology. An initial web page set is obtained from a search engine and is expressed as a vector set through purification, word division and removal of stop words, and a vector space model is used to calculate the text similarity. A network structure is utilized to perform linkage analysis to extracted URLs first, the linkage is filtered through directory hierarchies of the URLs, and then the weights of the URLs are modified according to the scaleless property of a network to perform the prior absorption selection. At the same time, unrelated topic areas are feedback, and the lengths of buffer areas of unrelated URLs are set through the distance between the URLs and a seed set. The heat of acquired topics is calculated to select one topic to obtain a new reply.

Description

The topic information acquisition method of topology Network Based
Technical field
The present invention relates to the topic information acquisition method of topology Network Based, belong to network safety filed.
Background technology
Universal day by day along with information networking, the information on the internet grows with each passing day, and huge potential value lies in the Web information resources of these magnanimity isomeries.The internet is the intercommunion platform of information published method and audient's interaction conveniently, makes network surmount traditional media, becomes the main mode that real-time information is obtained.Media event occurs on the internet usually the earliest, and generates a discussion in network.
How to extract effectively and utilize the network information to become a great challenge.Search engine provides information acquiring way efficiently and effectively by the mode of inquiry for the user.The network information gathering system is that search engine is downloaded webpage from WWW, be the important component part (J.Cho of search engine, Crawling the web:Discovery and Maintenance of Large-Saled Web Data, Computer Science, 2001.).General breadth-first (BFS) acquisition system is just carried out the search of next level after the search of finishing current level, broad covered area often comprises the unconcerned information of user.System has well solved this problem based on the specific topics topic information acquisition.The topic information acquisition system is according to set extracting target, adopt the web page analysis algorithm, selectively visit peer link, obtain needed information, its purpose is to prepare data resource (Zhou Lizhu, Lin Ling for the user inquiring of subject-oriented, focused crawler technical research summary, computer utility, 2005,25 (9): 1965-1969.).
(Jyh-Jong Tsay such as Jyh-Jong Tsay, Chen-Yang Shih, Bo-Liang Wu.Auto Crawler:an Integrated System For Automatic Topical Crawler.Computer and InformationScience, 2005.Fourth Annual ACIS International Conference on 2005:462-467.) on the basis of breadth-first search, used relevant feedback and length of tunnel rationally has been set, make acquisition system break away from uncorrelated zone as early as possible, the related content after the uncorrelated link is hidden in excavation.Jamali M. (Jamali M., Sayyadi H., Hariri B.B., et al.A Method for Focused Crawling Using Combination ofLink Structure and Content Similarity.Web Intelligence, 2006.IEEE/WIC/ACMInternational Conference on, 2006:753-756.) handle URL in conjunction with link structure and text similarity, the URL weights are defined as the product that webpage similarity and URL chain are gone into chain out-degree sum, are the URL processing modes of aftereffect.Wang Tao, Fan Xiaozhong (Wang Tao, Fan Xiaozhong. link analysis is to the improvement of Theme Crawler of Content. computer utility, 2004,24 (B12): 174-176.) on the basis of using vector space model, by link being carried out physical arrangement and logical organization analysis and filter URL, to reduce recall ratio slightly is cost, pursues higher precision ratio.
General topic information acquisition system treats the URL that extracts in the same webpage without distinction, has kept more incoherent link, and can't reduce the influence that the similarity mistake in judgment brings.
Summary of the invention
The object of the invention is to avoid above-mentioned weak point of the prior art and topic information acquisition method that a kind of topology Network Based is provided, the present invention handles URL according to the internet topology, URL is carried out link analysis,, and carried out the tunnel adjustment according to no scale network characterization correction weights.Theme has been gathered in visit according to the theme temperature simultaneously, obtains its return information.
Purpose of the present invention can reach by the following technical programs:
The topic information acquisition method of topology Network Based comprises the steps:
A, obtain the seed collections of web pages from search engine;
B, every piece of webpage in the seed collections of web pages is carried out participle according to descriptor, be expressed as the vector set, extract URL, the URL formation is not visited in initialization;
The URL formation is not visited in c, selection, gathers corresponding web page, calculates the similarity of gathering webpage and seed collections of web pages;
D, the similarity and the preset threshold of gathering webpage and seed collections of web pages are compared.
Described steps d specifically comprises:
If similarity is greater than preset threshold,
1) parse URL from webpage, go to heavy back to insert and do not visit the URL formation, relatively the path relation of father URL and sub-URL distributes different weights for sub-URL;
2) link of calculating sub-URL is weighed, and sub-pages i to the link weighting coefficient of father's webpage j is: link Ji=path Ji+ freq i, wherein, path JiBe different URL routine weight values, freq iBe normalized anchor text key word frequency;
3) the weighted value correction of antithetical phrase URL, revised weights are as follows:
score ( i ) = Σ t = 1 n link ti · η ( k t ) · sim ( V t , D )
Wherein, n is the in-degree of webpage i, sim (V t, D) be the correlativity of father's webpage and seed set, link TiBe the link weighting coefficient of webpage i to father's webpage, η ( k t ) = k t / Σ j k j The be the theme deflection probability of webpage, k tEffective link number of quoting for father's webpage;
If similarity is not more than preset threshold, be provided with in the length of tunnel according to the distance of URL and seed set.Length of tunnel is step ( i ) = floor ( σ n ( i ) ) , Floor rounds downwards, and σ is an initial depth parameter constant, and n (i) is bonded to the link degree of depth of webpage i for subset.If the length of tunnel of URL is greater than 0, sub-URL disposal route is identical greater than the situation of threshold value with similarity, otherwise, reduce all sub-URL weights.
The described different link weight of sub-URL distribution of giving specifically is included as:
1) sub-URL comprises father URL, and then sub-pages is in the subprime directory of father's webpage, and the theme of sub-pages is the expansion and the extension of father's Web page subject, and the weights that sub-URL distributes are t;
2) sub-URL has similar path to father URL, and sub-pages is identical with file length with father's webpage directories deep, and new theme is early stage or follow-up, and the weights that sub-URL distributes are t;
3) sub-URL is redundant link such as background illustration, advertisement, and the weights that sub-URL distributes are
Figure A200810227582D0008100150QIETU
0.4<t<0.6 wherein.
The topic information acquisition method that the present invention is based on network topology can also be realized for following steps:
A, obtain the seed collections of web pages from search engine;
B, every piece of webpage in the seed collections of web pages is carried out participle according to descriptor, be expressed as the vector set, extract URL;
C, the general URL of access queue carry out template matches, when webpage comprises return information, according to the theme temperature, preferentially choose the high theme URL of temperature and obtain new answer;
The theme temperature is: heft ( t ) = ( n + 1 ) α e - ( 1 + t - t ‾ ) / β , t ‾ = Σ i = 1 n t i / n ;
Wherein the average answer that is the theme of t promptly enlivens time point constantly; N is the total answer number of initial time to this theme of current time, and α, β are constant; 0<α<1, α is the weighted index of deflection probability; β has determined the level and smooth degree of theme temperature function.
Introduce concrete grammar of the present invention and step below in detail:
At first obtain kind of a sub-pages.According to focusing on keyword, visit each search engine, m bar record before obtaining is as the initial link that focuses on.Grasp the source file of initial link, obtain seed collections of web pages D=<D 1, D 2, D 3... D m.Every piece of webpage D in the pair set i, extract subject information and carry out participle, remove to transfer insignificant auxiliary word, adverbial word and stop words, be expressed as the document vector form.If document D iThe entry that comprises is<t 1, t 2, t 3... t n, then corresponding n dimension document vector is<w I1, w I2, w I3... w In, w wherein IjThe weight of entry j.w IjAdopt classical TF * IDF definition.IDF is according to the total number of documents incremental update.The seed collections of web pages has been mapped to document vector set W=<W 1, W 2, W 3... W m.Concentrate the URL that parses from kind of sub-pages, give initial weight 1, add the search queue of acquisition system, first search is than the link of high weight during collection.
The webpage that newly grabs behind pre-service and participle, changes into the entry vector, calculates the similarity of new web page and seed collections of web pages.Similarity between document uses the cosine of document vector angle to measure two webpage D i, D j, the similarity between them is sim ⟨ D i , D j ⟩ = = D i · D j | D i | × | D j | = Σ k = 1 n w ik × w jk Σ k = 1 n w ik 2 × Σ k = 1 n w jk 2 . Seed collections of web pages D, new web page V, new web page is the mean value of this webpage and all webpage similaritys of collections of web pages with the similarity of planting the sub-pages collection sim ⟨ V , D ⟩ = 1 m Σ k = 1 m sim ⟨ V , D k ⟩ . The webpage that similarity is higher, the angle in vector space is more little, tends to describe same topic, otherwise, the webpage that similarity is low more, the probability that belongs to different topics is big more.If the similarity of webpage and seed collections of web pages is higher than threshold value, then this webpage is added the seed set.
The URL that parses in the webpage is carried out link analysis filter URL.The link webpage of quoting in one piece of webpage pointed is called the sub-pages of this father's webpage.The URL that parses in father's webpage, its structure has reflected the relation of sub-pages and father's webpage, looks following several situation and distributes different weights (the power parameter is t, 0.4<t<0.6):
(1) sub-URL comprises father URL, as father URL be " http://mil .news .sina .com .cn/ ", sub-URL be " http://mil .news .sina .com .cn/ w/ d{4}-d{2}-d{2}/d+ .html ", or " http://mil .news .sina .com .cn/ d{4}-d{2}-d{2}/d+ .html ", then sub-pages is in the subprime directory of father's webpage.The theme of sub-pages is the expansion and the extension of father's Web page subject, and sub-URL distributes weights t.
(2) sub-URL has similar path to father URL.Sub-pages is identical with file length with father's webpage directories deep, and new theme is early stage or follow-up, distributes weights t.
(3) redundant link such as background illustration, advertisement, right of distribution
Simultaneously, the more URL relevant with theme links near one section text and comprises the focusing key word.Therefore webpage i to the link weighting coefficient of father's webpage j is: link Ji=path Ji+ freq i, wherein, path JiBe above-mentioned 3 kinds of different URL routine weight values, freq iBe normalized anchor text key word frequency.
On the basis of link analysis,, select link to add according to the URL weights and do not visit the URL formation URL weighting, ordering.WWW has the feature of no scale network, webpage or website are as the node of network, link is as the limit of network, network has degree distribution (the SEN QIN of power rate, GUAN-ZHONG DAI, YAN-LING LI, Design and Implementation of Web Hot-topic Talk Mining Basedon Scale-free Network, Proceedings of the Fifth International Conferenceon Machine Learning and Cybernetics, 2006, pp.13-16.).The growth of no scale network and preferential adsorption principle make those comprise the more webpage of link number and may obtain new url more.The new topic links that adds network is directly proportional to the number that links that deflection probability that has theme and theme comprise.The similarity error in judgement can be included into the focusing topic with incoherent webpage, only depends on link analysis can not filter out the link that extracts in these webpages effectively.But these wrong webpages that focus on, the effective link number that comprises usually is less.Therefore the deflection probability that uses no scale network has reduced the weights of the link that extracts in these judgement error webpages to the URL weighting, has improved precision ratio.On pagerank algorithm basis, take into account correlation calculations, link analysis and do not have the influence of scale network characteristic, revised weighted value is as follows:
score ( i ) = Σ t = 1 n link ti · η ( k t ) · sim ( V t , D )
Wherein, n is the in-degree of webpage i, sim (V t, D) be the correlativity of father's webpage and seed set, link TiBe the link weighting coefficient of webpage i to father's webpage, η ( k t ) = k t / Σ j k j The be the theme deflection probability of webpage, k tEffective link number of quoting for father's webpage.Sim (V t, D) being the pith of URL weights, the threshold value of similarity is determining the orientation of sub-URL.
The content of one piece of webpage is relevant with the focusing topic, and the possibility that its sub-URL that parses is relevant with topic is bigger, otherwise sub-URL tends to describe new topic.The webpage that correlativity is higher may have been quoted a large amount of recommended links, make the sub-URL that extracts belong to the focusing topic, but weights is higher because it has inherited the correlativity of father's webpage.Grasp this a little URL, will obtain too much incoherent content, influence system performance.Therefore, when grasping the sub-URL that parses by same father's webpage,, then reduce all sub-URL weights if all obtain the irrelevant information of topic continuously for several times.
In the subject network, may comprise incoherent zone the path from a related web page to another related web page, make relevant subject web page be hidden in the irrelevant link, this is called tunnel(l)ing.With the lower webpage of seed set similarity,, may be linked to the high webpage of similarity through after the multistage link.Ignore similarity and be lower than the URL that extracts in the webpage of threshold value, will lose the peer link that is hidden in behind these webpages, the theme of acquisition reduces.Uncorrelated webpage is set suitable buffering, improve the recall ratio of topic information acquisition system.Therefore adopt atomic model to come the link jumping figure of distributing buffer.The seed set is the foundation that topic focuses on, and as atomic nucleus, atomic nucleus is brought in constant renewal in.The URL that extracts from the seed set is as extranulear electron.Same URL has a plurality of chains of going into, with the minor increment of the seed set degree of depth that links as it.The URL that is in the same degree of depth is considered as identical energy level.The degree of depth of URL is more little, and the closer to atomic nucleus, it is big more to be subjected to nuclear binding force, and is bigger by its probability that obtains the related subject webpage, and bigger buffering progression is set.On the contrary, the URL degree of depth is big more, and the probability of shaking off atomic nucleus constraint escape is also bigger, is difficult for obtaining to focus on theme.Being set at of buffering jumping figure (being length of tunnel) step ( i ) = floor ( σ n ( i ) ) , Wherein step (i) is the buffering webpage jumping figure of webpage i, and floor rounds downwards, and σ is an initial depth parameter constant, and n (i) is bonded to the link degree of depth of webpage i for subset.If the buffering jumping figure of the URL in the uncorrelated webpage equals 0, and reduces all sub-URL weights in this webpage.
The theme URL that has visited does not discharge, and adds to have visited the URL formation.URL is linked template matches, if webpage by concern news analysis, broadcasting bulletin system, blog etc. take back multiple information, then choose the new answer that URL obtains the big theme of popular degree according to the theme temperature.Consider that those have the theme model of more answer, attracted popular interest, the probability that they obtain answer is also big more; Simultaneously As time goes on, the answer that out-of-date theme model obtains reduces, and absorption affinity is tending towards 0 gradually.Definition theme temperature is as follows:
heft ( t ) = ( n + 1 ) α e - ( 1 + t - t ‾ ) / β , t ‾ = Σ i = 1 n t i / n .
Wherein the average answer that is the theme of t promptly enlivens time point constantly, and n is the total answer number of initial time to this theme of current time, and α, β are constant.0<α<1, α is the weighted index of deflection probability.β has determined the level and smooth degree of theme temperature function, and β is more little, can react some details more, and β is big more, and function is mild more.Because mild function more can estimate the theme answer trend in future, β is generally greater than 2.
Usually between 0 to 30, topical subject can be thought in the theme greater than 15 to the temperature of theme, when selecting access queue, preferentially selects the high news analysis of popular degree, broadcasting bulletin system or blog title URL.
The present invention has following advantage compared to existing technology:
(1) URL is carried out link analysis, filter redundant and irrelevant link, save system resource.
(2) the deflection probability correction URL weights of the no scale network of use with the feature that abundant internet information is assembled, reduce the similarity mistake in judgment but next influence.
(3) URL is carried out relevant feedback, make system break away from uncorrelated zone as early as possible, improve accuracy rate.
(4) use the evaluation of theme temperature to take back multiple theme, system time mainly is distributed on the focus theme.
Description of drawings
Fig. 1 is the workflow diagram of entire method;
The be the theme precision ratio comparison diagram of information acquisition system and BFS acquisition system of Fig. 2;
The be the theme related subject acquisition rate comparison diagram of information acquisition system and BFS acquisition system of Fig. 3;
Fig. 4 is a link analysis and the precision ratio comparison diagram of link analysis not;
Fig. 5 is the precision ratio comparison diagram under the different weights;
Fig. 6 is the precision ratio comparison diagram before and after the similarity feedback.
Embodiment
The performance of topic information acquisition system is weighed by precision ratio.Precision ratio is the tolerance that obtains related subject webpage precision, if M is for catching the theme sum, T is the related subject number in the webpage of obtaining, and then precision ratio is precision=T/M.By key word " Beijing Olympic " is focused on, grasped many parts of subject web pages of up to a hundred websites, relatively the related subject collecting efficiency of topic information acquisition system and general acquisition system and different system parameter are to topic information acquisition Effect on Performance (power parameter t gets 0.5).As shown in Figure 1, topic information acquisition method workflow diagram for topology Network Based, extremely shown in Figure 6 as Fig. 2, the result be repeatedly the mean value of emulated data, because network upgrades fast, the accuracy rate of information acquisition is relevant with website structure, and in order to guarantee comparability, the seed set of theme acquisition system is consistent with the initial formation of BFS acquisition system.
Fig. 2 is the theme the precision ratio of information acquisition system and BFS acquisition system with the variation of grasping the webpage sum.From the initial URL formation that search engine obtains the kind sub-pages collection and the BFS of theme acquisition system, grasp more than 10000 part of webpage continuously.Fig. 3 is corresponding related web page acquisition rate.As can be seen, the theme acquisition system precision ratio that this paper carried is apparently higher than the BFS acquisition system, and along with the increasing of collecting net number of pages, the collection of BFS related subject is slow, and precision ratio is reduced to below 40%, and the theme acquisition system still maintains more than 70%.
Fig. 4 has reflected the influence of link analysis to theme acquisition system precision ratio.Initial seed webpage collection is 10, grasps 3500 parts of webpages.As seen, do not carry out link analysis, gathered the invalid and redundant link in a large amount of subject web pages and influenced the performance of system, precision ratio is reduced to below 0.6 after grasping 2500 webpages.
Fig. 5 is precision ratio variation with collection webpage number under the different weights mode.Than pagerank weights height, the range of decrease is milder based on the weights allocation scheme precision ratio of internet topology.And waiting the gain weights to distribute, the initial stage precision ratio is higher gathering, but along with the operation of system, the highest similarity that covets very easily is absorbed in local optimum, and precision ratio reduces rapidly.
Fig. 6 is the comparison of precision ratio before and after the similarity feedback.The similarity feedback makes system after being absorbed in uncorrelated zone, can withdraw from quickly and grasp mistaken ideas, prevents that performance from further worsening.
Following table 1 is to focus on different topics, chooses 10 initial links, when grasping 5000 webpages, and the precision ratio of theme acquisition system and BFS acquisition system.
The precision ratio of the different topics of table 1
The theme acquisition system The BFS acquisition system
The big aircraft project of China 0.38 0.071
Shenzhou VI spacecraft 0.65 0.12
Wenchuan earthquake 0.76 0.28
Notebook 0.73 0.13

Claims (4)

1. the topic information acquisition method of topology Network Based is characterized in that comprising the steps:
A, obtain the seed collections of web pages from search engine;
B, every piece of webpage in the seed collections of web pages is carried out participle according to descriptor, be expressed as the vector set, extract URL, the URL formation is not visited in initialization;
The URL formation is not visited in c, selection, gathers corresponding web page, calculates the similarity of gathering webpage and seed collections of web pages;
D, the similarity and the preset threshold of gathering webpage and seed collections of web pages are compared.
2, the topic information acquisition method of topology Network Based according to claim 1 is characterized in that described steps d specifically comprises:
If similarity is greater than preset threshold,
1) parse URL from webpage, go to heavy back to insert and do not visit the URL formation, relatively the path relation of father URL and sub-URL distributes different weights for sub-URL;
2) link of calculating sub-URL is weighed, and sub-pages i to the link weighting coefficient of father's webpage j is: link Ji=path Ji+ freq i, wherein, path JiBe different URL routine weight values, freq iBe normalized anchor text key word frequency;
3) the weighted value correction of antithetical phrase URL, revised weights are as follows:
score ( i ) = Σ t = 1 n link ti · η ( k t ) · sim ( V t , D )
Wherein, n is the in-degree of webpage i, sim (V t, D) be the correlativity of father's webpage and seed set, link TiBe the link weighting coefficient of webpage i to father's webpage, η ( k t ) = k t / Σ j k j The be the theme deflection probability of webpage, k tEffective link number of quoting for father's webpage;
If similarity is not more than preset threshold, be provided with in the length of tunnel according to the distance of URL and seed set.Length of tunnel is step ( i ) = floor ( σ n ( i ) ) , Floor rounds downwards, and σ is an initial depth parameter constant, and n (i) is bonded to the link degree of depth of webpage i for subset; If the length of tunnel of URL is greater than 0, sub-URL disposal route is identical greater than the situation of threshold value with similarity, otherwise, reduce all sub-URL weights.
3, the topic information acquisition method of topology Network Based according to claim 2 is characterized in that the described different link weight of sub-URL distribution of giving specifically is included as:
1) sub-URL comprises father URL, and then sub-pages is in the subprime directory of father's webpage, and the theme of sub-pages is the expansion and the extension of father's Web page subject, and the weights that sub-URL distributes are t;
2) sub-URL has similar path to father URL, and sub-pages is identical with file length with father's webpage directories deep, and new theme is early stage or follow-up, and the weights that sub-URL distributes are t;
3) sub-URL is redundant link such as background illustration, advertisement, and the weights that sub-URL distributes are
Figure A200810227582C0003101956QIETU
0.4<t<0.6 wherein.
4, the topic information acquisition method of topology Network Based according to claim 1 is characterized in that comprising the steps:
A, obtain the seed collections of web pages from search engine;
B, every piece of webpage in the seed collections of web pages is carried out participle according to descriptor, be expressed as the vector set, extract URL;
C, the general URL of access queue carry out template matches, when webpage comprises return information, according to the theme temperature, preferentially choose the high theme URL of temperature and obtain new answer;
The theme temperature is: heft ( t ) = ( n + 1 ) α e - ( 1 + t - t ‾ ) / β , t ‾ = Σ i = 1 n t i / n ;
Wherein the average answer that is the theme of t promptly enlivens time point constantly; N is the total answer number of initial time to this theme of current time, and α, β are constant; 0<α<1, α is the weighted index of deflection probability; β has determined the level and smooth degree of theme temperature function.
CN2008102275821A 2008-11-28 2008-11-28 Topic information acquisition method based on network topology Expired - Fee Related CN101441662B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008102275821A CN101441662B (en) 2008-11-28 2008-11-28 Topic information acquisition method based on network topology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008102275821A CN101441662B (en) 2008-11-28 2008-11-28 Topic information acquisition method based on network topology

Publications (2)

Publication Number Publication Date
CN101441662A true CN101441662A (en) 2009-05-27
CN101441662B CN101441662B (en) 2010-12-22

Family

ID=40726096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008102275821A Expired - Fee Related CN101441662B (en) 2008-11-28 2008-11-28 Topic information acquisition method based on network topology

Country Status (1)

Country Link
CN (1) CN101441662B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866362A (en) * 2010-07-01 2010-10-20 优视科技有限公司 Method and system for automatically positioning main contents of webpages for mobile communication equipment terminal
CN102129472A (en) * 2011-04-14 2011-07-20 上海红神信息技术有限公司 Construction method for high-efficiency hybrid storage structure of semantic-orient search engine
CN101727494B (en) * 2009-12-29 2012-03-28 华中师范大学 Network hot word generating system in specific area
CN102567313A (en) * 2010-12-07 2012-07-11 盛乐信息技术(上海)有限公司 Progressive webpage library deduplication system and realization method thereof
CN102779120A (en) * 2011-05-09 2012-11-14 北京百度网讯科技有限公司 Method, system and device for determining field information of station and judging correlation
CN102821088A (en) * 2012-05-07 2012-12-12 北京京东世纪贸易有限公司 System and method for acquiring network data
CN103023714A (en) * 2012-11-21 2013-04-03 上海交通大学 Activeness and cluster structure analyzing system and method based on network topics
CN102087648B (en) * 2009-12-03 2013-06-19 北京大学 Method and system for fetching news comment page
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机系统有限公司 Chinese word segmentation based text similarity identifying method and device
CN103310026A (en) * 2013-07-08 2013-09-18 焦点科技股份有限公司 Lightweight common webpage topic crawler method based on search engine
CN103544220A (en) * 2013-09-29 2014-01-29 北京航空航天大学 Method and device for recommending applications
CN103605702A (en) * 2013-11-08 2014-02-26 北京邮电大学 Word similarity based network text classification method
CN105589892A (en) * 2014-11-12 2016-05-18 中国银联股份有限公司 Webpage theme analysis method based on anchor text backtracking chain
CN105608072A (en) * 2015-12-23 2016-05-25 厦门市美亚柏科信息股份有限公司 Text related region analysis method and system
CN105677772A (en) * 2015-12-30 2016-06-15 赛尔网络有限公司 ISP interconnection port URL activity level statistics method and device
CN106126688A (en) * 2016-06-29 2016-11-16 厦门趣处网络科技有限公司 Based on WEB content and the intelligent network information acquisition system of structure excavation, method
CN106168977A (en) * 2016-07-15 2016-11-30 河南山谷网安科技股份有限公司 A kind of column recognition methods for web portal security monitoring
CN106257449A (en) * 2015-06-19 2016-12-28 阿里巴巴集团控股有限公司 A kind of information determines method and apparatus
CN108121741A (en) * 2016-11-30 2018-06-05 百度在线网络技术(北京)有限公司 Website quality appraisal procedure and device
CN111125564A (en) * 2018-11-01 2020-05-08 百度在线网络技术(北京)有限公司 Thermodynamic diagram generation method and device, computer equipment and storage medium
CN111143649A (en) * 2019-12-09 2020-05-12 杭州迪普科技股份有限公司 Webpage searching method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100371932C (en) * 2004-03-23 2008-02-27 南京大学 Expandable and customizable theme centralized universile-web net reptile setup method
CN100338610C (en) * 2005-06-22 2007-09-19 浙江大学 Individual searching engine method based on linkage analysis

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102087648B (en) * 2009-12-03 2013-06-19 北京大学 Method and system for fetching news comment page
CN101727494B (en) * 2009-12-29 2012-03-28 华中师范大学 Network hot word generating system in specific area
CN101866362A (en) * 2010-07-01 2010-10-20 优视科技有限公司 Method and system for automatically positioning main contents of webpages for mobile communication equipment terminal
CN102567313A (en) * 2010-12-07 2012-07-11 盛乐信息技术(上海)有限公司 Progressive webpage library deduplication system and realization method thereof
CN102129472A (en) * 2011-04-14 2011-07-20 上海红神信息技术有限公司 Construction method for high-efficiency hybrid storage structure of semantic-orient search engine
CN102129472B (en) * 2011-04-14 2012-12-19 上海红神信息技术有限公司 Construction method for high-efficiency hybrid storage structure of semantic-orient search engine
CN102779120B (en) * 2011-05-09 2014-12-10 北京百度网讯科技有限公司 Method, system and device for determining field information of station and judging correlation
CN102779120A (en) * 2011-05-09 2012-11-14 北京百度网讯科技有限公司 Method, system and device for determining field information of station and judging correlation
CN103257957B (en) * 2012-02-15 2017-09-08 深圳市腾讯计算机系统有限公司 A kind of text similarity recognition methods and device based on Chinese word segmentation
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机系统有限公司 Chinese word segmentation based text similarity identifying method and device
CN102821088A (en) * 2012-05-07 2012-12-12 北京京东世纪贸易有限公司 System and method for acquiring network data
CN102821088B (en) * 2012-05-07 2015-12-16 北京京东世纪贸易有限公司 Obtain the system and method for network data
CN103023714B (en) * 2012-11-21 2015-12-23 上海交通大学 The liveness of topic Network Based and cluster topology analytical system and method
CN103023714A (en) * 2012-11-21 2013-04-03 上海交通大学 Activeness and cluster structure analyzing system and method based on network topics
CN103310026A (en) * 2013-07-08 2013-09-18 焦点科技股份有限公司 Lightweight common webpage topic crawler method based on search engine
CN103310026B (en) * 2013-07-08 2016-11-23 焦点科技股份有限公司 A kind of lightweight common webpage topic crawler method based on search engine
CN103544220A (en) * 2013-09-29 2014-01-29 北京航空航天大学 Method and device for recommending applications
CN103544220B (en) * 2013-09-29 2017-04-05 北京航空航天大学 Using recommendation method and apparatus
CN103605702A (en) * 2013-11-08 2014-02-26 北京邮电大学 Word similarity based network text classification method
CN105589892A (en) * 2014-11-12 2016-05-18 中国银联股份有限公司 Webpage theme analysis method based on anchor text backtracking chain
CN105589892B (en) * 2014-11-12 2019-01-18 中国银联股份有限公司 Web page subject analysis method based on Anchor Text trace-back chain
CN106257449B (en) * 2015-06-19 2019-11-12 阿里巴巴集团控股有限公司 A kind of information determines method and apparatus
CN106257449A (en) * 2015-06-19 2016-12-28 阿里巴巴集团控股有限公司 A kind of information determines method and apparatus
CN105608072B (en) * 2015-12-23 2019-02-19 厦门市美亚柏科信息股份有限公司 Text is related to ground analysis method and its system
CN105608072A (en) * 2015-12-23 2016-05-25 厦门市美亚柏科信息股份有限公司 Text related region analysis method and system
CN105677772A (en) * 2015-12-30 2016-06-15 赛尔网络有限公司 ISP interconnection port URL activity level statistics method and device
CN105677772B (en) * 2015-12-30 2019-07-09 赛尔网络有限公司 The statistical method and device of interconnection port URL liveness between a kind of ISP
CN106126688A (en) * 2016-06-29 2016-11-16 厦门趣处网络科技有限公司 Based on WEB content and the intelligent network information acquisition system of structure excavation, method
CN106126688B (en) * 2016-06-29 2020-03-24 厦门趣处网络科技有限公司 Intelligent network information acquisition system and method based on WEB content and structure mining
CN106168977A (en) * 2016-07-15 2016-11-30 河南山谷网安科技股份有限公司 A kind of column recognition methods for web portal security monitoring
CN106168977B (en) * 2016-07-15 2019-07-02 山谷网安科技股份有限公司 A kind of column recognition methods for web portal security monitoring
CN108121741A (en) * 2016-11-30 2018-06-05 百度在线网络技术(北京)有限公司 Website quality appraisal procedure and device
CN108121741B (en) * 2016-11-30 2021-12-28 百度在线网络技术(北京)有限公司 Website quality evaluation method and device
CN111125564A (en) * 2018-11-01 2020-05-08 百度在线网络技术(北京)有限公司 Thermodynamic diagram generation method and device, computer equipment and storage medium
CN111125564B (en) * 2018-11-01 2023-09-15 百度在线网络技术(北京)有限公司 Thermodynamic diagram generation method, thermodynamic diagram generation device, thermodynamic diagram generation computer device and thermodynamic diagram generation storage medium
CN111143649A (en) * 2019-12-09 2020-05-12 杭州迪普科技股份有限公司 Webpage searching method and device

Also Published As

Publication number Publication date
CN101441662B (en) 2010-12-22

Similar Documents

Publication Publication Date Title
CN101441662B (en) Topic information acquisition method based on network topology
CN102054004B (en) Webpage recommendation method and device adopting same
Aggarwal et al. Intelligent crawling on the World Wide Web with arbitrary predicates
CN102930059B (en) Method for designing focused crawler
JP4996300B2 (en) File system search ranking method and related search engine
CN102298622B (en) Search method for focused web crawler based on anchor text and system thereof
Pal et al. Effective focused crawling based on content and link structure analysis
CN102355488A (en) Crawler seed obtaining method and equipment and crawler crawling method and equipment
CN101477554A (en) User interest based personalized meta search engine and search result processing method
CN103714149B (en) Self-adaptive incremental deep web data source discovery method
CN101770521A (en) Focusing relevancy ordering method for vertical search engine
CN103714140A (en) Searching method and device based on topic-focused web crawler
Bhatia Link analysis algorithms for web mining
CN103853831A (en) Personalized searching realization method based on user interest
CN102722499A (en) Search engine and implementation method thereof
CN102750380B (en) Page sorting method in combination with difference feature distribution and link feature
CN103279492A (en) Method and device for catching webpage
CN104636403A (en) Query request processing method and device
CN102799686A (en) Water resource information vertical search method based on cloud platform
Neunerdt et al. Focused crawling for building web comment corpora
Wang et al. Ts-ids algorithm for query selection in the deep web crawling
CN102521313A (en) Static index pruning method based on web page quality
Joshi et al. Improving Pagerank Calculation by using Content Weight
Hati et al. Improved focused crawling approach for retrieving relevant pages based on block partitioning
CN101382956A (en) Information acquisition method and system for orienting subject

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20101222

Termination date: 20121128