CN101441662A - Topic information acquisition method based on network topology - Google Patents
Topic information acquisition method based on network topology Download PDFInfo
- Publication number
- CN101441662A CN101441662A CNA2008102275821A CN200810227582A CN101441662A CN 101441662 A CN101441662 A CN 101441662A CN A2008102275821 A CNA2008102275821 A CN A2008102275821A CN 200810227582 A CN200810227582 A CN 200810227582A CN 101441662 A CN101441662 A CN 101441662A
- Authority
- CN
- China
- Prior art keywords
- url
- webpage
- sub
- link
- theme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention relates to a topic information acquisition method based on network topology. An initial web page set is obtained from a search engine and is expressed as a vector set through purification, word division and removal of stop words, and a vector space model is used to calculate the text similarity. A network structure is utilized to perform linkage analysis to extracted URLs first, the linkage is filtered through directory hierarchies of the URLs, and then the weights of the URLs are modified according to the scaleless property of a network to perform the prior absorption selection. At the same time, unrelated topic areas are feedback, and the lengths of buffer areas of unrelated URLs are set through the distance between the URLs and a seed set. The heat of acquired topics is calculated to select one topic to obtain a new reply.
Description
Technical field
The present invention relates to the topic information acquisition method of topology Network Based, belong to network safety filed.
Background technology
Universal day by day along with information networking, the information on the internet grows with each passing day, and huge potential value lies in the Web information resources of these magnanimity isomeries.The internet is the intercommunion platform of information published method and audient's interaction conveniently, makes network surmount traditional media, becomes the main mode that real-time information is obtained.Media event occurs on the internet usually the earliest, and generates a discussion in network.
How to extract effectively and utilize the network information to become a great challenge.Search engine provides information acquiring way efficiently and effectively by the mode of inquiry for the user.The network information gathering system is that search engine is downloaded webpage from WWW, be the important component part (J.Cho of search engine, Crawling the web:Discovery and Maintenance of Large-Saled Web Data, Computer Science, 2001.).General breadth-first (BFS) acquisition system is just carried out the search of next level after the search of finishing current level, broad covered area often comprises the unconcerned information of user.System has well solved this problem based on the specific topics topic information acquisition.The topic information acquisition system is according to set extracting target, adopt the web page analysis algorithm, selectively visit peer link, obtain needed information, its purpose is to prepare data resource (Zhou Lizhu, Lin Ling for the user inquiring of subject-oriented, focused crawler technical research summary, computer utility, 2005,25 (9): 1965-1969.).
(Jyh-Jong Tsay such as Jyh-Jong Tsay, Chen-Yang Shih, Bo-Liang Wu.Auto Crawler:an Integrated System For Automatic Topical Crawler.Computer and InformationScience, 2005.Fourth Annual ACIS International Conference on 2005:462-467.) on the basis of breadth-first search, used relevant feedback and length of tunnel rationally has been set, make acquisition system break away from uncorrelated zone as early as possible, the related content after the uncorrelated link is hidden in excavation.Jamali M. (Jamali M., Sayyadi H., Hariri B.B., et al.A Method for Focused Crawling Using Combination ofLink Structure and Content Similarity.Web Intelligence, 2006.IEEE/WIC/ACMInternational Conference on, 2006:753-756.) handle URL in conjunction with link structure and text similarity, the URL weights are defined as the product that webpage similarity and URL chain are gone into chain out-degree sum, are the URL processing modes of aftereffect.Wang Tao, Fan Xiaozhong (Wang Tao, Fan Xiaozhong. link analysis is to the improvement of Theme Crawler of Content. computer utility, 2004,24 (B12): 174-176.) on the basis of using vector space model, by link being carried out physical arrangement and logical organization analysis and filter URL, to reduce recall ratio slightly is cost, pursues higher precision ratio.
General topic information acquisition system treats the URL that extracts in the same webpage without distinction, has kept more incoherent link, and can't reduce the influence that the similarity mistake in judgment brings.
Summary of the invention
The object of the invention is to avoid above-mentioned weak point of the prior art and topic information acquisition method that a kind of topology Network Based is provided, the present invention handles URL according to the internet topology, URL is carried out link analysis,, and carried out the tunnel adjustment according to no scale network characterization correction weights.Theme has been gathered in visit according to the theme temperature simultaneously, obtains its return information.
Purpose of the present invention can reach by the following technical programs:
The topic information acquisition method of topology Network Based comprises the steps:
A, obtain the seed collections of web pages from search engine;
B, every piece of webpage in the seed collections of web pages is carried out participle according to descriptor, be expressed as the vector set, extract URL, the URL formation is not visited in initialization;
The URL formation is not visited in c, selection, gathers corresponding web page, calculates the similarity of gathering webpage and seed collections of web pages;
D, the similarity and the preset threshold of gathering webpage and seed collections of web pages are compared.
Described steps d specifically comprises:
If similarity is greater than preset threshold,
1) parse URL from webpage, go to heavy back to insert and do not visit the URL formation, relatively the path relation of father URL and sub-URL distributes different weights for sub-URL;
2) link of calculating sub-URL is weighed, and sub-pages i to the link weighting coefficient of father's webpage j is: link
Ji=path
Ji+ freq
i, wherein, path
JiBe different URL routine weight values, freq
iBe normalized anchor text key word frequency;
3) the weighted value correction of antithetical phrase URL, revised weights are as follows:
Wherein, n is the in-degree of webpage i, sim (V
t, D) be the correlativity of father's webpage and seed set, link
TiBe the link weighting coefficient of webpage i to father's webpage,
The be the theme deflection probability of webpage, k
tEffective link number of quoting for father's webpage;
If similarity is not more than preset threshold, be provided with in the length of tunnel according to the distance of URL and seed set.Length of tunnel is
Floor rounds downwards, and σ is an initial depth parameter constant, and n (i) is bonded to the link degree of depth of webpage i for subset.If the length of tunnel of URL is greater than 0, sub-URL disposal route is identical greater than the situation of threshold value with similarity, otherwise, reduce all sub-URL weights.
The described different link weight of sub-URL distribution of giving specifically is included as:
1) sub-URL comprises father URL, and then sub-pages is in the subprime directory of father's webpage, and the theme of sub-pages is the expansion and the extension of father's Web page subject, and the weights that sub-URL distributes are t;
2) sub-URL has similar path to father URL, and sub-pages is identical with file length with father's webpage directories deep, and new theme is early stage or follow-up, and the weights that sub-URL distributes are t;
3) sub-URL is redundant link such as background illustration, advertisement, and the weights that sub-URL distributes are
0.4<t<0.6 wherein.
The topic information acquisition method that the present invention is based on network topology can also be realized for following steps:
A, obtain the seed collections of web pages from search engine;
B, every piece of webpage in the seed collections of web pages is carried out participle according to descriptor, be expressed as the vector set, extract URL;
C, the general URL of access queue carry out template matches, when webpage comprises return information, according to the theme temperature, preferentially choose the high theme URL of temperature and obtain new answer;
The theme temperature is:
Wherein the average answer that is the theme of t promptly enlivens time point constantly; N is the total answer number of initial time to this theme of current time, and α, β are constant; 0<α<1, α is the weighted index of deflection probability; β has determined the level and smooth degree of theme temperature function.
Introduce concrete grammar of the present invention and step below in detail:
At first obtain kind of a sub-pages.According to focusing on keyword, visit each search engine, m bar record before obtaining is as the initial link that focuses on.Grasp the source file of initial link, obtain seed collections of web pages D=<D
1, D
2, D
3... D
m.Every piece of webpage D in the pair set
i, extract subject information and carry out participle, remove to transfer insignificant auxiliary word, adverbial word and stop words, be expressed as the document vector form.If document D
iThe entry that comprises is<t
1, t
2, t
3... t
n, then corresponding n dimension document vector is<w
I1, w
I2, w
I3... w
In, w wherein
IjThe weight of entry j.w
IjAdopt classical TF * IDF definition.IDF is according to the total number of documents incremental update.The seed collections of web pages has been mapped to document vector set W=<W
1, W
2, W
3... W
m.Concentrate the URL that parses from kind of sub-pages, give initial weight 1, add the search queue of acquisition system, first search is than the link of high weight during collection.
The webpage that newly grabs behind pre-service and participle, changes into the entry vector, calculates the similarity of new web page and seed collections of web pages.Similarity between document uses the cosine of document vector angle to measure two webpage D
i, D
j, the similarity between them is
Seed collections of web pages D, new web page V, new web page is the mean value of this webpage and all webpage similaritys of collections of web pages with the similarity of planting the sub-pages collection
The webpage that similarity is higher, the angle in vector space is more little, tends to describe same topic, otherwise, the webpage that similarity is low more, the probability that belongs to different topics is big more.If the similarity of webpage and seed collections of web pages is higher than threshold value, then this webpage is added the seed set.
The URL that parses in the webpage is carried out link analysis filter URL.The link webpage of quoting in one piece of webpage pointed is called the sub-pages of this father's webpage.The URL that parses in father's webpage, its structure has reflected the relation of sub-pages and father's webpage, looks following several situation and distributes different weights (the power parameter is t, 0.4<t<0.6):
(1) sub-URL comprises father URL, as father URL be " http://mil .news .sina .com .cn/ ", sub-URL be " http://mil .news .sina .com .cn/ w/ d{4}-d{2}-d{2}/d+ .html ", or " http://mil .news .sina .com .cn/ d{4}-d{2}-d{2}/d+ .html ", then sub-pages is in the subprime directory of father's webpage.The theme of sub-pages is the expansion and the extension of father's Web page subject, and sub-URL distributes weights t.
(2) sub-URL has similar path to father URL.Sub-pages is identical with file length with father's webpage directories deep, and new theme is early stage or follow-up, distributes weights t.
(3) redundant link such as background illustration, advertisement, right of distribution
Simultaneously, the more URL relevant with theme links near one section text and comprises the focusing key word.Therefore webpage i to the link weighting coefficient of father's webpage j is: link
Ji=path
Ji+ freq
i, wherein, path
JiBe above-mentioned 3 kinds of different URL routine weight values, freq
iBe normalized anchor text key word frequency.
On the basis of link analysis,, select link to add according to the URL weights and do not visit the URL formation URL weighting, ordering.WWW has the feature of no scale network, webpage or website are as the node of network, link is as the limit of network, network has degree distribution (the SEN QIN of power rate, GUAN-ZHONG DAI, YAN-LING LI, Design and Implementation of Web Hot-topic Talk Mining Basedon Scale-free Network, Proceedings of the Fifth International Conferenceon Machine Learning and Cybernetics, 2006, pp.13-16.).The growth of no scale network and preferential adsorption principle make those comprise the more webpage of link number and may obtain new url more.The new topic links that adds network is directly proportional to the number that links that deflection probability that has theme and theme comprise.The similarity error in judgement can be included into the focusing topic with incoherent webpage, only depends on link analysis can not filter out the link that extracts in these webpages effectively.But these wrong webpages that focus on, the effective link number that comprises usually is less.Therefore the deflection probability that uses no scale network has reduced the weights of the link that extracts in these judgement error webpages to the URL weighting, has improved precision ratio.On pagerank algorithm basis, take into account correlation calculations, link analysis and do not have the influence of scale network characteristic, revised weighted value is as follows:
Wherein, n is the in-degree of webpage i, sim (V
t, D) be the correlativity of father's webpage and seed set, link
TiBe the link weighting coefficient of webpage i to father's webpage,
The be the theme deflection probability of webpage, k
tEffective link number of quoting for father's webpage.Sim (V
t, D) being the pith of URL weights, the threshold value of similarity is determining the orientation of sub-URL.
The content of one piece of webpage is relevant with the focusing topic, and the possibility that its sub-URL that parses is relevant with topic is bigger, otherwise sub-URL tends to describe new topic.The webpage that correlativity is higher may have been quoted a large amount of recommended links, make the sub-URL that extracts belong to the focusing topic, but weights is higher because it has inherited the correlativity of father's webpage.Grasp this a little URL, will obtain too much incoherent content, influence system performance.Therefore, when grasping the sub-URL that parses by same father's webpage,, then reduce all sub-URL weights if all obtain the irrelevant information of topic continuously for several times.
In the subject network, may comprise incoherent zone the path from a related web page to another related web page, make relevant subject web page be hidden in the irrelevant link, this is called tunnel(l)ing.With the lower webpage of seed set similarity,, may be linked to the high webpage of similarity through after the multistage link.Ignore similarity and be lower than the URL that extracts in the webpage of threshold value, will lose the peer link that is hidden in behind these webpages, the theme of acquisition reduces.Uncorrelated webpage is set suitable buffering, improve the recall ratio of topic information acquisition system.Therefore adopt atomic model to come the link jumping figure of distributing buffer.The seed set is the foundation that topic focuses on, and as atomic nucleus, atomic nucleus is brought in constant renewal in.The URL that extracts from the seed set is as extranulear electron.Same URL has a plurality of chains of going into, with the minor increment of the seed set degree of depth that links as it.The URL that is in the same degree of depth is considered as identical energy level.The degree of depth of URL is more little, and the closer to atomic nucleus, it is big more to be subjected to nuclear binding force, and is bigger by its probability that obtains the related subject webpage, and bigger buffering progression is set.On the contrary, the URL degree of depth is big more, and the probability of shaking off atomic nucleus constraint escape is also bigger, is difficult for obtaining to focus on theme.Being set at of buffering jumping figure (being length of tunnel)
Wherein step (i) is the buffering webpage jumping figure of webpage i, and floor rounds downwards, and σ is an initial depth parameter constant, and n (i) is bonded to the link degree of depth of webpage i for subset.If the buffering jumping figure of the URL in the uncorrelated webpage equals 0, and reduces all sub-URL weights in this webpage.
The theme URL that has visited does not discharge, and adds to have visited the URL formation.URL is linked template matches, if webpage by concern news analysis, broadcasting bulletin system, blog etc. take back multiple information, then choose the new answer that URL obtains the big theme of popular degree according to the theme temperature.Consider that those have the theme model of more answer, attracted popular interest, the probability that they obtain answer is also big more; Simultaneously As time goes on, the answer that out-of-date theme model obtains reduces, and absorption affinity is tending towards 0 gradually.Definition theme temperature is as follows:
Wherein the average answer that is the theme of t promptly enlivens time point constantly, and n is the total answer number of initial time to this theme of current time, and α, β are constant.0<α<1, α is the weighted index of deflection probability.β has determined the level and smooth degree of theme temperature function, and β is more little, can react some details more, and β is big more, and function is mild more.Because mild function more can estimate the theme answer trend in future, β is generally greater than 2.
Usually between 0 to 30, topical subject can be thought in the theme greater than 15 to the temperature of theme, when selecting access queue, preferentially selects the high news analysis of popular degree, broadcasting bulletin system or blog title URL.
The present invention has following advantage compared to existing technology:
(1) URL is carried out link analysis, filter redundant and irrelevant link, save system resource.
(2) the deflection probability correction URL weights of the no scale network of use with the feature that abundant internet information is assembled, reduce the similarity mistake in judgment but next influence.
(3) URL is carried out relevant feedback, make system break away from uncorrelated zone as early as possible, improve accuracy rate.
(4) use the evaluation of theme temperature to take back multiple theme, system time mainly is distributed on the focus theme.
Description of drawings
Fig. 1 is the workflow diagram of entire method;
The be the theme precision ratio comparison diagram of information acquisition system and BFS acquisition system of Fig. 2;
The be the theme related subject acquisition rate comparison diagram of information acquisition system and BFS acquisition system of Fig. 3;
Fig. 4 is a link analysis and the precision ratio comparison diagram of link analysis not;
Fig. 5 is the precision ratio comparison diagram under the different weights;
Fig. 6 is the precision ratio comparison diagram before and after the similarity feedback.
Embodiment
The performance of topic information acquisition system is weighed by precision ratio.Precision ratio is the tolerance that obtains related subject webpage precision, if M is for catching the theme sum, T is the related subject number in the webpage of obtaining, and then precision ratio is precision=T/M.By key word " Beijing Olympic " is focused on, grasped many parts of subject web pages of up to a hundred websites, relatively the related subject collecting efficiency of topic information acquisition system and general acquisition system and different system parameter are to topic information acquisition Effect on Performance (power parameter t gets 0.5).As shown in Figure 1, topic information acquisition method workflow diagram for topology Network Based, extremely shown in Figure 6 as Fig. 2, the result be repeatedly the mean value of emulated data, because network upgrades fast, the accuracy rate of information acquisition is relevant with website structure, and in order to guarantee comparability, the seed set of theme acquisition system is consistent with the initial formation of BFS acquisition system.
Fig. 2 is the theme the precision ratio of information acquisition system and BFS acquisition system with the variation of grasping the webpage sum.From the initial URL formation that search engine obtains the kind sub-pages collection and the BFS of theme acquisition system, grasp more than 10000 part of webpage continuously.Fig. 3 is corresponding related web page acquisition rate.As can be seen, the theme acquisition system precision ratio that this paper carried is apparently higher than the BFS acquisition system, and along with the increasing of collecting net number of pages, the collection of BFS related subject is slow, and precision ratio is reduced to below 40%, and the theme acquisition system still maintains more than 70%.
Fig. 4 has reflected the influence of link analysis to theme acquisition system precision ratio.Initial seed webpage collection is 10, grasps 3500 parts of webpages.As seen, do not carry out link analysis, gathered the invalid and redundant link in a large amount of subject web pages and influenced the performance of system, precision ratio is reduced to below 0.6 after grasping 2500 webpages.
Fig. 5 is precision ratio variation with collection webpage number under the different weights mode.Than pagerank weights height, the range of decrease is milder based on the weights allocation scheme precision ratio of internet topology.And waiting the gain weights to distribute, the initial stage precision ratio is higher gathering, but along with the operation of system, the highest similarity that covets very easily is absorbed in local optimum, and precision ratio reduces rapidly.
Fig. 6 is the comparison of precision ratio before and after the similarity feedback.The similarity feedback makes system after being absorbed in uncorrelated zone, can withdraw from quickly and grasp mistaken ideas, prevents that performance from further worsening.
Following table 1 is to focus on different topics, chooses 10 initial links, when grasping 5000 webpages, and the precision ratio of theme acquisition system and BFS acquisition system.
The precision ratio of the different topics of table 1
The theme acquisition system | The BFS acquisition system | |
The big aircraft project of China | 0.38 | 0.071 |
Shenzhou VI spacecraft | 0.65 | 0.12 |
Wenchuan earthquake | 0.76 | 0.28 |
Notebook | 0.73 | 0.13 |
Claims (4)
1. the topic information acquisition method of topology Network Based is characterized in that comprising the steps:
A, obtain the seed collections of web pages from search engine;
B, every piece of webpage in the seed collections of web pages is carried out participle according to descriptor, be expressed as the vector set, extract URL, the URL formation is not visited in initialization;
The URL formation is not visited in c, selection, gathers corresponding web page, calculates the similarity of gathering webpage and seed collections of web pages;
D, the similarity and the preset threshold of gathering webpage and seed collections of web pages are compared.
2, the topic information acquisition method of topology Network Based according to claim 1 is characterized in that described steps d specifically comprises:
If similarity is greater than preset threshold,
1) parse URL from webpage, go to heavy back to insert and do not visit the URL formation, relatively the path relation of father URL and sub-URL distributes different weights for sub-URL;
2) link of calculating sub-URL is weighed, and sub-pages i to the link weighting coefficient of father's webpage j is: link
Ji=path
Ji+ freq
i, wherein, path
JiBe different URL routine weight values, freq
iBe normalized anchor text key word frequency;
3) the weighted value correction of antithetical phrase URL, revised weights are as follows:
Wherein, n is the in-degree of webpage i, sim (V
t, D) be the correlativity of father's webpage and seed set, link
TiBe the link weighting coefficient of webpage i to father's webpage,
The be the theme deflection probability of webpage, k
tEffective link number of quoting for father's webpage;
If similarity is not more than preset threshold, be provided with in the length of tunnel according to the distance of URL and seed set.Length of tunnel is
Floor rounds downwards, and σ is an initial depth parameter constant, and n (i) is bonded to the link degree of depth of webpage i for subset; If the length of tunnel of URL is greater than 0, sub-URL disposal route is identical greater than the situation of threshold value with similarity, otherwise, reduce all sub-URL weights.
3, the topic information acquisition method of topology Network Based according to claim 2 is characterized in that the described different link weight of sub-URL distribution of giving specifically is included as:
1) sub-URL comprises father URL, and then sub-pages is in the subprime directory of father's webpage, and the theme of sub-pages is the expansion and the extension of father's Web page subject, and the weights that sub-URL distributes are t;
2) sub-URL has similar path to father URL, and sub-pages is identical with file length with father's webpage directories deep, and new theme is early stage or follow-up, and the weights that sub-URL distributes are t;
3) sub-URL is redundant link such as background illustration, advertisement, and the weights that sub-URL distributes are
0.4<t<0.6 wherein.
4, the topic information acquisition method of topology Network Based according to claim 1 is characterized in that comprising the steps:
A, obtain the seed collections of web pages from search engine;
B, every piece of webpage in the seed collections of web pages is carried out participle according to descriptor, be expressed as the vector set, extract URL;
C, the general URL of access queue carry out template matches, when webpage comprises return information, according to the theme temperature, preferentially choose the high theme URL of temperature and obtain new answer;
The theme temperature is:
Wherein the average answer that is the theme of t promptly enlivens time point constantly; N is the total answer number of initial time to this theme of current time, and α, β are constant; 0<α<1, α is the weighted index of deflection probability; β has determined the level and smooth degree of theme temperature function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2008102275821A CN101441662B (en) | 2008-11-28 | 2008-11-28 | Topic information acquisition method based on network topology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2008102275821A CN101441662B (en) | 2008-11-28 | 2008-11-28 | Topic information acquisition method based on network topology |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101441662A true CN101441662A (en) | 2009-05-27 |
CN101441662B CN101441662B (en) | 2010-12-22 |
Family
ID=40726096
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2008102275821A Expired - Fee Related CN101441662B (en) | 2008-11-28 | 2008-11-28 | Topic information acquisition method based on network topology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101441662B (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101866362A (en) * | 2010-07-01 | 2010-10-20 | 优视科技有限公司 | Method and system for automatically positioning main contents of webpages for mobile communication equipment terminal |
CN102129472A (en) * | 2011-04-14 | 2011-07-20 | 上海红神信息技术有限公司 | Construction method for high-efficiency hybrid storage structure of semantic-orient search engine |
CN101727494B (en) * | 2009-12-29 | 2012-03-28 | 华中师范大学 | Network hot word generating system in specific area |
CN102567313A (en) * | 2010-12-07 | 2012-07-11 | 盛乐信息技术(上海)有限公司 | Progressive webpage library deduplication system and realization method thereof |
CN102779120A (en) * | 2011-05-09 | 2012-11-14 | 北京百度网讯科技有限公司 | Method, system and device for determining field information of station and judging correlation |
CN102821088A (en) * | 2012-05-07 | 2012-12-12 | 北京京东世纪贸易有限公司 | System and method for acquiring network data |
CN103023714A (en) * | 2012-11-21 | 2013-04-03 | 上海交通大学 | Activeness and cluster structure analyzing system and method based on network topics |
CN102087648B (en) * | 2009-12-03 | 2013-06-19 | 北京大学 | Method and system for fetching news comment page |
CN103257957A (en) * | 2012-02-15 | 2013-08-21 | 深圳市腾讯计算机系统有限公司 | Chinese word segmentation based text similarity identifying method and device |
CN103310026A (en) * | 2013-07-08 | 2013-09-18 | 焦点科技股份有限公司 | Lightweight common webpage topic crawler method based on search engine |
CN103544220A (en) * | 2013-09-29 | 2014-01-29 | 北京航空航天大学 | Method and device for recommending applications |
CN103605702A (en) * | 2013-11-08 | 2014-02-26 | 北京邮电大学 | Word similarity based network text classification method |
CN105589892A (en) * | 2014-11-12 | 2016-05-18 | 中国银联股份有限公司 | Webpage theme analysis method based on anchor text backtracking chain |
CN105608072A (en) * | 2015-12-23 | 2016-05-25 | 厦门市美亚柏科信息股份有限公司 | Text related region analysis method and system |
CN105677772A (en) * | 2015-12-30 | 2016-06-15 | 赛尔网络有限公司 | ISP interconnection port URL activity level statistics method and device |
CN106126688A (en) * | 2016-06-29 | 2016-11-16 | 厦门趣处网络科技有限公司 | Based on WEB content and the intelligent network information acquisition system of structure excavation, method |
CN106168977A (en) * | 2016-07-15 | 2016-11-30 | 河南山谷网安科技股份有限公司 | A kind of column recognition methods for web portal security monitoring |
CN106257449A (en) * | 2015-06-19 | 2016-12-28 | 阿里巴巴集团控股有限公司 | A kind of information determines method and apparatus |
CN108121741A (en) * | 2016-11-30 | 2018-06-05 | 百度在线网络技术(北京)有限公司 | Website quality appraisal procedure and device |
CN111125564A (en) * | 2018-11-01 | 2020-05-08 | 百度在线网络技术(北京)有限公司 | Thermodynamic diagram generation method and device, computer equipment and storage medium |
CN111143649A (en) * | 2019-12-09 | 2020-05-12 | 杭州迪普科技股份有限公司 | Webpage searching method and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100371932C (en) * | 2004-03-23 | 2008-02-27 | 南京大学 | Expandable and customizable theme centralized universile-web net reptile setup method |
CN100338610C (en) * | 2005-06-22 | 2007-09-19 | 浙江大学 | Individual searching engine method based on linkage analysis |
-
2008
- 2008-11-28 CN CN2008102275821A patent/CN101441662B/en not_active Expired - Fee Related
Cited By (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102087648B (en) * | 2009-12-03 | 2013-06-19 | 北京大学 | Method and system for fetching news comment page |
CN101727494B (en) * | 2009-12-29 | 2012-03-28 | 华中师范大学 | Network hot word generating system in specific area |
CN101866362A (en) * | 2010-07-01 | 2010-10-20 | 优视科技有限公司 | Method and system for automatically positioning main contents of webpages for mobile communication equipment terminal |
CN102567313A (en) * | 2010-12-07 | 2012-07-11 | 盛乐信息技术(上海)有限公司 | Progressive webpage library deduplication system and realization method thereof |
CN102129472A (en) * | 2011-04-14 | 2011-07-20 | 上海红神信息技术有限公司 | Construction method for high-efficiency hybrid storage structure of semantic-orient search engine |
CN102129472B (en) * | 2011-04-14 | 2012-12-19 | 上海红神信息技术有限公司 | Construction method for high-efficiency hybrid storage structure of semantic-orient search engine |
CN102779120B (en) * | 2011-05-09 | 2014-12-10 | 北京百度网讯科技有限公司 | Method, system and device for determining field information of station and judging correlation |
CN102779120A (en) * | 2011-05-09 | 2012-11-14 | 北京百度网讯科技有限公司 | Method, system and device for determining field information of station and judging correlation |
CN103257957B (en) * | 2012-02-15 | 2017-09-08 | 深圳市腾讯计算机系统有限公司 | A kind of text similarity recognition methods and device based on Chinese word segmentation |
CN103257957A (en) * | 2012-02-15 | 2013-08-21 | 深圳市腾讯计算机系统有限公司 | Chinese word segmentation based text similarity identifying method and device |
CN102821088A (en) * | 2012-05-07 | 2012-12-12 | 北京京东世纪贸易有限公司 | System and method for acquiring network data |
CN102821088B (en) * | 2012-05-07 | 2015-12-16 | 北京京东世纪贸易有限公司 | Obtain the system and method for network data |
CN103023714B (en) * | 2012-11-21 | 2015-12-23 | 上海交通大学 | The liveness of topic Network Based and cluster topology analytical system and method |
CN103023714A (en) * | 2012-11-21 | 2013-04-03 | 上海交通大学 | Activeness and cluster structure analyzing system and method based on network topics |
CN103310026A (en) * | 2013-07-08 | 2013-09-18 | 焦点科技股份有限公司 | Lightweight common webpage topic crawler method based on search engine |
CN103310026B (en) * | 2013-07-08 | 2016-11-23 | 焦点科技股份有限公司 | A kind of lightweight common webpage topic crawler method based on search engine |
CN103544220A (en) * | 2013-09-29 | 2014-01-29 | 北京航空航天大学 | Method and device for recommending applications |
CN103544220B (en) * | 2013-09-29 | 2017-04-05 | 北京航空航天大学 | Using recommendation method and apparatus |
CN103605702A (en) * | 2013-11-08 | 2014-02-26 | 北京邮电大学 | Word similarity based network text classification method |
CN105589892A (en) * | 2014-11-12 | 2016-05-18 | 中国银联股份有限公司 | Webpage theme analysis method based on anchor text backtracking chain |
CN105589892B (en) * | 2014-11-12 | 2019-01-18 | 中国银联股份有限公司 | Web page subject analysis method based on Anchor Text trace-back chain |
CN106257449B (en) * | 2015-06-19 | 2019-11-12 | 阿里巴巴集团控股有限公司 | A kind of information determines method and apparatus |
CN106257449A (en) * | 2015-06-19 | 2016-12-28 | 阿里巴巴集团控股有限公司 | A kind of information determines method and apparatus |
CN105608072B (en) * | 2015-12-23 | 2019-02-19 | 厦门市美亚柏科信息股份有限公司 | Text is related to ground analysis method and its system |
CN105608072A (en) * | 2015-12-23 | 2016-05-25 | 厦门市美亚柏科信息股份有限公司 | Text related region analysis method and system |
CN105677772A (en) * | 2015-12-30 | 2016-06-15 | 赛尔网络有限公司 | ISP interconnection port URL activity level statistics method and device |
CN105677772B (en) * | 2015-12-30 | 2019-07-09 | 赛尔网络有限公司 | The statistical method and device of interconnection port URL liveness between a kind of ISP |
CN106126688A (en) * | 2016-06-29 | 2016-11-16 | 厦门趣处网络科技有限公司 | Based on WEB content and the intelligent network information acquisition system of structure excavation, method |
CN106126688B (en) * | 2016-06-29 | 2020-03-24 | 厦门趣处网络科技有限公司 | Intelligent network information acquisition system and method based on WEB content and structure mining |
CN106168977A (en) * | 2016-07-15 | 2016-11-30 | 河南山谷网安科技股份有限公司 | A kind of column recognition methods for web portal security monitoring |
CN106168977B (en) * | 2016-07-15 | 2019-07-02 | 山谷网安科技股份有限公司 | A kind of column recognition methods for web portal security monitoring |
CN108121741A (en) * | 2016-11-30 | 2018-06-05 | 百度在线网络技术(北京)有限公司 | Website quality appraisal procedure and device |
CN108121741B (en) * | 2016-11-30 | 2021-12-28 | 百度在线网络技术(北京)有限公司 | Website quality evaluation method and device |
CN111125564A (en) * | 2018-11-01 | 2020-05-08 | 百度在线网络技术(北京)有限公司 | Thermodynamic diagram generation method and device, computer equipment and storage medium |
CN111125564B (en) * | 2018-11-01 | 2023-09-15 | 百度在线网络技术(北京)有限公司 | Thermodynamic diagram generation method, thermodynamic diagram generation device, thermodynamic diagram generation computer device and thermodynamic diagram generation storage medium |
CN111143649A (en) * | 2019-12-09 | 2020-05-12 | 杭州迪普科技股份有限公司 | Webpage searching method and device |
Also Published As
Publication number | Publication date |
---|---|
CN101441662B (en) | 2010-12-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101441662B (en) | Topic information acquisition method based on network topology | |
CN102054004B (en) | Webpage recommendation method and device adopting same | |
Aggarwal et al. | Intelligent crawling on the World Wide Web with arbitrary predicates | |
CN102930059B (en) | Method for designing focused crawler | |
JP4996300B2 (en) | File system search ranking method and related search engine | |
CN102298622B (en) | Search method for focused web crawler based on anchor text and system thereof | |
Pal et al. | Effective focused crawling based on content and link structure analysis | |
CN102355488A (en) | Crawler seed obtaining method and equipment and crawler crawling method and equipment | |
CN101477554A (en) | User interest based personalized meta search engine and search result processing method | |
CN103714149B (en) | Self-adaptive incremental deep web data source discovery method | |
CN101770521A (en) | Focusing relevancy ordering method for vertical search engine | |
CN103714140A (en) | Searching method and device based on topic-focused web crawler | |
Bhatia | Link analysis algorithms for web mining | |
CN103853831A (en) | Personalized searching realization method based on user interest | |
CN102722499A (en) | Search engine and implementation method thereof | |
CN102750380B (en) | Page sorting method in combination with difference feature distribution and link feature | |
CN103279492A (en) | Method and device for catching webpage | |
CN104636403A (en) | Query request processing method and device | |
CN102799686A (en) | Water resource information vertical search method based on cloud platform | |
Neunerdt et al. | Focused crawling for building web comment corpora | |
Wang et al. | Ts-ids algorithm for query selection in the deep web crawling | |
CN102521313A (en) | Static index pruning method based on web page quality | |
Joshi et al. | Improving Pagerank Calculation by using Content Weight | |
Hati et al. | Improved focused crawling approach for retrieving relevant pages based on block partitioning | |
CN101382956A (en) | Information acquisition method and system for orienting subject |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C17 | Cessation of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20101222 Termination date: 20121128 |