CN103841173A - Vertical web spider - Google Patents
Vertical web spider Download PDFInfo
- Publication number
- CN103841173A CN103841173A CN201210495397.7A CN201210495397A CN103841173A CN 103841173 A CN103841173 A CN 103841173A CN 201210495397 A CN201210495397 A CN 201210495397A CN 103841173 A CN103841173 A CN 103841173A
- Authority
- CN
- China
- Prior art keywords
- page
- theme
- spider
- web
- url
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a vertical web spider. The vertical web spider is a concept which is opposite to a web spider of a universal search engine. The difference between a vertical search engine and the universal search engine lies in that the vertical search engine serves for a specific group and only pays attention to information in a specific field, and thus traversing of a whole Web is not needed when searching is conducted by the vertical web spider, and the vertical web spider only needs to choose to have access to pages relevant to the field. Compared with a universal web spider, the webpage acquisition technology of the vertical web spider is extremely different from that of the universal web spider, and the algorithm and the working process are more complex. When Web searching is conducted by the vertical web spider, the subject relevance of a webpage needs to be judged according to a certain webpage analysis algorithm, subject prediction and reorganization are conducted on a found URL, and useful links are kept and are placed into a URL queue waiting to be grabbed; then a webpage URL needing to be grabbed in the next step is selected from the queue according to a certain search strategy, and the process is executed repeatedly until the system meets a certain condition.
Description
Technical field
The present invention relates to Web Spider technology, particularly a kind of perpendicular network spider for vertical search engine.
Background technology
Web Spider is the basic part of search engine, and it is the starting point in search engine workflow, and its performance directly affects the overall performance of search engine.The Web Spider of universal search engine is in the time gathering Web information, normally from one " subset ", by http protocol request and download the Web page, analyze the page and extract link, and then access newfound link, travel through access Web by the mode of this continuous diffusion.From whole Internet network topological diagram, Web Spider is several discrete points from the beginning, by the limit that between the page, link forms, progressively have access to the each node on whole topological diagram, and this is the typical working method of universal network spider.According to graph traversal mode, universal network spider can be taked the mode such as depth-first, breadth-first, and its deficiency is mainly reflected in the poor in timeliness of the low and page of the Web page coverage of crawl.
Perpendicular network spider can be called again specialized network spider or Topic web crawler, is a concept relative with the Web Spider of universal search engine.Different from universal search engine is, vertical search engine is served specific crowd, its concern be the information of a certain professional domain, therefore perpendicular network spider there is no need whole Web to travel through in search procedure, only needs to select a page relevant to this area to conduct interviews.Perpendicular network spider, compared with universal network spider, is very different web retrieval is technical, and its algorithm and workflow are more complicated.Perpendicular network spider, in the time of search Web, need to judge the topic relativity of webpage according to certain web page analysis algorithm, and the URL finding is carried out to theme prediction and identification, remains with the link of use and puts it into and wait for the URL queue capturing.Then, it will select next step webpage URL that will capture from queue according to certain search strategy, and repeats said process, until stop while reaching a certain condition of system.In addition, all crawled webpages will be stored by system, carry out certain analysis, filtration, and set up index, so that retrieval and indexing afterwards.For perpendicular network spider, the analysis result that this process obtains also may provide feedback and instruct later crawl process.
Summary of the invention
The problem existing for solving prior art, the present invention will design a kind of perpendicular network spider: comprise the following steps:
A, theme goal description
A1, appointment initial seed URL
According to the target web feature in field, initial seed URL given in advance, the start page that Web Spider is creeped;
A2, set up theme feature keyword
First automatically extract characteristic key words by collections of web pages, and then by artificial screening and adjustment;
After theme feature is established, vertical spider can also dynamic study be expanded keyword set in the crawl webpage that deepens continuously;
B, Webpage search:
B1, search strategy
Adopt best preferential Best-First search strategy; A URL queue to be creeped of dynamic structure, then sorts to the URL in queue according to certain Evaluation Strategy, selects best URL at every turn and preferentially creeps;
B2, URL Evaluation Strategy
Adopt the evaluation method based on web page contents; Use theme method of discrimination to calculate the degree of subject relativity of this webpage, degree of subject relativity value is dropped lower than the webpage of a certain threshold value;
C, degree of subject relativity are judged
Take the vector space model based on web page contents and structure; Its idiographic flow is the following aspects;
C1, preliminary treatment
Before Web Spider gathers, first the subset page of describing theme is carried out to extraction and the weighting of keyword, thereby obtain the characteristic vector of this theme and the weight of vector;
C2, text manipulation
The page body that spider is gathered is carried out participle, removes stop words, retains keyword, then according to formula TF
i=aTF
m+ bTF
t+ cTF
k+ dTF
d+ eTF
a, the diverse location occurring in article according to keyword calculates weighted frequency;
C3, keyword expansion
According to the characteristic vector of setting in theme, the page key words obtaining is adjusted and expanded;
The similarity of C4, the calculating page and theme
Calculate the similarity of the page and theme according to following formula;
C5, judge that whether the page is relevant to theme
Compare according to the size of similarity value and predefined threshold value d, if similarity value is more than or equal to d, representation page and Topic relative, downloads this page and remains into this locality; Otherwise be judged to uncorrelatedly, abandon this page.
Compared with prior art, the present invention has following beneficial effect:
1, need not travel through whole Web and just can find much more as far as possible and the webpage of Topic relative, so not only reduce the flow of the network bandwidth, also saved local memory space and computing time simultaneously;
2, the webpage capturing due to needs is a lot of less, makes upgrading in time of webpage become possibility;
3, just can index the webpage of more and Topic relative with less hardware costs.
Accompanying drawing explanation
2, the total accompanying drawing of the present invention, wherein:
Fig. 1 is perpendicular network spider system architecture figure;
Fig. 2 is the workflow diagram of perpendicular network spider.
Embodiment
Perpendicular network spider system architecture figure and detailed operation flow process are respectively as shown in Figure 1 and Figure 2.With respect to the Web Spider of universal search engine, perpendicular network spider also needs to solve three subject matters, is respectively theme goal description, Webpage search strategy and degree of subject relativity decision algorithm.The execution mode of every part is as follows.
A, theme goal description
A1, appointment initial seed URL
According to the target web feature in field, initial seed URL given in advance, the start page that Web Spider is creeped.The selection of planting subpage frame will directly affect the quality of vertical spider search, the principle of choosing initial seed URL is that kind of subpage frame itself will have higher topic relativity and extensively quote the subject resource in other authoritative website, it can be both the homepage of a website, can be also the subpage frame of website.Web Spider starts to crawl from these network address, not only can obtain rich in natural resources, and can expand the width of subject search, covers as much as possible subject resource, the final maximization that realizes crawl target.
A2, set up theme feature keyword
First automatically extract characteristic key words by collections of web pages, and then by artificial screening and adjustment, in the hope of reaching best effect.The field web page resources scope of initial collection will be extensively and is quantitatively guaranteed, and keyword feature vector distribution is just wider like this, and the weights of statistics are just more accurate, and the hit rate of the collection of later subject resource can be very high.After theme feature is established, vertical spider can also dynamic study be expanded keyword set in the crawl webpage that deepens continuously, and strives for accomplishing farthest Covering domain information, judges exactly topic relativity.
B, Webpage search
B1, search strategy
Adopt best preferential (Best-First) search strategy.The basic thought of this algorithm is the dynamic URL queue to be creeped that builds, and then according to certain Evaluation Strategy, the URL in queue is sorted, and selects best URL at every turn and preferentially creeps.
B2, URL Evaluation Strategy
Adopt the evaluation method based on web page contents.Because web page contents can be explained the theme of webpage exactly, if two webpages link together with the form of hyperlink, so they to belong to the possibility of same theme very large, therefore can predict according to the degree of correlation between text message and theme in webpage the degree of correlation of the URL comprising in webpage.The webpage that degree of subject relativity is large, the priority of the URL that it comprises is just high, thereby has determined the priority orders of URL in queue to be creeped.Certainly, also may there is in some cases mistake in this prediction, but this mistake can't affect the quality of Web Spider collecting web page, because by page download corresponding URL before this locality, need to use theme method of discrimination to calculate the degree of subject relativity of this webpage, degree of subject relativity value is dropped lower than the webpage of a certain threshold value, and under this situation, just the performance of Web Spider is subject to impact to a certain extent.
C, degree of subject relativity are judged
Take the vector space model based on web page contents and structure.Its idiographic flow is the following aspects.
C1, preliminary treatment
Before Web Spider gathers, first the subset page of describing theme is carried out to extraction and the weighting of keyword, thereby obtain the characteristic vector of this theme and the weight of vector.
C2, text manipulation
The page body that spider is gathered is carried out participle, removes stop words, retains keyword, then according to formula TF
i=aTF
m+ bTF
t+ cTF
k+ dTF
d+ eTF
a, the diverse location occurring in article according to keyword calculates weighted frequency.
C3, keyword expansion
According to the characteristic vector of setting in theme, the page key words obtaining is adjusted and expanded.
The similarity of C4, the calculating page and theme
Calculate the similarity of the page and theme according to following formula.
C5, judge that whether the page is relevant to theme
Compare according to the size of similarity value and predefined threshold value d, if similarity value is more than or equal to d, representation page and Topic relative, downloads this page and remains into this locality; Otherwise be judged to uncorrelatedly, abandon this page.
Threshold value mainly relies on the strategy that combined training statistics is manually set to obtain, it is relatively higher that starting stage manually arranges threshold value, prevent that the starting stage from may have a large amount of uncorrelated webpages to enter, causing to continue to crawl has the collected and unnecessary expense that causes of a large amount of irrelevant webpages in process.Can extract some related web pages counting statistics relevance degree, calculate average relevance degree, using mean value as initial threshold value.Then at set intervals stochastical sampling some crawl the original html document getting off, artificial judgment correlation, calculates correlation and gathers accuracy rate.Repeatedly add up accuracy rate, if hit rate is very stable and remain on a very high position, reduce threshold value by certain amplitude, make to crawl theme and reach covering to greatest extent.If hit rate is very low and unstable, improve threshold value by certain amplitude, improve the hit rate that crawls theme.Repeat this process, final statistical computation obtains reaching maximum hit rate with some threshold values.
Claims (1)
1. a perpendicular network spider, is characterized in that: comprise the following steps:
A, theme goal description
A1, appointment initial seed URL
According to the target web feature in field, initial seed URL given in advance, the start page that Web Spider is creeped;
A2, set up theme feature keyword
First automatically extract characteristic key words by collections of web pages, and then by artificial screening and adjustment;
After theme feature is established, vertical spider can also dynamic study be expanded keyword set in the crawl webpage that deepens continuously;
B, Webpage search:
B1, search strategy
Adopt best preferential Best-First search strategy; A URL queue to be creeped of dynamic structure, then sorts to the URL in queue according to certain Evaluation Strategy, selects best URL at every turn and preferentially creeps;
B2, URL Evaluation Strategy
Adopt the evaluation method based on web page contents; Use theme method of discrimination to calculate the degree of subject relativity of this webpage, degree of subject relativity value is dropped lower than the webpage of a certain threshold value;
C, degree of subject relativity are judged
Take the vector space model based on web page contents and structure; Its idiographic flow is the following aspects;
C1, preliminary treatment
Before Web Spider gathers, first the subset page of describing theme is carried out to extraction and the weighting of keyword, thereby obtain the characteristic vector of this theme and the weight of vector;
C2, text manipulation
The page body that spider is gathered is carried out participle, removes stop words, retains keyword, then according to formula TF
i=aTF
m+ bTF
t+ cTF
k+ dTF
d+ eTF
a, the diverse location occurring in article according to keyword calculates weighted frequency;
C3, keyword expansion
According to the characteristic vector of setting in theme, the page key words obtaining is adjusted and expanded;
The similarity of C4, the calculating page and theme
Calculate the similarity of the page and theme according to following formula;
C5, judge that whether the page is relevant to theme
Compare according to the size of similarity value and predefined threshold value d, if similarity value is more than or equal to d, representation page and Topic relative, downloads this page and remains into this locality; Otherwise be judged to uncorrelatedly, abandon this page.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210495397.7A CN103841173A (en) | 2012-11-27 | 2012-11-27 | Vertical web spider |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210495397.7A CN103841173A (en) | 2012-11-27 | 2012-11-27 | Vertical web spider |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103841173A true CN103841173A (en) | 2014-06-04 |
Family
ID=50804298
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210495397.7A Pending CN103841173A (en) | 2012-11-27 | 2012-11-27 | Vertical web spider |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103841173A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016000511A1 (en) * | 2014-06-30 | 2016-01-07 | 北京奇虎科技有限公司 | Method and apparatus for mining rare resource of internet |
CN105677772A (en) * | 2015-12-30 | 2016-06-15 | 赛尔网络有限公司 | ISP interconnection port URL activity level statistics method and device |
CN106612279A (en) * | 2016-12-22 | 2017-05-03 | 北京知道创宇信息技术有限公司 | Network address processing method, device and system |
CN108256110A (en) * | 2018-02-08 | 2018-07-06 | 平安科技(深圳)有限公司 | Gathering method, device, computer equipment and the storage medium of information |
CN108694197A (en) * | 2017-04-10 | 2018-10-23 | 富士通株式会社 | Hypertext grasping means and device |
CN110147473A (en) * | 2017-08-28 | 2019-08-20 | 北京国双科技有限公司 | A kind of crawling method and device of crawler |
CN110609952A (en) * | 2019-08-15 | 2019-12-24 | 中国平安财产保险股份有限公司 | Data acquisition method and system and computer equipment |
CN112597369A (en) * | 2020-12-22 | 2021-04-02 | 荆门汇易佳信息科技有限公司 | Webpage spider theme type search system based on improved cloud platform |
CN113449168A (en) * | 2021-07-14 | 2021-09-28 | 北京锐安科技有限公司 | Method, device and equipment for capturing theme webpage data and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060277175A1 (en) * | 2000-08-18 | 2006-12-07 | Dongming Jiang | Method and Apparatus for Focused Crawling |
CN101630327A (en) * | 2009-08-14 | 2010-01-20 | 昆明理工大学 | Design method of theme network crawler system |
CN102298622A (en) * | 2011-08-11 | 2011-12-28 | 中国科学院自动化研究所 | Search method for focused web crawler based on anchor text and system thereof |
CN102662954A (en) * | 2012-03-02 | 2012-09-12 | 杭州电子科技大学 | Method for implementing topical crawler system based on learning URL string information |
-
2012
- 2012-11-27 CN CN201210495397.7A patent/CN103841173A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060277175A1 (en) * | 2000-08-18 | 2006-12-07 | Dongming Jiang | Method and Apparatus for Focused Crawling |
CN101630327A (en) * | 2009-08-14 | 2010-01-20 | 昆明理工大学 | Design method of theme network crawler system |
CN102298622A (en) * | 2011-08-11 | 2011-12-28 | 中国科学院自动化研究所 | Search method for focused web crawler based on anchor text and system thereof |
CN102662954A (en) * | 2012-03-02 | 2012-09-12 | 杭州电子科技大学 | Method for implementing topical crawler system based on learning URL string information |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016000511A1 (en) * | 2014-06-30 | 2016-01-07 | 北京奇虎科技有限公司 | Method and apparatus for mining rare resource of internet |
CN105677772A (en) * | 2015-12-30 | 2016-06-15 | 赛尔网络有限公司 | ISP interconnection port URL activity level statistics method and device |
CN105677772B (en) * | 2015-12-30 | 2019-07-09 | 赛尔网络有限公司 | The statistical method and device of interconnection port URL liveness between a kind of ISP |
CN106612279B (en) * | 2016-12-22 | 2020-04-17 | 北京知道创宇信息技术股份有限公司 | Network address processing method, equipment and system |
CN106612279A (en) * | 2016-12-22 | 2017-05-03 | 北京知道创宇信息技术有限公司 | Network address processing method, device and system |
CN108694197A (en) * | 2017-04-10 | 2018-10-23 | 富士通株式会社 | Hypertext grasping means and device |
CN110147473A (en) * | 2017-08-28 | 2019-08-20 | 北京国双科技有限公司 | A kind of crawling method and device of crawler |
CN110147473B (en) * | 2017-08-28 | 2022-03-01 | 北京国双科技有限公司 | Crawling method and device for crawler |
CN108256110A (en) * | 2018-02-08 | 2018-07-06 | 平安科技(深圳)有限公司 | Gathering method, device, computer equipment and the storage medium of information |
CN110609952A (en) * | 2019-08-15 | 2019-12-24 | 中国平安财产保险股份有限公司 | Data acquisition method and system and computer equipment |
CN110609952B (en) * | 2019-08-15 | 2024-04-26 | 中国平安财产保险股份有限公司 | Data acquisition method, system and computer equipment |
CN112597369A (en) * | 2020-12-22 | 2021-04-02 | 荆门汇易佳信息科技有限公司 | Webpage spider theme type search system based on improved cloud platform |
CN113449168A (en) * | 2021-07-14 | 2021-09-28 | 北京锐安科技有限公司 | Method, device and equipment for capturing theme webpage data and storage medium |
CN113449168B (en) * | 2021-07-14 | 2024-02-20 | 北京锐安科技有限公司 | Theme webpage data grabbing method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103841173A (en) | Vertical web spider | |
CN102662954B (en) | Method for implementing topical crawler system based on learning URL string information | |
CN105138558B (en) | The real time individual information collecting method of content is accessed based on user | |
CN101441662B (en) | Topic information acquisition method based on network topology | |
CN103365839B (en) | The recommendation searching method and device of a kind of search engine | |
CN102222187B (en) | Domain name structural feature-based hang horse web page detection method | |
CN103530365B (en) | Obtain the method and system of the download link of resource | |
CN106991160B (en) | Microblog propagation prediction method based on user influence and content | |
CN104182412B (en) | A kind of web page crawl method and system | |
CN101477554A (en) | User interest based personalized meta search engine and search result processing method | |
CN102567407B (en) | Method and system for collecting forum reply increment | |
CN102930059A (en) | Method for designing focused crawler | |
CN101908071A (en) | Method and device thereof for improving search efficiency of search engine | |
CN110266528B (en) | Traffic prediction method for Internet of vehicles communication based on machine learning | |
CN102831193A (en) | Topic detecting device and topic detecting method based on distributed multistage cluster | |
CN103218431A (en) | System and method for identifying and automatically acquiring webpage information | |
CN103714149B (en) | Self-adaptive incremental deep web data source discovery method | |
CN106021418B (en) | The clustering method and device of media event | |
CN106980651B (en) | Crawling seed list updating method and device based on knowledge graph | |
CN106156230A (en) | A kind of method and device generating interior chain | |
CN104252348A (en) | Webpage access statistics method and device based on browser | |
CN107086925B (en) | Deep learning-based internet traffic big data analysis method | |
CN109526027B (en) | Cell capacity optimization method, device, equipment and computer storage medium | |
CN102930016B (en) | A kind of method and apparatus for providing Search Results on mobile terminals | |
CN109977285A (en) | A kind of auto-adaptive increment collecting method towards Deep Web |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140604 |
|
WD01 | Invention patent application deemed withdrawn after publication |