CN103841173A - Vertical web spider - Google Patents

Vertical web spider Download PDF

Info

Publication number
CN103841173A
CN103841173A CN201210495397.7A CN201210495397A CN103841173A CN 103841173 A CN103841173 A CN 103841173A CN 201210495397 A CN201210495397 A CN 201210495397A CN 103841173 A CN103841173 A CN 103841173A
Authority
CN
China
Prior art keywords
page
theme
spider
web
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210495397.7A
Other languages
Chinese (zh)
Inventor
郑世超
苏晓华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Original Assignee
DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd filed Critical DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Priority to CN201210495397.7A priority Critical patent/CN103841173A/en
Publication of CN103841173A publication Critical patent/CN103841173A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a vertical web spider. The vertical web spider is a concept which is opposite to a web spider of a universal search engine. The difference between a vertical search engine and the universal search engine lies in that the vertical search engine serves for a specific group and only pays attention to information in a specific field, and thus traversing of a whole Web is not needed when searching is conducted by the vertical web spider, and the vertical web spider only needs to choose to have access to pages relevant to the field. Compared with a universal web spider, the webpage acquisition technology of the vertical web spider is extremely different from that of the universal web spider, and the algorithm and the working process are more complex. When Web searching is conducted by the vertical web spider, the subject relevance of a webpage needs to be judged according to a certain webpage analysis algorithm, subject prediction and reorganization are conducted on a found URL, and useful links are kept and are placed into a URL queue waiting to be grabbed; then a webpage URL needing to be grabbed in the next step is selected from the queue according to a certain search strategy, and the process is executed repeatedly until the system meets a certain condition.

Description

A kind of perpendicular network spider
Technical field
The present invention relates to Web Spider technology, particularly a kind of perpendicular network spider for vertical search engine.
Background technology
Web Spider is the basic part of search engine, and it is the starting point in search engine workflow, and its performance directly affects the overall performance of search engine.The Web Spider of universal search engine is in the time gathering Web information, normally from one " subset ", by http protocol request and download the Web page, analyze the page and extract link, and then access newfound link, travel through access Web by the mode of this continuous diffusion.From whole Internet network topological diagram, Web Spider is several discrete points from the beginning, by the limit that between the page, link forms, progressively have access to the each node on whole topological diagram, and this is the typical working method of universal network spider.According to graph traversal mode, universal network spider can be taked the mode such as depth-first, breadth-first, and its deficiency is mainly reflected in the poor in timeliness of the low and page of the Web page coverage of crawl.
Perpendicular network spider can be called again specialized network spider or Topic web crawler, is a concept relative with the Web Spider of universal search engine.Different from universal search engine is, vertical search engine is served specific crowd, its concern be the information of a certain professional domain, therefore perpendicular network spider there is no need whole Web to travel through in search procedure, only needs to select a page relevant to this area to conduct interviews.Perpendicular network spider, compared with universal network spider, is very different web retrieval is technical, and its algorithm and workflow are more complicated.Perpendicular network spider, in the time of search Web, need to judge the topic relativity of webpage according to certain web page analysis algorithm, and the URL finding is carried out to theme prediction and identification, remains with the link of use and puts it into and wait for the URL queue capturing.Then, it will select next step webpage URL that will capture from queue according to certain search strategy, and repeats said process, until stop while reaching a certain condition of system.In addition, all crawled webpages will be stored by system, carry out certain analysis, filtration, and set up index, so that retrieval and indexing afterwards.For perpendicular network spider, the analysis result that this process obtains also may provide feedback and instruct later crawl process.
Summary of the invention
The problem existing for solving prior art, the present invention will design a kind of perpendicular network spider: comprise the following steps:
A, theme goal description
A1, appointment initial seed URL
According to the target web feature in field, initial seed URL given in advance, the start page that Web Spider is creeped;
A2, set up theme feature keyword
First automatically extract characteristic key words by collections of web pages, and then by artificial screening and adjustment;
After theme feature is established, vertical spider can also dynamic study be expanded keyword set in the crawl webpage that deepens continuously;
B, Webpage search:
B1, search strategy
Adopt best preferential Best-First search strategy; A URL queue to be creeped of dynamic structure, then sorts to the URL in queue according to certain Evaluation Strategy, selects best URL at every turn and preferentially creeps;
B2, URL Evaluation Strategy
Adopt the evaluation method based on web page contents; Use theme method of discrimination to calculate the degree of subject relativity of this webpage, degree of subject relativity value is dropped lower than the webpage of a certain threshold value;
C, degree of subject relativity are judged
Take the vector space model based on web page contents and structure; Its idiographic flow is the following aspects;
C1, preliminary treatment
Before Web Spider gathers, first the subset page of describing theme is carried out to extraction and the weighting of keyword, thereby obtain the characteristic vector of this theme and the weight of vector;
C2, text manipulation
The page body that spider is gathered is carried out participle, removes stop words, retains keyword, then according to formula TF i=aTF m+ bTF t+ cTF k+ dTF d+ eTF a, the diverse location occurring in article according to keyword calculates weighted frequency;
C3, keyword expansion
According to the characteristic vector of setting in theme, the page key words obtaining is adjusted and expanded;
The similarity of C4, the calculating page and theme
Calculate the similarity of the page and theme according to following formula;
Sim ( D ) = cos θ = Σ i = 1 n D i × T i ( Σ i = 1 n D i 2 ) × ( Σ i = 1 n T i 2 )
C5, judge that whether the page is relevant to theme
Compare according to the size of similarity value and predefined threshold value d, if similarity value is more than or equal to d, representation page and Topic relative, downloads this page and remains into this locality; Otherwise be judged to uncorrelatedly, abandon this page.
Compared with prior art, the present invention has following beneficial effect:
1, need not travel through whole Web and just can find much more as far as possible and the webpage of Topic relative, so not only reduce the flow of the network bandwidth, also saved local memory space and computing time simultaneously;
2, the webpage capturing due to needs is a lot of less, makes upgrading in time of webpage become possibility;
3, just can index the webpage of more and Topic relative with less hardware costs.
Accompanying drawing explanation
2, the total accompanying drawing of the present invention, wherein:
Fig. 1 is perpendicular network spider system architecture figure;
Fig. 2 is the workflow diagram of perpendicular network spider.
Embodiment
Perpendicular network spider system architecture figure and detailed operation flow process are respectively as shown in Figure 1 and Figure 2.With respect to the Web Spider of universal search engine, perpendicular network spider also needs to solve three subject matters, is respectively theme goal description, Webpage search strategy and degree of subject relativity decision algorithm.The execution mode of every part is as follows.
A, theme goal description
A1, appointment initial seed URL
According to the target web feature in field, initial seed URL given in advance, the start page that Web Spider is creeped.The selection of planting subpage frame will directly affect the quality of vertical spider search, the principle of choosing initial seed URL is that kind of subpage frame itself will have higher topic relativity and extensively quote the subject resource in other authoritative website, it can be both the homepage of a website, can be also the subpage frame of website.Web Spider starts to crawl from these network address, not only can obtain rich in natural resources, and can expand the width of subject search, covers as much as possible subject resource, the final maximization that realizes crawl target.
A2, set up theme feature keyword
First automatically extract characteristic key words by collections of web pages, and then by artificial screening and adjustment, in the hope of reaching best effect.The field web page resources scope of initial collection will be extensively and is quantitatively guaranteed, and keyword feature vector distribution is just wider like this, and the weights of statistics are just more accurate, and the hit rate of the collection of later subject resource can be very high.After theme feature is established, vertical spider can also dynamic study be expanded keyword set in the crawl webpage that deepens continuously, and strives for accomplishing farthest Covering domain information, judges exactly topic relativity.
B, Webpage search
B1, search strategy
Adopt best preferential (Best-First) search strategy.The basic thought of this algorithm is the dynamic URL queue to be creeped that builds, and then according to certain Evaluation Strategy, the URL in queue is sorted, and selects best URL at every turn and preferentially creeps.
B2, URL Evaluation Strategy
Adopt the evaluation method based on web page contents.Because web page contents can be explained the theme of webpage exactly, if two webpages link together with the form of hyperlink, so they to belong to the possibility of same theme very large, therefore can predict according to the degree of correlation between text message and theme in webpage the degree of correlation of the URL comprising in webpage.The webpage that degree of subject relativity is large, the priority of the URL that it comprises is just high, thereby has determined the priority orders of URL in queue to be creeped.Certainly, also may there is in some cases mistake in this prediction, but this mistake can't affect the quality of Web Spider collecting web page, because by page download corresponding URL before this locality, need to use theme method of discrimination to calculate the degree of subject relativity of this webpage, degree of subject relativity value is dropped lower than the webpage of a certain threshold value, and under this situation, just the performance of Web Spider is subject to impact to a certain extent.
C, degree of subject relativity are judged
Take the vector space model based on web page contents and structure.Its idiographic flow is the following aspects.
C1, preliminary treatment
Before Web Spider gathers, first the subset page of describing theme is carried out to extraction and the weighting of keyword, thereby obtain the characteristic vector of this theme and the weight of vector.
C2, text manipulation
The page body that spider is gathered is carried out participle, removes stop words, retains keyword, then according to formula TF i=aTF m+ bTF t+ cTF k+ dTF d+ eTF a, the diverse location occurring in article according to keyword calculates weighted frequency.
C3, keyword expansion
According to the characteristic vector of setting in theme, the page key words obtaining is adjusted and expanded.
The similarity of C4, the calculating page and theme
Calculate the similarity of the page and theme according to following formula.
Sim ( D ) = cos θ = Σ i = 1 n D i × T i ( Σ i = 1 n D i 2 ) × ( Σ i = 1 n T i 2 )
C5, judge that whether the page is relevant to theme
Compare according to the size of similarity value and predefined threshold value d, if similarity value is more than or equal to d, representation page and Topic relative, downloads this page and remains into this locality; Otherwise be judged to uncorrelatedly, abandon this page.
Threshold value mainly relies on the strategy that combined training statistics is manually set to obtain, it is relatively higher that starting stage manually arranges threshold value, prevent that the starting stage from may have a large amount of uncorrelated webpages to enter, causing to continue to crawl has the collected and unnecessary expense that causes of a large amount of irrelevant webpages in process.Can extract some related web pages counting statistics relevance degree, calculate average relevance degree, using mean value as initial threshold value.Then at set intervals stochastical sampling some crawl the original html document getting off, artificial judgment correlation, calculates correlation and gathers accuracy rate.Repeatedly add up accuracy rate, if hit rate is very stable and remain on a very high position, reduce threshold value by certain amplitude, make to crawl theme and reach covering to greatest extent.If hit rate is very low and unstable, improve threshold value by certain amplitude, improve the hit rate that crawls theme.Repeat this process, final statistical computation obtains reaching maximum hit rate with some threshold values.

Claims (1)

1. a perpendicular network spider, is characterized in that: comprise the following steps:
A, theme goal description
A1, appointment initial seed URL
According to the target web feature in field, initial seed URL given in advance, the start page that Web Spider is creeped;
A2, set up theme feature keyword
First automatically extract characteristic key words by collections of web pages, and then by artificial screening and adjustment;
After theme feature is established, vertical spider can also dynamic study be expanded keyword set in the crawl webpage that deepens continuously;
B, Webpage search:
B1, search strategy
Adopt best preferential Best-First search strategy; A URL queue to be creeped of dynamic structure, then sorts to the URL in queue according to certain Evaluation Strategy, selects best URL at every turn and preferentially creeps;
B2, URL Evaluation Strategy
Adopt the evaluation method based on web page contents; Use theme method of discrimination to calculate the degree of subject relativity of this webpage, degree of subject relativity value is dropped lower than the webpage of a certain threshold value;
C, degree of subject relativity are judged
Take the vector space model based on web page contents and structure; Its idiographic flow is the following aspects;
C1, preliminary treatment
Before Web Spider gathers, first the subset page of describing theme is carried out to extraction and the weighting of keyword, thereby obtain the characteristic vector of this theme and the weight of vector;
C2, text manipulation
The page body that spider is gathered is carried out participle, removes stop words, retains keyword, then according to formula TF i=aTF m+ bTF t+ cTF k+ dTF d+ eTF a, the diverse location occurring in article according to keyword calculates weighted frequency;
C3, keyword expansion
According to the characteristic vector of setting in theme, the page key words obtaining is adjusted and expanded;
The similarity of C4, the calculating page and theme
Calculate the similarity of the page and theme according to following formula;
Sim ( D ) = cos θ = Σ i = 1 n D i × T i ( Σ i = 1 n D i 2 ) × ( Σ i = 1 n T i 2 )
C5, judge that whether the page is relevant to theme
Compare according to the size of similarity value and predefined threshold value d, if similarity value is more than or equal to d, representation page and Topic relative, downloads this page and remains into this locality; Otherwise be judged to uncorrelatedly, abandon this page.
CN201210495397.7A 2012-11-27 2012-11-27 Vertical web spider Pending CN103841173A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210495397.7A CN103841173A (en) 2012-11-27 2012-11-27 Vertical web spider

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210495397.7A CN103841173A (en) 2012-11-27 2012-11-27 Vertical web spider

Publications (1)

Publication Number Publication Date
CN103841173A true CN103841173A (en) 2014-06-04

Family

ID=50804298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210495397.7A Pending CN103841173A (en) 2012-11-27 2012-11-27 Vertical web spider

Country Status (1)

Country Link
CN (1) CN103841173A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016000511A1 (en) * 2014-06-30 2016-01-07 北京奇虎科技有限公司 Method and apparatus for mining rare resource of internet
CN105677772A (en) * 2015-12-30 2016-06-15 赛尔网络有限公司 ISP interconnection port URL activity level statistics method and device
CN106612279A (en) * 2016-12-22 2017-05-03 北京知道创宇信息技术有限公司 Network address processing method, device and system
CN108256110A (en) * 2018-02-08 2018-07-06 平安科技(深圳)有限公司 Gathering method, device, computer equipment and the storage medium of information
CN108694197A (en) * 2017-04-10 2018-10-23 富士通株式会社 Hypertext grasping means and device
CN110147473A (en) * 2017-08-28 2019-08-20 北京国双科技有限公司 A kind of crawling method and device of crawler
CN110609952A (en) * 2019-08-15 2019-12-24 中国平安财产保险股份有限公司 Data acquisition method and system and computer equipment
CN112597369A (en) * 2020-12-22 2021-04-02 荆门汇易佳信息科技有限公司 Webpage spider theme type search system based on improved cloud platform
CN113449168A (en) * 2021-07-14 2021-09-28 北京锐安科技有限公司 Method, device and equipment for capturing theme webpage data and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060277175A1 (en) * 2000-08-18 2006-12-07 Dongming Jiang Method and Apparatus for Focused Crawling
CN101630327A (en) * 2009-08-14 2010-01-20 昆明理工大学 Design method of theme network crawler system
CN102298622A (en) * 2011-08-11 2011-12-28 中国科学院自动化研究所 Search method for focused web crawler based on anchor text and system thereof
CN102662954A (en) * 2012-03-02 2012-09-12 杭州电子科技大学 Method for implementing topical crawler system based on learning URL string information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060277175A1 (en) * 2000-08-18 2006-12-07 Dongming Jiang Method and Apparatus for Focused Crawling
CN101630327A (en) * 2009-08-14 2010-01-20 昆明理工大学 Design method of theme network crawler system
CN102298622A (en) * 2011-08-11 2011-12-28 中国科学院自动化研究所 Search method for focused web crawler based on anchor text and system thereof
CN102662954A (en) * 2012-03-02 2012-09-12 杭州电子科技大学 Method for implementing topical crawler system based on learning URL string information

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016000511A1 (en) * 2014-06-30 2016-01-07 北京奇虎科技有限公司 Method and apparatus for mining rare resource of internet
CN105677772A (en) * 2015-12-30 2016-06-15 赛尔网络有限公司 ISP interconnection port URL activity level statistics method and device
CN105677772B (en) * 2015-12-30 2019-07-09 赛尔网络有限公司 The statistical method and device of interconnection port URL liveness between a kind of ISP
CN106612279B (en) * 2016-12-22 2020-04-17 北京知道创宇信息技术股份有限公司 Network address processing method, equipment and system
CN106612279A (en) * 2016-12-22 2017-05-03 北京知道创宇信息技术有限公司 Network address processing method, device and system
CN108694197A (en) * 2017-04-10 2018-10-23 富士通株式会社 Hypertext grasping means and device
CN110147473A (en) * 2017-08-28 2019-08-20 北京国双科技有限公司 A kind of crawling method and device of crawler
CN110147473B (en) * 2017-08-28 2022-03-01 北京国双科技有限公司 Crawling method and device for crawler
CN108256110A (en) * 2018-02-08 2018-07-06 平安科技(深圳)有限公司 Gathering method, device, computer equipment and the storage medium of information
CN110609952A (en) * 2019-08-15 2019-12-24 中国平安财产保险股份有限公司 Data acquisition method and system and computer equipment
CN110609952B (en) * 2019-08-15 2024-04-26 中国平安财产保险股份有限公司 Data acquisition method, system and computer equipment
CN112597369A (en) * 2020-12-22 2021-04-02 荆门汇易佳信息科技有限公司 Webpage spider theme type search system based on improved cloud platform
CN113449168A (en) * 2021-07-14 2021-09-28 北京锐安科技有限公司 Method, device and equipment for capturing theme webpage data and storage medium
CN113449168B (en) * 2021-07-14 2024-02-20 北京锐安科技有限公司 Theme webpage data grabbing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103841173A (en) Vertical web spider
CN102662954B (en) Method for implementing topical crawler system based on learning URL string information
CN105138558B (en) The real time individual information collecting method of content is accessed based on user
CN101441662B (en) Topic information acquisition method based on network topology
CN103365839B (en) The recommendation searching method and device of a kind of search engine
CN102222187B (en) Domain name structural feature-based hang horse web page detection method
CN103530365B (en) Obtain the method and system of the download link of resource
CN106991160B (en) Microblog propagation prediction method based on user influence and content
CN104182412B (en) A kind of web page crawl method and system
CN101477554A (en) User interest based personalized meta search engine and search result processing method
CN102567407B (en) Method and system for collecting forum reply increment
CN102930059A (en) Method for designing focused crawler
CN101908071A (en) Method and device thereof for improving search efficiency of search engine
CN110266528B (en) Traffic prediction method for Internet of vehicles communication based on machine learning
CN102831193A (en) Topic detecting device and topic detecting method based on distributed multistage cluster
CN103218431A (en) System and method for identifying and automatically acquiring webpage information
CN103714149B (en) Self-adaptive incremental deep web data source discovery method
CN106021418B (en) The clustering method and device of media event
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
CN106156230A (en) A kind of method and device generating interior chain
CN104252348A (en) Webpage access statistics method and device based on browser
CN107086925B (en) Deep learning-based internet traffic big data analysis method
CN109526027B (en) Cell capacity optimization method, device, equipment and computer storage medium
CN102930016B (en) A kind of method and apparatus for providing Search Results on mobile terminals
CN109977285A (en) A kind of auto-adaptive increment collecting method towards Deep Web

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140604

WD01 Invention patent application deemed withdrawn after publication