CN103279507B

CN103279507B - Webpage spider operational method and system

Info

Publication number: CN103279507B
Application number: CN201310181364.XA
Authority: CN
Inventors: 许大伦; 毛颖; 黄明军
Original assignee: BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd
Current assignee: Lele Kaihang (Beijing) Education Technology Co., Ltd.
Priority date: 2013-05-16
Filing date: 2013-05-16
Publication date: 2016-12-28
Anticipated expiration: 2033-05-16
Also published as: CN103279507A

Abstract

The invention discloses a kind of webpage spider operational method and system, the method specifically includes that and captures the URL of website by the parameter of predetermined manner and add memory queue to；Whether the URL that described memory queue stores in judging it exists overlapping with the URL just adding entrance；To the webpage capture data under this URL and travel through in this webpage involved lower floor's link URL, and judge whether overlap；To the webpage capture data under this lower floor's link URL, then judging whether untreated URL, such as nothing, then described data crawl gone out carry out resolving and extracting being transferred to data handling queues according to pre-conditioned；These data are analyzed by described data handling queues with data with existing, and revise the crawl frequency in the parameter of described predetermined manner according to analysis result information.The present invention with solve in prior art web crawlers website is caused excessive added burden and can not accurately, the problem of effective acquisition site information.

Description

Webpage spider operational method and system

Technical field

The present invention relates to networking technology area, specifically, relate to a kind of webpage spider operational method and be System.

Background technology

Search engine, refers to according to certain strategy, uses specific computer program to collect from the Internet Information, after information is organized and processed, provides the user retrieval service, is correlated with by user search Information shows the system of user.Described search engine is collected from the Internet to the process of information, rely on In web crawlers, related web site information is crawled.

Described web crawlers, is the program of a kind of automatic acquisition web page contents, is the important composition of search engine Part.

In the prior art, for common searched engine, tradition reptile is from one or several Initial pages URL starts, it is thus achieved that the URL on Initial page, during capturing webpage, constantly from current page Extract new URL and put into queue, until meeting certain stop condition of system.

In currently available technology, web crawlers is poor to the analysis ability of web page contents, can only be by mechanical Constantly grasping information of web site, asks circulations to repeat to capture for the most concurrent tens or up to a hundred, and it crawls frequency The highest with crawling pressure, thus consume site resource in a large number, website is caused burden even cause website to collapse Burst.Meanwhile, web crawlers can not crawl out the useful information in website accurately and efficiently.

Therefore, how to solve web crawlers in prior art and website is caused excessive added burden and can not be accurate Really, effective acquisition site information, become as technical problem urgently to be resolved hurrily.

Summary of the invention

The technical problem to be solved is to provide a kind of webpage spider operational method and system, to solve In prior art web crawlers website is caused excessive added burden and can not accurately, effective acquisition website letter The problem of breath.

For solving above-mentioned technical problem, the invention provides a kind of webpage spider operational method, it is characterised in that Including:

Capture the URL of website by the parameter of predetermined manner and add memory queue to；

Whether the URL that described memory queue stores in judging it exists overlapping with the URL just adding entrance, as Have, then ignore this URL；Such as nothing, then to the webpage capture data under this URL and travel through institute in this webpage The lower floor's link URL related to, and judge whether this lower floor's link URL exists overlap, if any, then ignore； Such as nothing, then to the webpage capture data under this lower floor's link URL, the most described memory queue judges whether to deposit At untreated URL, such as nothing, then described data crawl gone out resolve according to pre-conditioned and extract It is transferred to data handling queues；

These data are analyzed by described data handling queues with data with existing, and believe according to analysis result Breath revises the parameter of described predetermined manner.

Preferably, wherein, the parameter of described predetermined manner, farther include: initially capture address, crawl In frequency, the crawl time delay condition of Webpage and webpage, data deposit queue condition.

Preferably, wherein, described pre-conditioned, farther include: resolve the DOM supporting class Jquery grammer Data, resolve and support json data and/or resolve the data supporting script script.

Preferably, wherein, the described parameter by predetermined manner captures the URL of website and adds internal memory team to Row, further comprise: according to the initialization context of place system, server performance, network broadband situation, And crawl number of processes arranges a preset template, by this preset template, the parameter in predetermined manner is entered Row is arranged, and captures the URL of website by the parameter of described predetermined manner and adds memory queue to.

Preferably, wherein, described described data crawl gone out resolve according to pre-conditioned carrying out and extract biography It is handed to data handling queues, is further: described data crawl gone out are propped up according to the parsing in pre-conditioned Hold the DOM data of class Jquery grammer, resolve and support json data and/or resolve the number supporting script script According to, carry out resolving and extracting information, after this information is packaged, be transferred to data handling queues.

Preferably, wherein, the described parameter revising described predetermined manner according to analysis result information, further For: according to analysis result information by the content in amendment preset template, and revise institute by this preset template State the crawl frequency in the parameter of predetermined manner.

For solving above-mentioned technical problem, present invention also offers a kind of spiders operating system, its feature exists In, including: handling module, memory modules and data analysis and processing module；It is characterized in that,

Described handling module, for being captured the URL of website by the parameter of predetermined manner, is transmitted this URL Add the memory queue in described memory modules to；

Described memory modules, the website URL transmitted for receiving described handling module stores internal memory in the inner In queue, it is judged that in described memory queue, whether the URL of storage exists overlapping with the URL just adding entrance, If any, then ignore this URL；Such as nothing, then to the webpage capture data under this URL and travel through in this webpage Involved lower floor's link URL, and judge whether this lower floor's link URL exists overlap, if any, then ignore； Such as nothing, then to the webpage capture data under this lower floor's link URL, the most described memory queue judges whether to deposit At untreated URL, such as nothing, then described data crawl gone out resolve according to pre-conditioned and extract It is transferred to described Data Analysis Services module；

Described Data Analysis Services module, for receiving the data after extraction that described memory modules transmits Putting in its internal data handling queues, it is right that these data and data with existing are carried out by described data handling queues Than analyzing, and revise the parameter of predetermined manner in described handling module according to analysis result information.

Preferably, wherein, the parameter of predetermined manner in described handling module, farther include: initially capture In address, crawl frequency, the crawl time delay condition of Webpage and webpage, data deposit queue condition.

Preferably, wherein, pre-conditioned in described memory modules, farther include: resolve and support class Jquery The DOM data of grammer, resolve and support json data and/or resolve the data supporting script script.

Preferably, wherein, described handling module, be additionally operable to further the initialization context according to place system, Server performance, network broadband situation and crawl number of processes arrange a preset template, pre-by this If the parameter in predetermined manner is configured by masterplate.

Preferably, wherein, described memory modules, the described data being additionally operable to further crawl are according to pre- If the DOM data resolving support class Jquery grammer in condition, resolve and support json data and/or parsing Hold the data of script script, carry out resolving and extracting information, after this information is packaged, be transferred to institute State the data handling queues in Data Analysis Services module.

Preferably, wherein, described Data Analysis Services module, it is additionally operable to further according to analysis result information By the content revised described in described handling module in preset template and described by the amendment of this preset template Crawl frequency in the parameter of predetermined manner.

Compared with prior art, a kind of webpage spider operational method of the present invention and system, reached as Lower effect:

1) present invention reduces crawling frequency and crawling pressure of web crawlers, effectively reduce and website is caused Excessive added burden.

2) present invention achieves and large-scale distributed concurrently gather, greatly improve data acquisition efficiency and The high efficiency of task customization.

3) present invention uses cloud, it is achieved that the high accuracy obtaining desired content.

Accompanying drawing explanation

Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, The schematic description and description of the present invention is used for explaining the present invention, is not intended that the improper limit to the present invention Fixed.In the accompanying drawings:

Fig. 1 is the schematic process flow diagram of a kind of webpage spider operational method described in the embodiment of the present invention one；

Fig. 2 is the concrete structure block diagram of a kind of spiders operating system described in the embodiment of the present invention two.

Detailed description of the invention

As employed some vocabulary in the middle of description and claim to censure specific components.Art technology Personnel are it is to be appreciated that hardware manufacturer may call same assembly with different nouns.This specification and In the way of claim not difference by title is used as distinguishing assembly, but with assembly difference functionally The different criterion being used as distinguishing." comprising " as mentioned by the middle of description and claim in the whole text is out Put formula term, therefore " comprise but be not limited to " should be construed to." substantially " refer in acceptable range of error, Those skilled in the art can solve described technical problem in the range of certain error, basically reaches described technology Effect.Additionally, " coupling " word comprises any directly and indirectly electric property coupling means at this.Therefore, if Described in literary composition, a first device is coupled to one second device, then representing described first device can direct electric property coupling In described second device, or indirectly it is electrically coupled to described second device by other devices or the means that couple. Description subsequent descriptions is to implement the better embodiment of the present invention, and right described description is to illustrate the present invention's For the purpose of rule, it is not limited to the scope of the present invention.Protection scope of the present invention is when regarding appended power Profit requires that defined person is as the criterion.

Below in conjunction with accompanying drawing, the present invention is described in further detail, but not as a limitation of the invention.

Embodiment one

As it is shown in figure 1, be a kind of webpage spider operational method flow process described in the embodiment of the present invention one.

Step 101, captures the URL(Uniform/Universal of website by the parameter of predetermined manner Resource Locator, web page address) and add memory queue to；

Step 102, whether the URL that described memory queue stores in judging it exists with the URL just adding entrance Overlap, if any, then ignore this URL；Such as nothing, then the webpage capture data under this URL and traversal are somebody's turn to do Lower floor's link URL involved in webpage, and judge whether this lower floor's link URL exists overlap, if any, Then ignore；Such as nothing, then to the webpage capture data under this lower floor's link URL, the most described memory queue is sentenced Breaking and whether there is untreated URL, such as nothing, then described data crawl gone out solve according to pre-conditioned Analyse and extract and be transferred to data handling queues；

Wherein, to the webpage capture data under this URL and travel through in this webpage involved lower floor's link URL, is according to lower floor's link involved in the pre-conditioned webpage parsed under this URL, and according to Traversal depth value in pre-conditioned carries out traversal and searches.Follow-up identical content repeats no more.

Step 103, these data are analyzed by described data handling queues with data with existing, and according to dividing Analysis object information revises the crawl frequency in the parameter of described predetermined manner.(the embodiment of the present invention one and after Face embodiment is to the amendment capturing frequency, is certainly not limited to this parameter and can also revise predetermined manner institute Including other parameters, after repeat no more).

The parameter of the described predetermined manner in step 101, including: initially capture address, capture frequency, webpage In the crawl time delay condition of the page and webpage, data deposit queue condition.

Further, the setting up procedure of described predetermined manner is: according to initialization context, the service of place system Device performance, network broadband situation and crawl number of processes arrange a preset template, by this default mould Parameter in predetermined manner is configured by version.

According to the crawl frequency in analysis result information amendment predetermined manner in described step 103, it is further: According to analysis result information by the content in amendment preset template and described pre-by the amendment of this preset template If the crawl frequency in the parameter of mode.

Further, described memory queue is referred to as duplicate removal queue in the present embodiment, for this area Be appreciated that completely for technical staff be meant that expressed by described duplicate removal queue and memory queue consistent, Follow-up repeat no more.

Further, pre-conditioned described in described step 102, including: resolve and support class Jquery grammer DOM data, resolve support json data and/or resolve support script script data.

Described data crawl gone out in described step 102 carry out resolving and extracting being transferred to according to pre-conditioned Data handling queues, be further: described data crawl gone out support class according to the parsing in pre-conditioned The DOM data of Jquery grammer, resolve and support json data and/or resolve the data supporting script script, enter Row resolves and extracts information (including the binary files such as data message, picture and/or flash), to this letter After breath is packaged, it is transferred to data handling queues.

Carry out the change frequency to this purpose analyzed being to analyze this website data in step 103, revise and grab Take frequency, thus reach to solve web crawlers in prior art and website is caused excessive added burden and can not Accurately, the problem of effective acquisition site information.

The concrete operations of the embodiment of the present invention one can be:

First, according to initialization context, server performance, the network broadband situation of place system and grab Take number of processes and one preset template be set, by this preset template, the parameter in predetermined manner is configured, The number of processes and Websites quantity simultaneously and concurrently captured is set, by crawl process mean allocation to each website, The most both website can rationally have been captured, it is to avoid the pressure of intensive access is brought in crawled website by reptile, the most not Crash rate；

Secondly, capture the URL of website by the parameter of predetermined manner and add memory queue to, due to website Having multiple page, before being not provided with masterplate, reptile will capture all of page in website, but be not each The page is all useful to user, thus causes the waste of Internet resources, so, by default template, reptile is only Capture user's data interested, wherein, the parameter of described predetermined manner, including: initially capture address, Capture data in frequency, the crawl time delay condition of Webpage and webpage and deposit queue condition；

Again, in described memory queue judges it, whether the URL of storage exists weight with the URL just adding entrance Folded, if any, then ignore this URL；Such as nothing, then to the webpage capture data under this URL and travel through this net Lower floor's link URL involved in Ye, and judge whether this lower floor's link URL exists overlap, if any, then Ignore；Such as nothing, then to the webpage capture data under this lower floor's link URL, the most described memory queue judges Whether there is untreated URL, such as nothing, then described data crawl gone out are according to pre-conditioned (described Pre-conditioned, including: resolve support class Jquery grammer DOM data, resolve support json data and/or Resolve the data supporting script script) carry out resolving and extract information (include data message, picture and/ Or the binary file such as flash), after this information is packaged, it is transferred to data handling queues；

Finally, these data are analyzed by described data handling queues with data with existing, according to analyzing knot Really information is by the content in amendment preset template, and is revised the ginseng of described predetermined manner by this preset template Crawl frequency in number.

Embodiment two

As in figure 2 it is shown, be a kind of spiders operating system described in the embodiment of the present invention two, including: capture Module 201, memory modules 202 and Data Analysis Services module 203；Wherein,

Described handling module 201, couples with described memory modules 202 and data analysis and processing module 203 phase, uses In the URL(Uniform/Universal Resource Locator by the parameter crawl website of predetermined manner, Web page address), this URL is transmitted and adds the memory queue in described memory modules 202 to.

Described memory modules 202, couples with described handling module 201 and data analysis and processing module 203 phase, uses Store in memory queue in the inner in the website URL receiving the transmission of described handling module 201, afterwards, it is judged that In described memory queue, whether the URL of storage exists overlapping with the URL just adding entrance, if any, then ignore This URL；Such as nothing, then to the webpage capture data under this URL and travel through lower floor involved in this webpage Link URL, and judge whether this lower floor's link URL exists overlap, if any, then ignore；Such as nothing, the most right Webpage capture data under this lower floor's link URL, the most described memory queue judges whether untreated URL, such as nothing, then described data crawl gone out according to pre-conditioned carry out resolving and extract be transferred to described The data handling queues of Data Analysis Services module 203.

Wherein, described pre-conditioned, farther include: resolve the DOM data supporting class Jquery grammer, Resolve and support json data and/or resolve the data supporting script script.

Further, the described data that crawl is gone out by described memory modules 202 are propped up according to the parsing in pre-conditioned Hold the DOM data of class Jquery grammer, resolve and support json data and/or resolve the number supporting script script According to, carry out resolving and extract information (including the binary files such as data message, picture and/or flash), After this information is packaged, it is transferred to the data handling queues in described Data Analysis Services module 203.

Described Data Analysis Services module 203, couples with described handling module 201 and memory modules 202 phase, uses Its internal data handling queues is put in the data after extraction receiving the transmission of described memory modules 202 In, these data are analyzed by described data handling queues with data with existing, and believe according to analysis result Breath revises the crawl frequency in the parameter of described handling module 201 predetermined manner.

In the present embodiment, the parameter of described predetermined manner, farther include: initially capture address, crawl In frequency, the crawl time delay condition of Webpage and webpage, data deposit queue condition.

Wherein, for the setting up procedure of described predetermined manner, it is further: described handling module 201 is according to institute Initialization context, server performance, network broadband situation and crawl number of processes in system are arranged One preset template, is configured the parameter in predetermined manner by this preset template.

Further, described Data Analysis Services module 203 according to analysis result information by revise described crawl Content in preset template described in module 201, and the parameter of described predetermined manner is revised by this preset template In crawl frequency.

Above-mentioned transmission and gatherer process can be realized by cloud, thus can carry out large-scale distributed concurrently Gather, improve data acquisition efficiency, acquisition accurate to desired content, facilitate the height of task to greatest extent Effect customization；Meanwhile, described spiders operating system passes through configuration template, gathers all browser energy flexibly The structured content seen, supports various page type, comprises news, forum, blog, picture etc..

Described above illustrate and describes some preferred embodiments of the present invention, but as previously mentioned, it should be understood that The present invention is not limited to form disclosed herein, is not to be taken as the eliminating to other embodiments, and can For other combinations various, amendment and environment, and can be in invention contemplated scope described herein, by upper State teaching or the technology of association area or knowledge is modified.And the change that those skilled in the art are carried out and change Without departing from the spirit and scope of the present invention, the most all should be in the protection domain of claims of the present invention.

Claims

1. a webpage spider operational method, it is characterised in that including:

Whether the URL that described memory queue stores in judging it exists overlapping with the URL just adding entrance, If any, then ignore this and just add the URL entered；Such as nothing, the then net under the URL this firm interpolation entered Page captures data and travels through lower floor's link URL involved in this webpage, and judges that this lower floor links Whether URL exists overlap, if any, then ignore；Such as nothing, then the webpage under this lower floor's link URL is grabbed Fetching data, the most described memory queue judges whether untreated URL, such as nothing, then crawl is gone out Described data carry out resolving and extracting being transferred to data handling queues, wherein, to this according to pre-conditioned Just add the webpage capture data under the URL entered and traveled through lower floor's link involved in this webpage URL, is according to involved in the described pre-conditioned webpage parsed under this URL just having added entrance Lower floor link, and according to described pre-conditioned in traversal depth value carry out traversal search；

The data extracted are analyzed by described data handling queues with data with existing, and according to dividing Analysis object information revises the parameter of described predetermined manner, wherein, the parameter of described predetermined manner, further Including: initially capture address, capture data in frequency, the crawl time delay condition of Webpage and webpage Deposit queue condition.

2. webpage spider operational method as claimed in claim 1, it is characterised in that described pre-conditioned, Farther include: resolve the DOM data supporting class Jquery grammer, resolve and support json data and/or solution The data of script script are supported in analysis.

3. webpage spider operational method as claimed in claim 2, it is characterised in that described by presetting The parameter of mode captures the URL of website and adds memory queue to, further comprises: according to place be The initialization context of system, server performance, network broadband situation and crawl number of processes arrange one Preset template, is configured the parameter in predetermined manner by this preset template, by described default side The parameter of formula captures the URL of website and adds memory queue to.

4. webpage spider operational method as claimed in claim 3, it is characterised in that described crawl is gone out Described data carry out resolving and extracting being transferred to data handling queues according to pre-conditioned, be further: Described data crawl gone out support the DOM data of class Jquery grammer according to resolving in pre-conditioned, Resolve and support json data and/or resolve the data supporting script script, carry out resolving and extracting data, After the data extracted are packaged, it is transferred to data handling queues.

5. webpage spider operational method as claimed in claim 4, it is characterised in that described according to analysis Object information revises the parameter of described predetermined manner, is further: according to analysis result information by amendment Content in preset template, and the crawl frequency in the parameter of described predetermined manner is revised by this preset template Rate.

6. a spiders operating system, it is characterised in that including: handling module, memory modules With data analysis and processing module；It is characterized in that,

Described handling module, for capturing the URL of website, by this website by the parameter of predetermined manner URL transmits and adds the memory queue in described memory modules to；

Described memory modules, stores in the inner for receiving the URL of the website that described handling module transmits In memory queue, it is judged that in described memory queue, whether the URL of storage exists with the URL just adding entrance Overlap, if any, then ignore this and just add the URL entered；Such as nothing, the then URL this firm interpolation entered Under webpage capture data and travel through in this webpage involved lower floor's link URL, and judge this lower floor Whether link URL exists overlap, if any, then ignore；Such as nothing, then to the net under this lower floor's link URL Page captures data, and the most described memory queue judges whether untreated URL, such as nothing, then will grab The described data taken out carry out resolving and extracting being transferred to described Data Analysis Services mould according to pre-conditioned Block, wherein, to these firm webpage capture data added under the URL entered and travel through in this webpage involved And lower floor's link URL, be pre-conditioned to parse this according to described and just added under the URL entered Lower floor's link involved in webpage, and according to described pre-conditioned in traversal depth value carry out traversal and look into Look for；

Described Data Analysis Services module, the number that the process transmitted for receiving described memory modules extracts According to putting in its internal data handling queues, the described data handling queues data to extracting are with existing Data are analyzed, and revise the ginseng of predetermined manner in described handling module according to analysis result information Number, wherein, the parameter of predetermined manner in described handling module, farther include: initially capture address, Capture data in frequency, the crawl time delay condition of Webpage and webpage and deposit queue condition.

7. spiders operating system as claimed in claim 6, it is characterised in that described memory modules In pre-conditioned, farther include: resolve support class Jquery grammer DOM data, resolve support json The data of script script are supported in data and/or parsing.

8. spiders operating system as claimed in claim 7, it is characterised in that described handling module, Be additionally operable to further the initialization context according to place system, server performance, network broadband situation, with And crawl number of processes arranges a preset template, by this preset template, the parameter in predetermined manner is entered Row is arranged.

9. spiders operating system as claimed in claim 8, it is characterised in that

Described memory modules, is additionally operable to the described data that crawl gone out further according to the solution in pre-conditioned The DOM data of class Jquery grammer are supported in analysis, resolve and support that script script is supported in json data and/or parsing Data, carry out resolving and extracting data, after the data extracted are packaged, be transferred to described Data handling queues in Data Analysis Services module.

10. spiders operating system as claimed in claim 9, it is characterised in that

Described Data Analysis Services module, is additionally operable to according to analysis result information described by amendment further Content in preset template described in handling module, and revise described predetermined manner by this preset template Crawl frequency in parameter.