CN103279507B - Webpage spider operational method and system - Google Patents

Webpage spider operational method and system Download PDF

Info

Publication number
CN103279507B
CN103279507B CN201310181364.XA CN201310181364A CN103279507B CN 103279507 B CN103279507 B CN 103279507B CN 201310181364 A CN201310181364 A CN 201310181364A CN 103279507 B CN103279507 B CN 103279507B
Authority
CN
China
Prior art keywords
data
url
webpage
parameter
crawl
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310181364.XA
Other languages
Chinese (zh)
Other versions
CN103279507A (en
Inventor
许大伦
毛颖
黄明军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lele Kaihang (Beijing) Education Technology Co., Ltd.
Original Assignee
BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd
Priority to CN201310181364.XA priority Critical patent/CN103279507B/en
Publication of CN103279507A publication Critical patent/CN103279507A/en
Application granted granted Critical
Publication of CN103279507B publication Critical patent/CN103279507B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of webpage spider operational method and system, the method specifically includes that and captures the URL of website by the parameter of predetermined manner and add memory queue to;Whether the URL that described memory queue stores in judging it exists overlapping with the URL just adding entrance;To the webpage capture data under this URL and travel through in this webpage involved lower floor's link URL, and judge whether overlap;To the webpage capture data under this lower floor's link URL, then judging whether untreated URL, such as nothing, then described data crawl gone out carry out resolving and extracting being transferred to data handling queues according to pre-conditioned;These data are analyzed by described data handling queues with data with existing, and revise the crawl frequency in the parameter of described predetermined manner according to analysis result information.The present invention with solve in prior art web crawlers website is caused excessive added burden and can not accurately, the problem of effective acquisition site information.

Description

Webpage spider operational method and system
Technical field
The present invention relates to networking technology area, specifically, relate to a kind of webpage spider operational method and be System.
Background technology
Search engine, refers to according to certain strategy, uses specific computer program to collect from the Internet Information, after information is organized and processed, provides the user retrieval service, is correlated with by user search Information shows the system of user.Described search engine is collected from the Internet to the process of information, rely on In web crawlers, related web site information is crawled.
Described web crawlers, is the program of a kind of automatic acquisition web page contents, is the important composition of search engine Part.
In the prior art, for common searched engine, tradition reptile is from one or several Initial pages URL starts, it is thus achieved that the URL on Initial page, during capturing webpage, constantly from current page Extract new URL and put into queue, until meeting certain stop condition of system.
In currently available technology, web crawlers is poor to the analysis ability of web page contents, can only be by mechanical Constantly grasping information of web site, asks circulations to repeat to capture for the most concurrent tens or up to a hundred, and it crawls frequency The highest with crawling pressure, thus consume site resource in a large number, website is caused burden even cause website to collapse Burst.Meanwhile, web crawlers can not crawl out the useful information in website accurately and efficiently.
Therefore, how to solve web crawlers in prior art and website is caused excessive added burden and can not be accurate Really, effective acquisition site information, become as technical problem urgently to be resolved hurrily.
Summary of the invention
The technical problem to be solved is to provide a kind of webpage spider operational method and system, to solve In prior art web crawlers website is caused excessive added burden and can not accurately, effective acquisition website letter The problem of breath.
For solving above-mentioned technical problem, the invention provides a kind of webpage spider operational method, it is characterised in that Including:
Capture the URL of website by the parameter of predetermined manner and add memory queue to;
Whether the URL that described memory queue stores in judging it exists overlapping with the URL just adding entrance, as Have, then ignore this URL;Such as nothing, then to the webpage capture data under this URL and travel through institute in this webpage The lower floor's link URL related to, and judge whether this lower floor's link URL exists overlap, if any, then ignore; Such as nothing, then to the webpage capture data under this lower floor's link URL, the most described memory queue judges whether to deposit At untreated URL, such as nothing, then described data crawl gone out resolve according to pre-conditioned and extract It is transferred to data handling queues;
These data are analyzed by described data handling queues with data with existing, and believe according to analysis result Breath revises the parameter of described predetermined manner.
Preferably, wherein, the parameter of described predetermined manner, farther include: initially capture address, crawl In frequency, the crawl time delay condition of Webpage and webpage, data deposit queue condition.
Preferably, wherein, described pre-conditioned, farther include: resolve the DOM supporting class Jquery grammer Data, resolve and support json data and/or resolve the data supporting script script.
Preferably, wherein, the described parameter by predetermined manner captures the URL of website and adds internal memory team to Row, further comprise: according to the initialization context of place system, server performance, network broadband situation, And crawl number of processes arranges a preset template, by this preset template, the parameter in predetermined manner is entered Row is arranged, and captures the URL of website by the parameter of described predetermined manner and adds memory queue to.
Preferably, wherein, described described data crawl gone out resolve according to pre-conditioned carrying out and extract biography It is handed to data handling queues, is further: described data crawl gone out are propped up according to the parsing in pre-conditioned Hold the DOM data of class Jquery grammer, resolve and support json data and/or resolve the number supporting script script According to, carry out resolving and extracting information, after this information is packaged, be transferred to data handling queues.
Preferably, wherein, the described parameter revising described predetermined manner according to analysis result information, further For: according to analysis result information by the content in amendment preset template, and revise institute by this preset template State the crawl frequency in the parameter of predetermined manner.
For solving above-mentioned technical problem, present invention also offers a kind of spiders operating system, its feature exists In, including: handling module, memory modules and data analysis and processing module;It is characterized in that,
Described handling module, for being captured the URL of website by the parameter of predetermined manner, is transmitted this URL Add the memory queue in described memory modules to;
Described memory modules, the website URL transmitted for receiving described handling module stores internal memory in the inner In queue, it is judged that in described memory queue, whether the URL of storage exists overlapping with the URL just adding entrance, If any, then ignore this URL;Such as nothing, then to the webpage capture data under this URL and travel through in this webpage Involved lower floor's link URL, and judge whether this lower floor's link URL exists overlap, if any, then ignore; Such as nothing, then to the webpage capture data under this lower floor's link URL, the most described memory queue judges whether to deposit At untreated URL, such as nothing, then described data crawl gone out resolve according to pre-conditioned and extract It is transferred to described Data Analysis Services module;
Described Data Analysis Services module, for receiving the data after extraction that described memory modules transmits Putting in its internal data handling queues, it is right that these data and data with existing are carried out by described data handling queues Than analyzing, and revise the parameter of predetermined manner in described handling module according to analysis result information.
Preferably, wherein, the parameter of predetermined manner in described handling module, farther include: initially capture In address, crawl frequency, the crawl time delay condition of Webpage and webpage, data deposit queue condition.
Preferably, wherein, pre-conditioned in described memory modules, farther include: resolve and support class Jquery The DOM data of grammer, resolve and support json data and/or resolve the data supporting script script.
Preferably, wherein, described handling module, be additionally operable to further the initialization context according to place system, Server performance, network broadband situation and crawl number of processes arrange a preset template, pre-by this If the parameter in predetermined manner is configured by masterplate.
Preferably, wherein, described memory modules, the described data being additionally operable to further crawl are according to pre- If the DOM data resolving support class Jquery grammer in condition, resolve and support json data and/or parsing Hold the data of script script, carry out resolving and extracting information, after this information is packaged, be transferred to institute State the data handling queues in Data Analysis Services module.
Preferably, wherein, described Data Analysis Services module, it is additionally operable to further according to analysis result information By the content revised described in described handling module in preset template and described by the amendment of this preset template Crawl frequency in the parameter of predetermined manner.
Compared with prior art, a kind of webpage spider operational method of the present invention and system, reached as Lower effect:
1) present invention reduces crawling frequency and crawling pressure of web crawlers, effectively reduce and website is caused Excessive added burden.
2) present invention achieves and large-scale distributed concurrently gather, greatly improve data acquisition efficiency and The high efficiency of task customization.
3) present invention uses cloud, it is achieved that the high accuracy obtaining desired content.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, The schematic description and description of the present invention is used for explaining the present invention, is not intended that the improper limit to the present invention Fixed.In the accompanying drawings:
Fig. 1 is the schematic process flow diagram of a kind of webpage spider operational method described in the embodiment of the present invention one;
Fig. 2 is the concrete structure block diagram of a kind of spiders operating system described in the embodiment of the present invention two.
Detailed description of the invention
As employed some vocabulary in the middle of description and claim to censure specific components.Art technology Personnel are it is to be appreciated that hardware manufacturer may call same assembly with different nouns.This specification and In the way of claim not difference by title is used as distinguishing assembly, but with assembly difference functionally The different criterion being used as distinguishing." comprising " as mentioned by the middle of description and claim in the whole text is out Put formula term, therefore " comprise but be not limited to " should be construed to." substantially " refer in acceptable range of error, Those skilled in the art can solve described technical problem in the range of certain error, basically reaches described technology Effect.Additionally, " coupling " word comprises any directly and indirectly electric property coupling means at this.Therefore, if Described in literary composition, a first device is coupled to one second device, then representing described first device can direct electric property coupling In described second device, or indirectly it is electrically coupled to described second device by other devices or the means that couple. Description subsequent descriptions is to implement the better embodiment of the present invention, and right described description is to illustrate the present invention's For the purpose of rule, it is not limited to the scope of the present invention.Protection scope of the present invention is when regarding appended power Profit requires that defined person is as the criterion.
Below in conjunction with accompanying drawing, the present invention is described in further detail, but not as a limitation of the invention.
Embodiment one
As it is shown in figure 1, be a kind of webpage spider operational method flow process described in the embodiment of the present invention one.
Step 101, captures the URL(Uniform/Universal of website by the parameter of predetermined manner Resource Locator, web page address) and add memory queue to;
Step 102, whether the URL that described memory queue stores in judging it exists with the URL just adding entrance Overlap, if any, then ignore this URL;Such as nothing, then the webpage capture data under this URL and traversal are somebody's turn to do Lower floor's link URL involved in webpage, and judge whether this lower floor's link URL exists overlap, if any, Then ignore;Such as nothing, then to the webpage capture data under this lower floor's link URL, the most described memory queue is sentenced Breaking and whether there is untreated URL, such as nothing, then described data crawl gone out solve according to pre-conditioned Analyse and extract and be transferred to data handling queues;
Wherein, to the webpage capture data under this URL and travel through in this webpage involved lower floor's link URL, is according to lower floor's link involved in the pre-conditioned webpage parsed under this URL, and according to Traversal depth value in pre-conditioned carries out traversal and searches.Follow-up identical content repeats no more.
Step 103, these data are analyzed by described data handling queues with data with existing, and according to dividing Analysis object information revises the crawl frequency in the parameter of described predetermined manner.(the embodiment of the present invention one and after Face embodiment is to the amendment capturing frequency, is certainly not limited to this parameter and can also revise predetermined manner institute Including other parameters, after repeat no more).
The parameter of the described predetermined manner in step 101, including: initially capture address, capture frequency, webpage In the crawl time delay condition of the page and webpage, data deposit queue condition.
Further, the setting up procedure of described predetermined manner is: according to initialization context, the service of place system Device performance, network broadband situation and crawl number of processes arrange a preset template, by this default mould Parameter in predetermined manner is configured by version.
According to the crawl frequency in analysis result information amendment predetermined manner in described step 103, it is further: According to analysis result information by the content in amendment preset template and described pre-by the amendment of this preset template If the crawl frequency in the parameter of mode.
Further, described memory queue is referred to as duplicate removal queue in the present embodiment, for this area Be appreciated that completely for technical staff be meant that expressed by described duplicate removal queue and memory queue consistent, Follow-up repeat no more.
Further, pre-conditioned described in described step 102, including: resolve and support class Jquery grammer DOM data, resolve support json data and/or resolve support script script data.
Described data crawl gone out in described step 102 carry out resolving and extracting being transferred to according to pre-conditioned Data handling queues, be further: described data crawl gone out support class according to the parsing in pre-conditioned The DOM data of Jquery grammer, resolve and support json data and/or resolve the data supporting script script, enter Row resolves and extracts information (including the binary files such as data message, picture and/or flash), to this letter After breath is packaged, it is transferred to data handling queues.
Carry out the change frequency to this purpose analyzed being to analyze this website data in step 103, revise and grab Take frequency, thus reach to solve web crawlers in prior art and website is caused excessive added burden and can not Accurately, the problem of effective acquisition site information.
The concrete operations of the embodiment of the present invention one can be:
First, according to initialization context, server performance, the network broadband situation of place system and grab Take number of processes and one preset template be set, by this preset template, the parameter in predetermined manner is configured, The number of processes and Websites quantity simultaneously and concurrently captured is set, by crawl process mean allocation to each website, The most both website can rationally have been captured, it is to avoid the pressure of intensive access is brought in crawled website by reptile, the most not Crash rate;
Secondly, capture the URL of website by the parameter of predetermined manner and add memory queue to, due to website Having multiple page, before being not provided with masterplate, reptile will capture all of page in website, but be not each The page is all useful to user, thus causes the waste of Internet resources, so, by default template, reptile is only Capture user's data interested, wherein, the parameter of described predetermined manner, including: initially capture address, Capture data in frequency, the crawl time delay condition of Webpage and webpage and deposit queue condition;
Again, in described memory queue judges it, whether the URL of storage exists weight with the URL just adding entrance Folded, if any, then ignore this URL;Such as nothing, then to the webpage capture data under this URL and travel through this net Lower floor's link URL involved in Ye, and judge whether this lower floor's link URL exists overlap, if any, then Ignore;Such as nothing, then to the webpage capture data under this lower floor's link URL, the most described memory queue judges Whether there is untreated URL, such as nothing, then described data crawl gone out are according to pre-conditioned (described Pre-conditioned, including: resolve support class Jquery grammer DOM data, resolve support json data and/or Resolve the data supporting script script) carry out resolving and extract information (include data message, picture and/ Or the binary file such as flash), after this information is packaged, it is transferred to data handling queues;
Finally, these data are analyzed by described data handling queues with data with existing, according to analyzing knot Really information is by the content in amendment preset template, and is revised the ginseng of described predetermined manner by this preset template Crawl frequency in number.
Embodiment two
As in figure 2 it is shown, be a kind of spiders operating system described in the embodiment of the present invention two, including: capture Module 201, memory modules 202 and Data Analysis Services module 203;Wherein,
Described handling module 201, couples with described memory modules 202 and data analysis and processing module 203 phase, uses In the URL(Uniform/Universal Resource Locator by the parameter crawl website of predetermined manner, Web page address), this URL is transmitted and adds the memory queue in described memory modules 202 to.
Described memory modules 202, couples with described handling module 201 and data analysis and processing module 203 phase, uses Store in memory queue in the inner in the website URL receiving the transmission of described handling module 201, afterwards, it is judged that In described memory queue, whether the URL of storage exists overlapping with the URL just adding entrance, if any, then ignore This URL;Such as nothing, then to the webpage capture data under this URL and travel through lower floor involved in this webpage Link URL, and judge whether this lower floor's link URL exists overlap, if any, then ignore;Such as nothing, the most right Webpage capture data under this lower floor's link URL, the most described memory queue judges whether untreated URL, such as nothing, then described data crawl gone out according to pre-conditioned carry out resolving and extract be transferred to described The data handling queues of Data Analysis Services module 203.
Wherein, described pre-conditioned, farther include: resolve the DOM data supporting class Jquery grammer, Resolve and support json data and/or resolve the data supporting script script.
Further, the described data that crawl is gone out by described memory modules 202 are propped up according to the parsing in pre-conditioned Hold the DOM data of class Jquery grammer, resolve and support json data and/or resolve the number supporting script script According to, carry out resolving and extract information (including the binary files such as data message, picture and/or flash), After this information is packaged, it is transferred to the data handling queues in described Data Analysis Services module 203.
Described Data Analysis Services module 203, couples with described handling module 201 and memory modules 202 phase, uses Its internal data handling queues is put in the data after extraction receiving the transmission of described memory modules 202 In, these data are analyzed by described data handling queues with data with existing, and believe according to analysis result Breath revises the crawl frequency in the parameter of described handling module 201 predetermined manner.
In the present embodiment, the parameter of described predetermined manner, farther include: initially capture address, crawl In frequency, the crawl time delay condition of Webpage and webpage, data deposit queue condition.
Wherein, for the setting up procedure of described predetermined manner, it is further: described handling module 201 is according to institute Initialization context, server performance, network broadband situation and crawl number of processes in system are arranged One preset template, is configured the parameter in predetermined manner by this preset template.
Further, described Data Analysis Services module 203 according to analysis result information by revise described crawl Content in preset template described in module 201, and the parameter of described predetermined manner is revised by this preset template In crawl frequency.
Above-mentioned transmission and gatherer process can be realized by cloud, thus can carry out large-scale distributed concurrently Gather, improve data acquisition efficiency, acquisition accurate to desired content, facilitate the height of task to greatest extent Effect customization;Meanwhile, described spiders operating system passes through configuration template, gathers all browser energy flexibly The structured content seen, supports various page type, comprises news, forum, blog, picture etc..
Compared with prior art, a kind of webpage spider operational method of the present invention and system, reached as Lower effect:
1) present invention reduces crawling frequency and crawling pressure of web crawlers, effectively reduce and website is caused Excessive added burden.
2) present invention achieves and large-scale distributed concurrently gather, greatly improve data acquisition efficiency and The high efficiency of task customization.
3) present invention uses cloud, it is achieved that the high accuracy obtaining desired content.
Described above illustrate and describes some preferred embodiments of the present invention, but as previously mentioned, it should be understood that The present invention is not limited to form disclosed herein, is not to be taken as the eliminating to other embodiments, and can For other combinations various, amendment and environment, and can be in invention contemplated scope described herein, by upper State teaching or the technology of association area or knowledge is modified.And the change that those skilled in the art are carried out and change Without departing from the spirit and scope of the present invention, the most all should be in the protection domain of claims of the present invention.

Claims (10)

1. a webpage spider operational method, it is characterised in that including:
Capture the URL of website by the parameter of predetermined manner and add memory queue to;
Whether the URL that described memory queue stores in judging it exists overlapping with the URL just adding entrance, If any, then ignore this and just add the URL entered;Such as nothing, the then net under the URL this firm interpolation entered Page captures data and travels through lower floor's link URL involved in this webpage, and judges that this lower floor links Whether URL exists overlap, if any, then ignore;Such as nothing, then the webpage under this lower floor's link URL is grabbed Fetching data, the most described memory queue judges whether untreated URL, such as nothing, then crawl is gone out Described data carry out resolving and extracting being transferred to data handling queues, wherein, to this according to pre-conditioned Just add the webpage capture data under the URL entered and traveled through lower floor's link involved in this webpage URL, is according to involved in the described pre-conditioned webpage parsed under this URL just having added entrance Lower floor link, and according to described pre-conditioned in traversal depth value carry out traversal search;
The data extracted are analyzed by described data handling queues with data with existing, and according to dividing Analysis object information revises the parameter of described predetermined manner, wherein, the parameter of described predetermined manner, further Including: initially capture address, capture data in frequency, the crawl time delay condition of Webpage and webpage Deposit queue condition.
2. webpage spider operational method as claimed in claim 1, it is characterised in that described pre-conditioned, Farther include: resolve the DOM data supporting class Jquery grammer, resolve and support json data and/or solution The data of script script are supported in analysis.
3. webpage spider operational method as claimed in claim 2, it is characterised in that described by presetting The parameter of mode captures the URL of website and adds memory queue to, further comprises: according to place be The initialization context of system, server performance, network broadband situation and crawl number of processes arrange one Preset template, is configured the parameter in predetermined manner by this preset template, by described default side The parameter of formula captures the URL of website and adds memory queue to.
4. webpage spider operational method as claimed in claim 3, it is characterised in that described crawl is gone out Described data carry out resolving and extracting being transferred to data handling queues according to pre-conditioned, be further: Described data crawl gone out support the DOM data of class Jquery grammer according to resolving in pre-conditioned, Resolve and support json data and/or resolve the data supporting script script, carry out resolving and extracting data, After the data extracted are packaged, it is transferred to data handling queues.
5. webpage spider operational method as claimed in claim 4, it is characterised in that described according to analysis Object information revises the parameter of described predetermined manner, is further: according to analysis result information by amendment Content in preset template, and the crawl frequency in the parameter of described predetermined manner is revised by this preset template Rate.
6. a spiders operating system, it is characterised in that including: handling module, memory modules With data analysis and processing module;It is characterized in that,
Described handling module, for capturing the URL of website, by this website by the parameter of predetermined manner URL transmits and adds the memory queue in described memory modules to;
Described memory modules, stores in the inner for receiving the URL of the website that described handling module transmits In memory queue, it is judged that in described memory queue, whether the URL of storage exists with the URL just adding entrance Overlap, if any, then ignore this and just add the URL entered;Such as nothing, the then URL this firm interpolation entered Under webpage capture data and travel through in this webpage involved lower floor's link URL, and judge this lower floor Whether link URL exists overlap, if any, then ignore;Such as nothing, then to the net under this lower floor's link URL Page captures data, and the most described memory queue judges whether untreated URL, such as nothing, then will grab The described data taken out carry out resolving and extracting being transferred to described Data Analysis Services mould according to pre-conditioned Block, wherein, to these firm webpage capture data added under the URL entered and travel through in this webpage involved And lower floor's link URL, be pre-conditioned to parse this according to described and just added under the URL entered Lower floor's link involved in webpage, and according to described pre-conditioned in traversal depth value carry out traversal and look into Look for;
Described Data Analysis Services module, the number that the process transmitted for receiving described memory modules extracts According to putting in its internal data handling queues, the described data handling queues data to extracting are with existing Data are analyzed, and revise the ginseng of predetermined manner in described handling module according to analysis result information Number, wherein, the parameter of predetermined manner in described handling module, farther include: initially capture address, Capture data in frequency, the crawl time delay condition of Webpage and webpage and deposit queue condition.
7. spiders operating system as claimed in claim 6, it is characterised in that described memory modules In pre-conditioned, farther include: resolve support class Jquery grammer DOM data, resolve support json The data of script script are supported in data and/or parsing.
8. spiders operating system as claimed in claim 7, it is characterised in that described handling module, Be additionally operable to further the initialization context according to place system, server performance, network broadband situation, with And crawl number of processes arranges a preset template, by this preset template, the parameter in predetermined manner is entered Row is arranged.
9. spiders operating system as claimed in claim 8, it is characterised in that
Described memory modules, is additionally operable to the described data that crawl gone out further according to the solution in pre-conditioned The DOM data of class Jquery grammer are supported in analysis, resolve and support that script script is supported in json data and/or parsing Data, carry out resolving and extracting data, after the data extracted are packaged, be transferred to described Data handling queues in Data Analysis Services module.
10. spiders operating system as claimed in claim 9, it is characterised in that
Described Data Analysis Services module, is additionally operable to according to analysis result information described by amendment further Content in preset template described in handling module, and revise described predetermined manner by this preset template Crawl frequency in parameter.
CN201310181364.XA 2013-05-16 2013-05-16 Webpage spider operational method and system Active CN103279507B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310181364.XA CN103279507B (en) 2013-05-16 2013-05-16 Webpage spider operational method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310181364.XA CN103279507B (en) 2013-05-16 2013-05-16 Webpage spider operational method and system

Publications (2)

Publication Number Publication Date
CN103279507A CN103279507A (en) 2013-09-04
CN103279507B true CN103279507B (en) 2016-12-28

Family

ID=49062027

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310181364.XA Active CN103279507B (en) 2013-05-16 2013-05-16 Webpage spider operational method and system

Country Status (1)

Country Link
CN (1) CN103279507B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942309B (en) * 2014-04-18 2017-06-30 网易乐得科技有限公司 A kind of implementation method of Network Data Capture equipment, method and acquisition process
CN104252530B (en) * 2014-09-10 2017-09-15 北京京东尚科信息技术有限公司 A kind of unit crawler capturing method and system
CN104391917A (en) * 2014-11-19 2015-03-04 四川长虹电器股份有限公司 Method for incrementally capturing webpage contents
CN104572901B (en) * 2014-12-25 2018-12-18 小米科技有限责任公司 The method for down loading and device of web data
CN106202300A (en) * 2016-06-30 2016-12-07 浪潮软件集团有限公司 Network information acquisition method and device
CN106126747A (en) * 2016-07-14 2016-11-16 北京邮电大学 Data capture method based on reptile and device
CN106649720B (en) * 2016-12-22 2020-10-13 北京一览群智数据科技有限责任公司 Data processing method and server
CN109213824B (en) * 2017-06-29 2022-03-04 北京京东尚科信息技术有限公司 Data capture system, method and device
CN107480264B (en) * 2017-08-17 2019-11-15 北京知道创宇信息技术股份有限公司 A kind of web crawlers De-weight method and calculate equipment
CN108763279B (en) * 2018-04-11 2020-12-15 北京中科闻歌科技股份有限公司 Webpage data distributed template acquisition method and system
CN110851746B (en) * 2018-07-27 2022-08-12 北京国双科技有限公司 Crawler seed generation method and device
CN109063144A (en) * 2018-08-07 2018-12-21 广州金猫信息技术服务有限公司 Visual network crawler method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 Method for designing focused crawler
CN103067521A (en) * 2013-01-08 2013-04-24 中国科学院声学研究所 Distributed-type nodes and distributed-type system in a crawler cluster
CN103092999A (en) * 2013-02-22 2013-05-08 人民搜索网络股份公司 Webpage crawling cycle adjusting method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9922119B2 (en) * 2007-11-08 2018-03-20 Entit Software Llc Navigational ranking for focused crawling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 Method for designing focused crawler
CN103067521A (en) * 2013-01-08 2013-04-24 中国科学院声学研究所 Distributed-type nodes and distributed-type system in a crawler cluster
CN103092999A (en) * 2013-02-22 2013-05-08 人民搜索网络股份公司 Webpage crawling cycle adjusting method and device

Also Published As

Publication number Publication date
CN103279507A (en) 2013-09-04

Similar Documents

Publication Publication Date Title
CN103279507B (en) Webpage spider operational method and system
CN105243159B (en) A kind of distributed network crawler system based on visualization script editing machine
CN102486799B (en) World wide web (WWW) page processing method and device
CN110020062B (en) Customizable web crawler method and system
CN104750471A (en) WEB page performance detection and analysis plug-in and method based on browser
CN102739663A (en) Detection method and scanning engine of web pages
CN108664559A (en) A kind of automatic crawling method of website and webpage source code
CN103927370A (en) Network information batch acquisition method of combined text and picture information
CN101441629A (en) Automatic acquiring method of non-structured web page information
US20150120692A1 (en) Method, device, and system for acquiring user behavior
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN106446113A (en) Mobile big data analysis method and device
CN107766509A (en) A kind of method and apparatus of webpage static backup
CN106599270B (en) Network data capturing method and crawler
CN103455600A (en) Video URL (Uniform Resource Locator) grabbing method and device and server equipment
US10491606B2 (en) Method and apparatus for providing website authentication data for search engine
CN103312692B (en) Chained address safety detecting method and device
CN104281629A (en) Method and device for extracting picture from webpage and client equipment
CN103036746B (en) Passive measurement method and passive measurement system of web page responding time based on network intermediate point
CN104636340A (en) Webpage URL filtering method, device and system
CN103927367A (en) Microblog acquisition system and method based on events
CN105930385A (en) Data crawling method and system
CN103354546A (en) Message filtering method and message filtering apparatus
US9749352B2 (en) Apparatus and method for collecting harmful website information
CN103117892B (en) Add method and the device of website visiting record

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20190923

Address after: 100088 Beijing Haidian District Garden Road No. 13 Courtyard 7 Floor 12, 1203-1

Patentee after: Lele Kaihang (Beijing) Education Technology Co., Ltd.

Address before: 100085, room 2, building 5, building 1, No. 516, ten Street, Haidian District, Beijing

Patentee before: Beijing Shangyou Tongda Information Technology Co., Ltd.

TR01 Transfer of patent right