CN103279507B - Webpage spider operational method and system - Google Patents
Webpage spider operational method and system Download PDFInfo
- Publication number
- CN103279507B CN103279507B CN201310181364.XA CN201310181364A CN103279507B CN 103279507 B CN103279507 B CN 103279507B CN 201310181364 A CN201310181364 A CN 201310181364A CN 103279507 B CN103279507 B CN 103279507B
- Authority
- CN
- China
- Prior art keywords
- data
- url
- webpage
- parameter
- crawl
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of webpage spider operational method and system, the method specifically includes that and captures the URL of website by the parameter of predetermined manner and add memory queue to;Whether the URL that described memory queue stores in judging it exists overlapping with the URL just adding entrance;To the webpage capture data under this URL and travel through in this webpage involved lower floor's link URL, and judge whether overlap;To the webpage capture data under this lower floor's link URL, then judging whether untreated URL, such as nothing, then described data crawl gone out carry out resolving and extracting being transferred to data handling queues according to pre-conditioned;These data are analyzed by described data handling queues with data with existing, and revise the crawl frequency in the parameter of described predetermined manner according to analysis result information.The present invention with solve in prior art web crawlers website is caused excessive added burden and can not accurately, the problem of effective acquisition site information.
Description
Technical field
The present invention relates to networking technology area, specifically, relate to a kind of webpage spider operational method and be
System.
Background technology
Search engine, refers to according to certain strategy, uses specific computer program to collect from the Internet
Information, after information is organized and processed, provides the user retrieval service, is correlated with by user search
Information shows the system of user.Described search engine is collected from the Internet to the process of information, rely on
In web crawlers, related web site information is crawled.
Described web crawlers, is the program of a kind of automatic acquisition web page contents, is the important composition of search engine
Part.
In the prior art, for common searched engine, tradition reptile is from one or several Initial pages
URL starts, it is thus achieved that the URL on Initial page, during capturing webpage, constantly from current page
Extract new URL and put into queue, until meeting certain stop condition of system.
In currently available technology, web crawlers is poor to the analysis ability of web page contents, can only be by mechanical
Constantly grasping information of web site, asks circulations to repeat to capture for the most concurrent tens or up to a hundred, and it crawls frequency
The highest with crawling pressure, thus consume site resource in a large number, website is caused burden even cause website to collapse
Burst.Meanwhile, web crawlers can not crawl out the useful information in website accurately and efficiently.
Therefore, how to solve web crawlers in prior art and website is caused excessive added burden and can not be accurate
Really, effective acquisition site information, become as technical problem urgently to be resolved hurrily.
Summary of the invention
The technical problem to be solved is to provide a kind of webpage spider operational method and system, to solve
In prior art web crawlers website is caused excessive added burden and can not accurately, effective acquisition website letter
The problem of breath.
For solving above-mentioned technical problem, the invention provides a kind of webpage spider operational method, it is characterised in that
Including:
Capture the URL of website by the parameter of predetermined manner and add memory queue to;
Whether the URL that described memory queue stores in judging it exists overlapping with the URL just adding entrance, as
Have, then ignore this URL;Such as nothing, then to the webpage capture data under this URL and travel through institute in this webpage
The lower floor's link URL related to, and judge whether this lower floor's link URL exists overlap, if any, then ignore;
Such as nothing, then to the webpage capture data under this lower floor's link URL, the most described memory queue judges whether to deposit
At untreated URL, such as nothing, then described data crawl gone out resolve according to pre-conditioned and extract
It is transferred to data handling queues;
These data are analyzed by described data handling queues with data with existing, and believe according to analysis result
Breath revises the parameter of described predetermined manner.
Preferably, wherein, the parameter of described predetermined manner, farther include: initially capture address, crawl
In frequency, the crawl time delay condition of Webpage and webpage, data deposit queue condition.
Preferably, wherein, described pre-conditioned, farther include: resolve the DOM supporting class Jquery grammer
Data, resolve and support json data and/or resolve the data supporting script script.
Preferably, wherein, the described parameter by predetermined manner captures the URL of website and adds internal memory team to
Row, further comprise: according to the initialization context of place system, server performance, network broadband situation,
And crawl number of processes arranges a preset template, by this preset template, the parameter in predetermined manner is entered
Row is arranged, and captures the URL of website by the parameter of described predetermined manner and adds memory queue to.
Preferably, wherein, described described data crawl gone out resolve according to pre-conditioned carrying out and extract biography
It is handed to data handling queues, is further: described data crawl gone out are propped up according to the parsing in pre-conditioned
Hold the DOM data of class Jquery grammer, resolve and support json data and/or resolve the number supporting script script
According to, carry out resolving and extracting information, after this information is packaged, be transferred to data handling queues.
Preferably, wherein, the described parameter revising described predetermined manner according to analysis result information, further
For: according to analysis result information by the content in amendment preset template, and revise institute by this preset template
State the crawl frequency in the parameter of predetermined manner.
For solving above-mentioned technical problem, present invention also offers a kind of spiders operating system, its feature exists
In, including: handling module, memory modules and data analysis and processing module;It is characterized in that,
Described handling module, for being captured the URL of website by the parameter of predetermined manner, is transmitted this URL
Add the memory queue in described memory modules to;
Described memory modules, the website URL transmitted for receiving described handling module stores internal memory in the inner
In queue, it is judged that in described memory queue, whether the URL of storage exists overlapping with the URL just adding entrance,
If any, then ignore this URL;Such as nothing, then to the webpage capture data under this URL and travel through in this webpage
Involved lower floor's link URL, and judge whether this lower floor's link URL exists overlap, if any, then ignore;
Such as nothing, then to the webpage capture data under this lower floor's link URL, the most described memory queue judges whether to deposit
At untreated URL, such as nothing, then described data crawl gone out resolve according to pre-conditioned and extract
It is transferred to described Data Analysis Services module;
Described Data Analysis Services module, for receiving the data after extraction that described memory modules transmits
Putting in its internal data handling queues, it is right that these data and data with existing are carried out by described data handling queues
Than analyzing, and revise the parameter of predetermined manner in described handling module according to analysis result information.
Preferably, wherein, the parameter of predetermined manner in described handling module, farther include: initially capture
In address, crawl frequency, the crawl time delay condition of Webpage and webpage, data deposit queue condition.
Preferably, wherein, pre-conditioned in described memory modules, farther include: resolve and support class Jquery
The DOM data of grammer, resolve and support json data and/or resolve the data supporting script script.
Preferably, wherein, described handling module, be additionally operable to further the initialization context according to place system,
Server performance, network broadband situation and crawl number of processes arrange a preset template, pre-by this
If the parameter in predetermined manner is configured by masterplate.
Preferably, wherein, described memory modules, the described data being additionally operable to further crawl are according to pre-
If the DOM data resolving support class Jquery grammer in condition, resolve and support json data and/or parsing
Hold the data of script script, carry out resolving and extracting information, after this information is packaged, be transferred to institute
State the data handling queues in Data Analysis Services module.
Preferably, wherein, described Data Analysis Services module, it is additionally operable to further according to analysis result information
By the content revised described in described handling module in preset template and described by the amendment of this preset template
Crawl frequency in the parameter of predetermined manner.
Compared with prior art, a kind of webpage spider operational method of the present invention and system, reached as
Lower effect:
1) present invention reduces crawling frequency and crawling pressure of web crawlers, effectively reduce and website is caused
Excessive added burden.
2) present invention achieves and large-scale distributed concurrently gather, greatly improve data acquisition efficiency and
The high efficiency of task customization.
3) present invention uses cloud, it is achieved that the high accuracy obtaining desired content.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application,
The schematic description and description of the present invention is used for explaining the present invention, is not intended that the improper limit to the present invention
Fixed.In the accompanying drawings:
Fig. 1 is the schematic process flow diagram of a kind of webpage spider operational method described in the embodiment of the present invention one;
Fig. 2 is the concrete structure block diagram of a kind of spiders operating system described in the embodiment of the present invention two.
Detailed description of the invention
As employed some vocabulary in the middle of description and claim to censure specific components.Art technology
Personnel are it is to be appreciated that hardware manufacturer may call same assembly with different nouns.This specification and
In the way of claim not difference by title is used as distinguishing assembly, but with assembly difference functionally
The different criterion being used as distinguishing." comprising " as mentioned by the middle of description and claim in the whole text is out
Put formula term, therefore " comprise but be not limited to " should be construed to." substantially " refer in acceptable range of error,
Those skilled in the art can solve described technical problem in the range of certain error, basically reaches described technology
Effect.Additionally, " coupling " word comprises any directly and indirectly electric property coupling means at this.Therefore, if
Described in literary composition, a first device is coupled to one second device, then representing described first device can direct electric property coupling
In described second device, or indirectly it is electrically coupled to described second device by other devices or the means that couple.
Description subsequent descriptions is to implement the better embodiment of the present invention, and right described description is to illustrate the present invention's
For the purpose of rule, it is not limited to the scope of the present invention.Protection scope of the present invention is when regarding appended power
Profit requires that defined person is as the criterion.
Below in conjunction with accompanying drawing, the present invention is described in further detail, but not as a limitation of the invention.
Embodiment one
As it is shown in figure 1, be a kind of webpage spider operational method flow process described in the embodiment of the present invention one.
Step 101, captures the URL(Uniform/Universal of website by the parameter of predetermined manner
Resource Locator, web page address) and add memory queue to;
Step 102, whether the URL that described memory queue stores in judging it exists with the URL just adding entrance
Overlap, if any, then ignore this URL;Such as nothing, then the webpage capture data under this URL and traversal are somebody's turn to do
Lower floor's link URL involved in webpage, and judge whether this lower floor's link URL exists overlap, if any,
Then ignore;Such as nothing, then to the webpage capture data under this lower floor's link URL, the most described memory queue is sentenced
Breaking and whether there is untreated URL, such as nothing, then described data crawl gone out solve according to pre-conditioned
Analyse and extract and be transferred to data handling queues;
Wherein, to the webpage capture data under this URL and travel through in this webpage involved lower floor's link
URL, is according to lower floor's link involved in the pre-conditioned webpage parsed under this URL, and according to
Traversal depth value in pre-conditioned carries out traversal and searches.Follow-up identical content repeats no more.
Step 103, these data are analyzed by described data handling queues with data with existing, and according to dividing
Analysis object information revises the crawl frequency in the parameter of described predetermined manner.(the embodiment of the present invention one and after
Face embodiment is to the amendment capturing frequency, is certainly not limited to this parameter and can also revise predetermined manner institute
Including other parameters, after repeat no more).
The parameter of the described predetermined manner in step 101, including: initially capture address, capture frequency, webpage
In the crawl time delay condition of the page and webpage, data deposit queue condition.
Further, the setting up procedure of described predetermined manner is: according to initialization context, the service of place system
Device performance, network broadband situation and crawl number of processes arrange a preset template, by this default mould
Parameter in predetermined manner is configured by version.
According to the crawl frequency in analysis result information amendment predetermined manner in described step 103, it is further:
According to analysis result information by the content in amendment preset template and described pre-by the amendment of this preset template
If the crawl frequency in the parameter of mode.
Further, described memory queue is referred to as duplicate removal queue in the present embodiment, for this area
Be appreciated that completely for technical staff be meant that expressed by described duplicate removal queue and memory queue consistent,
Follow-up repeat no more.
Further, pre-conditioned described in described step 102, including: resolve and support class Jquery grammer
DOM data, resolve support json data and/or resolve support script script data.
Described data crawl gone out in described step 102 carry out resolving and extracting being transferred to according to pre-conditioned
Data handling queues, be further: described data crawl gone out support class according to the parsing in pre-conditioned
The DOM data of Jquery grammer, resolve and support json data and/or resolve the data supporting script script, enter
Row resolves and extracts information (including the binary files such as data message, picture and/or flash), to this letter
After breath is packaged, it is transferred to data handling queues.
Carry out the change frequency to this purpose analyzed being to analyze this website data in step 103, revise and grab
Take frequency, thus reach to solve web crawlers in prior art and website is caused excessive added burden and can not
Accurately, the problem of effective acquisition site information.
The concrete operations of the embodiment of the present invention one can be:
First, according to initialization context, server performance, the network broadband situation of place system and grab
Take number of processes and one preset template be set, by this preset template, the parameter in predetermined manner is configured,
The number of processes and Websites quantity simultaneously and concurrently captured is set, by crawl process mean allocation to each website,
The most both website can rationally have been captured, it is to avoid the pressure of intensive access is brought in crawled website by reptile, the most not
Crash rate;
Secondly, capture the URL of website by the parameter of predetermined manner and add memory queue to, due to website
Having multiple page, before being not provided with masterplate, reptile will capture all of page in website, but be not each
The page is all useful to user, thus causes the waste of Internet resources, so, by default template, reptile is only
Capture user's data interested, wherein, the parameter of described predetermined manner, including: initially capture address,
Capture data in frequency, the crawl time delay condition of Webpage and webpage and deposit queue condition;
Again, in described memory queue judges it, whether the URL of storage exists weight with the URL just adding entrance
Folded, if any, then ignore this URL;Such as nothing, then to the webpage capture data under this URL and travel through this net
Lower floor's link URL involved in Ye, and judge whether this lower floor's link URL exists overlap, if any, then
Ignore;Such as nothing, then to the webpage capture data under this lower floor's link URL, the most described memory queue judges
Whether there is untreated URL, such as nothing, then described data crawl gone out are according to pre-conditioned (described
Pre-conditioned, including: resolve support class Jquery grammer DOM data, resolve support json data and/or
Resolve the data supporting script script) carry out resolving and extract information (include data message, picture and/
Or the binary file such as flash), after this information is packaged, it is transferred to data handling queues;
Finally, these data are analyzed by described data handling queues with data with existing, according to analyzing knot
Really information is by the content in amendment preset template, and is revised the ginseng of described predetermined manner by this preset template
Crawl frequency in number.
Embodiment two
As in figure 2 it is shown, be a kind of spiders operating system described in the embodiment of the present invention two, including: capture
Module 201, memory modules 202 and Data Analysis Services module 203;Wherein,
Described handling module 201, couples with described memory modules 202 and data analysis and processing module 203 phase, uses
In the URL(Uniform/Universal Resource Locator by the parameter crawl website of predetermined manner,
Web page address), this URL is transmitted and adds the memory queue in described memory modules 202 to.
Described memory modules 202, couples with described handling module 201 and data analysis and processing module 203 phase, uses
Store in memory queue in the inner in the website URL receiving the transmission of described handling module 201, afterwards, it is judged that
In described memory queue, whether the URL of storage exists overlapping with the URL just adding entrance, if any, then ignore
This URL;Such as nothing, then to the webpage capture data under this URL and travel through lower floor involved in this webpage
Link URL, and judge whether this lower floor's link URL exists overlap, if any, then ignore;Such as nothing, the most right
Webpage capture data under this lower floor's link URL, the most described memory queue judges whether untreated
URL, such as nothing, then described data crawl gone out according to pre-conditioned carry out resolving and extract be transferred to described
The data handling queues of Data Analysis Services module 203.
Wherein, described pre-conditioned, farther include: resolve the DOM data supporting class Jquery grammer,
Resolve and support json data and/or resolve the data supporting script script.
Further, the described data that crawl is gone out by described memory modules 202 are propped up according to the parsing in pre-conditioned
Hold the DOM data of class Jquery grammer, resolve and support json data and/or resolve the number supporting script script
According to, carry out resolving and extract information (including the binary files such as data message, picture and/or flash),
After this information is packaged, it is transferred to the data handling queues in described Data Analysis Services module 203.
Described Data Analysis Services module 203, couples with described handling module 201 and memory modules 202 phase, uses
Its internal data handling queues is put in the data after extraction receiving the transmission of described memory modules 202
In, these data are analyzed by described data handling queues with data with existing, and believe according to analysis result
Breath revises the crawl frequency in the parameter of described handling module 201 predetermined manner.
In the present embodiment, the parameter of described predetermined manner, farther include: initially capture address, crawl
In frequency, the crawl time delay condition of Webpage and webpage, data deposit queue condition.
Wherein, for the setting up procedure of described predetermined manner, it is further: described handling module 201 is according to institute
Initialization context, server performance, network broadband situation and crawl number of processes in system are arranged
One preset template, is configured the parameter in predetermined manner by this preset template.
Further, described Data Analysis Services module 203 according to analysis result information by revise described crawl
Content in preset template described in module 201, and the parameter of described predetermined manner is revised by this preset template
In crawl frequency.
Above-mentioned transmission and gatherer process can be realized by cloud, thus can carry out large-scale distributed concurrently
Gather, improve data acquisition efficiency, acquisition accurate to desired content, facilitate the height of task to greatest extent
Effect customization;Meanwhile, described spiders operating system passes through configuration template, gathers all browser energy flexibly
The structured content seen, supports various page type, comprises news, forum, blog, picture etc..
Compared with prior art, a kind of webpage spider operational method of the present invention and system, reached as
Lower effect:
1) present invention reduces crawling frequency and crawling pressure of web crawlers, effectively reduce and website is caused
Excessive added burden.
2) present invention achieves and large-scale distributed concurrently gather, greatly improve data acquisition efficiency and
The high efficiency of task customization.
3) present invention uses cloud, it is achieved that the high accuracy obtaining desired content.
Described above illustrate and describes some preferred embodiments of the present invention, but as previously mentioned, it should be understood that
The present invention is not limited to form disclosed herein, is not to be taken as the eliminating to other embodiments, and can
For other combinations various, amendment and environment, and can be in invention contemplated scope described herein, by upper
State teaching or the technology of association area or knowledge is modified.And the change that those skilled in the art are carried out and change
Without departing from the spirit and scope of the present invention, the most all should be in the protection domain of claims of the present invention.
Claims (10)
1. a webpage spider operational method, it is characterised in that including:
Capture the URL of website by the parameter of predetermined manner and add memory queue to;
Whether the URL that described memory queue stores in judging it exists overlapping with the URL just adding entrance,
If any, then ignore this and just add the URL entered;Such as nothing, the then net under the URL this firm interpolation entered
Page captures data and travels through lower floor's link URL involved in this webpage, and judges that this lower floor links
Whether URL exists overlap, if any, then ignore;Such as nothing, then the webpage under this lower floor's link URL is grabbed
Fetching data, the most described memory queue judges whether untreated URL, such as nothing, then crawl is gone out
Described data carry out resolving and extracting being transferred to data handling queues, wherein, to this according to pre-conditioned
Just add the webpage capture data under the URL entered and traveled through lower floor's link involved in this webpage
URL, is according to involved in the described pre-conditioned webpage parsed under this URL just having added entrance
Lower floor link, and according to described pre-conditioned in traversal depth value carry out traversal search;
The data extracted are analyzed by described data handling queues with data with existing, and according to dividing
Analysis object information revises the parameter of described predetermined manner, wherein, the parameter of described predetermined manner, further
Including: initially capture address, capture data in frequency, the crawl time delay condition of Webpage and webpage
Deposit queue condition.
2. webpage spider operational method as claimed in claim 1, it is characterised in that described pre-conditioned,
Farther include: resolve the DOM data supporting class Jquery grammer, resolve and support json data and/or solution
The data of script script are supported in analysis.
3. webpage spider operational method as claimed in claim 2, it is characterised in that described by presetting
The parameter of mode captures the URL of website and adds memory queue to, further comprises: according to place be
The initialization context of system, server performance, network broadband situation and crawl number of processes arrange one
Preset template, is configured the parameter in predetermined manner by this preset template, by described default side
The parameter of formula captures the URL of website and adds memory queue to.
4. webpage spider operational method as claimed in claim 3, it is characterised in that described crawl is gone out
Described data carry out resolving and extracting being transferred to data handling queues according to pre-conditioned, be further:
Described data crawl gone out support the DOM data of class Jquery grammer according to resolving in pre-conditioned,
Resolve and support json data and/or resolve the data supporting script script, carry out resolving and extracting data,
After the data extracted are packaged, it is transferred to data handling queues.
5. webpage spider operational method as claimed in claim 4, it is characterised in that described according to analysis
Object information revises the parameter of described predetermined manner, is further: according to analysis result information by amendment
Content in preset template, and the crawl frequency in the parameter of described predetermined manner is revised by this preset template
Rate.
6. a spiders operating system, it is characterised in that including: handling module, memory modules
With data analysis and processing module;It is characterized in that,
Described handling module, for capturing the URL of website, by this website by the parameter of predetermined manner
URL transmits and adds the memory queue in described memory modules to;
Described memory modules, stores in the inner for receiving the URL of the website that described handling module transmits
In memory queue, it is judged that in described memory queue, whether the URL of storage exists with the URL just adding entrance
Overlap, if any, then ignore this and just add the URL entered;Such as nothing, the then URL this firm interpolation entered
Under webpage capture data and travel through in this webpage involved lower floor's link URL, and judge this lower floor
Whether link URL exists overlap, if any, then ignore;Such as nothing, then to the net under this lower floor's link URL
Page captures data, and the most described memory queue judges whether untreated URL, such as nothing, then will grab
The described data taken out carry out resolving and extracting being transferred to described Data Analysis Services mould according to pre-conditioned
Block, wherein, to these firm webpage capture data added under the URL entered and travel through in this webpage involved
And lower floor's link URL, be pre-conditioned to parse this according to described and just added under the URL entered
Lower floor's link involved in webpage, and according to described pre-conditioned in traversal depth value carry out traversal and look into
Look for;
Described Data Analysis Services module, the number that the process transmitted for receiving described memory modules extracts
According to putting in its internal data handling queues, the described data handling queues data to extracting are with existing
Data are analyzed, and revise the ginseng of predetermined manner in described handling module according to analysis result information
Number, wherein, the parameter of predetermined manner in described handling module, farther include: initially capture address,
Capture data in frequency, the crawl time delay condition of Webpage and webpage and deposit queue condition.
7. spiders operating system as claimed in claim 6, it is characterised in that described memory modules
In pre-conditioned, farther include: resolve support class Jquery grammer DOM data, resolve support json
The data of script script are supported in data and/or parsing.
8. spiders operating system as claimed in claim 7, it is characterised in that described handling module,
Be additionally operable to further the initialization context according to place system, server performance, network broadband situation, with
And crawl number of processes arranges a preset template, by this preset template, the parameter in predetermined manner is entered
Row is arranged.
9. spiders operating system as claimed in claim 8, it is characterised in that
Described memory modules, is additionally operable to the described data that crawl gone out further according to the solution in pre-conditioned
The DOM data of class Jquery grammer are supported in analysis, resolve and support that script script is supported in json data and/or parsing
Data, carry out resolving and extracting data, after the data extracted are packaged, be transferred to described
Data handling queues in Data Analysis Services module.
10. spiders operating system as claimed in claim 9, it is characterised in that
Described Data Analysis Services module, is additionally operable to according to analysis result information described by amendment further
Content in preset template described in handling module, and revise described predetermined manner by this preset template
Crawl frequency in parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310181364.XA CN103279507B (en) | 2013-05-16 | 2013-05-16 | Webpage spider operational method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310181364.XA CN103279507B (en) | 2013-05-16 | 2013-05-16 | Webpage spider operational method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103279507A CN103279507A (en) | 2013-09-04 |
CN103279507B true CN103279507B (en) | 2016-12-28 |
Family
ID=49062027
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310181364.XA Active CN103279507B (en) | 2013-05-16 | 2013-05-16 | Webpage spider operational method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103279507B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103942309B (en) * | 2014-04-18 | 2017-06-30 | 网易乐得科技有限公司 | A kind of implementation method of Network Data Capture equipment, method and acquisition process |
CN104252530B (en) * | 2014-09-10 | 2017-09-15 | 北京京东尚科信息技术有限公司 | A kind of unit crawler capturing method and system |
CN104391917A (en) * | 2014-11-19 | 2015-03-04 | 四川长虹电器股份有限公司 | Method for incrementally capturing webpage contents |
CN104572901B (en) * | 2014-12-25 | 2018-12-18 | 小米科技有限责任公司 | The method for down loading and device of web data |
CN106202300A (en) * | 2016-06-30 | 2016-12-07 | 浪潮软件集团有限公司 | Network information acquisition method and device |
CN106126747A (en) * | 2016-07-14 | 2016-11-16 | 北京邮电大学 | Data capture method based on reptile and device |
CN106649720B (en) * | 2016-12-22 | 2020-10-13 | 北京一览群智数据科技有限责任公司 | Data processing method and server |
CN109213824B (en) * | 2017-06-29 | 2022-03-04 | 北京京东尚科信息技术有限公司 | Data capture system, method and device |
CN107480264B (en) * | 2017-08-17 | 2019-11-15 | 北京知道创宇信息技术股份有限公司 | A kind of web crawlers De-weight method and calculate equipment |
CN108763279B (en) * | 2018-04-11 | 2020-12-15 | 北京中科闻歌科技股份有限公司 | Webpage data distributed template acquisition method and system |
CN110851746B (en) * | 2018-07-27 | 2022-08-12 | 北京国双科技有限公司 | Crawler seed generation method and device |
CN109063144A (en) * | 2018-08-07 | 2018-12-21 | 广州金猫信息技术服务有限公司 | Visual network crawler method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101635718A (en) * | 2009-08-26 | 2010-01-27 | 中兴通讯股份有限公司 | Network crawler system and method for acquiring resource as well as network resource gripping device |
CN102930059A (en) * | 2012-11-26 | 2013-02-13 | 电子科技大学 | Method for designing focused crawler |
CN103067521A (en) * | 2013-01-08 | 2013-04-24 | 中国科学院声学研究所 | Distributed-type nodes and distributed-type system in a crawler cluster |
CN103092999A (en) * | 2013-02-22 | 2013-05-08 | 人民搜索网络股份公司 | Webpage crawling cycle adjusting method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9922119B2 (en) * | 2007-11-08 | 2018-03-20 | Entit Software Llc | Navigational ranking for focused crawling |
-
2013
- 2013-05-16 CN CN201310181364.XA patent/CN103279507B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101635718A (en) * | 2009-08-26 | 2010-01-27 | 中兴通讯股份有限公司 | Network crawler system and method for acquiring resource as well as network resource gripping device |
CN102930059A (en) * | 2012-11-26 | 2013-02-13 | 电子科技大学 | Method for designing focused crawler |
CN103067521A (en) * | 2013-01-08 | 2013-04-24 | 中国科学院声学研究所 | Distributed-type nodes and distributed-type system in a crawler cluster |
CN103092999A (en) * | 2013-02-22 | 2013-05-08 | 人民搜索网络股份公司 | Webpage crawling cycle adjusting method and device |
Also Published As
Publication number | Publication date |
---|---|
CN103279507A (en) | 2013-09-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103279507B (en) | Webpage spider operational method and system | |
CN105243159B (en) | A kind of distributed network crawler system based on visualization script editing machine | |
CN102486799B (en) | World wide web (WWW) page processing method and device | |
CN110020062B (en) | Customizable web crawler method and system | |
CN104750471A (en) | WEB page performance detection and analysis plug-in and method based on browser | |
CN102739663A (en) | Detection method and scanning engine of web pages | |
CN108664559A (en) | A kind of automatic crawling method of website and webpage source code | |
CN103927370A (en) | Network information batch acquisition method of combined text and picture information | |
CN101441629A (en) | Automatic acquiring method of non-structured web page information | |
US20150120692A1 (en) | Method, device, and system for acquiring user behavior | |
CN104391978A (en) | Method and device for storing and processing web pages of browsers | |
CN106446113A (en) | Mobile big data analysis method and device | |
CN107766509A (en) | A kind of method and apparatus of webpage static backup | |
CN106599270B (en) | Network data capturing method and crawler | |
CN103455600A (en) | Video URL (Uniform Resource Locator) grabbing method and device and server equipment | |
US10491606B2 (en) | Method and apparatus for providing website authentication data for search engine | |
CN103312692B (en) | Chained address safety detecting method and device | |
CN104281629A (en) | Method and device for extracting picture from webpage and client equipment | |
CN103036746B (en) | Passive measurement method and passive measurement system of web page responding time based on network intermediate point | |
CN104636340A (en) | Webpage URL filtering method, device and system | |
CN103927367A (en) | Microblog acquisition system and method based on events | |
CN105930385A (en) | Data crawling method and system | |
CN103354546A (en) | Message filtering method and message filtering apparatus | |
US9749352B2 (en) | Apparatus and method for collecting harmful website information | |
CN103117892B (en) | Add method and the device of website visiting record |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20190923 Address after: 100088 Beijing Haidian District Garden Road No. 13 Courtyard 7 Floor 12, 1203-1 Patentee after: Lele Kaihang (Beijing) Education Technology Co., Ltd. Address before: 100085, room 2, building 5, building 1, No. 516, ten Street, Haidian District, Beijing Patentee before: Beijing Shangyou Tongda Information Technology Co., Ltd. |
|
TR01 | Transfer of patent right |