CN102932448B - The URL re-scheduling system and method for a kind of distributed network reptile - Google Patents

The URL re-scheduling system and method for a kind of distributed network reptile Download PDF

Info

Publication number
CN102932448B
CN102932448B CN201210425213.XA CN201210425213A CN102932448B CN 102932448 B CN102932448 B CN 102932448B CN 201210425213 A CN201210425213 A CN 201210425213A CN 102932448 B CN102932448 B CN 102932448B
Authority
CN
China
Prior art keywords
scheduling
reptile
url information
child node
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210425213.XA
Other languages
Chinese (zh)
Other versions
CN102932448A (en
Inventor
刘述
徐贵宝
江文学
何宝宏
高强
赵劲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Academy of Information and Communications Technology CAICT
Original Assignee
Research Institute of Telecommunications Transmission Ministry of Industry and Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Research Institute of Telecommunications Transmission Ministry of Industry and Information Technology filed Critical Research Institute of Telecommunications Transmission Ministry of Industry and Information Technology
Priority to CN201210425213.XA priority Critical patent/CN102932448B/en
Publication of CN102932448A publication Critical patent/CN102932448A/en
Application granted granted Critical
Publication of CN102932448B publication Critical patent/CN102932448B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A URL re-scheduling system and method for distributed network reptile, described system comprises reptile and gathers child node, central server, database server.Described method comprises: reptile gathers child node and registers on central server; Reptile gathers child node and obtain URL from database waiting list, and from then on URL obtains new URL information; Reptile gathers child node and carries out one-level re-scheduling to the new URL obtained, and as one-level re-scheduling is not passed through, then abandons this URL; As one-level re-scheduling is passed through, the URL newly obtained is added local URL abstract and sends to central server; Central server carries out secondary re-scheduling to the new URL obtained, and as secondary re-scheduling is passed through, URL is added overall URL abstract; Reptile gathers child node and the link of this URL is joined in waiting list.System and method provided by the invention can be decomposed originally concentrate on re-scheduling task that Centroid carries out each reptile by one-level re-scheduling and gathered child node by classification re-scheduling mechanism, central server safeguards an overall re-scheduling form by the mode of secondary re-scheduling, very conveniently to expand in system, the design of system, dispose and become very flexibly with operation, facilitate.

Description

The URL re-scheduling system and method for a kind of distributed network reptile
Technical field
The present invention relates to internet arena, particularly the URL re-scheduling system and method for a kind of distributed network reptile based on distributed structure/architecture.
Background technology
Current distributed network reptile can be divided into master slave mode, autonomous mode and mixed mode three kinds by communication mode difference.Master slave mode refers to that the main frame being responsible for all operational network reptiles as Controlling vertex by a main frame manages, reptile only needs from Controlling vertex there reception task, and newly-generated job invocation is just passable to Controlling vertex, need not communicate with other reptiles in this process, this mode realizes simple and is beneficial to management.Controlling vertex then needs to communicate with all reptiles, and it needs an address list to carry out the information of all reptiles in saved system.When the reptile quantity in system changes, coordinator needs the data in scheduler list, and this process is transparent for the reptile in system.But along with the increase of reptile webpage quantity, Controlling vertex can become the bottleneck of whole system and cause whole distributed network crawler system hydraulic performance decline.And autonomous mode refers in system do not have coordinator, all reptiles all must intercom mutually, more more complex than reptile under master slave mode.Make each web crawlers in this way can safeguard an address list, in table, store the position of all reptiles in whole system, during each communication, directly data can be sent to the reptile needing these data.When the reptile quantity in system changes, the address list of each reptile needs to upgrade.Mixed mode is the one compromise pattern of the feature combining two kinds of patterns above.Reptiles all in this pattern can intercom mutually and all have task matching function simultaneously.But have a special reptile in all reptiles, this reptile major function carries out centralized distribution to cannot having distributed after reptile task matching of task.Use each web crawlers of this mode only need safeguard the address list of oneself acquisition range.And special reptile need except preserving the address list of oneself acquisition range also preserve need the address list carrying out centralized distribution.
URL re-scheduling is the core content of network crawler system, and in current web crawlers, the re-scheduling mode of URL mainly contains following four kinds: the re-scheduling mode based on database, the re-scheduling mode based on internal memory Hash, the re-scheduling mode based on disk path and the re-scheduling mode based on Bloom filter.Re-scheduling based on database is realized by the every bar record in ergodic data storehouse, and the stability of this re-scheduling mode is very high, but re-scheduling efficiency is very low, and when the record number of database is more than 1,000,000, the performance of database can sharply decline.So the time efficiency based on the re-scheduling mode of database becomes the bottleneck of whole distributed system.Re-scheduling mode based on internal memory Hash carries out re-scheduling by the Hash table in internal memory, but this mode is too large to the consumption of memory source.Re-scheduling mode based on disk path is by carrying out MD5 coding to link and create corresponding path in disk, this re-scheduling mode based on path has higher time efficiency, but is not suitable for a large-scale distributed crawler system of creeping towards the whole network.Re-scheduling mode finally based on Bloom filter is a kind of re-scheduling scheme relatively commonly used, and benefit is very fast, saves space, but can there is certain false recognition rate, is also a kind of re-scheduling mode based on internal memory.
General distributed network reptile can select more stable database re-scheduling mode, but re-scheduling efficiency is too low, is not suitable for when big data quantity.Re-scheduling under master slave mode concentrates on center control nodes and carries out, and each child node that gathers only is responsible for processing being assigned to of task, through concentrating re-scheduling before allocating task, very high to the requirement of center control nodes like this.
Summary of the invention
For above-mentioned problems of the prior art, the object of the present invention is to provide the URL re-scheduling system and method for a kind of new distributed network reptile.
In order to realize foregoing invention object, the technical solution used in the present invention is as follows:
A URL re-scheduling system for distributed network reptile, is characterized in that, comprising: reptile gathers child node, central server, database server, wherein,
Described reptile gathers child node and is deployed on the multiple servers in net, for one-level re-scheduling;
Described central server is used for secondary re-scheduling; Described reptile gathers child node and need register on described central server, obtains authentication; An address list safeguarded by described central server, gathers child node address information for the reptile of recording current active;
Described data server is for the storage of the distribution and collection result that crawl task; Described reptile gathers child node and obtain task from described database server waiting list, and is stored into after collection result process in database server.
Further, in the URL re-scheduling system of above-mentioned distributed network reptile, described reptile gathers child node storage inside and safeguards an anticipatory remark ground URL information abstract.
Further, in the URL re-scheduling system of above-mentioned distributed network reptile, the overall URL information abstract of storage system maintenance one in described central server.
The present invention provides the URL rearrangement of a kind of distributed network reptile of the URL re-scheduling system for any one distributed network reptile above-mentioned on the other hand, it is characterized in that, comprises the following steps:
Step 1: reptile gathers child node and registers on central server;
Step 2: reptile gathers child node to database server acquisition request new task, namely reptile gathers child node and obtain a URL from database waiting list, and from then on URL obtains new URL information;
Step 3: reptile gathers the URL information of child node to described new acquisition and carries out one-level re-scheduling in this locality, as one-level re-scheduling is not passed through, then abandons this URL information, again obtains task, enter step 2; As one-level re-scheduling is passed through, reptile gathers child node and described URL information is added local URL information abstract and enters step 4;
Step 4: reptile gathers child node and described URL information is sent to central server;
Step 5: central server carries out secondary re-scheduling to described URL information; As secondary re-scheduling is not passed through, then abandon this URL information; Central server notice reptile gathers child node and again obtains task, enters step 2; As secondary re-scheduling is passed through, enter step 6;
Step 6: this URL information is added overall URL abstract and notifies that reptile gathers child node by central server;
Step 7: reptile gathers child node and the link of this URL information joined in waiting list.
Further, in the URL rearrangement of above-mentioned distributed network reptile, described step 1 also comprises: set up by central server and store an overall URL information abstract; Gather child node by each reptile set up and store local URL information abstract.
Further, in the URL rearrangement of any one distributed network reptile above-mentioned, in the one-level re-scheduling process of described step 3, the described URL information according to new url parsing and reptile are gathered the URL information recorded in the local URL information abstract stored in child node compare, then represent that one-level re-scheduling is not passed through if any identical URL information, as do not found, identical URL information then represents that one-level re-scheduling is passed through.
Further, in the URL rearrangement of any one distributed network reptile above-mentioned, in the secondary re-scheduling process of described step 5, the URL information recorded in the overall URL information abstract stored in the described URL information according to new url parsing and central server is compared, then represent that secondary re-scheduling is not passed through if any identical URL information, as do not found, identical URL information then represents that secondary re-scheduling is passed through.
Further, in the URL rearrangement of any one distributed network reptile above-mentioned, described central server collects overall URL information abstract by secondary re-scheduling mechanism, and each reptile gathers child node by secondary re-scheduling mechanism and central server interactive learning, the URL information abstract making it safeguard and central server reach unanimity, and can reduce the traffic with central server.
Further, in the URL rearrangement of any one distributed network reptile above-mentioned, before described step 1, also comprise the steps:
Step 0: gather child node the multiple reptile of interior multiple servers deploy.
Use the URL re-scheduling system and method for distributed network reptile provided by the invention, child node can be gathered originally concentrating on the re-scheduling Task-decomposing that Centroid (central server) carries out to each reptile by classification re-scheduling mechanism, the pressure of Centroid can not be excessive, Centroid no longer bears task matching function simultaneously, the communication process of such Centroid and child node becomes very simple, need not design complicated communication protocol.By monitoring each reptile, central server gathers that the mode of child status is very convenient to be expanded in system, the design of system, dispose become very flexibly with operation, conveniently, cost performance is higher.
Accompanying drawing explanation
Fig. 1 is the URL re-scheduling system architecture schematic diagram of a kind of distributed network reptile;
Fig. 2 is the URL rearrangement FB(flow block) of a kind of distributed network reptile.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with embodiment and accompanying drawing, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
The URL re-scheduling system architecture schematic diagram of a kind of distributed network reptile that the embodiment of the present invention provides as shown in Figure 1, is characterized in that, comprising: reptile gathers child node (reptile child node), central server, database server, wherein,
Described reptile gathers child node and is deployed on the multiple servers in net, for one-level re-scheduling;
Described central server is used for secondary re-scheduling; Described reptile gathers child node and need register on described central server, obtains authentication; An address list safeguarded by described central server, gathers child node address information for the reptile of recording current active;
Described data server is for the storage of the distribution and collection result that crawl task; Described reptile gathers child node and obtain task from described database server waiting list, and is stored into after collection result process in database server.
In the present embodiment, the multiple reptile of multiple servers deploy in net gathers child node, and the unique identification that each child node is used for oneself represents oneself " identity ".The distributed network reptile URL re-scheduling system of the present embodiment does not have center control nodes, and each reptile gathers child node and do not obtain from center control nodes and crawl task but oneself obtains URL information from database server.Each reptile gathers between child node and need not communicate like this, can independent work, and each reptile gathers child node only need in central server " registration " when joining this system, the adding and exit of very convenient new node, and autgmentability is fine.
Central server in the present embodiment is used as the aggregation node of distributed re-scheduling, and each reptile gathers child node to be needed when joining in system first to register on central server.The function of distributed tasks do not born by central server, is only responsible for the process of second level URL re-scheduling, lower to the requirement of central server like this.Central server is now only equivalent to a data accumulation hub, and it can gather child node with each reptile and communicate, and completes whole data re-scheduling task.
Adopt data server to be used for crawling the distribution of task and the storage of collection result in the present embodiment, waiting list puts convenient the carrying out controlling the process that crawls in a database.Here database server serves the effect of task matching and transfer, and each reptile is gathered child node and can be realized alternately by it.
Further, described reptile gathers child node storage inside and safeguards that an anticipatory remark ground URL information abstract is for first order re-scheduling.
Further, the overall URL information abstract of storage system maintenance one in described central server, for second level re-scheduling.
The mode of re-scheduling Scheme Choice of the present invention abstract in internal memory is carried out, and each child node inside that gathers all safeguards that such anticipatory remark ground URL abstract is for first order re-scheduling.When getting a new url and just enter the first order re-scheduling stage from waiting list after filtering, acquisition node can first use local abstract to carry out re-scheduling to this link.If first order re-scheduling is passed through, the summary of this link can be sent to central server and carry out second level re-scheduling by acquisition node, simultaneously also the local abstract of URL summary write.If first order re-scheduling is not passed through, just second level re-scheduling need not be carried out.An overall abstract is also safeguarded in central server inside, and it gets after " being gathered " by the data of the local abstract of each collection child node, and the information of the inside is more complete.When second level re-scheduling that and if only if is passed through, this link is just considered new url and adds operation queue and processes.
The present invention's embodiment on the other hand provides a kind of URL rearrangement utilizing the URL re-scheduling system of any one distributed network reptile above-mentioned to carry out distributed network reptile, and the method FB(flow block) as shown in Figure 2, comprises the steps:
Step 1: reptile gathers child node (collection child node) and registers on central server;
Step 2: reptile gathers child node to database server acquisition request new task, namely reptile gathers child node and obtain a URL from database waiting list, and from then on URL obtains new URL information;
Step 3: reptile gathers the URL information of child node to described new acquisition and carries out one-level re-scheduling in this locality, as one-level re-scheduling is not passed through, then abandons this URL information, again obtains task, enter step 2; As one-level re-scheduling is passed through, reptile gathers child node and described URL information is added local URL information abstract and enters step 4;
Step 4: reptile gathers child node and described URL information is sent (uploading) to central server;
Step 5: central server carries out secondary re-scheduling to described URL information; As secondary re-scheduling is not passed through, then abandon this URL information; Central server notice reptile gathers child node and again obtains task, enters step 2; As secondary re-scheduling is passed through, enter step 6;
Step 6: this URL information is added overall URL abstract and notifies that reptile gathers child node by central server;
Step 7: reptile gathers child node and the link of this URL information joined in waiting list.
In said process, the flow process of once complete re-scheduling process comprises two-stage re-scheduling, concrete steps are as follows: first, gather child node and obtain a new url from database waiting list, ensure it is unduplicated due to the link in database server waiting list, so carry out request process to it, resolve the URL that makes new advances, then new URL is joined in waiting list carrying out re-scheduling process.Gather child node and first in this locality, one-level re-scheduling is carried out to a new URL, in local re-scheduling URL abstract, namely judge whether wherein this URL, if, represent that re-scheduling is not passed through, directly give up this URL, then obtain new URL process; If local URL abstract not this summary, then represent that re-scheduling is passed through, gather child node and this URL can be joined in local re-scheduling abstract.Then gather child node and the summary of this URL is sent to central server, carry out second level re-scheduling by central server, if this summary exists in central server, then represent that this link is processed, notice reptile obtains new URL.If secondary re-scheduling is passed through, this URL summary can join in the re-scheduling table of oneself by central server, then gathering child node and joins in waiting list by this link, to finish once taking turns re-scheduling process like this.
Further, in said method, in step 1, set up by central server and store an overall URL information abstract; Gather child node by each reptile set up and store local URL information abstract.
Further, in said method, in the one-level re-scheduling process of described step 3, the described URL information according to new url parsing and reptile are gathered the URL information recorded in the local URL information abstract stored in child node compare, then represent that one-level re-scheduling is not passed through if any identical URL information, as do not found, identical URL information then represents that one-level re-scheduling is passed through.
Further, in said method, in the secondary re-scheduling process of described step 5, the URL information recorded in the overall URL information abstract stored in the described URL information according to new url parsing and central server is compared, then represent that secondary re-scheduling is not passed through if any identical URL information, as do not found, identical URL information then represents that secondary re-scheduling is passed through.
Further, in the URL rearrangement of any one distributed network reptile above-mentioned, described central server collects overall URL information abstract by secondary re-scheduling mechanism, and each reptile gathers child node by secondary re-scheduling mechanism and central server interactive learning, the URL information abstract making it safeguard and central server reach unanimity, and can reduce the traffic with central server.
Time initial, the overall re-scheduling table (overall URL information abstract) that central server is safeguarded is the summation of each child node re-scheduling table (local URL summary info table), after a period of time, the re-scheduling table (local URL summary info table) that each reptile gathers child node can level off to unanimously, such secondary re-scheduling (secondary re-scheduling) can be fewer and feweri, ideally only just can be obtained a result by one-level re-scheduling (one-level re-scheduling), efficiency can improve greatly.
Further, in said method, before described step 1, also comprise the steps:
Step 0: the multiple reptile of the multiple servers deploy in net gathers child node.
The URL re-scheduling system and method for the distributed network reptile using above-described embodiment to provide, child node can be gathered originally concentrating on the re-scheduling Task-decomposing that Centroid (central server) carries out to each reptile by classification re-scheduling mechanism, the pressure of Centroid can not be excessive, Centroid no longer bears task matching function simultaneously, the communication process of such Centroid and child node becomes very simple, need not design complicated communication protocol.By monitoring, the mode of child status is very convenient to be expanded in system central server, the design of system, disposes and becomes very flexibly with operation, facilitates, and cost performance is higher.
The distributed network reptile framework that the present invention relates to and data re-scheduling scheme thereof go for the needs of multiple directed information acquisition tasks, and deployment convenience, flexible configuration, autgmentability are strong.
The above embodiment only have expressed embodiments of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection range of patent of the present invention should be as the criterion with claims.

Claims (3)

1. a URL re-scheduling system for distributed network reptile, is characterized in that, comprising: reptile gathers child node, central server, database server, wherein,
Described reptile gathers child node and is deployed on the multiple servers in net, one-level re-scheduling is carried out for using local URL information abstract, when one-level re-scheduling is passed through, corresponding URL information is added local URL information abstract and this URL information is sent to central server;
Described central server carries out secondary re-scheduling for using overall URL information abstract to the URL information received, and when secondary re-scheduling is passed through, described URL information is added overall URL information abstract; Described reptile gathers child node and need register on described central server, obtains authentication; An address list safeguarded by described central server, gathers child node address information for the reptile of recording current active;
Described database server is for the storage of the distribution and collection result that crawl task; Described reptile gathers child node and obtain task from described database server waiting list, and is stored in database server by after collection result process;
Described reptile gathers child node storage inside and safeguards an anticipatory remark ground URL information abstract;
The overall URL information abstract of storage system maintenance one in described central server;
Reptile described in each gathers child node and central server interactive learning, and the described local URL information abstract making it safeguard and described overall URL information abstract reach unanimity.
2., for a URL rearrangement for the distributed network reptile of system described in claim 1, it is characterized in that, comprise the following steps:
Step 1: reptile gathers child node and registers on central server;
Step 2: reptile gathers child node to database server acquisition request new task, namely reptile gathers child node and obtain a URL from database server waiting list, and from then on URL obtains new URL information;
Step 3: reptile gathers the URL information of child node to described new acquisition and carries out one-level re-scheduling in this locality, as one-level re-scheduling is not passed through, then abandons this URL information, again obtains task, enter step 2; As one-level re-scheduling is passed through, reptile gathers child node and described URL information is added local URL information abstract and enters step 4;
Step 4: reptile gathers child node and described URL information is sent to central server;
Step 5: central server carries out secondary re-scheduling to described URL information; As secondary re-scheduling is not passed through, then abandon this URL information; Central server notice reptile gathers child node and again obtains task, enters step 2; As secondary re-scheduling is passed through, enter step 6;
Step 6: this URL information is added overall URL information abstract and notifies that reptile gathers child node by central server;
Step 7: reptile gathers child node and joins in waiting list by the link of this URL information;
Described step 1 also comprises: set up by central server and store an overall URL information abstract; Gather child node by each reptile set up and store local URL information abstract;
In the one-level re-scheduling process of described step 3, the URL information of described new acquisition and reptile are gathered the URL information recorded in the local URL information abstract stored in child node to compare, then represent that one-level re-scheduling is not passed through if any identical URL information, as do not found, identical URL information then represents that one-level re-scheduling is passed through;
In the secondary re-scheduling process of described step 5, the URL information recorded in the overall URL information abstract stored the described URL information from the reception of reptile collection child node and central server is compared, then represent that secondary re-scheduling is not passed through if any identical URL information, as do not found, identical URL information then represents that secondary re-scheduling is passed through;
Described central server collects overall URL information abstract by secondary re-scheduling mechanism, and each reptile gathers child node by secondary re-scheduling mechanism and central server interactive learning, the URL information abstract making it safeguard and central server reach unanimity, and can reduce the traffic with central server.
3. the URL rearrangement of distributed network reptile according to claim 2, is characterized in that, before described step 1, also comprises the steps:
Step 0: the multiple reptile of the multiple servers deploy in net gathers child node.
CN201210425213.XA 2012-10-30 2012-10-30 The URL re-scheduling system and method for a kind of distributed network reptile Active CN102932448B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210425213.XA CN102932448B (en) 2012-10-30 2012-10-30 The URL re-scheduling system and method for a kind of distributed network reptile

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210425213.XA CN102932448B (en) 2012-10-30 2012-10-30 The URL re-scheduling system and method for a kind of distributed network reptile

Publications (2)

Publication Number Publication Date
CN102932448A CN102932448A (en) 2013-02-13
CN102932448B true CN102932448B (en) 2016-04-27

Family

ID=47647145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210425213.XA Active CN102932448B (en) 2012-10-30 2012-10-30 The URL re-scheduling system and method for a kind of distributed network reptile

Country Status (1)

Country Link
CN (1) CN102932448B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103475688A (en) * 2013-05-24 2013-12-25 北京网秦天下科技有限公司 Distributed method and distributed system for downloading website data
CN104063506B (en) * 2014-07-08 2017-04-12 百度在线网络技术(北京)有限公司 Method and device for identifying repeated web pages
CN104408182A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for processing web crawler data on distributed system
CN106528567B (en) * 2015-09-11 2019-11-12 北京国双科技有限公司 The update method and device of web crawlers cluster information
CN106506673B (en) * 2016-11-25 2019-08-02 国信优易数据有限公司 A kind of large-scale distributed data management system and its method
CN107066526A (en) * 2017-02-23 2017-08-18 武汉智寻天下科技有限公司 A kind of network crawler system and method
CN107329969A (en) * 2017-05-23 2017-11-07 合肥智权信息科技有限公司 It is a kind of that system and method are updated based on the data message repeatedly verified
CN109359231A (en) * 2017-12-29 2019-02-19 广州Tcl智能家居科技有限公司 A kind of information crawler method, server and the storage medium of distributed network crawler
CN108415941A (en) * 2018-01-29 2018-08-17 湖北省楚天云有限公司 A kind of spiders method, apparatus and electronic equipment
CN111064713B (en) * 2019-02-15 2021-05-25 腾讯科技(深圳)有限公司 Node control method and related device in distributed system
CN110457556B (en) * 2019-07-04 2023-11-14 重庆金融资产交易所有限责任公司 Distributed crawler system architecture, method for crawling data and computer equipment
CN112448991B (en) * 2019-09-05 2023-06-13 顺丰科技有限公司 Address de-duplication method, related equipment and storage medium
CN112199175A (en) * 2020-04-02 2021-01-08 支付宝(杭州)信息技术有限公司 Task queue generating method, device and equipment
CN112422707A (en) * 2020-10-22 2021-02-26 北京安博通科技股份有限公司 Domain name data mining method and device and Redis server

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073683A (en) * 2010-12-22 2011-05-25 四川大学 Distributed real-time news information acquisition system
CN102271331A (en) * 2010-06-02 2011-12-07 中国移动通信集团广东有限公司 Method and system for detecting reliability of service provider (SP) site
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080235163A1 (en) * 2007-03-22 2008-09-25 Srinivasan Balasubramanian System and method for online duplicate detection and elimination in a web crawler

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102271331A (en) * 2010-06-02 2011-12-07 中国移动通信集团广东有限公司 Method and system for detecting reliability of service provider (SP) site
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same
CN102073683A (en) * 2010-12-22 2011-05-25 四川大学 Distributed real-time news information acquisition system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
分布式网络爬虫URL去重策略的改进;吴小惠;《平顶山学院学报》;20091031;第24卷(第5期);第116-119页 *

Also Published As

Publication number Publication date
CN102932448A (en) 2013-02-13

Similar Documents

Publication Publication Date Title
CN102932448B (en) The URL re-scheduling system and method for a kind of distributed network reptile
CN104618693A (en) Cloud computing based online processing task management method and system for monitoring video
CN104052759B (en) System for realizing add-and-play technology of internet of things
CN104184785B (en) Forestry Internet of things system based on cloud platform
CN102902669B (en) Distributed information grasping means based on internet system
CN107465764A (en) Internet of Things network communication system, gateway device and method based on stelliform connection topology configuration
CN103108387B (en) A kind of method for building up of the multi-hop wireless self-organizing network being applied to agricultural
CN106210152B (en) Vehicle-mounted cloud system based on Internet of things and resource acquisition method
CN106211311B (en) A kind of method and apparatus of user equipment registration to network
CN110266783A (en) A kind of railway CTC system communications platform based on DDS
CN108293063A (en) The system and method for information catapult on network tapestry and moment granularity
CN108964949A (en) Virtual machine migration method, SDN controller and computer readable storage medium
CN104956629B (en) Event distributing method in software defined network, control device and processor
CN106777308A (en) The synchronous method and device of civil aviaton's sequence information
CN104866528B (en) Multi-platform collecting method and system
CN109074287A (en) Infrastructure resources state
CN102394903A (en) Active reconstruction calculating system constructing system
Gia et al. Exploiting LoRa, edge, and fog computing for traffic monitoring in smart cities
CN102882979B (en) Data acquisition based on cloud computing system and the system and method collecting shunting
CN105450520A (en) Message processing method and device, and method and device for building aggregation tunnel
CN110169019A (en) The network switch and Database Systems that database function defines
CN112351106B (en) Service grid platform containing event grid and communication method thereof
CN104301131A (en) Fault management method and device
CN103973466B (en) A kind of method and device waking up suspend mode link
CN101867988A (en) Centralized self-adaptation network manager node selection algorithm in Ad Hoc network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20211228

Address after: 100191 No. 40, Haidian District, Beijing, Xueyuan Road

Patentee after: CHINA ACADEMY OF INFORMATION AND COMMUNICATIONS

Address before: 100191 6th Floor, Block B, Telecommunications Research Institute, No. 52 Huayuan North Road, Haidian District, Beijing

Patentee before: The Research Institute of Telecommunications Transmission MIIT