CN107193960B - Distributed crawler system and periodic incremental grabbing method - Google Patents

Distributed crawler system and periodic incremental grabbing method Download PDF

Info

Publication number
CN107193960B
CN107193960B CN201710372282.1A CN201710372282A CN107193960B CN 107193960 B CN107193960 B CN 107193960B CN 201710372282 A CN201710372282 A CN 201710372282A CN 107193960 B CN107193960 B CN 107193960B
Authority
CN
China
Prior art keywords
component
distributed
spider
crawler
grabbing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710372282.1A
Other languages
Chinese (zh)
Other versions
CN107193960A (en
Inventor
张雷
韩建军
张文哲
谭龙海
王崇骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201710372282.1A priority Critical patent/CN107193960B/en
Publication of CN107193960A publication Critical patent/CN107193960A/en
Application granted granted Critical
Publication of CN107193960B publication Critical patent/CN107193960B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a distributed crawler system which is configured into three major parts, namely a distributed service based on ZooKeeper, a system component and a database, wherein the system component comprises a system monitoring component Monitor, a coordination component Coordinator, a log collection component Logger and a basic crawler component Spider, the database comprises a Redis memory database, Redis is a key-value storage form, and a distributed URL task queue and a distributed BloomFilter are stored in the Redis memory database. The invention also discloses a periodic increment grabbing method based on the system, which comprises the following steps: the coordination component Coordinator periodically imports tasks into a distributed URL task queue and wakes up a dormant Spider component; and the Spider component performs sleep or periodic incremental grabbing according to the execution condition of the current distributed URL task queue. The system and the method solve the problem of how to effectively combine single-machine crawlers together, realize the distributed crawlers with high availability, high stability and high throughput rate in a cluster environment, and realize periodic incremental grabbing.

Description

Distributed crawler system and periodic incremental grabbing method
Technical Field
The invention relates to the technical field of efficient data acquisition of internet big data, in particular to a distributed crawler system and a periodic incremental grabbing method.
Background
The web crawler obtains the URL on the initial web page starting from the URL (Uniform Resource Locator) of one or more initial web pages, and continuously extracts new URLs from the current web page to be put into a task queue according to different capturing strategies in the process of capturing the web page until the stopping condition of the system is met.
With the rapid development of the internet, the network data is in explosive growth, and the network data source is more and more diversified. In the face of such huge and diversified internet data, how to improve the capturing efficiency of the web crawler and how to perform a customizable crawling strategy aiming at different data sources is very important.
Compare in traditional stand-alone reptile, the efficiency of snatching that the reptile can obviously be improved to the distributed reptile, but also introduced new problem thereupon: the method comprises the following steps of multi-node task issuing problem, load balancing problem, webpage repeatability problem, periodic increment grabbing problem and the like in the distributed environment.
To sum up, how to solve a series of problems brought by the distributed crawler and simultaneously effectively improve the crawling speed of the crawler, and realizing the periodic incremental crawling of the distributed crawler on the basis is a main problem existing in the prior art.
Disclosure of Invention
The invention aims to provide a distributed crawler system and a periodic incremental grabbing method, which solve the problem of how to effectively combine single-machine crawlers together, realize the distributed crawlers with high availability, high stability and high throughput rate in a cluster environment, and realize the periodic incremental grabbing. The technical scheme adopted by the invention for solving the problems is as follows:
the invention discloses a distributed crawler system, which is configured to three major parts, namely a distributed service based on ZooKeeper, a system component and a database, wherein the system component comprises a system monitoring component Monitor, a coordination component Coordinator, a log collection component Logger and a basic crawler component Spider, the database comprises a Redis memory database, Redis is a key-value storage form, and a distributed URL task queue and a distributed BloomFilter are stored in the Redis memory database; the ZooKeeper-based distributed service provides distributed coordination service for each system component; the system monitoring component Monitor is responsible for dynamic configuration of the system and state monitoring of the system; the coordination component Coordinator is responsible for importing the seed URL into a distributed task queue based on Redis, periodically summarizing the state of each node to a ZooKeeper, and dynamically allocating one or more of detection and management of a log source and a cluster node for a log collection component Logger; the log collection component Logger is responsible for collecting log data from each basic crawler component Spider in the cluster; the basic crawler component Spider is responsible for processing a crawling task of a webpage; the Redis-based distributed URL task queue is responsible for storing all task URLs to be crawled; the Redis-based distributed BloomFilter is responsible for URL deduplication requests of all basic crawler components Spiders in the cluster.
Further, the ZooKeeper-based distributed service provides one or more of distributed services including dynamic configuration, cluster node detection and management, Master election, distributed locks, and ID generation of global URLs for each system component through the mutual coordination work with each system component.
Furthermore, the system monitoring component Monitor has a Monitor interface, a user can modify the system configuration parameters existing on the ZooKeeper through the Monitor interface, the coordinating component Coordinator, the log collecting component Logger and the basic crawler component Spider in the cluster can Monitor the corresponding data nodes on the ZooKeeper, and obtain corresponding notifications after the contents of the data nodes are modified, and then make corresponding adjustments according to the modified configuration parameters.
Further, the Monitor interface can also display the state parameters of each system and each component existing on the ZooKeeper in real time.
Further, the basic crawler component Spider component has multiple component kernels, and the crawling strategies of the component kernels are not completely consistent.
Further, the basic crawler component Spider component has high expansibility, so that a new component kernel can be written conveniently aiming at a new data source.
Further, a task distribution mode of the distributed URL task queue adopts a Pull (Pull) mode of a basic crawler component Spider.
Furthermore, the distributed BloomFilter adopts a segmentation mechanism to segment bit vectors and store the bit vectors on keys of Redis, and realizes the synchronicity control of the Spider access of each basic crawler component through a segmented optimistic lock.
The invention also discloses a periodic increment grabbing method based on the distributed crawler system, which comprises the following steps: the coordination component Coordinator periodically imports tasks into a distributed URL task queue and wakes up a dormant Spider component; the Spider component conducts dormancy or periodic incremental grabbing according to the execution condition of the current distributed URL task queue, when the task is not grabbed, the Spider component enters a dormant state, and when the dormant Spider component is awakened by other Spider components or coordinators, the grabbing task continues.
Further, the method comprises the following steps:
s1, the coordination component Coordinator periodically imports the task to the distributed URL task queue and wakes up the dormant Spider component. That is, the Coordinator component of the system will periodically import tasks into the distributed URL queue, and after the tasks are imported, the Coordinator will wake up all dormant spiders to start a new round of incremental fetching tasks. The grabbing task is executed periodically, and each period starts from the importing of the seed task.
And S2, judging whether the periodic increment grabbing of the system is finished or not by the Spider component, if so, executing S6, and otherwise, executing S3. That is, a capturing thread in the Spider component can check corresponding data node information in the ZooKeeper, the data node information is set by a Monitor, and when the periodic increment capturing of the system is read to be finished, the Spider component can carry out a series of cleaning and saving work and then finish the process of the Spider component; otherwise, periodic incremental grabbing continues.
And S3, judging whether the current distributed task queue is empty, if so, executing S4, otherwise, jumping to S5. That is, a grabbing thread in the spinner component checks whether tasks to be grabbed still exist in a distributed task queue in Redis, and if yes, the tasks are acquired and enter a grabbing stage; otherwise, the sleep phase is entered.
S4, entering a basic crawler assembly Spider dormancy stage, mainly comprising: 1) blocking a capture thread or a dormant base crawler component (Spider) component, 2) waking up a thread;
the method specifically comprises the following steps: a) judging whether the current Spider component except the current grabbing thread and other grabbing threads are blocked, if so, executing the step b), otherwise, executing the step c); b) a dormancy marking node is created in the ZooKeeper, the node can be used for indicating that the current Spider assembly is dormant, and when other assemblies need to wake up the Spider assembly, only the data node needs to be deleted; c) blocking the fetch thread; d) the grabbing thread is blocked and waits for other threads to wake up; e) the grab thread is woken up by the other threads and executes S2.
When the tasks are not grabbed, the Spider components enter the stage, the Spider components sleep to avoid the idle consumption of system resources, and when other Spider components have new tasks to be added into a task queue or a new round of incremental grabbing starts, the sleeping Spider components are awakened by other Spider components or Coordinator components to continue grabbing the tasks.
S5, entering a basic crawler component (Spider) grabbing stage, and specifically comprising the following steps: the method comprises the following steps: 1) acquiring tasks from a distributed URL task queue, and 2) a Spider executes a grabbing task; 3) the grab thread or base crawler component (Spider) component is awakened.
The method specifically comprises the following steps: a) acquiring a grabbing task from a distributed URL queue; b) capturing a corresponding webpage according to the acquired task and storing a result; c) analyzing the captured webpage hyperlink and acquiring a new task set; d) sending the acquired new task to a distributed BloomFilter for duplicate removal; e) adding the new task after the duplication removal to a distributed task queue; f) judging whether the current Spider component is blocked by a capturing thread, if so, executing the step g), otherwise, executing the step h); g) waking up a blocked grabbing thread in the current spinner component; h) and judging whether other Spider components in the current cluster sleep or not, if so, awakening the corresponding sleep Spider, and otherwise, executing S2.
And S6, ending. That is, when each component detects that the system needs to stop working, the respective process is finished after necessary cleaning work is performed.
Compared with the prior art, the distributed crawler system and the periodic increment capturing method have the following beneficial effects that aiming at huge and diversified internet data:
1) the realization is simple: the distributed crawler system is constructed based on the open-source distributed coordination service ZooKeeper and the open-source distributed memory database Redis, and deep development is carried out on the basis of utilizing a technical framework, so that specific requirements are met, and development cost is reduced.
2) High performance: the grabbing task adopts a multi-node multi-thread working mode, high performance of webpage grabbing is achieved, and linear expansion of the Spider component is supported.
3) High availability: based on ZooKeeper and Redis, all components of the system work in a cluster mode, so that the problem of single-node breakdown is avoided, and a high-availability and high-stability webpage capturing service is realized externally.
4) Automatic periodic incremental grabbing: after the initial task and the related system parameters are set at one time, the system can automatically carry out periodical incremental grabbing service without human intervention.
5) The customizable grabbing strategy is as follows: the Spider component comprises a plurality of component kernels, each component kernel corresponds to a different crawling strategy, and the Spider is designed into a highly-extended component, so that a new component kernel can be written conveniently aiming at a new data source.
6) The expansibility is good: all components of the system are organized together with low coupling, the influence of the up and down lines of any single node on the system is very little, and the linear expansion of each component is supported.
Therefore, the method has the advantages of reasonable design, simple architecture, high availability, high stability, high performance, good expansibility and the like.
Drawings
FIG. 1 is a diagram of the distributed crawler system architecture
FIG. 2 is a main flow chart of a periodic incremental capture method
FIG. 3 is a flow chart of the Spider grabbing phase
FIG. 4 is a flow chart of the Spider sleep stage
Detailed Description
In order to better understand the technical content of the invention, specific embodiments are specifically illustrated and further described in conjunction with the accompanying drawings.
FIG. 1 is a diagram of a distributed crawler system architecture of the present invention, which includes three major parts, a ZooKeeper-based distributed service, system components, and a database. The ZooKeeper-based distributed service provides distributed coordination service for each system component; the system component comprises a system monitoring component Monitor, a coordination component Coordinator, a log collection component Logger and a basic crawler component Spider; the database comprises a Redis memory database and other databases for storing and capturing webpages, and a distributed URL task queue and a distributed BloomFilter are stored in the Redis memory database.
The ZooKeeper-based distributed service provides distributed coordination services such as dynamic configuration, cluster node detection and management, Master election, distributed lock, and ID generation of global URL for each system component through the mutual coordination work with each system component. The ZooKeeper maintains a tree data structure similar to a file system in a memory, and the distributed services based on the ZooKeeper can be realized by creating, querying, deleting and monitoring corresponding data nodes of all components on the ZooKeeper data structure.
The system monitoring component Monitor is responsible for dynamic configuration of the system and state monitoring of the system. The user can modify the system configuration parameters (such as parameters of spiders of basic crawler components) existing on the ZooKeeper through a Monitor interface, each corresponding component (including spiders, coordinators and logers) in the cluster can Monitor the corresponding data node on the ZooKeeper, and each corresponding component can obtain corresponding notification after the content of the data node is modified, namely, the notification of data change sent by the ZooKeeper, and then each component can make corresponding adjustment according to the modified configuration parameters. The Monitor interface can also display the state parameters of each system and each component existing on the ZooKeeper in real time, so that a user can Monitor in real time, find problems in time and carry out corresponding remedial measures. The system configuration parameters mainly include a seed import period, a regular constraint, a number of grabbing threads, a grabbing depth, a maximum error number and the like, and many other configuration parameters with great details.
And the coordination component Coordinator is responsible for importing the URL of the seed webpage into the distributed task queue, periodically summarizing the state of each node to the ZooKeeper, and dynamically distributing log sources and cluster nodes for the log collection component Logger for detection and management.
The log collection component Logger is responsible for collecting log data from each base crawler component Spider in the cluster for subsequent log analysis.
The basic crawler component Spider is responsible for specific webpage crawling tasks, the Spider component comprises multiple component kernels, each component kernel corresponds to different crawling strategies, the Spider is designed into a high-expansion component, and new component kernels can be written conveniently aiming at new data sources. In the crawling process, the Spider component firstly carries out corresponding initialization according to system configuration, then continuously requests URLs from the distributed task queue, switches corresponding crawling strategies according to the corresponding URLs, crawls webpages, extracts webpage features and texts, stores extraction results, analyzes webpage hyperlinks, removes the weight of the newly acquired URLs through the distributed BloomFilter and then adds the URLs into the distributed task queue until the distributed task queue is empty.
The Redis-based distributed URL task queue is responsible for storing all task URLs to be crawled. A task distribution mode adopts a pulling (Pull) mode of a basic crawler component Spider, and when the current crawling task of the Spider is finished, the Spider can actively Pull a new task from a distributed queue to carry out the next round of work. It is worth noting that in the case of the current distributed queue based on the Redis, a Pull (Pull) mode is the best and simplest mode, and other modes combining push and Pull can also be implemented, but both modes need to be implemented additionally, and the Pull mode does not need to be implemented additionally.
The Redis-based distributed BloomFilter is responsible for URL deduplication requests of all basic crawler components Spiders in the cluster. Redis is a Key-value storage form, the distributed BloomFilter adopts a segmentation mechanism to segment and store bit vectors on different keys of Redis, and realizes the synchronicity control of the access of each Spider through a segmented optimistic lock. The implementation mechanism of the segmented optimistic lock is as follows: the deduplication request of each Spider firstly calculates keys corresponding to all sections of bit vectors to be accessed, then monitors the keys (keys), and then initiates Redis transactions of all sections of bit vector updating requests, wherein when the transactions are executed, whether bit vectors corresponding to the monitored keys are changed after monitoring is firstly checked, and if yes, the transactions are abandoned and executed, and the deduplication requests are automatically and repeatedly initiated; otherwise, the update is successful and the deduplication URL is successfully added to BloomFilter. The distributed BloomFilter based on the segmentation mechanism and the optimistic lock implementation not only can provide high-throughput deduplication requests, but also can be expanded along with the linear expansion of the Redis cluster, and capacity limitation does not exist.
The embodiment also discloses a periodic increment grabbing method based on the distributed crawler system, and the method is described in detail with reference to fig. 2 to 4.
Fig. 2 is a main flowchart of the periodic increment capture method in the embodiment, which is specifically introduced as follows:
step 1-0, periodically and incrementally grabbing an initial state of the method;
step 1-1, periodically importing a task to a distributed URL task queue by a coordination component Coordinator;
step 1-2, judging whether to finish the periodic incremental grabbing of the system: if the judgment result in the step 1-2 is yes, entering the step 1-9, otherwise, executing the step 1-3;
step 1-3, judging whether a current distributed task queue is empty; and if so, entering a Spider sleep stage and executing corresponding steps 1-4 and 1-5, otherwise, entering a Spider grabbing stage and executing corresponding steps 1-6, 1-7 and 1-8.
Step 1-4, blocking a grabbing thread or a dormant Spider component;
step 1-5, the blocking thread is awakened, and step 1-2 is executed;
step 1-6, acquiring tasks from a distributed queue;
1-7, executing a specific grabbing task by a Spider;
step 1-8, waking up a grabbing thread or a spinner component, and executing step 1-2;
step 1-9, ending state.
Fig. 3 is a flowchart of the capturing stage of the spreader in the embodiment, which specifically introduces the following steps:
step 2-0, starting a Spider grabbing stage, wherein the step is immediately followed by the step 1-3;
step 2-1, acquiring tasks from a distributed queue;
step 2-2, capturing a corresponding webpage according to the acquired task and storing a result;
step 2-3, analyzing the captured webpage hyperlink and acquiring a new task set;
step 2-4, removing the duplicate of the acquired new task to a distributed BloomFilter;
step 2-5, adding the new task after the duplication removal to a distributed task queue;
step 2-6, judging whether the Spider assembly is blocked by a capturing thread, if so, executing the step 2-7, otherwise, executing the step 2-8;
step 2-7, awakening the blocked capturing thread of the Spider component;
2-8, judging whether other Spider assemblies in the current cluster are dormant or not, if so, executing the step 2-9, otherwise, executing the step 2-10;
step 2-9, waking up the dormant Spider;
step 2-10, the end state of the Spider grabbing phase, and then step 1-2 is executed.
Fig. 4 is a flowchart of the sleep stage of the spreader in the embodiment, which is specifically introduced as follows:
step 3-0, starting a Spider dormancy stage, wherein the step is immediately followed by the step 1-3;
3-1, judging whether the Spider assembly is blocked by other grabbing threads except the grabbing threads, if so, executing the step 3-2, otherwise, executing the step 3-3;
step 3-2, a dormancy marking node is created in the ZooKeeper, the node can be used for indicating that the corresponding Spider component is dormant, and when other components need to wake up the Spider component, only the data node needs to be deleted;
3-3, blocking the capturing thread;
3-4, the capturing thread is blocked, and other threads are waited to be awakened;
3-5, the capturing thread is awakened by other threads;
step 3-6, end state of the Spider sleep phase, and then step 1-2 is executed.
Although the embodiments of the present invention have been described above with reference to the accompanying drawings, the present invention is not limited to the above-described embodiments and application fields, and the above-described embodiments are illustrative, instructive, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous modifications thereto without departing from the scope of the invention as defined by the appended claims.

Claims (9)

1. A distributed crawler system is characterized in that the system is configured to be distributed service based on ZooKeeper, system components and a database, wherein the system components comprise a system monitoring component Monitor, a coordination component Coordinator, a log collection component Logger and a basic crawler component Spider, the database comprises a Redis memory database, and a distributed URL task queue and a distributed BloomFilter are stored in the Redis memory database; wherein, the ZooKeeper-based distributed service provides distributed coordination service for each system component,
the system monitoring component Monitor is responsible for dynamic configuration of the system and status monitoring of the system,
the coordination component Coordinator is responsible for importing the seed URL into a distributed task queue based on Redis, periodically summarizing the state of each node to ZooKeeper, dynamically allocating one or more of log source and detection and management of cluster nodes for the log collection component Logger,
the log collection component Logger is responsible for collecting log data from each basic crawler component Spider in the cluster, the basic crawler component Spider is responsible for processing the crawling task of the web pages,
the Redis-based distributed URL task queue is responsible for storing all task URLs to be crawled,
the distributed BloomFilter based on the Redis is responsible for URL (Uniform resource locator) deduplication requests of all basic crawler components Spiders in the cluster; the ZooKeeper-based distributed service provides one or more of distributed services including dynamic configuration, cluster node detection and management, Master election, distributed locks and ID generation of global URLs for each system component through the mutual coordination work with each system component.
2. The distributed crawler system according to claim 1, wherein the system monitoring component Monitor has a Monitor interface, a user can modify system configuration parameters existing on the ZooKeeper through the Monitor interface, and the coordination component coorditor, the log collection component Logger and the basic crawler component Spider in the cluster Monitor corresponding data nodes on the ZooKeeper and obtain corresponding notifications after contents of the data nodes are modified, and then make corresponding adjustments according to the modified configuration parameters.
3. The distributed crawler system of claim 2, wherein the Monitor interface is further capable of displaying in real time the system state parameters and the component state parameters that exist on the ZooKeeper.
4. A distributed crawler system according to claim 1, wherein said base crawler component Spider component has multiple component kernels, and the crawling strategies of the component kernels are not completely consistent.
5. The distributed crawler system of claim 1, wherein the base crawler component Spider component is highly extensible to facilitate writing new component kernels for new data sources.
6. The distributed crawler system of claim 1, wherein the task distribution of the distributed URL task queue is a pull of a base crawler component Spider.
7. The distributed crawler system according to claim 1, wherein the distributed BloomFilter adopts a segmentation mechanism to segment bit vectors stored on keys different from Redis, and realizes synchronization control of access of each base crawler component Spider through a segmented optimistic lock.
8. A periodic incremental crawling method, based on the distributed crawler system of any one of claims 1 to 7, comprising: the coordination component Coordinator periodically imports tasks into a distributed URL task queue and wakes up a dormant Spider component; the Spider component conducts dormancy or periodic incremental grabbing according to the execution condition of the current distributed URL task queue, when the task is not grabbed, the Spider component enters a dormant state, and when the dormant Spider component is awakened by other basic crawler components or Coordinator components, the task can be continuously grabbed.
9. The method of periodic incremental grabbing according to claim 8, comprising the steps of:
s1, the coordination component Coordinator periodically imports tasks to a distributed URL task queue and wakes up a dormant Spider component;
s2, judging whether the periodic incremental grabbing of the system is finished, if so, jumping to S6, otherwise, executing S3;
s3, judging whether the current distributed task queue is empty, if so, executing S4, otherwise, jumping to S5;
s4, entering a basic crawler component (Spider) dormancy stage, and comprising the following steps: a) judging whether the current Spider component except the current grabbing thread and other grabbing threads are blocked, if so, executing the step b), otherwise, executing the step c); b) a dormancy marking node is created in the ZooKeeper, the node can be used for indicating that the current Spider assembly is dormant, and when other assemblies need to wake up the Spider assembly, only the data node needs to be deleted; c) blocking the fetch thread; d) the grabbing thread is blocked and waits for other threads to wake up; e) the grab thread is woken up by other threads and executes S2;
s5, entering a basic crawler component Spider grabbing stage, comprising: a) acquiring a grabbing task from a distributed URL queue; b) capturing a corresponding webpage according to the acquired task and storing a result; c) analyzing the captured webpage hyperlink and acquiring a new task set; d) sending the acquired new task to a distributed BloomFilter for duplicate removal; e) adding the new task after the duplication removal to a distributed task queue; f) judging whether the current Spider component is blocked by a capturing thread, if so, executing the step g), otherwise, executing the step h); g) waking up a blocked grabbing thread in the current spinner component; h) judging whether other Spider assemblies in the current cluster are dormant or not, if so, awakening the corresponding dormant Spider, otherwise, executing S2;
and S6, ending.
CN201710372282.1A 2017-05-24 2017-05-24 Distributed crawler system and periodic incremental grabbing method Active CN107193960B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710372282.1A CN107193960B (en) 2017-05-24 2017-05-24 Distributed crawler system and periodic incremental grabbing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710372282.1A CN107193960B (en) 2017-05-24 2017-05-24 Distributed crawler system and periodic incremental grabbing method

Publications (2)

Publication Number Publication Date
CN107193960A CN107193960A (en) 2017-09-22
CN107193960B true CN107193960B (en) 2020-11-10

Family

ID=59874541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710372282.1A Active CN107193960B (en) 2017-05-24 2017-05-24 Distributed crawler system and periodic incremental grabbing method

Country Status (1)

Country Link
CN (1) CN107193960B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107657053A (en) * 2017-10-17 2018-02-02 山东浪潮云服务信息科技有限公司 A kind of reptile implementation method and device
CN107943991A (en) * 2017-12-01 2018-04-20 成都嗨翻屋文化传播有限公司 A kind of distributed reptile frame and implementation method based on memory database
CN109359231A (en) * 2017-12-29 2019-02-19 广州Tcl智能家居科技有限公司 A kind of information crawler method, server and the storage medium of distributed network crawler
CN108376142B (en) * 2018-01-10 2021-05-14 北京思特奇信息技术股份有限公司 Distributed memory database data synchronization method and system
CN108133041A (en) * 2018-01-11 2018-06-08 四川九洲电器集团有限责任公司 Data collecting system and method based on web crawlers and data transfer technology
CN109684058B (en) * 2018-12-18 2022-11-04 成都睿码科技有限责任公司 Efficient crawler platform capable of being linearly expanded for multiple tenants and using method thereof
CN109471979B (en) * 2018-12-20 2021-09-10 奇安信科技集团股份有限公司 Method, system, equipment and medium for capturing dynamic page
CN109948079A (en) * 2019-03-11 2019-06-28 湖南衍金征信数据服务有限公司 A kind of method that distributed capture discloses page data
CN110673968A (en) * 2019-09-26 2020-01-10 科大国创软件股份有限公司 Token ring-based public opinion monitoring target protection method
CN110929126A (en) * 2019-12-02 2020-03-27 杭州安恒信息技术股份有限公司 Distributed crawler scheduling method based on remote procedure call
CN111209460A (en) * 2019-12-27 2020-05-29 青岛海洋科学与技术国家实验室发展中心 Data acquisition system and method based on script crawler framework
CN111274013B (en) * 2020-01-16 2022-05-03 北京思特奇信息技术股份有限公司 Method and system for optimizing timed task scheduling based on memory database in container
CN112381317A (en) * 2020-11-26 2021-02-19 方是哲如管理咨询有限公司 Big data platform for tissue behavior analysis and result prediction

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same
CN105243125A (en) * 2015-09-29 2016-01-13 北京京东尚科信息技术有限公司 PrestoDB cluster running method and apparatus, cluster and data query method and apparatus
CN105447097A (en) * 2015-11-10 2016-03-30 北京北信源软件股份有限公司 Data acquisition method and system
CN105868327A (en) * 2016-03-28 2016-08-17 浪潮软件集团有限公司 Distributed web crawler capturing method based on different updating strategies
KR101664712B1 (en) * 2015-06-19 2016-10-10 이화여자대학교 산학협력단 Bloomfilter query apparatus and method for identifying true positiveness without accessing hashtable

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103037010A (en) * 2012-12-26 2013-04-10 人民搜索网络股份公司 Distributed network crawler system and catching method thereof
CN103559219B (en) * 2013-10-18 2016-12-07 北京京东尚科信息技术有限公司 Distributed network crawler capturing method for scheduling task, dispatching terminal equipment and crawl node
CN106383896A (en) * 2016-09-28 2017-02-08 浪潮软件集团有限公司 Crawler + RocktMQ-based data capturing and distributing method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same
KR101664712B1 (en) * 2015-06-19 2016-10-10 이화여자대학교 산학협력단 Bloomfilter query apparatus and method for identifying true positiveness without accessing hashtable
CN105243125A (en) * 2015-09-29 2016-01-13 北京京东尚科信息技术有限公司 PrestoDB cluster running method and apparatus, cluster and data query method and apparatus
CN105447097A (en) * 2015-11-10 2016-03-30 北京北信源软件股份有限公司 Data acquisition method and system
CN105868327A (en) * 2016-03-28 2016-08-17 浪潮软件集团有限公司 Distributed web crawler capturing method based on different updating strategies

Also Published As

Publication number Publication date
CN107193960A (en) 2017-09-22

Similar Documents

Publication Publication Date Title
CN107193960B (en) Distributed crawler system and periodic incremental grabbing method
Chen et al. G-miner: an efficient task-oriented graph mining system
Olston et al. Automatic optimization of parallel dataflow programs
Meehan et al. Data Ingestion for the Connected World.
Mishne et al. Fast data in the era of big data: Twitter's real-time related query suggestion architecture
Verma et al. Breaking the MapReduce stage barrier
Borkar et al. Hyracks: A flexible and extensible foundation for data-intensive computing
US8055918B2 (en) Optimizing preemptible read-copy update for low-power usage by avoiding unnecessary wakeups
US9396226B2 (en) Highly scalable tree-based trylock
US20100023732A1 (en) Optimizing non-preemptible read-copy update for low-power usage by avoiding unnecessary wakeups
Mahajan et al. Improving the energy efficiency of relational and NoSQL databases via query optimizations
CN107820611B (en) Event processing system paging
US10963839B2 (en) Nested hierarchical rollups by level using a normalized table
WO2019047441A1 (en) Communication optimization method and system
Sun et al. Efficient parallel subgraph enumeration on a single machine
Zhang et al. Recognizing patterns in streams with imprecise timestamps
Yan et al. G-thinker: big graph mining made easier and faster
Ding et al. ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms
Li et al. R-Store: A scalable distributed system for supporting real-time analytics
Arora et al. Multi-representation based data processing architecture for IoT applications
Chen et al. E3: an elastic execution engine for scalable data processing
Zhao et al. MapReduce model-based optimization of range queries
Ding et al. Commapreduce: An improvement of mapreduce with lightweight communication mechanisms
Bian et al. Rainbow: Adaptive layout optimization for wide tables
He et al. The high-activity parallel implementation of data preprocessing based on MapReduce

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant