CN107193960A - A kind of distributed reptile system and periodicity increment grasping means - Google Patents

A kind of distributed reptile system and periodicity increment grasping means Download PDF

Info

Publication number
CN107193960A
CN107193960A CN201710372282.1A CN201710372282A CN107193960A CN 107193960 A CN107193960 A CN 107193960A CN 201710372282 A CN201710372282 A CN 201710372282A CN 107193960 A CN107193960 A CN 107193960A
Authority
CN
China
Prior art keywords
component
spider
distributed
reptile
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710372282.1A
Other languages
Chinese (zh)
Other versions
CN107193960B (en
Inventor
张雷
韩建军
张文哲
谭龙海
王崇骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201710372282.1A priority Critical patent/CN107193960B/en
Publication of CN107193960A publication Critical patent/CN107193960A/en
Application granted granted Critical
Publication of CN107193960B publication Critical patent/CN107193960B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The present invention discloses a kind of distributed reptile system, the system is configured as the Distributed Services based on ZooKeeper, system component and database three parts, wherein, system component includes system monitoring component Monitor, coordination component Coordinator, log collection component Logger, basic reptile component Spider, database includes Redis memory databases, redis is that storage is distributed formula URL task queues and distribution BloomFilter in key value storage form, Redis memory databases.Invention additionally discloses a kind of periodicity increment grasping means based on the system, including:Coordination component Coordinator periodically imports task to distribution URL task queues, and wakes up the Spider components just in dormancy;Spider components carry out dormancy according to the implementation status of current distribution URL task queues or periodicity increment is captured.How the system and method is effectively combined unit reptile together if being solved, realize the distributed reptile of High Availabitity under cluster environment, high stable and high-throughput, and property performance period increment is captured.

Description

A kind of distributed reptile system and periodicity increment grasping means
Technical field
The present invention relates to the high efficient data capture technical field of internet big data, more particularly to a kind of distributed reptile system System and periodicity increment grasping means.
Background technology
Web crawlers is URL (the Uniform Resource Locator, unified resource from one or several Initial pages Finger URL) start, obtain the URL on Initial page, during webpage is captured, according to different crawl strategies, constantly from New URL is extracted on current page and is put into task queue, the stop condition until meeting system.
With the high speed development of internet, explosive growth is presented in network data, and network data source also increasingly tends to be many Memberization.In face of so huge and internet data of diversification, the crawl efficiency of web crawlers how is improved, how for difference Data source carry out customizable and crawl strategy, it appears it is most important.
Compared to conventional individual reptile, distributed reptile can significantly improve the crawl efficiency of reptile, but be also introduced into therewith The problem of new:The mission dispatching problem of multinode, problem of load balancing, webpage repeat sex chromosome mosaicism and cycle under distributed environment Property increment crawl problem etc..
In summary, how while a series of problems that solution distributed reptile is brought, it can effectively improve and climb The grasp speed of worm, and it is present in currently available technology to realize that the periodicity increment of distributed reptile is captured on this basis Subject matter.
The content of the invention
The technical problems to be solved by the invention are to provide a kind of distributed reptile system and periodicity increment crawl side How unit reptile is effectively combined together by method, solution, realizes High Availabitity under cluster environment, high stable and high-throughput Distributed reptile, and the crawl of property performance period increment.The present invention the used technical scheme that solves the above problems is:
The present invention discloses a kind of distributed reptile system, the system be configured as the Distributed Services based on ZooKeeper, System component and database three parts, wherein, system component includes system monitoring component Monitor, coordination component Coordinator, log collection component Logger, basic reptile component Spider, database include Redis memory databases, Redis is that storage is distributed formula URL task queues and distribution in key-value storage form, Redis memory databases BloomFilter;Wherein, the Distributed Services based on ZooKeeper provide distributed coordination service for each system component;It is described System monitoring component Monitor is responsible for the dynamic configuration of system and the condition monitoring of system;The coordination component Coordinator Be responsible for by seed URL imported into the distributed task scheduling queue based on Redis, periodically collect each node state to ZooKeeper, For the one or more in the detection and management of log collection component Logger dynamically distributes Log Source and clustered node;The day Will collection assembly Logger is responsible for basic reptile component Spider collector journal data each from cluster;The basic reptile component Spider, which is responsible for handling webpage, crawls task;The distributed URL task queues based on Redis are responsible for storage and need to be climbed The task URL taken;The distributed BloomFilter based on Redis is responsible for all basic reptile component Spider in cluster URL duplicate removals request.
Further, the Distributed Services based on ZooKeeper by with the mutually coordinated work of each system component, be each system System component, which is provided, includes dynamic configuration, clustered node detection and management, Master elections, distributed lock, overall situation URL ID generations Distributed Services in one or more.
Further, system monitoring component Monitor has Monitor interfaces, and user can be repaiied by Monitor interfaces Change coordination component Coordinator, the log collection component in the system configuration parameter being present on ZooKeeper, cluster Logger and basis reptile component Spider can monitor the corresponding data node on ZooKeeper, and in back end content quilt Notified, and then adjusted accordingly according to amended configuration parameter accordingly after modification.
Further, Monitor interfaces can also show each system status parameters for being present on ZooKeeper in real time and each Component states parameter.
Further, basic reptile component Spider components have a multiple assembly kernel, and each component kernel crawls plan It is slightly not quite identical.
Further, basic reptile component Spider components have high scalability, are write with facilitating for new data source New component kernel.
Further, the task ways of distribution of distributed URL task queues pulling using basis reptile component Spider (Pull) mode.
Further, distributed BloomFilter is different in Redis by the vectorial fragmented storages of bit using fragmentation scheme On Key, and the synchronism control that each basic reptile component Spider is accessed is realized by being segmented optimistic lock.
Invention additionally discloses a kind of periodicity increment grasping means based on above-mentioned distributed reptile system, including:Coordinate Component Coordinator periodically imports task to distribution URL task queues, and wakes up the Spider components just in dormancy; Spider components carry out dormancy according to the implementation status of current distribution URL task queues or periodicity increment is captured, and are not having During crawl task, Spider components will enter resting state, the Spider components of dormancy can by other Spider components or Coordinator components may proceed to carry out crawl task when waking up.
Further, this method comprises the following steps:
S1, coordination component Coordinator periodically import task and stopped to distribution URL task queues, and wake-up The Spider components of dormancy.That is, the Coordinator components of system can periodically import task to distribution URL queues, and task is led After entering, Coordinator can wake up the increment crawl task that all Spider components in dormancy start a new round.Perform crawl Task is periodically to carry out always, and since each cycle be importing kind of subtask.
S2, Spider component judge whether that ends with system periodicity increment is captured, if it is, performing S6, otherwise, perform S3.That is, the crawl thread in Spider components can check corresponding back end information in ZooKeeper, back end letter Breath is set by Monitor, and when reading the crawl of ends with system periodicity increment, Spider components can carry out a series of cleanings Terminate the process of oneself after preserving work;Otherwise, it may proceed to carry out periodicity increment crawl.
S3, judge whether current distributed task scheduling queue is empty, if it is, performing S4, otherwise, redirect S5.That is, Spider Crawl thread in component can check whether the needing to be captured in distributed task scheduling queue in Redis of the task, if so, can then obtain Take task and enter stage of gripping;Otherwise, the dormant stage can be entered.
S4, into the basic reptile component Spider dormant stages, mainly include:1) obstruction crawl thread or dormancy basis are climbed Worm component (Spider) component, 2) wake up thread;
Specifically include following steps:A) current Spider components are judged in addition to currently crawl thread, and other crawl threads are It is no to have blocked, if so, then performing step b), otherwise, perform step c);B) dormancy sign node is created in ZooKeeper, The node can only need for representing current Spider components dormancy when other assemblies need and wake up the Spider components Delete the back end;C) the crawl thread is blocked;D) crawl thread has blocked, and waits other thread wakenings;E) grab Line taking journey performs S2 by other thread wakenings.
In no crawl task, Spider components will enter the stage, dormancy oneself, it is to avoid the sky of system resource Consumption, when other Spider components have new task to be added to task queue or new one wheel increment crawl beginning, dormancy Spider components can be waken up by other Spider components or Coordinator components, proceed crawl task.
S5, into basic reptile component (Spider) stage of gripping, specifically include following steps:Including:1) from distribution URL task queues obtain task, 2) Spider execution crawl tasks;3) crawl thread or basic reptile component (Spider) are waken up Component.
Specifically include following steps:A) crawl task is obtained from distributed URL queues;B) according to the task of acquisition, crawl Corresponding web page simultaneously stores result;C) hyperlinks between Web pages grabbed are analyzed and new set of tasks is obtained;D) it is new by what is got Task is sent to distributed BloomFilter duplicate removals;E) new task after duplicate removal is added to distributed task scheduling queue;F) judge Whether current Spider components have crawl thread block, if so, then performing step g), otherwise, perform step h);G) wake up current The crawl thread blocked in Spider components;H) judge whether there are other Spider component sleeps in current cluster, if so, then Corresponding dormancy Spider is waken up, otherwise, S2 is performed.
S6, end.I.e. when each component, which detects system, to be needed to be stopped, it can be tied after necessary cleaning work is carried out The respective process of beam.
The internet data of diversification for huge, a kind of distributed reptile system of the invention and periodicity increment are grabbed Take method compared to the prior art, have the advantages that:
1) realize simple:The present invention is based on the distributed coordination service ZooKeeper increased income and the distributed memory increased income Database Redis builds distributed reptile system, and deep development is carried out on the basis of using technological frame, particular needs are both met Ask, development cost is reduced again.
2) high-performance:Crawl task uses multinode multithreading working method, realizes the high-performance of webpage capture, and prop up Hold the linear expansion of Spider components.
3) High Availabitity:Based on ZooKeeper and Redis, each component of system works in cluster form, it is to avoid single-unit Point crash issue, externally realizes a kind of webpage capture service of High Availabitity high stable.
4) automation periodicity increment crawl:After disposable setting initiating task and relevant system parameters, system can be automatic Periodically increment crawl service is carried out, human intervention is not required to.
5) customizable crawl strategy:Spider components include multiple assembly kernel, and each component kernel correspondence is different Strategy is crawled, and Spider is designed to a kind of component of high extension, easily can be write newly for new data source Component kernel.
6) favorable expandability:System all component is organized together with lower coupling, any single node it is upper offline to being The influence caused of uniting is very little, supports the linear expansion of each component.
As can be seen here, the present invention has reasonable in design, framework simple, High Availabitity, high stable, high-performance, favorable expandability etc. Advantage.
Brief description of the drawings
Fig. 1 is the distributed reptile system architecture diagram
Fig. 2 is periodicity increment grasping means main flow chart
Fig. 3 is Spider stage of gripping flow charts
Fig. 4 is Spider dormant stage flow charts
Embodiment
In order to be better understood by the technology contents of the present invention, make furtherly especially exemplified by diagram appended by specific embodiment and cooperation It is bright.
Fig. 1 is a kind of distributed reptile system architecture diagram of the invention, and the system includes the distribution based on ZooKeeper Service, system component and database three parts.Wherein, the Distributed Services based on ZooKeeper provide for each system component Distributed coordination is serviced;System component includes system monitoring component Monitor, coordination component Coordinator, log collection group Part Logger, basic reptile component Spider;Database includes the data of Redis memory databases and other storage crawl webpages Storage is distributed formula URL task queues and distribution BloomFilter in storehouse, Redis memory databases.
Distributed Services based on ZooKeeper by with the mutually coordinated work of each system component, be that each system component is carried For generating distributed including dynamic configuration, clustered node detection and management, Master elections, distributed lock, the overall situation URL ID Coordination service.ZooKeeper maintains the tree form data structure of a similar file system in internal memory, based on ZooKeeper These Distributed Services can be by creating, inquiring about, delete and monitoring each component corresponding in ZooKeeper data structures Back end is realized.
System monitoring component Monitor is responsible for the dynamic configuration of system and the condition monitoring of system.User can pass through Monitor interface modifications are present in the system configuration parameter (such as each basic reptile component Spider parameter) on ZooKeeper, The corresponding assembly of each in cluster (including Spider, Coordinator and Logger) can monitor the corresponding data on ZooKeeper Node, and each corresponding assembly can be notified accordingly after being changed in back end content, i.e., sent by ZooKeeper Data change notice, and then each component can adjust accordingly according to amended configuration parameter.Monitor interfaces can also be real When display be present in each system status parameters and each component state parameter on ZooKeeper, so that user is monitored in real time, Pinpoint the problems in time and carry out corresponding remedial measure.Wherein, system configuration parameter mainly includes seed importing cycle, canonical about Beam, crawl Thread Count, crawl depth, maximum error number etc., the configuration parameter of also many other very details.
Coordination component Coordinator be responsible for import sub-pages URL to be based on distributed task scheduling queue, periodically collect Each node state is to ZooKeeper, for detection and the pipe of log collection component Logger dynamically distributes Log Source and clustered node Reason.
Log collection component Logger is responsible for basic reptile component Spider collector journal data each from cluster, with after an action of the bowels Continuous log analysis.
Basic reptile component Spider is responsible for specific web page crawl task, and Spider components include multiple assembly kernel, Each component kernel correspondence is different to crawl strategy, and Spider is designed to a kind of component of high extension, can be very square Just new component kernel is write for new data source.During crawling, Spider components enter according to system configuration first The corresponding initialization of row, afterwards can constantly from distributed task scheduling queue request URL, according to corresponding URL switchings it is corresponding crawl it is tactful, Webpage, extraction web page characteristics and text, storage is crawled to extract result, analysis hyperlinks between Web pages, will newly obtain URL through distribution It is added to after BloomFilter duplicate removals in distributed task scheduling queue, until distributed task scheduling queue is sky.
The task URL that storage needs to be crawled is responsible in distributed URL task queues based on Redis.The distributor of task Formula pulls (Pull) mode using basic reptile component Spider's, when Spider currently crawl task at the end of, Spider New task can be actively pulled into distributed queue, the work of next round is carried out.It is worth noting that, being based on currently employed In the case of Redis distributed queue, it is a kind of optimal simplest mode to pull (Pull) mode, other also have push and The mode that push-and-pull is combined can also be realized, but both modes are required for doing extra realization, and pulled mode and do not needed additionally Realization.
Distributed BloomFilter based on Redis is responsible for the URL duplicate removals of all basic reptile component Spider in cluster Request.Redis is key-value storage form, and bit vectors are segmented by the distributed BloomFilter using fragmentation scheme It is stored on the different Key of Redis, and the synchronism control that each Spider is accessed is realized by being segmented optimistic lock.Segmentation pleasure Seeing the realization mechanism locked is:Each Spider duplicate removal request can calculate each section of correspondence of the bit to be accessed vectors first Key, can be monitored afterwards on each Key (key), then initiate the Redis things that each section of bit vector updates request Business, the affairs can first check for whether the corresponding bit vectors of monitored Key were modified after monitoring upon execution, if Affairs since then are then abandoned to perform and initiation duplicate removal request again of spinning;Otherwise, it is updated successfully, duplicate removal URL is added successfully In BloomFilter.The distributed BloomFilter realized based on fragmentation scheme and optimistic lock can not only provide high-throughput Duplicate removal request, and can be extended with the linear expansion of Redis clusters, in the absence of capacity limit.
A kind of periodicity increment grasping means based on above-mentioned distributed reptile system is also disclosed in embodiment, with reference to accompanying drawing 2 to Fig. 4 this method is described in detail.
Fig. 2 is the main flow chart of periodicity increment grasping means in embodiment, is specifically described as follows:
Step 1-0, the initial state of periodicity increment grasping means;
Step 1-1, coordination component Coordinator periodically import task to distribution URL task queues;
Step 1-2, judge whether ends with system periodicity increment capture:If step 1-2 judged results are yes, enter step Rapid 1-9, otherwise performs step 1-3;
Step 1-3, judge whether current distributed task scheduling queue is empty;If the determination result is YES, then stop into Spider In the dormancy stage, corresponding steps 1-4 and step 1-5 is performed, otherwise, into Spider stages of gripping, perform corresponding steps 1-6, step 1-7 and step 1-8.
Step 1-4, obstruction crawl thread or dormancy Spider components;
Step 1-5, obstruction thread are waken up, and perform step 1-2;
Step 1-6, from distributed queue obtain task;
Step 1-7, Spider performs specific crawl task;
Step 1-8, wake-up crawl thread or Spider components, and perform step 1-2;
Step 1-9, done state.
Fig. 3 is Spider stage of gripping flow charts in embodiment, is specifically described as follows:
Step 2-0, Spider stage of gripping starts state, step Following step 1-3;
Step 2-1, from distributed queue obtain task;
Step 2-2, the task according to acquisition, capture corresponding web page and store result;
Step 2-3, the hyperlinks between Web pages that grab of analysis simultaneously obtain new set of tasks;
Step 2-4, by the new task got to distribution BloomFilter duplicate removals;
Step 2-5, the new task after duplicate removal is added to distributed task scheduling queue;
Step 2-6, judge whether this Spider components have crawl thread block, if so, then performing step 2-7, otherwise, hold Row step 2-8;
Step 2-7, the crawl thread for waking up this Spider components blocked;
Step 2-8, judge whether there are other Spider component sleeps in current cluster, if so, step 2-9 is then performed, Otherwise, 2-10 is performed;
Step 2-9, wake-up dormancy Spider;
The done state of step 2-10, Spider stage of gripping, can perform step 1-2 afterwards.
Fig. 4 is Spider dormant stage flow charts in embodiment, is specifically described as follows:
Step 3-0, the Spider dormant stage starts state, step Following step 1-3;
Step 3-1, judge this Spider components except capture thread in addition to, it is other crawl threads whether blocked, if so, then Step 3-2 is performed, otherwise, step 3-3 is performed;
Step 3-2, the establishment dormancy sign node in ZooKeeper, the node can be for the corresponding Spider groups of expression Part dormancy, the back end need to be only deleted when other assemblies need and wake up the Spider components;
Step 3-3, block this crawl thread;
Step 3-4, crawl thread have blocked, and wait other thread wakenings;
Step 3-5, crawl thread are by other thread wakenings;
The done state of step 3-6, Spider dormant stage, can perform step 1-2 afterwards.
Although embodiment of the present invention is described above in association with accompanying drawing, the invention is not limited in above-mentioned Specific embodiments and applications field, above-mentioned specific embodiment is only schematical, guiding, rather than restricted 's.One of ordinary skill in the art is not departing from the scope that the claims in the present invention are protected under the enlightenment of this specification In the case of, the form of many kinds can also be made, these belong to the row of protection of the invention.

Claims (10)

1. a kind of distributed reptile system, it is characterised in that the system be configured as the Distributed Services based on ZooKeeper, System component and database three parts, wherein, system component includes system monitoring component Monitor, coordination component Coordinator, log collection component Logger, basic reptile component Spider, database include Redis memory databases, Storage is distributed formula URL task queues and distribution BloomFilter in Redis memory databases;Wherein,
Distributed Services based on ZooKeeper provide distributed coordination service for each system component,
The system monitoring component Monitor is responsible for the dynamic configuration of system and the condition monitoring of system,
The coordination component Coordinator is responsible for seed URL importeding into the distributed task scheduling queue based on Redis, cycle Property collect each node state to ZooKeeper, for the inspection of log collection component Logger dynamically distributes Log Source and clustered node One or more in surveying and managing,
The log collection component Logger is responsible for basic reptile component Spider collector journal data each from cluster,
The basic reptile component Spider, which is responsible for handling webpage, crawls task,
The task URL that storage needs to be crawled is responsible in the distributed URL task queues based on Redis,
The distributed BloomFilter based on Redis is responsible for the URL duplicate removals of all basic reptile component Spider in cluster Request.
2. distributed reptile system according to claim 1, it is characterised in that the Distributed Services based on ZooKeeper By with the mutually coordinated work of each system component, for each system component provide including dynamic configuration, clustered node detection with management, One or more in Master elections, distributed lock, the Distributed Services of overall situation URL ID generations.
3. a kind of distributed reptile system according to claim 1, it is characterised in that wherein described system monitoring component Monitor has Monitor interfaces, and user can be present in the system configuration on ZooKeeper by Monitor interface modifications Coordination component Coordinator, log collection component Logger and basis reptile component Spider in parameter, cluster can be monitored Corresponding data node on ZooKeeper, and notified accordingly after back end content is changed, and then according to modification Configuration parameter afterwards is adjusted accordingly.
4. distributed reptile system according to claim 3, it is characterised in that Monitor interfaces can also in real time show and deposit It is each system status parameters and each component state parameter on ZooKeeper.
5. a kind of distributed reptile system according to claim 1, it is characterised in that the basic reptile component Spider Component has a multiple assembly kernel, and each component kernel to crawl strategy not quite identical.
6. distributed reptile system according to claim 1, it is characterised in that the basic reptile component Spider components With high scalability, new component kernel is write for new data source to facilitate.
7. distributed reptile system according to claim 1, it is characterised in that times of the distributed URL task queues Business ways of distribution pulls mode using basic reptile component Spider's.
8. distributed reptile system according to claim 1, it is characterised in that the distributed BloomFilter is used The vectorial fragmented storages of bit are realized each basic reptile by fragmentation scheme on the different Key of Redis, and by being segmented optimistic lock The synchronism control that component Spider is accessed.
9. a kind of periodicity increment grasping means, it is characterised in that climbed based on distribution described in claim 1 to 8 any one Worm system, including:Coordination component Coordinator periodically imports task and stopped to distribution URL task queues, and wake-up The Spider components of dormancy;Spider components carry out dormancy or periodicity according to the implementation status of current distribution URL task queues Increment is captured, in no crawl task, and Spider components will enter resting state, and the Spider components of dormancy can be by other Basic reptile component Spider components or Coordinator components may proceed to carry out crawl task when waking up.
10. periodicity increment grasping means according to claim 9, it is characterised in that comprise the following steps:
S1, coordination component Coordinator periodically import task to distribution URL task queues, and wake up just in dormancy Spider components;
S2, judge whether that ends with system periodicity increment is captured, if it is, redirect S6, otherwise, perform S3;
S3, judge whether current distributed task scheduling queue is empty, if it is, performing S4, otherwise, redirect S5;
S4, into basic reptile component (Spider) dormant stage, comprise the following steps:A) judge that current Spider components are removed to work as Outside preceding crawl thread, whether other crawl threads have blocked, if so, then performing step b), otherwise, perform step c);B) exist Dormancy sign node is created in ZooKeeper, the node can be for representing current Spider components dormancy, when other groups Part needs only delete the back end when waking up the Spider components;C) the crawl thread is blocked;D) thread has been captured Through obstruction, other thread wakenings are waited;E) crawl thread is by other thread wakenings, and performs S2;
S5, into basic reptile component Spider stages of gripping, including:A) crawl task is obtained from distributed URL queues;B) root According to the task of acquisition, capture corresponding web page and store result;C) hyperlinks between Web pages grabbed are analyzed and new task-set is obtained Close;D) new task got is sent to distributed BloomFilter duplicate removals;E) new task after duplicate removal is added to distribution Formula task queue;F) judge whether current Spider components have crawl thread block, if so, then performing step g), otherwise, perform Step h);G) the crawl thread blocked in current Spider components is waken up;H) judge whether there are others in current cluster Spider component sleeps, if so, then waking up corresponding dormancy Spider, otherwise, perform S2;
S6, end.
CN201710372282.1A 2017-05-24 2017-05-24 Distributed crawler system and periodic incremental grabbing method Active CN107193960B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710372282.1A CN107193960B (en) 2017-05-24 2017-05-24 Distributed crawler system and periodic incremental grabbing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710372282.1A CN107193960B (en) 2017-05-24 2017-05-24 Distributed crawler system and periodic incremental grabbing method

Publications (2)

Publication Number Publication Date
CN107193960A true CN107193960A (en) 2017-09-22
CN107193960B CN107193960B (en) 2020-11-10

Family

ID=59874541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710372282.1A Active CN107193960B (en) 2017-05-24 2017-05-24 Distributed crawler system and periodic incremental grabbing method

Country Status (1)

Country Link
CN (1) CN107193960B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107657053A (en) * 2017-10-17 2018-02-02 山东浪潮云服务信息科技有限公司 A kind of reptile implementation method and device
CN107943991A (en) * 2017-12-01 2018-04-20 成都嗨翻屋文化传播有限公司 A kind of distributed reptile frame and implementation method based on memory database
CN108133041A (en) * 2018-01-11 2018-06-08 四川九洲电器集团有限责任公司 Data collecting system and method based on web crawlers and data transfer technology
CN108376142A (en) * 2018-01-10 2018-08-07 北京思特奇信息技术股份有限公司 A kind of distributed memory database method of data synchronization and system
CN109359231A (en) * 2017-12-29 2019-02-19 广州Tcl智能家居科技有限公司 A kind of information crawler method, server and the storage medium of distributed network crawler
CN109471979A (en) * 2018-12-20 2019-03-15 北京奇安信科技有限公司 A kind of method, system, equipment and medium grabbing dynamic page
CN109684058A (en) * 2018-12-18 2019-04-26 成都睿码科技有限责任公司 It is a kind of for multi-tenant can linear expansion efficient crawler platform and its application method
CN109948079A (en) * 2019-03-11 2019-06-28 湖南衍金征信数据服务有限公司 A kind of method that distributed capture discloses page data
CN110673968A (en) * 2019-09-26 2020-01-10 科大国创软件股份有限公司 Token ring-based public opinion monitoring target protection method
CN110929126A (en) * 2019-12-02 2020-03-27 杭州安恒信息技术股份有限公司 Distributed crawler scheduling method based on remote procedure call
CN111209460A (en) * 2019-12-27 2020-05-29 青岛海洋科学与技术国家实验室发展中心 Data acquisition system and method based on script crawler framework
CN111274013A (en) * 2020-01-16 2020-06-12 北京思特奇信息技术股份有限公司 Method and system for optimizing timed task scheduling based on memory database in container
CN112381317A (en) * 2020-11-26 2021-02-19 方是哲如管理咨询有限公司 Big data platform for tissue behavior analysis and result prediction

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same
CN103037010A (en) * 2012-12-26 2013-04-10 人民搜索网络股份公司 Distributed network crawler system and catching method thereof
CN103559219A (en) * 2013-10-18 2014-02-05 北京京东尚科信息技术有限公司 Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes
CN105243125A (en) * 2015-09-29 2016-01-13 北京京东尚科信息技术有限公司 PrestoDB cluster running method and apparatus, cluster and data query method and apparatus
CN105447097A (en) * 2015-11-10 2016-03-30 北京北信源软件股份有限公司 Data acquisition method and system
CN105868327A (en) * 2016-03-28 2016-08-17 浪潮软件集团有限公司 Distributed web crawler capturing method based on different updating strategies
KR101664712B1 (en) * 2015-06-19 2016-10-10 이화여자대학교 산학협력단 Bloomfilter query apparatus and method for identifying true positiveness without accessing hashtable
CN106383896A (en) * 2016-09-28 2017-02-08 浪潮软件集团有限公司 Crawler + RocktMQ-based data capturing and distributing method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same
CN103037010A (en) * 2012-12-26 2013-04-10 人民搜索网络股份公司 Distributed network crawler system and catching method thereof
CN103559219A (en) * 2013-10-18 2014-02-05 北京京东尚科信息技术有限公司 Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes
KR101664712B1 (en) * 2015-06-19 2016-10-10 이화여자대학교 산학협력단 Bloomfilter query apparatus and method for identifying true positiveness without accessing hashtable
CN105243125A (en) * 2015-09-29 2016-01-13 北京京东尚科信息技术有限公司 PrestoDB cluster running method and apparatus, cluster and data query method and apparatus
CN105447097A (en) * 2015-11-10 2016-03-30 北京北信源软件股份有限公司 Data acquisition method and system
CN105868327A (en) * 2016-03-28 2016-08-17 浪潮软件集团有限公司 Distributed web crawler capturing method based on different updating strategies
CN106383896A (en) * 2016-09-28 2017-02-08 浪潮软件集团有限公司 Crawler + RocktMQ-based data capturing and distributing method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YAKUSHEV, A.V.: "Technology for distributed crawling and analysis of big data from social media", 《DYNAMICS OF COMPLICATED SYSTEMS》 *
崔璨: "分布式垂直搜索引擎的研究与设计", 《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107657053A (en) * 2017-10-17 2018-02-02 山东浪潮云服务信息科技有限公司 A kind of reptile implementation method and device
CN107943991A (en) * 2017-12-01 2018-04-20 成都嗨翻屋文化传播有限公司 A kind of distributed reptile frame and implementation method based on memory database
CN109359231A (en) * 2017-12-29 2019-02-19 广州Tcl智能家居科技有限公司 A kind of information crawler method, server and the storage medium of distributed network crawler
CN108376142A (en) * 2018-01-10 2018-08-07 北京思特奇信息技术股份有限公司 A kind of distributed memory database method of data synchronization and system
CN108376142B (en) * 2018-01-10 2021-05-14 北京思特奇信息技术股份有限公司 Distributed memory database data synchronization method and system
CN108133041A (en) * 2018-01-11 2018-06-08 四川九洲电器集团有限责任公司 Data collecting system and method based on web crawlers and data transfer technology
CN109684058A (en) * 2018-12-18 2019-04-26 成都睿码科技有限责任公司 It is a kind of for multi-tenant can linear expansion efficient crawler platform and its application method
CN109684058B (en) * 2018-12-18 2022-11-04 成都睿码科技有限责任公司 Efficient crawler platform capable of being linearly expanded for multiple tenants and using method thereof
CN109471979A (en) * 2018-12-20 2019-03-15 北京奇安信科技有限公司 A kind of method, system, equipment and medium grabbing dynamic page
CN109948079A (en) * 2019-03-11 2019-06-28 湖南衍金征信数据服务有限公司 A kind of method that distributed capture discloses page data
CN110673968A (en) * 2019-09-26 2020-01-10 科大国创软件股份有限公司 Token ring-based public opinion monitoring target protection method
CN110929126A (en) * 2019-12-02 2020-03-27 杭州安恒信息技术股份有限公司 Distributed crawler scheduling method based on remote procedure call
CN111209460A (en) * 2019-12-27 2020-05-29 青岛海洋科学与技术国家实验室发展中心 Data acquisition system and method based on script crawler framework
CN111274013A (en) * 2020-01-16 2020-06-12 北京思特奇信息技术股份有限公司 Method and system for optimizing timed task scheduling based on memory database in container
CN112381317A (en) * 2020-11-26 2021-02-19 方是哲如管理咨询有限公司 Big data platform for tissue behavior analysis and result prediction

Also Published As

Publication number Publication date
CN107193960B (en) 2020-11-10

Similar Documents

Publication Publication Date Title
CN107193960A (en) A kind of distributed reptile system and periodicity increment grasping means
CN103310012B (en) A kind of distributed network crawler system
CN105677918B (en) A kind of distributed reptile framework and its implementation based on Kafka and Quartz
CN107391719A (en) Distributed stream data processing method and system in a kind of cloud environment
CN104969213B (en) Data flow for low latency data access is split
CN102118261B (en) Method and device for data acquisition, and network management equipment
CN106202346B (en) A kind of data load cleaning engine, scheduling and storage system
CN103235820B (en) Date storage method and device in a kind of group system
CN103970788A (en) Webpage-crawling-based crawler technology
CN109543067A (en) Enterprise's production status based on artificial intelligence monitors analysis system in real time
CN107943991A (en) A kind of distributed reptile frame and implementation method based on memory database
CN106599043A (en) Middleware used for multilevel database and multilevel database system
KR101617696B1 (en) Method and device for mining data regular expression
US8996677B2 (en) Information processing system and processing method arrangements providing load distribution and leveling on data collection units
CN108520024A (en) Binary cycle crawler system and its operation method based on Spark Streaming
CN107092627A (en) The column-shaped storage of record is represented
CN104965935B (en) The update method of network monitoring daily record
CN108133041A (en) Data collecting system and method based on web crawlers and data transfer technology
CN106817253A (en) The monitor in real time of journal file and the method and system of alarm
US20080275742A1 (en) Nested hierarchical rollups by level using a normalized table
CN110007905A (en) A kind of generation method and system of the software development scheme based on big data
CN103870465B (en) A kind of implementation method of the database reptile of non-invasive
Gadepally et al. Version 0.1 of the bigdawg polystore system
CN101499096A (en) Distributed reptile cluster system
Maabreh Optimizing Database Query Performance Using Table Partitioning Techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant