CN110442769A - Distributed data crawls system, method, apparatus, equipment and storage medium - Google Patents

Distributed data crawls system, method, apparatus, equipment and storage medium Download PDF

Info

Publication number
CN110442769A
CN110442769A CN201910717429.5A CN201910717429A CN110442769A CN 110442769 A CN110442769 A CN 110442769A CN 201910717429 A CN201910717429 A CN 201910717429A CN 110442769 A CN110442769 A CN 110442769A
Authority
CN
China
Prior art keywords
address
task queue
cluster
crawl
crawls
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910717429.5A
Other languages
Chinese (zh)
Inventor
肖淋峰
吴志坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Lexin Software Technology Co Ltd
Original Assignee
Shenzhen Lexin Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Lexin Software Technology Co Ltd filed Critical Shenzhen Lexin Software Technology Co Ltd
Priority to CN201910717429.5A priority Critical patent/CN110442769A/en
Publication of CN110442769A publication Critical patent/CN110442769A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of distributed datas to crawl system, method, apparatus, equipment and storage medium.System provided in an embodiment of the present invention includes that task queue cluster and data crawl cluster, wherein, task queue cluster includes an at least terminal, initial task queue and intermediate task queue are provided in the task queue cluster, the initial task queue and intermediate task queue distribution crawl address and centre for preservation starting and crawl address, it includes an at least terminal that data, which crawl cluster, address and centre, which are crawled, to obtain starting for accessing the task queue cluster crawls address, and address and the centre are crawled according to the starting and crawls address and crawls target webpage.System provided in an embodiment of the present invention, by the way that initial task queue and intermediate task queue are respectively set in task queue cluster, the change of task amount during crawling convenient for data reduces scheduling of resource degree of difficulty, improves data and crawl efficiency.

Description

Distributed data crawls system, method, apparatus, equipment and storage medium
Technical field
The present embodiments relate to computer application technologies more particularly to a kind of distributed data to crawl system, side Method, device, equipment and storage medium.
Background technique
Current social comes into big data era, and human lives increasingly be unable to do without data, and data volume is in explosive growth, Since data contain a large amount of value, data how are obtained as a urgent problem to be solved.
Network data is crawled frequently with web crawlers in the prior art, common web crawlers implementation includes: to use The library Requests of Python or the library Trep building crawler system are using Scrapy framework establishment one process multithreading crawler System uses Scrapy frame and Redis database sharing distributed reptile system memory-based, above-mentioned technical solution, There are problems that scheduling of resource difficulty and the fixation of job task amount can not change job task amount during crawling, data crawl Efficiency is poor.
Summary of the invention
The present invention provides a kind of distributed data and crawls system, method, apparatus, equipment and storage medium, is reduced with realizing Scheduling of resource difficulty, that improves network data crawls efficiency.
In a first aspect, the embodiment of the invention provides a kind of distributed datas to crawl system, which includes:
Task queue cluster and data crawl cluster;
Wherein, task queue cluster includes an at least terminal, is provided with initial task team in the task queue cluster Column and intermediate task queue, the initial task queue and intermediate task queue are respectively used to save to originate and crawl address and centre Crawl address;
It includes an at least terminal that data, which crawl cluster, crawls ground for accessing the task queue cluster to obtain starting Location and centre crawl address, and crawl address and the centre according to the starting and crawl address and crawl target webpage.
Second aspect, the embodiment of the invention also provides a kind of distributed data crawling methods, this method comprises:
The initial task queue and/or intermediate task queue for detecting the setting of task queue cluster are stored with and crawl address When, the task queue cluster, which is accessed, to obtain crawls address;
Target webpage is obtained according to the address that crawls;
The address to be crawled in the target webpage is obtained, and the address to be crawled is uploaded into the task queue collection Group.
The third aspect, the embodiment of the invention also provides a kind of distributed data crawling methods, this method comprises:
When detecting that data crawl the access request of cluster, according to the access request by starting crawl address and/or Centre, which crawls address and is sent to the data, crawls cluster;
The address to be crawled that the data crawl cluster upload is obtained, by the address storage to be crawled to intermediate task team Column.
Fourth aspect, the embodiment of the invention also provides a kind of distributed datas to crawl device, which includes:
Address acquisition module, for detecting initial task queue and/or the intermediate task team of the setting of task queue cluster Column are stored with when crawling address, are accessed the task queue cluster to obtain and are crawled address;
Web page crawl module obtains target webpage for crawling address according to;
Address uploading module, for obtaining the address to be crawled in the target webpage, and by described wait crawl on address Pass to the task queue cluster.
5th aspect, the embodiment of the invention also provides a kind of distributed datas to crawl device, which includes:
Address sending module, for being incited somebody to action according to the access request when detecting that data crawl the access request of cluster Starting crawls address and/or centre crawls address and is sent to the data and crawls cluster;
Address memory module crawls the address to be crawled of cluster upload for obtaining the data, by described wait crawl ground Location is stored to intermediate task queue.
6th method additionally provides a kind of equipment, which includes:
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by locating one or more processors, so that one or more of processing Device realizes the distributed data crawling method as described in any in the embodiment of the present invention.
7th aspect, additionally provides a kind of computer readable storage medium, is stored thereon with computer program, the program quilt The distributed data crawling method as described in any in the embodiment of the present invention is realized when processor executes.
The technical solution of the embodiment of the present invention crawls cluster, task queue collection by setting task queue cluster and data More terminals are provided in group, initial task queue and intermediate task queue, initial task team are provided in task queue cluster Column and intermediate task queue, which are respectively used to save starting, to be crawled address and centre and crawls address, and it includes more that data, which crawl cluster, Terminal crawls address and centre for accessing task queue cluster to obtain starting and crawls address, according to starting crawl address and Centre crawls address and crawls target webpage, and the network data realized under distributed structure/architecture crawls, during data crawl more Change data crawl operation task amount it is more convenient, the speed that network data crawls can be improved.
Detailed description of the invention
Fig. 1 is the configuration diagram that a kind of distributed data that the embodiment of the present invention one provides crawls system;
Fig. 2 is the configuration diagram that a kind of distributed data provided by Embodiment 2 of the present invention crawls system;
Fig. 3 is that a kind of distributed data provided by Embodiment 2 of the present invention crawls exemplary system figure;
Fig. 4 is a kind of step flow chart for distributed data crawling method that the embodiment of the present invention three provides;
Fig. 5 is a kind of step flow chart for distributed data crawling method that the embodiment of the present invention four provides;
Fig. 6 is the structural schematic diagram that a kind of distributed data that the embodiment of the present invention five provides crawls device;
Fig. 7 is the structural schematic diagram that a kind of distributed data that the embodiment of the present invention six provides crawls device;
Fig. 8 is a kind of structural schematic diagram for equipment that the embodiment of the present invention seven provides.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure, in addition, in the absence of conflict, this The feature in embodiment and embodiment in invention can be combined with each other.
Embodiment one
Fig. 1 is the configuration diagram that a kind of distributed data that the embodiment of the present invention one provides crawls system, the present embodiment It is applicable to the case where network data crawls, referring to Fig. 1, which may include that task queue cluster 10 and data crawl cluster 11;
Wherein, task queue cluster 10 includes an at least terminal 101, is provided with starting in the task queue cluster 10 Task queue and intermediate task queue, the initial task queue and intermediate task queue are respectively used to preservation starting and crawl address Address is crawled with centre;Data crawl cluster 11 include an at least terminal 111, for access the task queue cluster 10 with Acquisition starting crawls address and centre crawls address, and crawls address and the centre according to the starting and crawl address and crawl mesh Mark webpage.
In embodiments of the present invention, task queue cluster 10 may include that more terminals 101 are constituted, task queue cluster 10 Can be used for running initial task queue and intermediate task queue, can be more terminals collectively as operation task queue and The server of intermediate task queue, wherein initial task queue, which can be, saves the queue that starting crawls address, initial task team Element may include that starting crawls address in column, and distributed data crawls system when progress data crawl, can pass through increase Or delete the job task amount that the starting in initial task queue crawls address modification system.
Specifically, the element in intermediate task queue in task queue cluster 10 can be centre and crawl address, it is intermediate It crawls address and can be distributed data and crawl system and crawled to get when address carries out and crawls web data according to starting and crawl Address, centre, which crawls address, can be the subaddressing that starting crawls address, and centre crawls address can be crawled address life by starting At.
In embodiments of the present invention, it may include more terminals 111 that data, which crawl cluster 11, and each terminal 111 can basis The starting got crawls address and centre and crawls address progress network data and crawl, and climbs for example, can be disposed in terminal 11 Worm program can crawl address and centre according to the starting got and crawl address and crawl data.Data crawl cluster 11 can be with Timer access task queue cluster 10 obtains starting and crawls address and/or centre and crawl address, when determining task queue cluster 10 In initial task queue and intermediate task queue in there is starting and crawl address and/or when centre crawls address, it is available The starting of the 10 initial task queue of task queue cluster and intermediate task queue crawls address and centre and crawls address.
The technical solution of the embodiment of the present invention, by the way that initial task queue and intermediate task are arranged in task queue cluster Starting is preserved respectively in queue, initial task queue and intermediate task queue crawling address and centre and crawl address, data are climbed It takes cluster to obtain starting for accessing task queue cluster to crawl address and centre and crawl address, data crawl cluster according to starting It crawls address and centre to crawl address and crawl network data, the network data realized under distributed structure/architecture crawls, and can be convenient for Network data crawl during job task amount change, improve scheduling of resource efficiency and data crawl speed.
Embodiment two
Fig. 2 is the configuration diagram that a kind of distributed data provided by Embodiment 2 of the present invention crawls system, the present embodiment It is to be embodied based on foregoing invention embodiment, referring to fig. 2, distributed data provided in an embodiment of the present invention, which crawls, is System includes: that task queue cluster 20 and data crawl cluster 21, wherein task queue cluster 20 includes an at least terminal, institute It states and is provided with initial task queue 202 and intermediate task queue 201, the initial task queue 202 in task queue cluster 20 It is respectively used to save starting with intermediate task queue 201 and crawls address and centre and crawl address;It includes extremely that data, which crawl cluster 21, A few terminal crawls address and centre for accessing the task queue cluster to obtain starting and crawls address, and according to institute It states starting and crawls address and the centre and crawl address and crawl target webpage.
Task queue cluster 20 is also used to remove duplicate centre and crawls address.Data crawl cluster 21 and are also used to obtain mesh Address to be crawled in webpage is marked, and address to be crawled is uploaded in the conduct of intermediate task queue 201 of task queue cluster 20 Between crawl address.Initial task queue and intermediate task queue are realized by Mark reaction message system.
The centre that data generate during crawling cluster 21 and can crawling in embodiments of the present invention crawls address upload To task queue cluster 20, the centre received before being crawled address storage to intermediate task queue by task queue cluster 20 Duplicate removal can be carried out to address is crawled among this, the centre stored in intermediate task queue can be crawled into address It removes, it is ensured that distributed data, which crawls system, not will do it iterative task, to improve the efficiency that data crawl.Task queue cluster 20 after crawling address duplicate removal to centre, and the centre of duplicate removal can be crawled to address and be stored in intermediate task queue 201.
Specifically, the intermediate task queue and initial task queue in task queue cluster 20 can pass through Mark reaction message System realizes that intermediate task queue and initial task queue can rise respectively as the topic queue in Mark reaction message system Begin to crawl address and the intermediate address that crawls can be used as topic in corresponding topic queue, data crawl each terminal in cluster 21 The topic in intermediate task queue and initial task queue can be subscribed to respectively, when in intermediate task queue and initial task queue Address is crawled there are starting and when centre crawls address namely in topic queue there are when topic, data crawl in cluster 21 Terminal can access that starting crawls address and centre crawls address and obtains topic, intermediate task queue and initial task queue respectively In be not present topic when, data crawl cluster 21 can be with automatic pause.
The technical solution of the embodiment of the present invention, by the way that initial task queue and intermediate task are arranged in task queue cluster Starting is preserved respectively in queue, initial task queue and intermediate task queue crawling address and centre and crawl address, data are climbed It takes cluster to obtain starting for accessing task queue cluster to crawl address and centre and crawl address, data crawl cluster according to starting It crawls address and centre to crawl address and crawl network data, data crawl cluster and also crawl the centre generated during crawling Address uploads to task queue cluster, and task queue cluster crawls address to the centre of upload and carries out duplicate removal, and will be after duplicate removal Centre crawls the intermediate task queue that address is stored in task queue cluster, and initial task queue and initial task queue can be with It is realized by Mark reaction message system.The data realized under distributed structure/architecture crawl, and prevent waist performance, improve distributed number According to the scheduling of resource ability for the system that crawls, improve network data crawls speed.
Illustratively, Fig. 3 is that a kind of distributed data provided by Embodiment 2 of the present invention crawls exemplary system figure;Referring to figure 3, data crawl each terminal 31 in cluster can configured with scrapy crawler frame, including crawler engine scrapy engine, Crawler Kafka_spiders and scheduler Kafka_scheduler.Wherein, crawler engine is responsible in scrapy crawler frame respectively The data of module are transmitted;Crawler can crawl address progress data from basis and crawl;Scheduler can receive to crawl engine transmission Request, and arrangement arrangement is carried out to request, returns request when crawling engine and needing.Specifically, since scrapy is single Machine data crawl frame, in order to need to modify the spider module source code of scrapy in conjunction with Mark reaction message system, so that rising Beginning crawls address start_urls and obtains from Mark reaction message system, and climbs it not in Mark reaction message system topic The state that holding is hung up always when the url of address is taken, can thus exist in Mark reaction message system topic and crawl ground It automatically begins to crawl when the url of location, automatic pause when not crawling address url.Data crawl each terminal in cluster The same kafka theme is connected, all terminals carry on a shoulder pole data altogether and crawl pressure.Kafka message can be passed through in task queue cluster Two message queues of system setup, topic spidername_request is as intermediate task queue 32 and spidername_ For start_urls as initial task queue 33, data crawl each terminal 31 in cluster can subscribe to the two message teams respectively Column.Task queue cluster can also crawl address spidername to the centre of acquisition and be gone by duperfilter algorithm Weight guarantees not repeat to store identical centre in topic spidername_request to crawl address.Data crawl in cluster The url that terminal 31 starts to carry out when data crawl in available spider_start_urls crawls address as starting, in number It is uploaded in topic spidername_request according to address is crawled among obtaining during crawling, data crawl the end in cluster It holds the centre that can be obtained respectively in topic spidername_request to crawl address to be crawled.
Embodiment three
Fig. 4 is a kind of step flow chart for distributed data crawling method that the embodiment of the present invention three provides, the present embodiment It is applicable to the case where data of distributed reptile system crawl cluster, which crawls cluster and can be made of multiple terminals, This method can crawl device by the distributed data in the embodiment of the present invention to execute, the device can by software and/or The mode of hardware realizes that the method for the embodiment of the present invention specifically comprises the following steps:
Step 301, the initial task queue for detecting the setting of task queue cluster and/or intermediate task queue are stored with and climb When taking address, the task queue cluster is accessed to obtain and crawls address.
Wherein, initial task queue can be the queue of storage initial task, and data crawl the terminal in cluster can root According in initial task queue crawl address start carry out network data crawl.Intermediate task queue can be storage network data What is got during crawling newly crawls the queue of address, and the address that crawls in intermediate task queue can be data and crawl cluster In terminal upload crawl address, the terminal for crawling address and can also being crawled in cluster by data in intermediate task queue obtains It takes, it is to be understood that the terminal that upload crawls address may be the same or different with the terminal that address is climbed in acquisition.
In embodiments of the present invention, data crawl each terminal in cluster can be respectively to the starting in task queue cluster Task queue and intermediate task queue are detected, and crawl address when determining to exist in initial task queue and intermediate task queue When, each accessible task queue cluster of terminal gets in initial task queue and intermediate task queue and crawls address, can With understanding, terminal have not been obtained crawl address when, can return and continue to execute detection initial task queue and intermediate appoint Business queue.
Step 302 crawls address acquisition target webpage according to.
Wherein, it crawls address and can be data and crawl the address that each terminal is got in task queue cluster in cluster, Terminal can be crawled the centre that can be in task queue cluster to obtain data to the corresponding target webpage in address is crawled Centre in task queue and initial task queue crawls address or starting crawls address.
Colony terminal is crawled specifically, getting and crawling the data of address, corresponding mesh can be accessed according to address is crawled Webpage is marked, target webpage can be downloaded after target webpage is accessed, analysis acquisition can be carried out to the target webpage of downloading To network data to be crawled.
Address to be crawled in step 303, the acquisition target webpage, and the address to be crawled is uploaded into described appoint Business queue cluster.
Wherein, address to be crawled can be the web page address occurred in target webpage, and the corresponding webpage in address to be crawled can To be the associating web pages or sub-pages of target webpage.
Specifically, data crawl cluster can be obtained in target webpage with the associated web page address of target webpage, can be with Using the web page address as address to be crawled, getting after crawling address, data crawl that cluster can will acquire to It crawls address and uploads to task queue cluster, the address to be crawled that task queue cluster can will acquire crawls ground as centre Location is stored into intermediate task queue.
The technical solution of the embodiment of the present invention, by the initial task queue and the centre that detect the setting of task queue cluster When storage is by crawling address in task queue, access task queue cluster acquisition crawls address, can obtain according to address is crawled Target webpage obtains the address to be crawled in target webpage, and address to be crawled is uploaded to task queue cluster, crawls address Acquisition source include initial task queue and intermediate task queue, terminal is according to oneself needing to obtain to crawl address, resource tune Flexibly, improve network data crawls efficiency to degree.
Further, on the basis of foregoing invention embodiment, target webpage is obtained according to the address that crawls, comprising: According to the web page contents for crawling address downloading target webpage and obtaining the target webpage;It is determined according to the web page contents The processing rule of target webpage;The acquisition of target webpage is realized according to locating processing rule extraction object content.
Wherein, web page contents can be the information content in target webpage, may include text, picture, video and audio etc., Processing rule can be according to the rule for crawling web page contents, such as obtain the rule of text in table, different web page contents Different processing rules can be corresponded to.
In embodiments of the present invention, data crawl each terminal in cluster can be according to crawling address access target webpage simultaneously The target webpage is downloaded, can determine the web page contents for including in target webpage after downloading target webpage, it is true according to web page contents Fixed corresponding processing is regular, carries out processing to the web page contents in target webpage according to processing rule and obtains network data realization mesh Mark the acquisition of webpage.
Example IV
Fig. 5 is a kind of step flow chart for distributed data crawling method that the embodiment of the present invention four provides, the present embodiment The case where being applicable to the task queue cluster of distributed reptile system, the task queue cluster may include multiple terminals, should Method can crawl device by the distributed data in the embodiment of the present invention to execute, which can be by software and/or hard The mode of part realizes that the method for the embodiment of the present invention specifically comprises the following steps:
Step 401, when detecting that data crawl the access request of cluster, starting is crawled ground according to the access request Location and/or centre, which crawl address and be sent to the data, crawls cluster.
Wherein, access request can be the request that data crawl each terminal request access task queue cluster in cluster, visit Ask in request to may include identification number that data crawl terminal in cluster.
In embodiments of the present invention, task queue cluster can detect access request, detect access request When, starting can be crawled by address according to the source of access request and/or centre crawls address to be sent to access request corresponding Data cluster terminal.Starting crawl address and it is intermediate crawl address can store task queue cluster initial task queue and In intermediate task queue, after starting crawls address and centre crawls address and sent, corresponding starting can be crawled into address Address is crawled from initial task queue and intermediate task queue with centre to remove.
Step 402 obtains the address to be crawled that the data crawl cluster upload, by the address storage to be crawled in Between task queue.
Wherein, address to be crawled, which can be data and crawl cluster and get during data crawl, crawls address.
Specifically, the address to be crawled can be deposited with the address to be crawled of the available upload of task queue cluster It stores up in the intermediate task queue in task queue cluster, address to be crawled as centre can be crawled into address.
The technical solution of the embodiment of the present invention, by when detecting that data crawl the access request of cluster, according to access Starting is crawled address for request and/or centre crawls address and is sent to data and crawls cluster, obtains data and crawls cluster upload Address to be crawled is stored to intermediate task queue, realizes distributed data and crawl by address to be crawled, and improves what data crawled Efficiency.
Further, it on the basis of foregoing invention embodiment, stores by the address to be crawled to intermediate task team Before column, further includes:
Generate the verification information of the address to be crawled;Judge that the verification information whether there is in verification information library; If so, the address to be crawled is abandoned, if it is not, then storing the verification information to the verification information library.
Wherein, verification information can be the information for repeated authentication generated according to address to be crawled, and verification information can It is generated with being run by Hash, verification information library can be the storage region for being stored with verification information, can be in task queue The Redis database configured in cluster.
In embodiments of the present invention, task queue cluster, which can be treated, crawls address process generation verification information, for example, can It to generate the verification information of address to be crawled by hash algorithm, is getting after the verification information for crawling address, can sentence The verification information of breaking whether there is in verification information library, if the verification information is stored in verification information library, it may be said that bright to be somebody's turn to do Address to be crawled has been stored in intermediate task queue, it may not be necessary to store the address to be crawled, if should Verification information does not store in verification information library, can determine that the address to be crawled is stored by intermediate task queue not yet, It can be by the address storage to be crawled into the intermediate task queue of task queue cluster.
Embodiment five
Fig. 6 is the structural schematic diagram that a kind of distributed data that the embodiment of the present invention five provides crawls device, this can be performed The distributed data crawling method that inventive embodiments provide, has the corresponding functional module of execution method and beneficial effect, the dress Setting can be specifically included by software and or hardware realization: address acquisition module 401, web page crawl module 402 and address upload Module 403.
Wherein, address acquisition module 401, for detect task queue cluster setting initial task queue and/or in Between task queue be stored with when crawling address, access the task queue cluster to obtain and crawl address.
Web page crawl module 402 obtains target webpage for crawling address according to.
Address uploading module 403, for obtaining the address to be crawled in the target webpage, and by the address to be crawled Upload to the task queue cluster.
The technical solution of the embodiment of the present invention detects that the starting of task queue cluster setting is appointed by address acquisition module When being engaged in storing in queue and intermediate task queue by crawling address, access task queue cluster acquisition crawls address, web page crawl Module obtains target webpage according to address is crawled, and address uploading module obtains the address to be crawled in target webpage, and will be wait climb Address is taken to upload to task queue cluster, the acquisition source for crawling address includes initial task queue and intermediate task queue, end End is according to oneself needing to obtain to crawl address, and scheduling of resource is flexible, and improve network data crawls efficiency.
Further, web page crawl module includes:
Contents acquiring unit downloads target webpage for crawling address according to and obtains the webpage of the target webpage Content.
Rule determination unit, for determining the processing rule of target webpage according to the web page contents.
Content extraction unit, for realizing the acquisition of target webpage according to locating processing rule extraction object content.
Embodiment six
Fig. 7 is the structural schematic diagram that a kind of distributed data that the embodiment of the present invention six provides crawls device;This can be performed The distributed data crawling method that inventive embodiments provide, has the corresponding functional module of execution method and beneficial effect, the dress Setting can be specifically included by software and or hardware realization: address sending module 501 and address memory module 502.
Wherein, address sending module 501, for when detecting that data crawl the access request of cluster, according to the visit Ask that starting is crawled address by request and/or centre crawls address and is sent to the data and crawls cluster.
Address memory module 502 crawls the address to be crawled of cluster upload for obtaining the data, by described wait crawl Address is stored to intermediate task queue.
Technical solution provided in an embodiment of the present invention is detecting that data crawl the access of cluster by address sending module When request, starting is crawled by address according to access request and/or centre crawls address and is sent to data and crawls cluster, address storage Module obtains the address to be crawled that data crawl cluster upload, by address to be crawled storage to intermediate task queue, realizes point Cloth data crawl, and improve the efficiency that data crawl.
Further, on the basis of foregoing invention embodiment, the device further include:
Generation module is verified, for generating the verification information of the address to be crawled.
Judgment module is verified, for judging that the verification information whether there is in verification information library.
Duplicate removal execution module, for if so, the address to be crawled is abandoned, if it is not, then depositing the verification information Store up the verification information library.
Embodiment seven
Fig. 8 is a kind of structural schematic diagram for equipment that the embodiment of the present invention seven provides, as shown in figure 8, the equipment includes place Manage device 70, memory 71, input unit 72 and output device 73;The quantity of processor 70 can be one or more in equipment, In Fig. 8 by taking a processor 70 as an example;Processor 70, memory 71, input unit 72 and output device 73 in equipment can be with It is connected by bus or other modes, in Fig. 8 for being connected by bus.
Memory 71 is used as a kind of computer readable storage medium, can be used for storing software program, journey can be performed in computer Sequence and module, if the corresponding program module of distributed data crawling method in the embodiment of the present invention is (for example, distributed data Address acquisition module 401, web page crawl module 402 and the address uploading module 403 in device are crawled, alternatively, address sends mould Block 501 and address memory module 502).Software program, instruction and the mould that processor 70 is stored in memory 71 by operation Block realizes above-mentioned distributed data crawling method thereby executing the various function application and data processing of equipment.
Memory 71 can mainly include storing program area and storage data area, wherein storing program area can store operation system Application program needed for system, at least one function;Storage data area, which can be stored, uses created data etc. according to terminal.This Outside, memory 71 may include high-speed random access memory, can also include nonvolatile memory, for example, at least a magnetic Disk storage device, flush memory device or other non-volatile solid state memory parts.In some instances, memory 71 can be further Including the memory remotely located relative to processor 70, these remote memories can pass through network connection to equipment.It is above-mentioned The example of network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Input unit 72 can be used for receiving the number or character information of input, and generate with the user setting of equipment and The related key signals input of function control.Output device 73 may include that display screen etc. shows equipment.
Embodiment eight
The embodiment of the present invention eight also provides a kind of storage medium comprising computer executable instructions, and the computer can be held Row instruction is used to execute a kind of distributed data crawling method when being executed by computer processor, this method comprises:
The initial task queue and/or intermediate task queue for detecting the setting of task queue cluster are stored with and crawl address When, the task queue cluster, which is accessed, to obtain crawls address;
Target webpage is obtained according to the address that crawls;
The address to be crawled in the target webpage is obtained, and the address to be crawled is uploaded into the task queue collection Group.
Or this method comprises:
When detecting that data crawl the access request of cluster, according to the access request by starting crawl address and/or Centre, which crawls address and is sent to the data, crawls cluster;
The address to be crawled that the data crawl cluster upload is obtained, by the address storage to be crawled to intermediate task team Column.
Certainly, a kind of storage medium comprising computer executable instructions, computer provided by the embodiment of the present invention Distributed number provided by any embodiment of the invention can also be performed in the method operation that executable instruction is not limited to the described above According to the relevant operation in crawling method.
By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art Part can be embodied in the form of software products, which can store in computer readable storage medium In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.
It is worth noting that, above-mentioned distributed data crawls in the embodiment of device, included each unit and module It is only divided according to the functional logic, but is not limited to the above division, as long as corresponding functions can be realized; In addition, the specific name of each functional unit is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims (12)

1. a kind of distributed data crawls system characterized by comprising task queue cluster and data crawl cluster;
Wherein, task queue cluster includes an at least terminal, be provided in the task queue cluster initial task queue and Intermediate task queue, the initial task queue and intermediate task queue, which are respectively used to save starting, to be crawled address and centre and crawls Address;
Data crawl cluster include an at least terminal, for accessing the task queue cluster with obtain starting crawl address and Centre crawls address, and crawls address and the centre according to the starting and crawl address and crawl target webpage.
2. system according to claim 1, which is characterized in that the task queue cluster is also used to remove duplicate centre Crawl address.
3. system according to claim 1, which is characterized in that the data crawl cluster and are also used to obtain in target webpage Address to be crawled, and the intermediate task queue that address to be crawled uploads to task queue cluster is crawled into address as centre.
4. system according to claim 1, which is characterized in that the initial task queue and intermediate task queue pass through card Husband's card message system is realized.
5. a kind of distributed data crawling method, which is characterized in that the data applied to distributed reptile system crawl cluster, packet It includes:
The initial task queue and/or intermediate task queue for detecting the setting of task queue cluster are stored with when crawling address, are visited Ask that the task queue cluster crawls address to obtain;
Target webpage is obtained according to the address that crawls;
The address to be crawled in the target webpage is obtained, and the address to be crawled is uploaded into the task queue cluster.
6. according to the method described in claim 5, it is characterized in that, the address that crawls according to obtains target webpage, packet It includes:
According to the web page contents for crawling address downloading target webpage and obtaining the target webpage;
The processing rule of target webpage is determined according to the web page contents;
The acquisition of target webpage is realized according to locating processing rule extraction object content.
7. a kind of distributed data crawling method, which is characterized in that climb the task queue cluster of weight system, packet applied to distribution It includes:
When detecting that data crawl the access request of cluster, starting is crawled by address and/or centre according to the access request It crawls address and is sent to the data and crawl cluster;
The address to be crawled that the data crawl cluster upload is obtained, by the address storage to be crawled to intermediate task queue.
8. the method according to the description of claim 7 is characterized in that storing by the address to be crawled to intermediate task queue Before, further includes:
Generate the verification information of the address to be crawled;
Judge that the verification information whether there is in verification information library;
If so, the address to be crawled is abandoned, if it is not, then storing the verification information to the verification information library.
9. a kind of distributed data crawls device, which is characterized in that the data applied to distributed reptile system climb cluster, packet It includes:
Address acquisition module, initial task queue and/or intermediate task queue for detecting the setting of task queue cluster are deposited It contains when crawling address, accesses the task queue cluster to obtain and crawl address;
Web page crawl module obtains target webpage for crawling address according to;
Address uploading module is uploaded to for obtaining the address to be crawled in the target webpage, and by the address to be crawled The task queue cluster.
10. a kind of distributed data crawls device, which is characterized in that the data applied to distributed reptile system climb cluster, packet It includes:
Address sending module, for that will be originated according to the access request when detecting that data crawl the access request of cluster It crawls address and centre crawls address and is sent to the data and crawls cluster;
Address memory module crawls the address to be crawled of cluster upload for obtaining the data, the address to be crawled is deposited Store up intermediate task queue.
11. a kind of equipment, which is characterized in that the equipment includes:
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by locating one or more processors, so that one or more of processors are real Now such as distributed data crawling method as claimed in claim 6 to 8.
12. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The distributed data crawling method as described in any in claim 6-8 is realized when execution.
CN201910717429.5A 2019-08-05 2019-08-05 Distributed data crawls system, method, apparatus, equipment and storage medium Pending CN110442769A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910717429.5A CN110442769A (en) 2019-08-05 2019-08-05 Distributed data crawls system, method, apparatus, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910717429.5A CN110442769A (en) 2019-08-05 2019-08-05 Distributed data crawls system, method, apparatus, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110442769A true CN110442769A (en) 2019-11-12

Family

ID=68433230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910717429.5A Pending CN110442769A (en) 2019-08-05 2019-08-05 Distributed data crawls system, method, apparatus, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110442769A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199567A (en) * 2020-09-27 2021-01-08 深圳市伊欧乐科技有限公司 Distributed data acquisition method, system, server and storage medium
CN113254747A (en) * 2021-06-09 2021-08-13 南京北斗创新应用科技研究院有限公司 Geographic space data acquisition system and method based on distributed web crawler

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982184A (en) * 2012-12-26 2013-03-20 福建师范大学 Crawler algorithm for capturing webpage in online shopping mall
CN103475688A (en) * 2013-05-24 2013-12-25 北京网秦天下科技有限公司 Distributed method and distributed system for downloading website data
CN108011931A (en) * 2017-11-22 2018-05-08 用友金融信息技术股份有限公司 Web data acquisition method and web data acquisition system
CN108133041A (en) * 2018-01-11 2018-06-08 四川九洲电器集团有限责任公司 Data collecting system and method based on web crawlers and data transfer technology
US10083222B1 (en) * 2016-03-29 2018-09-25 Sprint Communications Company L.P. Automated categorization of web pages
CN110020046A (en) * 2017-10-20 2019-07-16 中移(苏州)软件技术有限公司 A kind of data grab method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982184A (en) * 2012-12-26 2013-03-20 福建师范大学 Crawler algorithm for capturing webpage in online shopping mall
CN103475688A (en) * 2013-05-24 2013-12-25 北京网秦天下科技有限公司 Distributed method and distributed system for downloading website data
US10083222B1 (en) * 2016-03-29 2018-09-25 Sprint Communications Company L.P. Automated categorization of web pages
CN110020046A (en) * 2017-10-20 2019-07-16 中移(苏州)软件技术有限公司 A kind of data grab method and device
CN108011931A (en) * 2017-11-22 2018-05-08 用友金融信息技术股份有限公司 Web data acquisition method and web data acquisition system
CN108133041A (en) * 2018-01-11 2018-06-08 四川九洲电器集团有限责任公司 Data collecting system and method based on web crawlers and data transfer technology

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
王伟军等: "《大数据分析》", 31 May 2017 *
范晖等: "《Python大数据基础与实战》", 31 July 2019 *
袁津生等: "《21世纪高等学校精品教材 搜索引擎与信息检索教程》", 30 April 2008 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199567A (en) * 2020-09-27 2021-01-08 深圳市伊欧乐科技有限公司 Distributed data acquisition method, system, server and storage medium
CN113254747A (en) * 2021-06-09 2021-08-13 南京北斗创新应用科技研究院有限公司 Geographic space data acquisition system and method based on distributed web crawler
CN113254747B (en) * 2021-06-09 2021-10-15 南京北斗创新应用科技研究院有限公司 Geographic space data acquisition system and method based on distributed web crawler

Similar Documents

Publication Publication Date Title
CN108875091B (en) Distributed web crawler system with unified management
CN106534244B (en) Scheduling method and device of proxy resources
CN110020062B (en) Customizable web crawler method and system
CN109743315A (en) For Activity recognition method, apparatus, equipment and the readable storage medium storing program for executing of website
CN109684370A (en) Daily record data processing method, system, equipment and storage medium
CN107329750A (en) The recognition methods of advertisement page, jump method and mobile terminal in application program
CN109600385B (en) Access control method and device
CN105871587A (en) Log uploading method and device
CN107391775A (en) A kind of general web crawlers model implementation method and system
CN107809383A (en) A kind of map paths method and device based on MVC
WO2018157686A1 (en) Webpage crawling method and apparatus
US20190109920A1 (en) Browser resource pre-pulling method, terminal and storage medium
CN108011931B (en) Web data acquisition method and Web data acquisition system
CN105607799B (en) Data processing method and device
CN110442769A (en) Distributed data crawls system, method, apparatus, equipment and storage medium
CN110535974A (en) Method for pushing, driving means, equipment and the storage medium of resource to be put
CN106776917A (en) A kind of method and apparatus for obtaining resource file
CN105095299A (en) Picture capturing method and system
CN104461702A (en) Business processing method and business processing device
CN106534280A (en) Data sharing method and device
CN111061807A (en) Distributed data acquisition and analysis system and method, server and medium
CN111444412B (en) Method and device for scheduling web crawler tasks
CN112052163A (en) High-concurrency webpage pressure testing method and device, electronic equipment and storage medium
CN104572945B (en) A kind of file search method and device based on cloud storage space
CN107968812A (en) The method and device of synchronous local resource and Internet resources

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191112

RJ01 Rejection of invention patent application after publication