CN110442769A - Distributed data crawls system, method, apparatus, equipment and storage medium - Google Patents
Distributed data crawls system, method, apparatus, equipment and storage medium Download PDFInfo
- Publication number
- CN110442769A CN110442769A CN201910717429.5A CN201910717429A CN110442769A CN 110442769 A CN110442769 A CN 110442769A CN 201910717429 A CN201910717429 A CN 201910717429A CN 110442769 A CN110442769 A CN 110442769A
- Authority
- CN
- China
- Prior art keywords
- address
- task queue
- cluster
- crawl
- crawls
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a kind of distributed datas to crawl system, method, apparatus, equipment and storage medium.System provided in an embodiment of the present invention includes that task queue cluster and data crawl cluster, wherein, task queue cluster includes an at least terminal, initial task queue and intermediate task queue are provided in the task queue cluster, the initial task queue and intermediate task queue distribution crawl address and centre for preservation starting and crawl address, it includes an at least terminal that data, which crawl cluster, address and centre, which are crawled, to obtain starting for accessing the task queue cluster crawls address, and address and the centre are crawled according to the starting and crawls address and crawls target webpage.System provided in an embodiment of the present invention, by the way that initial task queue and intermediate task queue are respectively set in task queue cluster, the change of task amount during crawling convenient for data reduces scheduling of resource degree of difficulty, improves data and crawl efficiency.
Description
Technical field
The present embodiments relate to computer application technologies more particularly to a kind of distributed data to crawl system, side
Method, device, equipment and storage medium.
Background technique
Current social comes into big data era, and human lives increasingly be unable to do without data, and data volume is in explosive growth,
Since data contain a large amount of value, data how are obtained as a urgent problem to be solved.
Network data is crawled frequently with web crawlers in the prior art, common web crawlers implementation includes: to use
The library Requests of Python or the library Trep building crawler system are using Scrapy framework establishment one process multithreading crawler
System uses Scrapy frame and Redis database sharing distributed reptile system memory-based, above-mentioned technical solution,
There are problems that scheduling of resource difficulty and the fixation of job task amount can not change job task amount during crawling, data crawl
Efficiency is poor.
Summary of the invention
The present invention provides a kind of distributed data and crawls system, method, apparatus, equipment and storage medium, is reduced with realizing
Scheduling of resource difficulty, that improves network data crawls efficiency.
In a first aspect, the embodiment of the invention provides a kind of distributed datas to crawl system, which includes:
Task queue cluster and data crawl cluster;
Wherein, task queue cluster includes an at least terminal, is provided with initial task team in the task queue cluster
Column and intermediate task queue, the initial task queue and intermediate task queue are respectively used to save to originate and crawl address and centre
Crawl address;
It includes an at least terminal that data, which crawl cluster, crawls ground for accessing the task queue cluster to obtain starting
Location and centre crawl address, and crawl address and the centre according to the starting and crawl address and crawl target webpage.
Second aspect, the embodiment of the invention also provides a kind of distributed data crawling methods, this method comprises:
The initial task queue and/or intermediate task queue for detecting the setting of task queue cluster are stored with and crawl address
When, the task queue cluster, which is accessed, to obtain crawls address;
Target webpage is obtained according to the address that crawls;
The address to be crawled in the target webpage is obtained, and the address to be crawled is uploaded into the task queue collection
Group.
The third aspect, the embodiment of the invention also provides a kind of distributed data crawling methods, this method comprises:
When detecting that data crawl the access request of cluster, according to the access request by starting crawl address and/or
Centre, which crawls address and is sent to the data, crawls cluster;
The address to be crawled that the data crawl cluster upload is obtained, by the address storage to be crawled to intermediate task team
Column.
Fourth aspect, the embodiment of the invention also provides a kind of distributed datas to crawl device, which includes:
Address acquisition module, for detecting initial task queue and/or the intermediate task team of the setting of task queue cluster
Column are stored with when crawling address, are accessed the task queue cluster to obtain and are crawled address;
Web page crawl module obtains target webpage for crawling address according to;
Address uploading module, for obtaining the address to be crawled in the target webpage, and by described wait crawl on address
Pass to the task queue cluster.
5th aspect, the embodiment of the invention also provides a kind of distributed datas to crawl device, which includes:
Address sending module, for being incited somebody to action according to the access request when detecting that data crawl the access request of cluster
Starting crawls address and/or centre crawls address and is sent to the data and crawls cluster;
Address memory module crawls the address to be crawled of cluster upload for obtaining the data, by described wait crawl ground
Location is stored to intermediate task queue.
6th method additionally provides a kind of equipment, which includes:
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by locating one or more processors, so that one or more of processing
Device realizes the distributed data crawling method as described in any in the embodiment of the present invention.
7th aspect, additionally provides a kind of computer readable storage medium, is stored thereon with computer program, the program quilt
The distributed data crawling method as described in any in the embodiment of the present invention is realized when processor executes.
The technical solution of the embodiment of the present invention crawls cluster, task queue collection by setting task queue cluster and data
More terminals are provided in group, initial task queue and intermediate task queue, initial task team are provided in task queue cluster
Column and intermediate task queue, which are respectively used to save starting, to be crawled address and centre and crawls address, and it includes more that data, which crawl cluster,
Terminal crawls address and centre for accessing task queue cluster to obtain starting and crawls address, according to starting crawl address and
Centre crawls address and crawls target webpage, and the network data realized under distributed structure/architecture crawls, during data crawl more
Change data crawl operation task amount it is more convenient, the speed that network data crawls can be improved.
Detailed description of the invention
Fig. 1 is the configuration diagram that a kind of distributed data that the embodiment of the present invention one provides crawls system;
Fig. 2 is the configuration diagram that a kind of distributed data provided by Embodiment 2 of the present invention crawls system;
Fig. 3 is that a kind of distributed data provided by Embodiment 2 of the present invention crawls exemplary system figure;
Fig. 4 is a kind of step flow chart for distributed data crawling method that the embodiment of the present invention three provides;
Fig. 5 is a kind of step flow chart for distributed data crawling method that the embodiment of the present invention four provides;
Fig. 6 is the structural schematic diagram that a kind of distributed data that the embodiment of the present invention five provides crawls device;
Fig. 7 is the structural schematic diagram that a kind of distributed data that the embodiment of the present invention six provides crawls device;
Fig. 8 is a kind of structural schematic diagram for equipment that the embodiment of the present invention seven provides.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just
Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure, in addition, in the absence of conflict, this
The feature in embodiment and embodiment in invention can be combined with each other.
Embodiment one
Fig. 1 is the configuration diagram that a kind of distributed data that the embodiment of the present invention one provides crawls system, the present embodiment
It is applicable to the case where network data crawls, referring to Fig. 1, which may include that task queue cluster 10 and data crawl cluster
11;
Wherein, task queue cluster 10 includes an at least terminal 101, is provided with starting in the task queue cluster 10
Task queue and intermediate task queue, the initial task queue and intermediate task queue are respectively used to preservation starting and crawl address
Address is crawled with centre;Data crawl cluster 11 include an at least terminal 111, for access the task queue cluster 10 with
Acquisition starting crawls address and centre crawls address, and crawls address and the centre according to the starting and crawl address and crawl mesh
Mark webpage.
In embodiments of the present invention, task queue cluster 10 may include that more terminals 101 are constituted, task queue cluster 10
Can be used for running initial task queue and intermediate task queue, can be more terminals collectively as operation task queue and
The server of intermediate task queue, wherein initial task queue, which can be, saves the queue that starting crawls address, initial task team
Element may include that starting crawls address in column, and distributed data crawls system when progress data crawl, can pass through increase
Or delete the job task amount that the starting in initial task queue crawls address modification system.
Specifically, the element in intermediate task queue in task queue cluster 10 can be centre and crawl address, it is intermediate
It crawls address and can be distributed data and crawl system and crawled to get when address carries out and crawls web data according to starting and crawl
Address, centre, which crawls address, can be the subaddressing that starting crawls address, and centre crawls address can be crawled address life by starting
At.
In embodiments of the present invention, it may include more terminals 111 that data, which crawl cluster 11, and each terminal 111 can basis
The starting got crawls address and centre and crawls address progress network data and crawl, and climbs for example, can be disposed in terminal 11
Worm program can crawl address and centre according to the starting got and crawl address and crawl data.Data crawl cluster 11 can be with
Timer access task queue cluster 10 obtains starting and crawls address and/or centre and crawl address, when determining task queue cluster 10
In initial task queue and intermediate task queue in there is starting and crawl address and/or when centre crawls address, it is available
The starting of the 10 initial task queue of task queue cluster and intermediate task queue crawls address and centre and crawls address.
The technical solution of the embodiment of the present invention, by the way that initial task queue and intermediate task are arranged in task queue cluster
Starting is preserved respectively in queue, initial task queue and intermediate task queue crawling address and centre and crawl address, data are climbed
It takes cluster to obtain starting for accessing task queue cluster to crawl address and centre and crawl address, data crawl cluster according to starting
It crawls address and centre to crawl address and crawl network data, the network data realized under distributed structure/architecture crawls, and can be convenient for
Network data crawl during job task amount change, improve scheduling of resource efficiency and data crawl speed.
Embodiment two
Fig. 2 is the configuration diagram that a kind of distributed data provided by Embodiment 2 of the present invention crawls system, the present embodiment
It is to be embodied based on foregoing invention embodiment, referring to fig. 2, distributed data provided in an embodiment of the present invention, which crawls, is
System includes: that task queue cluster 20 and data crawl cluster 21, wherein task queue cluster 20 includes an at least terminal, institute
It states and is provided with initial task queue 202 and intermediate task queue 201, the initial task queue 202 in task queue cluster 20
It is respectively used to save starting with intermediate task queue 201 and crawls address and centre and crawl address;It includes extremely that data, which crawl cluster 21,
A few terminal crawls address and centre for accessing the task queue cluster to obtain starting and crawls address, and according to institute
It states starting and crawls address and the centre and crawl address and crawl target webpage.
Task queue cluster 20 is also used to remove duplicate centre and crawls address.Data crawl cluster 21 and are also used to obtain mesh
Address to be crawled in webpage is marked, and address to be crawled is uploaded in the conduct of intermediate task queue 201 of task queue cluster 20
Between crawl address.Initial task queue and intermediate task queue are realized by Mark reaction message system.
The centre that data generate during crawling cluster 21 and can crawling in embodiments of the present invention crawls address upload
To task queue cluster 20, the centre received before being crawled address storage to intermediate task queue by task queue cluster 20
Duplicate removal can be carried out to address is crawled among this, the centre stored in intermediate task queue can be crawled into address
It removes, it is ensured that distributed data, which crawls system, not will do it iterative task, to improve the efficiency that data crawl.Task queue cluster
20 after crawling address duplicate removal to centre, and the centre of duplicate removal can be crawled to address and be stored in intermediate task queue 201.
Specifically, the intermediate task queue and initial task queue in task queue cluster 20 can pass through Mark reaction message
System realizes that intermediate task queue and initial task queue can rise respectively as the topic queue in Mark reaction message system
Begin to crawl address and the intermediate address that crawls can be used as topic in corresponding topic queue, data crawl each terminal in cluster 21
The topic in intermediate task queue and initial task queue can be subscribed to respectively, when in intermediate task queue and initial task queue
Address is crawled there are starting and when centre crawls address namely in topic queue there are when topic, data crawl in cluster 21
Terminal can access that starting crawls address and centre crawls address and obtains topic, intermediate task queue and initial task queue respectively
In be not present topic when, data crawl cluster 21 can be with automatic pause.
The technical solution of the embodiment of the present invention, by the way that initial task queue and intermediate task are arranged in task queue cluster
Starting is preserved respectively in queue, initial task queue and intermediate task queue crawling address and centre and crawl address, data are climbed
It takes cluster to obtain starting for accessing task queue cluster to crawl address and centre and crawl address, data crawl cluster according to starting
It crawls address and centre to crawl address and crawl network data, data crawl cluster and also crawl the centre generated during crawling
Address uploads to task queue cluster, and task queue cluster crawls address to the centre of upload and carries out duplicate removal, and will be after duplicate removal
Centre crawls the intermediate task queue that address is stored in task queue cluster, and initial task queue and initial task queue can be with
It is realized by Mark reaction message system.The data realized under distributed structure/architecture crawl, and prevent waist performance, improve distributed number
According to the scheduling of resource ability for the system that crawls, improve network data crawls speed.
Illustratively, Fig. 3 is that a kind of distributed data provided by Embodiment 2 of the present invention crawls exemplary system figure;Referring to figure
3, data crawl each terminal 31 in cluster can configured with scrapy crawler frame, including crawler engine scrapy engine,
Crawler Kafka_spiders and scheduler Kafka_scheduler.Wherein, crawler engine is responsible in scrapy crawler frame respectively
The data of module are transmitted;Crawler can crawl address progress data from basis and crawl;Scheduler can receive to crawl engine transmission
Request, and arrangement arrangement is carried out to request, returns request when crawling engine and needing.Specifically, since scrapy is single
Machine data crawl frame, in order to need to modify the spider module source code of scrapy in conjunction with Mark reaction message system, so that rising
Beginning crawls address start_urls and obtains from Mark reaction message system, and climbs it not in Mark reaction message system topic
The state that holding is hung up always when the url of address is taken, can thus exist in Mark reaction message system topic and crawl ground
It automatically begins to crawl when the url of location, automatic pause when not crawling address url.Data crawl each terminal in cluster
The same kafka theme is connected, all terminals carry on a shoulder pole data altogether and crawl pressure.Kafka message can be passed through in task queue cluster
Two message queues of system setup, topic spidername_request is as intermediate task queue 32 and spidername_
For start_urls as initial task queue 33, data crawl each terminal 31 in cluster can subscribe to the two message teams respectively
Column.Task queue cluster can also crawl address spidername to the centre of acquisition and be gone by duperfilter algorithm
Weight guarantees not repeat to store identical centre in topic spidername_request to crawl address.Data crawl in cluster
The url that terminal 31 starts to carry out when data crawl in available spider_start_urls crawls address as starting, in number
It is uploaded in topic spidername_request according to address is crawled among obtaining during crawling, data crawl the end in cluster
It holds the centre that can be obtained respectively in topic spidername_request to crawl address to be crawled.
Embodiment three
Fig. 4 is a kind of step flow chart for distributed data crawling method that the embodiment of the present invention three provides, the present embodiment
It is applicable to the case where data of distributed reptile system crawl cluster, which crawls cluster and can be made of multiple terminals,
This method can crawl device by the distributed data in the embodiment of the present invention to execute, the device can by software and/or
The mode of hardware realizes that the method for the embodiment of the present invention specifically comprises the following steps:
Step 301, the initial task queue for detecting the setting of task queue cluster and/or intermediate task queue are stored with and climb
When taking address, the task queue cluster is accessed to obtain and crawls address.
Wherein, initial task queue can be the queue of storage initial task, and data crawl the terminal in cluster can root
According in initial task queue crawl address start carry out network data crawl.Intermediate task queue can be storage network data
What is got during crawling newly crawls the queue of address, and the address that crawls in intermediate task queue can be data and crawl cluster
In terminal upload crawl address, the terminal for crawling address and can also being crawled in cluster by data in intermediate task queue obtains
It takes, it is to be understood that the terminal that upload crawls address may be the same or different with the terminal that address is climbed in acquisition.
In embodiments of the present invention, data crawl each terminal in cluster can be respectively to the starting in task queue cluster
Task queue and intermediate task queue are detected, and crawl address when determining to exist in initial task queue and intermediate task queue
When, each accessible task queue cluster of terminal gets in initial task queue and intermediate task queue and crawls address, can
With understanding, terminal have not been obtained crawl address when, can return and continue to execute detection initial task queue and intermediate appoint
Business queue.
Step 302 crawls address acquisition target webpage according to.
Wherein, it crawls address and can be data and crawl the address that each terminal is got in task queue cluster in cluster,
Terminal can be crawled the centre that can be in task queue cluster to obtain data to the corresponding target webpage in address is crawled
Centre in task queue and initial task queue crawls address or starting crawls address.
Colony terminal is crawled specifically, getting and crawling the data of address, corresponding mesh can be accessed according to address is crawled
Webpage is marked, target webpage can be downloaded after target webpage is accessed, analysis acquisition can be carried out to the target webpage of downloading
To network data to be crawled.
Address to be crawled in step 303, the acquisition target webpage, and the address to be crawled is uploaded into described appoint
Business queue cluster.
Wherein, address to be crawled can be the web page address occurred in target webpage, and the corresponding webpage in address to be crawled can
To be the associating web pages or sub-pages of target webpage.
Specifically, data crawl cluster can be obtained in target webpage with the associated web page address of target webpage, can be with
Using the web page address as address to be crawled, getting after crawling address, data crawl that cluster can will acquire to
It crawls address and uploads to task queue cluster, the address to be crawled that task queue cluster can will acquire crawls ground as centre
Location is stored into intermediate task queue.
The technical solution of the embodiment of the present invention, by the initial task queue and the centre that detect the setting of task queue cluster
When storage is by crawling address in task queue, access task queue cluster acquisition crawls address, can obtain according to address is crawled
Target webpage obtains the address to be crawled in target webpage, and address to be crawled is uploaded to task queue cluster, crawls address
Acquisition source include initial task queue and intermediate task queue, terminal is according to oneself needing to obtain to crawl address, resource tune
Flexibly, improve network data crawls efficiency to degree.
Further, on the basis of foregoing invention embodiment, target webpage is obtained according to the address that crawls, comprising:
According to the web page contents for crawling address downloading target webpage and obtaining the target webpage;It is determined according to the web page contents
The processing rule of target webpage;The acquisition of target webpage is realized according to locating processing rule extraction object content.
Wherein, web page contents can be the information content in target webpage, may include text, picture, video and audio etc.,
Processing rule can be according to the rule for crawling web page contents, such as obtain the rule of text in table, different web page contents
Different processing rules can be corresponded to.
In embodiments of the present invention, data crawl each terminal in cluster can be according to crawling address access target webpage simultaneously
The target webpage is downloaded, can determine the web page contents for including in target webpage after downloading target webpage, it is true according to web page contents
Fixed corresponding processing is regular, carries out processing to the web page contents in target webpage according to processing rule and obtains network data realization mesh
Mark the acquisition of webpage.
Example IV
Fig. 5 is a kind of step flow chart for distributed data crawling method that the embodiment of the present invention four provides, the present embodiment
The case where being applicable to the task queue cluster of distributed reptile system, the task queue cluster may include multiple terminals, should
Method can crawl device by the distributed data in the embodiment of the present invention to execute, which can be by software and/or hard
The mode of part realizes that the method for the embodiment of the present invention specifically comprises the following steps:
Step 401, when detecting that data crawl the access request of cluster, starting is crawled ground according to the access request
Location and/or centre, which crawl address and be sent to the data, crawls cluster.
Wherein, access request can be the request that data crawl each terminal request access task queue cluster in cluster, visit
Ask in request to may include identification number that data crawl terminal in cluster.
In embodiments of the present invention, task queue cluster can detect access request, detect access request
When, starting can be crawled by address according to the source of access request and/or centre crawls address to be sent to access request corresponding
Data cluster terminal.Starting crawl address and it is intermediate crawl address can store task queue cluster initial task queue and
In intermediate task queue, after starting crawls address and centre crawls address and sent, corresponding starting can be crawled into address
Address is crawled from initial task queue and intermediate task queue with centre to remove.
Step 402 obtains the address to be crawled that the data crawl cluster upload, by the address storage to be crawled in
Between task queue.
Wherein, address to be crawled, which can be data and crawl cluster and get during data crawl, crawls address.
Specifically, the address to be crawled can be deposited with the address to be crawled of the available upload of task queue cluster
It stores up in the intermediate task queue in task queue cluster, address to be crawled as centre can be crawled into address.
The technical solution of the embodiment of the present invention, by when detecting that data crawl the access request of cluster, according to access
Starting is crawled address for request and/or centre crawls address and is sent to data and crawls cluster, obtains data and crawls cluster upload
Address to be crawled is stored to intermediate task queue, realizes distributed data and crawl by address to be crawled, and improves what data crawled
Efficiency.
Further, it on the basis of foregoing invention embodiment, stores by the address to be crawled to intermediate task team
Before column, further includes:
Generate the verification information of the address to be crawled;Judge that the verification information whether there is in verification information library;
If so, the address to be crawled is abandoned, if it is not, then storing the verification information to the verification information library.
Wherein, verification information can be the information for repeated authentication generated according to address to be crawled, and verification information can
It is generated with being run by Hash, verification information library can be the storage region for being stored with verification information, can be in task queue
The Redis database configured in cluster.
In embodiments of the present invention, task queue cluster, which can be treated, crawls address process generation verification information, for example, can
It to generate the verification information of address to be crawled by hash algorithm, is getting after the verification information for crawling address, can sentence
The verification information of breaking whether there is in verification information library, if the verification information is stored in verification information library, it may be said that bright to be somebody's turn to do
Address to be crawled has been stored in intermediate task queue, it may not be necessary to store the address to be crawled, if should
Verification information does not store in verification information library, can determine that the address to be crawled is stored by intermediate task queue not yet,
It can be by the address storage to be crawled into the intermediate task queue of task queue cluster.
Embodiment five
Fig. 6 is the structural schematic diagram that a kind of distributed data that the embodiment of the present invention five provides crawls device, this can be performed
The distributed data crawling method that inventive embodiments provide, has the corresponding functional module of execution method and beneficial effect, the dress
Setting can be specifically included by software and or hardware realization: address acquisition module 401, web page crawl module 402 and address upload
Module 403.
Wherein, address acquisition module 401, for detect task queue cluster setting initial task queue and/or in
Between task queue be stored with when crawling address, access the task queue cluster to obtain and crawl address.
Web page crawl module 402 obtains target webpage for crawling address according to.
Address uploading module 403, for obtaining the address to be crawled in the target webpage, and by the address to be crawled
Upload to the task queue cluster.
The technical solution of the embodiment of the present invention detects that the starting of task queue cluster setting is appointed by address acquisition module
When being engaged in storing in queue and intermediate task queue by crawling address, access task queue cluster acquisition crawls address, web page crawl
Module obtains target webpage according to address is crawled, and address uploading module obtains the address to be crawled in target webpage, and will be wait climb
Address is taken to upload to task queue cluster, the acquisition source for crawling address includes initial task queue and intermediate task queue, end
End is according to oneself needing to obtain to crawl address, and scheduling of resource is flexible, and improve network data crawls efficiency.
Further, web page crawl module includes:
Contents acquiring unit downloads target webpage for crawling address according to and obtains the webpage of the target webpage
Content.
Rule determination unit, for determining the processing rule of target webpage according to the web page contents.
Content extraction unit, for realizing the acquisition of target webpage according to locating processing rule extraction object content.
Embodiment six
Fig. 7 is the structural schematic diagram that a kind of distributed data that the embodiment of the present invention six provides crawls device;This can be performed
The distributed data crawling method that inventive embodiments provide, has the corresponding functional module of execution method and beneficial effect, the dress
Setting can be specifically included by software and or hardware realization: address sending module 501 and address memory module 502.
Wherein, address sending module 501, for when detecting that data crawl the access request of cluster, according to the visit
Ask that starting is crawled address by request and/or centre crawls address and is sent to the data and crawls cluster.
Address memory module 502 crawls the address to be crawled of cluster upload for obtaining the data, by described wait crawl
Address is stored to intermediate task queue.
Technical solution provided in an embodiment of the present invention is detecting that data crawl the access of cluster by address sending module
When request, starting is crawled by address according to access request and/or centre crawls address and is sent to data and crawls cluster, address storage
Module obtains the address to be crawled that data crawl cluster upload, by address to be crawled storage to intermediate task queue, realizes point
Cloth data crawl, and improve the efficiency that data crawl.
Further, on the basis of foregoing invention embodiment, the device further include:
Generation module is verified, for generating the verification information of the address to be crawled.
Judgment module is verified, for judging that the verification information whether there is in verification information library.
Duplicate removal execution module, for if so, the address to be crawled is abandoned, if it is not, then depositing the verification information
Store up the verification information library.
Embodiment seven
Fig. 8 is a kind of structural schematic diagram for equipment that the embodiment of the present invention seven provides, as shown in figure 8, the equipment includes place
Manage device 70, memory 71, input unit 72 and output device 73;The quantity of processor 70 can be one or more in equipment,
In Fig. 8 by taking a processor 70 as an example;Processor 70, memory 71, input unit 72 and output device 73 in equipment can be with
It is connected by bus or other modes, in Fig. 8 for being connected by bus.
Memory 71 is used as a kind of computer readable storage medium, can be used for storing software program, journey can be performed in computer
Sequence and module, if the corresponding program module of distributed data crawling method in the embodiment of the present invention is (for example, distributed data
Address acquisition module 401, web page crawl module 402 and the address uploading module 403 in device are crawled, alternatively, address sends mould
Block 501 and address memory module 502).Software program, instruction and the mould that processor 70 is stored in memory 71 by operation
Block realizes above-mentioned distributed data crawling method thereby executing the various function application and data processing of equipment.
Memory 71 can mainly include storing program area and storage data area, wherein storing program area can store operation system
Application program needed for system, at least one function;Storage data area, which can be stored, uses created data etc. according to terminal.This
Outside, memory 71 may include high-speed random access memory, can also include nonvolatile memory, for example, at least a magnetic
Disk storage device, flush memory device or other non-volatile solid state memory parts.In some instances, memory 71 can be further
Including the memory remotely located relative to processor 70, these remote memories can pass through network connection to equipment.It is above-mentioned
The example of network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Input unit 72 can be used for receiving the number or character information of input, and generate with the user setting of equipment and
The related key signals input of function control.Output device 73 may include that display screen etc. shows equipment.
Embodiment eight
The embodiment of the present invention eight also provides a kind of storage medium comprising computer executable instructions, and the computer can be held
Row instruction is used to execute a kind of distributed data crawling method when being executed by computer processor, this method comprises:
The initial task queue and/or intermediate task queue for detecting the setting of task queue cluster are stored with and crawl address
When, the task queue cluster, which is accessed, to obtain crawls address;
Target webpage is obtained according to the address that crawls;
The address to be crawled in the target webpage is obtained, and the address to be crawled is uploaded into the task queue collection
Group.
Or this method comprises:
When detecting that data crawl the access request of cluster, according to the access request by starting crawl address and/or
Centre, which crawls address and is sent to the data, crawls cluster;
The address to be crawled that the data crawl cluster upload is obtained, by the address storage to be crawled to intermediate task team
Column.
Certainly, a kind of storage medium comprising computer executable instructions, computer provided by the embodiment of the present invention
Distributed number provided by any embodiment of the invention can also be performed in the method operation that executable instruction is not limited to the described above
According to the relevant operation in crawling method.
By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention
It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more
Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art
Part can be embodied in the form of software products, which can store in computer readable storage medium
In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer
Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set
Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.
It is worth noting that, above-mentioned distributed data crawls in the embodiment of device, included each unit and module
It is only divided according to the functional logic, but is not limited to the above division, as long as corresponding functions can be realized;
In addition, the specific name of each functional unit is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that
The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation,
It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention
It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also
It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.
Claims (12)
1. a kind of distributed data crawls system characterized by comprising task queue cluster and data crawl cluster;
Wherein, task queue cluster includes an at least terminal, be provided in the task queue cluster initial task queue and
Intermediate task queue, the initial task queue and intermediate task queue, which are respectively used to save starting, to be crawled address and centre and crawls
Address;
Data crawl cluster include an at least terminal, for accessing the task queue cluster with obtain starting crawl address and
Centre crawls address, and crawls address and the centre according to the starting and crawl address and crawl target webpage.
2. system according to claim 1, which is characterized in that the task queue cluster is also used to remove duplicate centre
Crawl address.
3. system according to claim 1, which is characterized in that the data crawl cluster and are also used to obtain in target webpage
Address to be crawled, and the intermediate task queue that address to be crawled uploads to task queue cluster is crawled into address as centre.
4. system according to claim 1, which is characterized in that the initial task queue and intermediate task queue pass through card
Husband's card message system is realized.
5. a kind of distributed data crawling method, which is characterized in that the data applied to distributed reptile system crawl cluster, packet
It includes:
The initial task queue and/or intermediate task queue for detecting the setting of task queue cluster are stored with when crawling address, are visited
Ask that the task queue cluster crawls address to obtain;
Target webpage is obtained according to the address that crawls;
The address to be crawled in the target webpage is obtained, and the address to be crawled is uploaded into the task queue cluster.
6. according to the method described in claim 5, it is characterized in that, the address that crawls according to obtains target webpage, packet
It includes:
According to the web page contents for crawling address downloading target webpage and obtaining the target webpage;
The processing rule of target webpage is determined according to the web page contents;
The acquisition of target webpage is realized according to locating processing rule extraction object content.
7. a kind of distributed data crawling method, which is characterized in that climb the task queue cluster of weight system, packet applied to distribution
It includes:
When detecting that data crawl the access request of cluster, starting is crawled by address and/or centre according to the access request
It crawls address and is sent to the data and crawl cluster;
The address to be crawled that the data crawl cluster upload is obtained, by the address storage to be crawled to intermediate task queue.
8. the method according to the description of claim 7 is characterized in that storing by the address to be crawled to intermediate task queue
Before, further includes:
Generate the verification information of the address to be crawled;
Judge that the verification information whether there is in verification information library;
If so, the address to be crawled is abandoned, if it is not, then storing the verification information to the verification information library.
9. a kind of distributed data crawls device, which is characterized in that the data applied to distributed reptile system climb cluster, packet
It includes:
Address acquisition module, initial task queue and/or intermediate task queue for detecting the setting of task queue cluster are deposited
It contains when crawling address, accesses the task queue cluster to obtain and crawl address;
Web page crawl module obtains target webpage for crawling address according to;
Address uploading module is uploaded to for obtaining the address to be crawled in the target webpage, and by the address to be crawled
The task queue cluster.
10. a kind of distributed data crawls device, which is characterized in that the data applied to distributed reptile system climb cluster, packet
It includes:
Address sending module, for that will be originated according to the access request when detecting that data crawl the access request of cluster
It crawls address and centre crawls address and is sent to the data and crawls cluster;
Address memory module crawls the address to be crawled of cluster upload for obtaining the data, the address to be crawled is deposited
Store up intermediate task queue.
11. a kind of equipment, which is characterized in that the equipment includes:
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by locating one or more processors, so that one or more of processors are real
Now such as distributed data crawling method as claimed in claim 6 to 8.
12. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
The distributed data crawling method as described in any in claim 6-8 is realized when execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910717429.5A CN110442769A (en) | 2019-08-05 | 2019-08-05 | Distributed data crawls system, method, apparatus, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910717429.5A CN110442769A (en) | 2019-08-05 | 2019-08-05 | Distributed data crawls system, method, apparatus, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110442769A true CN110442769A (en) | 2019-11-12 |
Family
ID=68433230
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910717429.5A Pending CN110442769A (en) | 2019-08-05 | 2019-08-05 | Distributed data crawls system, method, apparatus, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110442769A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112199567A (en) * | 2020-09-27 | 2021-01-08 | 深圳市伊欧乐科技有限公司 | Distributed data acquisition method, system, server and storage medium |
CN113254747A (en) * | 2021-06-09 | 2021-08-13 | 南京北斗创新应用科技研究院有限公司 | Geographic space data acquisition system and method based on distributed web crawler |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102982184A (en) * | 2012-12-26 | 2013-03-20 | 福建师范大学 | Crawler algorithm for capturing webpage in online shopping mall |
CN103475688A (en) * | 2013-05-24 | 2013-12-25 | 北京网秦天下科技有限公司 | Distributed method and distributed system for downloading website data |
CN108011931A (en) * | 2017-11-22 | 2018-05-08 | 用友金融信息技术股份有限公司 | Web data acquisition method and web data acquisition system |
CN108133041A (en) * | 2018-01-11 | 2018-06-08 | 四川九洲电器集团有限责任公司 | Data collecting system and method based on web crawlers and data transfer technology |
US10083222B1 (en) * | 2016-03-29 | 2018-09-25 | Sprint Communications Company L.P. | Automated categorization of web pages |
CN110020046A (en) * | 2017-10-20 | 2019-07-16 | 中移(苏州)软件技术有限公司 | A kind of data grab method and device |
-
2019
- 2019-08-05 CN CN201910717429.5A patent/CN110442769A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102982184A (en) * | 2012-12-26 | 2013-03-20 | 福建师范大学 | Crawler algorithm for capturing webpage in online shopping mall |
CN103475688A (en) * | 2013-05-24 | 2013-12-25 | 北京网秦天下科技有限公司 | Distributed method and distributed system for downloading website data |
US10083222B1 (en) * | 2016-03-29 | 2018-09-25 | Sprint Communications Company L.P. | Automated categorization of web pages |
CN110020046A (en) * | 2017-10-20 | 2019-07-16 | 中移(苏州)软件技术有限公司 | A kind of data grab method and device |
CN108011931A (en) * | 2017-11-22 | 2018-05-08 | 用友金融信息技术股份有限公司 | Web data acquisition method and web data acquisition system |
CN108133041A (en) * | 2018-01-11 | 2018-06-08 | 四川九洲电器集团有限责任公司 | Data collecting system and method based on web crawlers and data transfer technology |
Non-Patent Citations (3)
Title |
---|
王伟军等: "《大数据分析》", 31 May 2017 * |
范晖等: "《Python大数据基础与实战》", 31 July 2019 * |
袁津生等: "《21世纪高等学校精品教材 搜索引擎与信息检索教程》", 30 April 2008 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112199567A (en) * | 2020-09-27 | 2021-01-08 | 深圳市伊欧乐科技有限公司 | Distributed data acquisition method, system, server and storage medium |
CN113254747A (en) * | 2021-06-09 | 2021-08-13 | 南京北斗创新应用科技研究院有限公司 | Geographic space data acquisition system and method based on distributed web crawler |
CN113254747B (en) * | 2021-06-09 | 2021-10-15 | 南京北斗创新应用科技研究院有限公司 | Geographic space data acquisition system and method based on distributed web crawler |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108875091B (en) | Distributed web crawler system with unified management | |
CN106534244B (en) | Scheduling method and device of proxy resources | |
CN109743315A (en) | For Activity recognition method, apparatus, equipment and the readable storage medium storing program for executing of website | |
CN109684370A (en) | Daily record data processing method, system, equipment and storage medium | |
CN107391775A (en) | A kind of general web crawlers model implementation method and system | |
CN109600385B (en) | Access control method and device | |
WO2018157686A1 (en) | Webpage crawling method and apparatus | |
US20190109920A1 (en) | Browser resource pre-pulling method, terminal and storage medium | |
CN107809383A (en) | A kind of map paths method and device based on MVC | |
CN107729564A (en) | A kind of distributed focused web crawler web page crawl method and system | |
CN108011931B (en) | Web data acquisition method and Web data acquisition system | |
CN105607799B (en) | Data processing method and device | |
CN111444412B (en) | Method and device for scheduling web crawler tasks | |
CN110442769A (en) | Distributed data crawls system, method, apparatus, equipment and storage medium | |
CN110535974A (en) | Method for pushing, driving means, equipment and the storage medium of resource to be put | |
CN106776917A (en) | A kind of method and apparatus for obtaining resource file | |
CN105095299A (en) | Picture capturing method and system | |
CN104461702A (en) | Business processing method and business processing device | |
CN106534280A (en) | Data sharing method and device | |
CN111061807A (en) | Distributed data acquisition and analysis system and method, server and medium | |
CN104572945B (en) | A kind of file search method and device based on cloud storage space | |
CN106209992A (en) | A kind of router supports method and the router of RSS subscription task download | |
Shivaprasad et al. | Knowledge discovery from web usage data: An efficient implementation of web log preprocessing techniques | |
CN105989151A (en) | Webpage crawling method and apparatus | |
WO2023275782A1 (en) | Systems and methods for locating devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191112 |
|
RJ01 | Rejection of invention patent application after publication |