CN109359231A

CN109359231A - A kind of information crawler method, server and the storage medium of distributed network crawler

Info

Publication number: CN109359231A
Application number: CN201711478979.3A
Authority: CN
Inventors: 徐松柏
Original assignee: Guangzhou Tcl Smart Home Technology Co Ltd
Current assignee: Shenzhen TCL New Technology Co Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2019-02-19

Abstract

The present invention provides information crawler method, server and the storage medium of a kind of distributed network crawler, by using the multiple IP got while carrying out network URL and crawl, and the URL crawled is encoded to the storage of key value into redis cluster；Multiple crawler clients obtain URL simultaneously from the redis cluster, and parse target information from the URL of acquisition.Information crawler method provided by the present invention is cooperated using multiple devices using multiple IP, while being crawled to all URL on Internet, is realized more preferably, faster, useful information is more accurately got from the Internet resources of magnanimity.

Description

A kind of information crawler method, server and the storage medium of distributed network crawler

Technical field

The present invention relates to information technology field more particularly to a kind of information crawler method of distributed network crawler, Server and storage medium.

Background technique

Currently, Internet surfs the Internet number of pages beyond 20,000,000,000, research shows that the page close to 30% is according to statistics It is duplicate, and there are also the presence of a large amount of dynamic pages.The application of client, Server-side Scripting Language is so that be directed toward identical The URL(Uniform Resource Locator of Web (World Wide Web, global wide area network or WWW) information, system One Resource orientation) quantity exponentially increases, if at this time we are with a server inside the webpage of Internet Want to crawl the information that we need, then wants the cost a large amount of time, user cannot obtain information needed in time, therefore will lead to all It is mostly inconvenient.

Therefore, the existing technology needs further improvement.

Summary of the invention

In view of the above shortcomings in the prior art, it is an object of the invention to provide a kind of distributed network for user Information crawler method, server and the storage medium of crawler, overcoming cannot quickly search in the prior art from mass network resource In crawl the defect of information needed.

The present invention provides first embodiment be a kind of distributed network crawler information crawler method, wherein including Following steps:

Network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to the storage of key value to redis collection In group；

Multiple crawler clients obtain URL from the redis cluster, and parse target information from the URL of acquisition.

Optionally, before described the step of being crawled using the multiple IP got progress network URL further include:

The idle IP on network is obtained, and would sit idle for IP and be stored in MongoDB；

Multiple crawler clients obtain IP from the MongoDB.

Optionally, the URL that will be crawled is encoded to the storage of key value to the step in redis cluster further include:

The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to the redis with key value and URL In first major key of cluster.

Optionally, the method also includes:

The URL for having parsed target information be transferred in the second major key of redis cluster.

Optionally, the multiple crawler client obtains URL from the redis cluster, and parses from the URL of acquisition The step of target information out further include:

The multiple crawler client carries out duplicate checking screening according to the record in first major key and the second major key, main from first The URL not being resolved is chosen in key as target URL；

The target URL is parsed to obtain target information.

The present invention provides second embodiment be a kind of server, wherein the server include: processor, storage Device and the information crawler control journey for being stored in the distributed network crawler that can be run on the memory and on the processor Sequence, wherein the information crawler control program of the distributed network crawler performs the steps of when being executed by the processor

Optionally, the information crawler control program of the distributed network crawler when being executed by the processor also realization with Lower step:

Multiple crawler clients obtain IP from the MongoDB.

The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to the redis with key value and URL In first major key of cluster；

The target URL is parsed to obtain target information.

The present invention provides 3rd embodiment be a kind of computer readable storage medium, wherein it is described computer-readable The information crawler control program of distributed network crawler, the information crawler of the distributed network crawler are stored on storage medium Control program is executed by processor the step of information crawler method for realizing the distributed network crawler.

Beneficial effect, the present invention provides a kind of information crawler method of distributed network crawler, server and storages to be situated between Matter is crawled by carrying out network URL using the multiple IP got, and the URL crawled is encoded to key value storage and is arrived In redis cluster, URL is obtained from the redis cluster, and parse target information from the URL of acquisition.The present invention is mentioned For information crawler method cooperated using multiple devices using multiple IP, while on Internet all URL carry out It crawls, realizes more preferably, faster, more accurately obtain our useful information on network.

Detailed description of the invention

Fig. 1 is the information crawler method and step flow chart of distributed network crawler of the present invention.

Fig. 2 is the control principle schematic diagram of client browser in the method for the invention concrete application embodiment.

Fig. 3 is the theory structure schematic diagram of server of the present invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer and more explicit, right as follows in conjunction with drawings and embodiments The present invention is further described.It should be appreciated that specific embodiment described herein is used only for explaining the present invention, and do not have to It is of the invention in limiting.

The a large amount of time is needed due to using a server to carry out crawling for Internet resources, the invention discloses one The distributed multiple servers cooperation of kind carries out the method that Internet resources crawl and passes through more using MongoDB and redis technology Server carries out resource simultaneously and crawls, and realization quickly gets target information from the Internet resources of magnanimity.

Herein, it should be noted that MongoDB is a kind of distribution type file storing data library, is a high-performance, opens Source, the Document image analysis of non-mode.And Redis is being write using ANSI C language, supported network, can be based on an of open source Memory also can persistence log type, Key-Value database, and provide the API(Application of multilingual Programming Interface, application programming interface), redis cluster is the set of multiple redis.

The present invention provides first embodiment be a kind of distributed network crawler information crawler method, as shown in Figure 1, The following steps are included:

Step S1, network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to key value storage and is arrived In redis cluster.

In order to which efficient raising crawls URL, the acquisition of URL information is carried out in this step simultaneously using multiple IP.Having In body embodiment, multiple IP are packaged into client browser as request IP or directly and are realized to website URL Crawl.

In order to preferably store the URL crawled out, the URL crawled is carried out in this step to be encoded to key value Afterwards, storage is into redis cluster.It, can be to avoid since the attribute of redis cluster is that can be automatically deleted duplicate URL Storage to same URL, while also avoiding crawling the duplicate message of the URL value.

But since the mechanism that single IP connected reference number crosses multi-shielding IP has all been done in many websites, each IP accesses certain The number of a website is excessive, will lead to the website and shields the IP, therefore repeatedly carries out information visit to some websites to be able to achieve It asks, it is preferred that in this step further include: obtain the idle IP on network, and would sit idle for IP and be stored in MongoDB；It is multiple to climb Worm client obtains IP from the MongoDB.Since we obtain idle IP from network, and the idle IP that will acquire is deposited Storage facilitates the IP that crawler client obtains storage from MongoDB, and new IP is obtained from MongoDB in MongoDB Its used IP by some websites shielding has been replaced, the access to the website is re-initiated.

It is envisioned that this step can be used more clients as the producer cooperate with according to be utilized respectively from The IP value got in MongoDB carries out crawling for URL, crawls effect to obtain the URL of greater efficiency.

Due to the storage using MongoDB database progress IP in this step, the distribution that information may be implemented is deposited Storage, realizes the distributed collaborative of information crawler.

Step S2, multiple crawler clients obtain URL from the redis cluster, and parse mesh from the URL of acquisition Mark information.

When multiple crawler clients are (for realizing the client of web crawlers function, wherein realize that web crawlers function can To be automatically to grab the program or script of web message according to according to certain rules) overall network that will crawl After URL storage is into redis cluster, then more other crawler clients can get the URL of storage, and as consumption Person carries out webpage information to the URL that gets and crawls, and by goal task to be treated compared with webpage information pair, crawls out phase The target information answered.

It is described to crawl in order to avoid repeating to carry out information crawler to identical URL as the client of consumer URL is stored to the step in redis cluster

The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to redis cluster with key value and URL The first major key in.

In this step, the URL for being crossed using different major key storing and resolvings and not parsed respectively be that is to say using redis Two of cluster are different, and major key removes the URL for storing the URL for having crawled information needed respectively and not crawling information needed, from And make the URL for preferably identifying post-consumer and non-post-consumer, it avoids client from carrying out repeated resolution to URL, causes to increase On the other hand the workload of parsing also avoids omitting the URL not parsed.

In a particular embodiment, in order to carry out crawling for network URL as the producer further as more clients, it is It avoids storing duplicate URL in redis cluster, the client as consumer is caused to carry out weight to the URL parsed Multiple parsing, increases the workload of parsing, and optionally, the multiple crawler client obtains URL from the redis cluster, And the step of parsing target information from the URL of acquisition further include:

The target URL is parsed to obtain target information.

Duplicate checking screening through the above steps, deletes and crawls from different crawler clients and be stored in redis cluster Repetition URL in first major key and the second major key avoids duplicate allocation and parsing to the same URL, improves parsing effect Rate.

The information crawler method of distributed network crawler provided by the present invention, in conjunction with redis distributed storage, Yi Jiwei One key value is managed automatically, and the non-relation data fragment storage of MongoDB is cooperated with producer consumer mode using multiple devices All URL on Internet are crawled simultaneously；Our creep speed will increase in geometry speed in this way, due to MongoDB database purchase amount is larger, and the storage of split blade type information may be implemented, thus greatly cannot very much without concern of data amount Storage, different IP, therefore obtaining for the network information can be got from different storage regions respectively by also facilitating client It takes and stores and provide convenience.

As shown in connection with fig. 2, the information crawler method of distributed network crawler that the present invention is mentioned in the specific implementation, including Following steps:

Step H1, the idle IP of network is obtained.

Step H2, client browser is simulated: the stored good idle IP in front is obtained from MongoDB by ours Program is packaged into a client browser, our software is prevented to be blocked when crawling webpage.

Step H3, connection obtains: opening more crawler clients as the producer and (object to be treated is stored in some area Domain) all URL of Internet are traversed, URL progress base64 is encoded to key value deposit redis, it is included only by redis One key value manages unique row of all URL, to realize that the distributed more machines cooperation of the producer obtains all URL.

Step H4, acquisition of information: screening parses the data of oneself needs there are no processed URL from URL, this The URL of the crawler processing of sample distinct device is also different, to realize distributed collaborative.

Step H5, URL unloading: multiple consumers' (task to be treated is obtained from some object) are obtained from redis URL is taken, and respective and processed URL is stored in another major key of redis, to know which URL is Through processed.

Since data volume is very huge, so huge information is stored using MongoDB fragment, us is also facilitated to read.

The present invention provides second embodiment be a kind of server, as shown in figure 3, the server 30 include: processing It device 310, memory 320 and is stored in the distributed network that can be run on the memory 320 and on the processor 310 and climbs The information crawler of worm controls program, wherein the information crawler control program of the distributed network crawler is executed by the processor When perform the steps of

Network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to the storage of key value to redis collection In group；It realizes function as described in step S1.

Multiple crawler clients obtain URL from the redis cluster, and parse target information from the URL of acquisition, It realizes that function is as described in step S2.

Specifically, the information crawler control program of the distributed network crawler when being executed by the processor also realization with Lower step:

Multiple crawler clients obtain IP from the MongoDB.

The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to redis cluster with key value and URL The first major key in；

The target URL is parsed to obtain target information.

Server provided by the present invention, by executing control journey corresponding to information crawler method provided by the present invention Sequence, and IP is carried out by using the database of MongoDB and crawls the distributed storage of information, it is crawled using redis The storage of URL information reaches the very fast effect for obtaining target information in network to realize more crawler client cooperative cooperatings Fruit.

Wherein, storage equipment 320 is used as a kind of non-volatile computer readable storage medium storing program for executing, can be used for storing non-volatile Software program, non-volatile computer executable program and module, processor 310 are stored in storage equipment 320 by operation In non-volatile software program, instruction and module, thereby executing the various function application and data processing of server, i.e., Realize the information crawler method of the distributed network crawler of above method embodiment.

Storing equipment 320 may include storing program area and storage data area, wherein storing program area can store operation system Application program required for system, at least one function；Storage data area, which can be stored, uses institute according to wall hole designing system The data etc. of creation.In addition, storage equipment 320 may include high random access storage equipment, it can also include non-volatile Store equipment, a for example, at least disk storage equipment part, flush memory device or other non-volatile solid-state memory devices parts.? In some embodiments, optional storage equipment 320 includes the storage equipment remotely located relative to processor 310, these are remotely deposited Storing up equipment can be by network connection to the server.The example of above-mentioned network includes but is not limited to internet, enterprises Net, local area network, mobile radio communication and combinations thereof.

The present invention also provides 3rd embodiment be a kind of computer readable storage medium, the computer-readable storage The information crawler control program of distributed network crawler, the information crawler control of the distributed network crawler are stored on medium Program is executed by processor the step of information crawler method for realizing the distributed network crawler.

The present invention provides information crawler method, server and the storage mediums of a kind of distributed network crawler, by obtaining Network is taken to leave unused IP, and the multiple idle IP that will acquire are stored in MongoDB database；Each idle IP is packaged into Client browser；Network URL, and the URL that will be crawled are crawled using each client browser after packaging as the producer It stores in redis cluster；URL is obtained from redis cluster using multiple client browser as consumer, and from respectively obtaining Target information is parsed in the URL taken.Information crawler method provided by the present invention is right simultaneously using multiple devices cooperation All URL on Internet are crawled, and are realized more preferably, faster, more accurately obtain our useful information on network.

It, can according to the technique and scheme of the present invention and its hair it is understood that for those of ordinary skills Bright design is subject to equivalent substitution or change, and all these changes or replacement all should belong to the guarantor of appended claims of the invention Protect range.

Claims

1. a kind of information crawler method of distributed network crawler, which comprises the following steps:

2. the information crawler method of distributed network crawler according to claim 1, which is characterized in that described to utilize acquisition To multiple IP carry out network URL crawl the step of before further include:

Multiple crawler clients obtain IP from the MongoDB.

3. the information crawler method of distributed network crawler according to claim 1, which is characterized in that described to crawl URL be encoded to the storage of key value to the step in redis cluster further include:

4. the information crawler method of distributed network crawler according to claim 1, which is characterized in that the method is also wrapped It includes:

The URL for having parsed target information is transferred in the second major key of the redis cluster.

5. the information crawler method of distributed network crawler according to claim 4, which is characterized in that the multiple crawler Client obtains URL, and the step of parsing target information from the URL of acquisition from the redis cluster further include:

The target URL is parsed to obtain target information.

6. a kind of server, which is characterized in that the server includes: processor, memory and is stored on the memory And the information crawler for the distributed network crawler that can be run on the processor controls program, wherein the distributed network is climbed The information crawler control program of worm performs the steps of when being executed by the processor

7. server according to claim 6, which is characterized in that the information crawler of the distributed network crawler controls journey It is also performed the steps of when sequence is executed by the processor

Multiple crawler clients obtain IP from the MongoDB.

8. server according to claim 6, which is characterized in that the information crawler of the distributed network crawler controls journey It is also performed the steps of when sequence is executed by the processor

9. server according to claim 8, which is characterized in that the information crawler of the distributed network crawler controls journey It is also performed the steps of when sequence is executed by the processor

The target URL is parsed to obtain target information.

10. a kind of computer readable storage medium, which is characterized in that be stored with distribution on the computer readable storage medium The information crawler of web crawlers controls program, and the information crawler control program of the distributed network crawler is executed by processor reality Now the step of information crawler method of the distributed network crawler as described in any one of claims 1 to 5.