CN105930385A

CN105930385A - Data crawling method and system

Info

Publication number: CN105930385A
Application number: CN201610232182.4A
Authority: CN
Inventors: 祝奔
Original assignee: Zhuhai Gotech Intelligent Technology Co Ltd
Current assignee: Zhuhai Gotech Intelligent Technology Co Ltd
Priority date: 2016-04-13
Filing date: 2016-04-13
Publication date: 2016-09-07

Abstract

The invention discloses a data crawling method and system. The method comprises following steps: obtaining a target website from a url queue and obtaining source codes of the target website; saving the source codes of the target website in to a html queue, and resolving final data of the target website from the source codes of the target website; determining whether a url website exists in the source codes of the target website; if a url website exists, extracting the url website from the source codes of the target website and saving the url website into the url queue. According to the embodiment of the invention, source codes are obtained from pre-stored websites saved in the url queue, and the url websites extracted from the source codes are put in the url queue, and final data of the source codes is obtained from the html queue. By means of browser accessing, anti-crawling means are bypassed, assigned information can be obtained, quick data crawling is realized and data crawling cost is reduced.

Description

A kind of data crawling method and system

Technical field

The present invention relates to data and crawl technical field, more particularly, it relates to a kind of data crawling method and System.

Background technology

In the routine work of web front end exploitation, it is often necessary to from the interconnection substantial amounts of information of online collection.As Fruit, with manually completing, can consume substantial amounts of manpower and time, and therefore better method writes reptile exactly Script helps us to complete the collection of information.Crawlers can send http request, clothes to server always Business device is accomplished by receiving these requests, and does corresponding process, finally returns to data.But, reptile is also This principle can be used, server is carried out the attack of malice, use multiple program simultaneously to same clothes Business device sends http request, causes server to be busy with processing, thus reduces server performance, affects server Stability.Therefore, some servers will use some measures to prevent their content by reptile journey Sequence accesses.The anti-creep mode of the general use on anti-reptile webpage may more than one, therefore if it is desired to Crawl the content on this webpage, it is necessary to the anti-reptile means of website are made a concrete analysis of, the most again Corresponding solution is write out in code.If webpage takes a variety of modes and prevents reptile, that Make crawlers become sufficiently complex, directly increase the cost of reptile.

Therefore, how crawling data is the problem that those skilled in the art need to solve.

Summary of the invention

It is an object of the invention to provide a kind of data crawling method and system, to realize crawling of low cost Data.

For achieving the above object, following technical scheme is embodiments provided:

A kind of data crawling method, including:

From url queue, obtain target network address, and obtain the source code of described target network address；Wherein, described In url queue, the network address of storage at least includes: prestore network address, and/or, the url net extracted from source code Location；

The source code of described target network address is stored in html queue, and crawls rule, from described according to predetermined The source code of target network address parses the final data of described target network address；

Judge whether the source code of described target network address exists url network address；

If existing, then from the source code of described target network address, extract url network address, and be stored in described url team Row.

Wherein, it is judged that whether the source code of described target network address exists url network address, including:

Judge in the source code of described target network address, if exist and meet the url network address presetting network address rule.

Wherein, described obtain before target network address from url queue, including:

The reception network address that prestores arranges instruction, and arranges instruction in described url queue according to the described network address that prestores The network address that prestores be configured.

Receiving thread number arranges instruction；Wherein, described Thread Count arranges to instruct and at least includes: url travel line Number of passes, and/or, html travel line number of passes.

A kind of data crawl system, including:

Target website acquisition module, for obtaining target network address from url queue；

Source code acquisition module, for obtaining the source code of described target network address；Wherein, described url queue The network address of middle storage at least includes: prestore network address, and/or, the url network address extracted from source code；

First is stored in module, for the source code of described target network address is stored in html queue；

Data resolution module, for crawling rule according to predetermined, solves from the source code of described target network address Separate out the final data of described target network address；

Judge module, for judging whether the source code of described target network address exists url network address；If existing, Then trigger second and be stored in module；

Described second is stored in module, extracts url network address, and be stored in the source code of described target network address Described url queue.

Wherein, described judge module includes:

Judging unit, in the source code judging described target network address, if exists to meet and presets network address The url network address of rule.

Wherein, described data crawl system and include:

First receiver module, arranges instruction for the reception network address that prestores；

Module is set, for the network address that prestores described in basis, instruction is set to the network address that prestores in described url queue It is configured.

Wherein, described data crawl system and include:

Second receiver module, arranges instruction for receiving thread number；Wherein, described Thread Count arranges instruction At least include: url travel line number of passes, and/or, html travel line number of passes.

By above scheme, a kind of data crawling method of embodiment of the present invention offer and system, bag Include: from url queue, obtain target network address, and obtain the source code of described target network address；Wherein, described In url queue, the network address of storage at least includes: prestore network address, and/or, the url net extracted from source code Location；The source code of described target network address is stored in html queue, and crawls rule, from described according to predetermined The source code of target network address parses the final data of described target network address；Judge described target network address Whether source code exists url network address；If existing, then from the source code of described target network address, extract url net Location, and it is stored in described url queue.

Visible, in the present embodiment, the network address that prestores stored by url queue obtains source code, and will be from The url network address extracted in source code continues to put into url queue, obtains in source code eventually through html queue Final data, this form by browser access, directly around these anti-creep means, can obtain To the information specified, can not only quickly crawl data, moreover it is possible to reduce and crawl cost.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to reality Execute the required accompanying drawing used in example or description of the prior art to be briefly described, it should be apparent that below, Accompanying drawing in description is only some embodiments of the present invention, for those of ordinary skill in the art, On the premise of not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is a kind of data crawling method schematic flow sheet disclosed in the embodiment of the present invention；

Fig. 2 is that disclosed in the embodiment of the present invention, a kind of data crawl system structure schematic diagram.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out Clearly and completely describe, it is clear that described embodiment is only a part of embodiment of the present invention, and It is not all, of embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art are not making Go out the every other embodiment obtained under creative work premise, broadly fall into the scope of protection of the invention.

The embodiment of the invention discloses a kind of data crawling method and system, crawl number with realize low cost According to.It should be noted that the method crawling data in the present embodiment, it is necessary first to Selenium is installed Instrument, Selenium is an instrument for web application test.Selenium test directly fortune Go in a browser, just as real user is in operation.Support the browser of current main flow, and Can run under multiple platforms.

See Fig. 1, a kind of data crawling method that the embodiment of the present invention provides, including:

S101, from url queue, obtain target network address, and obtain the source code of described target network address；Wherein, In described url queue, the network address of storage at least includes: prestore network address, and/or, extract from source code Url network address；

Concrete, reptile of the prior art can only crawl the appointment information of some website, and in this reality Execute in example and instruction is set by the reading network address that prestores, network address to be read can be configured, thus realize Multiple website data crawled.

Concrete, default network address in the present embodiment is specifically as follows the main web site of a website, example As: homepage=http: //www.imdb.com/, so can crawl all webpages under this domain name.As Fruit merely desires to the content crawling under the webpage of a certain ad hoc rule, it is also possible to arrange concrete webpage rule, But have to comply with python grammer.Such as: url= " http://www.imdb.com/name/num%0.7d/？Ref=numbio_bio_nm " %number.

Concrete, the Thread Count in the present embodiment arranges what instruction was set according to main frame configuration for user, General PC could be arranged to 10.

S102, the source code of described target network address is stored in html queue, and crawls rule according to predetermined, The final data of described target network address is parsed from the source code of described target network address；

Concrete, predetermined in the present embodiment crawls rule and can find you accurately for pre-set The rule of the position in the page of the information wanted, rule is arranged must sequencing.Such as:

Css selector, searching id is the content of itemprop, then arrange id=itemprop, searches id It is the content of job for the class in itemprop, then id=itemprop, class=job are set, with this type of Push away, it is also possible to directly use regular expression to carry out content location, such as reg=' id=" itme " > (.*？) < ', Css selector and regular expression collocation can also be used to use, but sequencing must be noted.

S103, judge whether the source code of described target network address exists url network address；If existing, then from described The source code of target network address extracts url network address, and is stored in described url queue.

Concrete, in the present embodiment by walking around anti-creep means, i.e. use the form of browser access to obtain The information that fetching is fixed.Because anti-creep simply prevents, program is substantial amounts of sends request to server, crawls webpage Content, thus use the form of browser just can access the most easily.

Concrete, the present embodiment can specifically be interpreted as, first reads the network address that prestores from url queue, and By the access to the network address that prestores, obtain the source code of the network address that prestores, and source code is stored in html queue. After html queue receives source code, first pass through analysis source code and obtain final data；And analyze source Whether code having the url network address meeting predetermined network address rule, if having, then from source code, extracting url net Location, and the url network address of extraction is stored in url queue, and continue to access the url network address in url queue, directly There is no data to url queue or html queue.

Visible, by accessing url network address, while the source code obtained after accessing url network address preserves In html queue, get new url according to the source code in html queue the most again and be stored in url In queue, multiple threads of two queues form circulation work, thus accelerate the speed crawling data.

A kind of data crawling method that the embodiment of the present invention provides, including: from url queue, obtain target network Location, and obtain the source code of described target network address；Wherein, in described url queue, the network address of storage is at least wrapped Include: prestore network address, and/or, the url network address extracted from source code；Source code by described target network address It is stored in html queue, and crawls rule according to predetermined, from the source code of described target network address, parse institute State the final data of target network address；Judge whether the source code of described target network address exists url network address；If depositing , then from the source code of described target network address, extract url network address, and be stored in described url queue.

A kind of data provided the embodiment of the present invention below crawl system and are introduced, and described below one Kind data crawl system can be cross-referenced with above-described a kind of data crawling method.

Seeing Fig. 2, a kind of data that the embodiment of the present invention provides crawl system, including:

Target website acquisition module 100, for obtaining target network address from url queue；

Source code acquisition module 200, for obtaining the source code of described target network address；Wherein, described url In queue, the network address of storage at least includes: prestore network address, and/or, the url network address extracted from source code；

First is stored in module 300, for the source code of described target network address is stored in html queue；

Data resolution module 400, for crawling rule, from the source code of described target network address according to predetermined Parse the final data of described target network address；

Judge module 500, for judging whether the source code of described target network address exists url network address；If depositing , then trigger second and be stored in module 600；

Described second is stored in module 600, extracts url network address in the source code of described target network address, and It is stored in described url queue.

A kind of data that the embodiment of the present invention provides crawl system, including: target website acquisition module 100, For obtaining target network address from url queue；Source code acquisition module 200, is used for obtaining described target network The source code of location；Wherein, in described url queue, the network address of storage at least includes: prestore network address, and/or, The url network address extracted from source code；First is stored in module 300, for by the source generation of described target network address Code is stored in html queue；Data resolution module 400, for crawling rule, from described target according to predetermined The source code of network address parses the final data of described target network address；Judge module 500, is used for judging institute Whether the source code stating target network address exists url network address；If existing, then trigger second and be stored in module 600； Described second is stored in module 600, extracts url network address, and be stored in the source code of described target network address Described url queue.

Wherein, described judge module includes:

Wherein, described data crawl system and include:

In this specification, each embodiment uses the mode gone forward one by one to describe, and each embodiment stresses Being the difference with other embodiments, between each embodiment, identical similar portion sees mutually.

Described above to the disclosed embodiments, makes professional and technical personnel in the field be capable of or uses The present invention.Multiple amendment to these embodiments will be aobvious and easy for those skilled in the art See, generic principles defined herein can without departing from the spirit or scope of the present invention, Realize in other embodiments.Therefore, the present invention is not intended to be limited to the embodiments shown herein, And it is to fit to the widest scope consistent with principles disclosed herein and features of novelty.

Claims

1. a data crawling method, it is characterised in that including:

Data crawling method the most according to claim 1, it is characterised in that judge described target network Whether the source code of location exists url network address, including:

Data crawling method the most according to claim 2, it is characterised in that described from url queue Before middle acquisition target network address, including:

Data crawling method the most according to claim 3, it is characterised in that described from url queue Before middle acquisition target network address, including:

5. data crawl system, it is characterised in that including:

Data the most according to claim 5 crawl system, it is characterised in that described judge module bag Include:

Data the most according to claim 6 crawl system, it is characterised in that described data crawl and are System includes:

Data the most according to claim 7 crawl system, it is characterised in that described data crawl and are System includes: