CN105930385A - Data crawling method and system - Google Patents

Data crawling method and system Download PDF

Info

Publication number
CN105930385A
CN105930385A CN201610232182.4A CN201610232182A CN105930385A CN 105930385 A CN105930385 A CN 105930385A CN 201610232182 A CN201610232182 A CN 201610232182A CN 105930385 A CN105930385 A CN 105930385A
Authority
CN
China
Prior art keywords
network address
url
source code
target network
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610232182.4A
Other languages
Chinese (zh)
Inventor
祝奔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Gotech Intelligent Technology Co Ltd
Original Assignee
Zhuhai Gotech Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Gotech Intelligent Technology Co Ltd filed Critical Zhuhai Gotech Intelligent Technology Co Ltd
Priority to CN201610232182.4A priority Critical patent/CN105930385A/en
Publication of CN105930385A publication Critical patent/CN105930385A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a data crawling method and system. The method comprises following steps: obtaining a target website from a url queue and obtaining source codes of the target website; saving the source codes of the target website in to a html queue, and resolving final data of the target website from the source codes of the target website; determining whether a url website exists in the source codes of the target website; if a url website exists, extracting the url website from the source codes of the target website and saving the url website into the url queue. According to the embodiment of the invention, source codes are obtained from pre-stored websites saved in the url queue, and the url websites extracted from the source codes are put in the url queue, and final data of the source codes is obtained from the html queue. By means of browser accessing, anti-crawling means are bypassed, assigned information can be obtained, quick data crawling is realized and data crawling cost is reduced.

Description

A kind of data crawling method and system
Technical field
The present invention relates to data and crawl technical field, more particularly, it relates to a kind of data crawling method and System.
Background technology
In the routine work of web front end exploitation, it is often necessary to from the interconnection substantial amounts of information of online collection.As Fruit, with manually completing, can consume substantial amounts of manpower and time, and therefore better method writes reptile exactly Script helps us to complete the collection of information.Crawlers can send http request, clothes to server always Business device is accomplished by receiving these requests, and does corresponding process, finally returns to data.But, reptile is also This principle can be used, server is carried out the attack of malice, use multiple program simultaneously to same clothes Business device sends http request, causes server to be busy with processing, thus reduces server performance, affects server Stability.Therefore, some servers will use some measures to prevent their content by reptile journey Sequence accesses.The anti-creep mode of the general use on anti-reptile webpage may more than one, therefore if it is desired to Crawl the content on this webpage, it is necessary to the anti-reptile means of website are made a concrete analysis of, the most again Corresponding solution is write out in code.If webpage takes a variety of modes and prevents reptile, that Make crawlers become sufficiently complex, directly increase the cost of reptile.
Therefore, how crawling data is the problem that those skilled in the art need to solve.
Summary of the invention
It is an object of the invention to provide a kind of data crawling method and system, to realize crawling of low cost Data.
For achieving the above object, following technical scheme is embodiments provided:
A kind of data crawling method, including:
From url queue, obtain target network address, and obtain the source code of described target network address;Wherein, described In url queue, the network address of storage at least includes: prestore network address, and/or, the url net extracted from source code Location;
The source code of described target network address is stored in html queue, and crawls rule, from described according to predetermined The source code of target network address parses the final data of described target network address;
Judge whether the source code of described target network address exists url network address;
If existing, then from the source code of described target network address, extract url network address, and be stored in described url team Row.
Wherein, it is judged that whether the source code of described target network address exists url network address, including:
Judge in the source code of described target network address, if exist and meet the url network address presetting network address rule.
Wherein, described obtain before target network address from url queue, including:
The reception network address that prestores arranges instruction, and arranges instruction in described url queue according to the described network address that prestores The network address that prestores be configured.
Wherein, described obtain before target network address from url queue, including:
Receiving thread number arranges instruction;Wherein, described Thread Count arranges to instruct and at least includes: url travel line Number of passes, and/or, html travel line number of passes.
A kind of data crawl system, including:
Target website acquisition module, for obtaining target network address from url queue;
Source code acquisition module, for obtaining the source code of described target network address;Wherein, described url queue The network address of middle storage at least includes: prestore network address, and/or, the url network address extracted from source code;
First is stored in module, for the source code of described target network address is stored in html queue;
Data resolution module, for crawling rule according to predetermined, solves from the source code of described target network address Separate out the final data of described target network address;
Judge module, for judging whether the source code of described target network address exists url network address;If existing, Then trigger second and be stored in module;
Described second is stored in module, extracts url network address, and be stored in the source code of described target network address Described url queue.
Wherein, described judge module includes:
Judging unit, in the source code judging described target network address, if exists to meet and presets network address The url network address of rule.
Wherein, described data crawl system and include:
First receiver module, arranges instruction for the reception network address that prestores;
Module is set, for the network address that prestores described in basis, instruction is set to the network address that prestores in described url queue It is configured.
Wherein, described data crawl system and include:
Second receiver module, arranges instruction for receiving thread number;Wherein, described Thread Count arranges instruction At least include: url travel line number of passes, and/or, html travel line number of passes.
By above scheme, a kind of data crawling method of embodiment of the present invention offer and system, bag Include: from url queue, obtain target network address, and obtain the source code of described target network address;Wherein, described In url queue, the network address of storage at least includes: prestore network address, and/or, the url net extracted from source code Location;The source code of described target network address is stored in html queue, and crawls rule, from described according to predetermined The source code of target network address parses the final data of described target network address;Judge described target network address Whether source code exists url network address;If existing, then from the source code of described target network address, extract url net Location, and it is stored in described url queue.
Visible, in the present embodiment, the network address that prestores stored by url queue obtains source code, and will be from The url network address extracted in source code continues to put into url queue, obtains in source code eventually through html queue Final data, this form by browser access, directly around these anti-creep means, can obtain To the information specified, can not only quickly crawl data, moreover it is possible to reduce and crawl cost.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to reality Execute the required accompanying drawing used in example or description of the prior art to be briefly described, it should be apparent that below, Accompanying drawing in description is only some embodiments of the present invention, for those of ordinary skill in the art, On the premise of not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is a kind of data crawling method schematic flow sheet disclosed in the embodiment of the present invention;
Fig. 2 is that disclosed in the embodiment of the present invention, a kind of data crawl system structure schematic diagram.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out Clearly and completely describe, it is clear that described embodiment is only a part of embodiment of the present invention, and It is not all, of embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art are not making Go out the every other embodiment obtained under creative work premise, broadly fall into the scope of protection of the invention.
The embodiment of the invention discloses a kind of data crawling method and system, crawl number with realize low cost According to.It should be noted that the method crawling data in the present embodiment, it is necessary first to Selenium is installed Instrument, Selenium is an instrument for web application test.Selenium test directly fortune Go in a browser, just as real user is in operation.Support the browser of current main flow, and Can run under multiple platforms.
See Fig. 1, a kind of data crawling method that the embodiment of the present invention provides, including:
S101, from url queue, obtain target network address, and obtain the source code of described target network address;Wherein, In described url queue, the network address of storage at least includes: prestore network address, and/or, extract from source code Url network address;
Wherein, described obtain before target network address from url queue, including:
The reception network address that prestores arranges instruction, and arranges instruction in described url queue according to the described network address that prestores The network address that prestores be configured.
Concrete, reptile of the prior art can only crawl the appointment information of some website, and in this reality Execute in example and instruction is set by the reading network address that prestores, network address to be read can be configured, thus realize Multiple website data crawled.
Concrete, default network address in the present embodiment is specifically as follows the main web site of a website, example As: homepage=http: //www.imdb.com/, so can crawl all webpages under this domain name.As Fruit merely desires to the content crawling under the webpage of a certain ad hoc rule, it is also possible to arrange concrete webpage rule, But have to comply with python grammer.Such as: url= " http://www.imdb.com/name/num%0.7d/?Ref=numbio_bio_nm " %number.
Wherein, described obtain before target network address from url queue, including:
Receiving thread number arranges instruction;Wherein, described Thread Count arranges to instruct and at least includes: url travel line Number of passes, and/or, html travel line number of passes.
Concrete, the Thread Count in the present embodiment arranges what instruction was set according to main frame configuration for user, General PC could be arranged to 10.
S102, the source code of described target network address is stored in html queue, and crawls rule according to predetermined, The final data of described target network address is parsed from the source code of described target network address;
Concrete, predetermined in the present embodiment crawls rule and can find you accurately for pre-set The rule of the position in the page of the information wanted, rule is arranged must sequencing.Such as:
Css selector, searching id is the content of itemprop, then arrange id=itemprop, searches id It is the content of job for the class in itemprop, then id=itemprop, class=job are set, with this type of Push away, it is also possible to directly use regular expression to carry out content location, such as reg=' id=" itme " > (.*?) < ', Css selector and regular expression collocation can also be used to use, but sequencing must be noted.
S103, judge whether the source code of described target network address exists url network address;If existing, then from described The source code of target network address extracts url network address, and is stored in described url queue.
Wherein, it is judged that whether the source code of described target network address exists url network address, including:
Judge in the source code of described target network address, if exist and meet the url network address presetting network address rule.
Concrete, in the present embodiment by walking around anti-creep means, i.e. use the form of browser access to obtain The information that fetching is fixed.Because anti-creep simply prevents, program is substantial amounts of sends request to server, crawls webpage Content, thus use the form of browser just can access the most easily.
Concrete, the present embodiment can specifically be interpreted as, first reads the network address that prestores from url queue, and By the access to the network address that prestores, obtain the source code of the network address that prestores, and source code is stored in html queue. After html queue receives source code, first pass through analysis source code and obtain final data;And analyze source Whether code having the url network address meeting predetermined network address rule, if having, then from source code, extracting url net Location, and the url network address of extraction is stored in url queue, and continue to access the url network address in url queue, directly There is no data to url queue or html queue.
Visible, by accessing url network address, while the source code obtained after accessing url network address preserves In html queue, get new url according to the source code in html queue the most again and be stored in url In queue, multiple threads of two queues form circulation work, thus accelerate the speed crawling data.
A kind of data crawling method that the embodiment of the present invention provides, including: from url queue, obtain target network Location, and obtain the source code of described target network address;Wherein, in described url queue, the network address of storage is at least wrapped Include: prestore network address, and/or, the url network address extracted from source code;Source code by described target network address It is stored in html queue, and crawls rule according to predetermined, from the source code of described target network address, parse institute State the final data of target network address;Judge whether the source code of described target network address exists url network address;If depositing , then from the source code of described target network address, extract url network address, and be stored in described url queue.
Visible, in the present embodiment, the network address that prestores stored by url queue obtains source code, and will be from The url network address extracted in source code continues to put into url queue, obtains in source code eventually through html queue Final data, this form by browser access, directly around these anti-creep means, can obtain To the information specified, can not only quickly crawl data, moreover it is possible to reduce and crawl cost.
A kind of data provided the embodiment of the present invention below crawl system and are introduced, and described below one Kind data crawl system can be cross-referenced with above-described a kind of data crawling method.
Seeing Fig. 2, a kind of data that the embodiment of the present invention provides crawl system, including:
Target website acquisition module 100, for obtaining target network address from url queue;
Source code acquisition module 200, for obtaining the source code of described target network address;Wherein, described url In queue, the network address of storage at least includes: prestore network address, and/or, the url network address extracted from source code;
First is stored in module 300, for the source code of described target network address is stored in html queue;
Data resolution module 400, for crawling rule, from the source code of described target network address according to predetermined Parse the final data of described target network address;
Judge module 500, for judging whether the source code of described target network address exists url network address;If depositing , then trigger second and be stored in module 600;
Described second is stored in module 600, extracts url network address in the source code of described target network address, and It is stored in described url queue.
A kind of data that the embodiment of the present invention provides crawl system, including: target website acquisition module 100, For obtaining target network address from url queue;Source code acquisition module 200, is used for obtaining described target network The source code of location;Wherein, in described url queue, the network address of storage at least includes: prestore network address, and/or, The url network address extracted from source code;First is stored in module 300, for by the source generation of described target network address Code is stored in html queue;Data resolution module 400, for crawling rule, from described target according to predetermined The source code of network address parses the final data of described target network address;Judge module 500, is used for judging institute Whether the source code stating target network address exists url network address;If existing, then trigger second and be stored in module 600; Described second is stored in module 600, extracts url network address, and be stored in the source code of described target network address Described url queue.
Visible, in the present embodiment, the network address that prestores stored by url queue obtains source code, and will be from The url network address extracted in source code continues to put into url queue, obtains in source code eventually through html queue Final data, this form by browser access, directly around these anti-creep means, can obtain To the information specified, can not only quickly crawl data, moreover it is possible to reduce and crawl cost.
Wherein, described judge module includes:
Judging unit, in the source code judging described target network address, if exists to meet and presets network address The url network address of rule.
Wherein, described data crawl system and include:
First receiver module, arranges instruction for the reception network address that prestores;
Module is set, for the network address that prestores described in basis, instruction is set to the network address that prestores in described url queue It is configured.
Wherein, described data crawl system and include:
Second receiver module, arranges instruction for receiving thread number;Wherein, described Thread Count arranges instruction At least include: url travel line number of passes, and/or, html travel line number of passes.
In this specification, each embodiment uses the mode gone forward one by one to describe, and each embodiment stresses Being the difference with other embodiments, between each embodiment, identical similar portion sees mutually.
Described above to the disclosed embodiments, makes professional and technical personnel in the field be capable of or uses The present invention.Multiple amendment to these embodiments will be aobvious and easy for those skilled in the art See, generic principles defined herein can without departing from the spirit or scope of the present invention, Realize in other embodiments.Therefore, the present invention is not intended to be limited to the embodiments shown herein, And it is to fit to the widest scope consistent with principles disclosed herein and features of novelty.

Claims (8)

1. a data crawling method, it is characterised in that including:
From url queue, obtain target network address, and obtain the source code of described target network address;Wherein, described In url queue, the network address of storage at least includes: prestore network address, and/or, the url net extracted from source code Location;
The source code of described target network address is stored in html queue, and crawls rule, from described according to predetermined The source code of target network address parses the final data of described target network address;
Judge whether the source code of described target network address exists url network address;
If existing, then from the source code of described target network address, extract url network address, and be stored in described url team Row.
Data crawling method the most according to claim 1, it is characterised in that judge described target network Whether the source code of location exists url network address, including:
Judge in the source code of described target network address, if exist and meet the url network address presetting network address rule.
Data crawling method the most according to claim 2, it is characterised in that described from url queue Before middle acquisition target network address, including:
The reception network address that prestores arranges instruction, and arranges instruction in described url queue according to the described network address that prestores The network address that prestores be configured.
Data crawling method the most according to claim 3, it is characterised in that described from url queue Before middle acquisition target network address, including:
Receiving thread number arranges instruction;Wherein, described Thread Count arranges to instruct and at least includes: url travel line Number of passes, and/or, html travel line number of passes.
5. data crawl system, it is characterised in that including:
Target website acquisition module, for obtaining target network address from url queue;
Source code acquisition module, for obtaining the source code of described target network address;Wherein, described url queue The network address of middle storage at least includes: prestore network address, and/or, the url network address extracted from source code;
First is stored in module, for the source code of described target network address is stored in html queue;
Data resolution module, for crawling rule according to predetermined, solves from the source code of described target network address Separate out the final data of described target network address;
Judge module, for judging whether the source code of described target network address exists url network address;If existing, Then trigger second and be stored in module;
Described second is stored in module, extracts url network address, and be stored in the source code of described target network address Described url queue.
Data the most according to claim 5 crawl system, it is characterised in that described judge module bag Include:
Judging unit, in the source code judging described target network address, if exists to meet and presets network address The url network address of rule.
Data the most according to claim 6 crawl system, it is characterised in that described data crawl and are System includes:
First receiver module, arranges instruction for the reception network address that prestores;
Module is set, for the network address that prestores described in basis, instruction is set to the network address that prestores in described url queue It is configured.
Data the most according to claim 7 crawl system, it is characterised in that described data crawl and are System includes:
Second receiver module, arranges instruction for receiving thread number;Wherein, described Thread Count arranges instruction At least include: url travel line number of passes, and/or, html travel line number of passes.
CN201610232182.4A 2016-04-13 2016-04-13 Data crawling method and system Pending CN105930385A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610232182.4A CN105930385A (en) 2016-04-13 2016-04-13 Data crawling method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610232182.4A CN105930385A (en) 2016-04-13 2016-04-13 Data crawling method and system

Publications (1)

Publication Number Publication Date
CN105930385A true CN105930385A (en) 2016-09-07

Family

ID=56839137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610232182.4A Pending CN105930385A (en) 2016-04-13 2016-04-13 Data crawling method and system

Country Status (1)

Country Link
CN (1) CN105930385A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649720A (en) * 2016-12-22 2017-05-10 北京览群智数据科技有限责任公司 Data processing method and server
CN108664559A (en) * 2018-03-30 2018-10-16 中山大学 A kind of automatic crawling method of website and webpage source code
CN109271145A (en) * 2018-09-03 2019-01-25 科大国创软件股份有限公司 Fast regular method for customizing based on pythonQT and intelligent algorithm
CN110020076A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 The method and apparatus that web data crawls

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102043862A (en) * 2010-12-29 2011-05-04 重庆新媒农信科技有限公司 Directional web data extraction method
CN103577427A (en) * 2012-07-25 2014-02-12 中国移动通信集团公司 Browser kernel based web page crawling method and device and browser containing device
US20140052433A1 (en) * 2012-08-16 2014-02-20 Fujitsu Limited Automatically extracting a model for the behavior of a mobile application
CN104376063A (en) * 2014-11-11 2015-02-25 南京邮电大学 Multithreading web crawler method based on sort management and real-time information updating system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102043862A (en) * 2010-12-29 2011-05-04 重庆新媒农信科技有限公司 Directional web data extraction method
CN103577427A (en) * 2012-07-25 2014-02-12 中国移动通信集团公司 Browser kernel based web page crawling method and device and browser containing device
US20140052433A1 (en) * 2012-08-16 2014-02-20 Fujitsu Limited Automatically extracting a model for the behavior of a mobile application
CN104376063A (en) * 2014-11-11 2015-02-25 南京邮电大学 Multithreading web crawler method based on sort management and real-time information updating system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649720A (en) * 2016-12-22 2017-05-10 北京览群智数据科技有限责任公司 Data processing method and server
CN110020076A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 The method and apparatus that web data crawls
CN108664559A (en) * 2018-03-30 2018-10-16 中山大学 A kind of automatic crawling method of website and webpage source code
CN109271145A (en) * 2018-09-03 2019-01-25 科大国创软件股份有限公司 Fast regular method for customizing based on pythonQT and intelligent algorithm
CN109271145B (en) * 2018-09-03 2021-12-14 科大国创软件股份有限公司 Quick rule customizing method based on pythonQT and intelligent algorithm

Similar Documents

Publication Publication Date Title
US8424004B2 (en) High performance script behavior detection through browser shimming
US20150244670A1 (en) Browser and method for domain name resolution by the same
CN104572777B (en) Webpage loading method and device based on UIWebView component
CN107463641A (en) System and method for improving the access to search result
CN107885777A (en) A kind of control method and system of the crawl web data based on collaborative reptile
CN104881608A (en) XSS vulnerability detection method based on simulating browser behavior
US9785710B2 (en) Automatic crawling of encoded dynamic URLs
CN106126747A (en) Data capture method based on reptile and device
CN104881607A (en) XSS vulnerability detection method based on simulating browser behavior
CN105930385A (en) Data crawling method and system
CN106126693A (en) The sending method of the related data of a kind of webpage and device
CN110442815A (en) Page generation method, system, device and computer readable storage medium
CN105528369B (en) Webpage code-transferring method, device and server
CN103248707B (en) File access method, system and equipment
CN102855334A (en) Browser and method for acquiring domain name system (DNS) resolving data
CN106326734A (en) Method and device for detecting sensitive information
US9465814B2 (en) Annotating search results with images
CN106874502A (en) A kind of method of video search, device and terminal
CN103617225B (en) A kind of associating web pages searching method and system
CN110532455A (en) A kind of Web page picture acquisition methods and system based on Chrome browser
CN104778232B (en) Searching result optimizing method and device based on long query
CN103347069A (en) Method and device for realizing network access
US10095791B2 (en) Information search method and apparatus
CN103581321B (en) A kind of creation method of refer chains, device and safety detection method and client
CN103092937B (en) Visual webpage includes detection method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160907