CN105930385A - Data crawling method and system - Google Patents
Data crawling method and system Download PDFInfo
- Publication number
- CN105930385A CN105930385A CN201610232182.4A CN201610232182A CN105930385A CN 105930385 A CN105930385 A CN 105930385A CN 201610232182 A CN201610232182 A CN 201610232182A CN 105930385 A CN105930385 A CN 105930385A
- Authority
- CN
- China
- Prior art keywords
- network address
- url
- source code
- target network
- queue
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a data crawling method and system. The method comprises following steps: obtaining a target website from a url queue and obtaining source codes of the target website; saving the source codes of the target website in to a html queue, and resolving final data of the target website from the source codes of the target website; determining whether a url website exists in the source codes of the target website; if a url website exists, extracting the url website from the source codes of the target website and saving the url website into the url queue. According to the embodiment of the invention, source codes are obtained from pre-stored websites saved in the url queue, and the url websites extracted from the source codes are put in the url queue, and final data of the source codes is obtained from the html queue. By means of browser accessing, anti-crawling means are bypassed, assigned information can be obtained, quick data crawling is realized and data crawling cost is reduced.
Description
Technical field
The present invention relates to data and crawl technical field, more particularly, it relates to a kind of data crawling method and
System.
Background technology
In the routine work of web front end exploitation, it is often necessary to from the interconnection substantial amounts of information of online collection.As
Fruit, with manually completing, can consume substantial amounts of manpower and time, and therefore better method writes reptile exactly
Script helps us to complete the collection of information.Crawlers can send http request, clothes to server always
Business device is accomplished by receiving these requests, and does corresponding process, finally returns to data.But, reptile is also
This principle can be used, server is carried out the attack of malice, use multiple program simultaneously to same clothes
Business device sends http request, causes server to be busy with processing, thus reduces server performance, affects server
Stability.Therefore, some servers will use some measures to prevent their content by reptile journey
Sequence accesses.The anti-creep mode of the general use on anti-reptile webpage may more than one, therefore if it is desired to
Crawl the content on this webpage, it is necessary to the anti-reptile means of website are made a concrete analysis of, the most again
Corresponding solution is write out in code.If webpage takes a variety of modes and prevents reptile, that
Make crawlers become sufficiently complex, directly increase the cost of reptile.
Therefore, how crawling data is the problem that those skilled in the art need to solve.
Summary of the invention
It is an object of the invention to provide a kind of data crawling method and system, to realize crawling of low cost
Data.
For achieving the above object, following technical scheme is embodiments provided:
A kind of data crawling method, including:
From url queue, obtain target network address, and obtain the source code of described target network address;Wherein, described
In url queue, the network address of storage at least includes: prestore network address, and/or, the url net extracted from source code
Location;
The source code of described target network address is stored in html queue, and crawls rule, from described according to predetermined
The source code of target network address parses the final data of described target network address;
Judge whether the source code of described target network address exists url network address;
If existing, then from the source code of described target network address, extract url network address, and be stored in described url team
Row.
Wherein, it is judged that whether the source code of described target network address exists url network address, including:
Judge in the source code of described target network address, if exist and meet the url network address presetting network address rule.
Wherein, described obtain before target network address from url queue, including:
The reception network address that prestores arranges instruction, and arranges instruction in described url queue according to the described network address that prestores
The network address that prestores be configured.
Wherein, described obtain before target network address from url queue, including:
Receiving thread number arranges instruction;Wherein, described Thread Count arranges to instruct and at least includes: url travel line
Number of passes, and/or, html travel line number of passes.
A kind of data crawl system, including:
Target website acquisition module, for obtaining target network address from url queue;
Source code acquisition module, for obtaining the source code of described target network address;Wherein, described url queue
The network address of middle storage at least includes: prestore network address, and/or, the url network address extracted from source code;
First is stored in module, for the source code of described target network address is stored in html queue;
Data resolution module, for crawling rule according to predetermined, solves from the source code of described target network address
Separate out the final data of described target network address;
Judge module, for judging whether the source code of described target network address exists url network address;If existing,
Then trigger second and be stored in module;
Described second is stored in module, extracts url network address, and be stored in the source code of described target network address
Described url queue.
Wherein, described judge module includes:
Judging unit, in the source code judging described target network address, if exists to meet and presets network address
The url network address of rule.
Wherein, described data crawl system and include:
First receiver module, arranges instruction for the reception network address that prestores;
Module is set, for the network address that prestores described in basis, instruction is set to the network address that prestores in described url queue
It is configured.
Wherein, described data crawl system and include:
Second receiver module, arranges instruction for receiving thread number;Wherein, described Thread Count arranges instruction
At least include: url travel line number of passes, and/or, html travel line number of passes.
By above scheme, a kind of data crawling method of embodiment of the present invention offer and system, bag
Include: from url queue, obtain target network address, and obtain the source code of described target network address;Wherein, described
In url queue, the network address of storage at least includes: prestore network address, and/or, the url net extracted from source code
Location;The source code of described target network address is stored in html queue, and crawls rule, from described according to predetermined
The source code of target network address parses the final data of described target network address;Judge described target network address
Whether source code exists url network address;If existing, then from the source code of described target network address, extract url net
Location, and it is stored in described url queue.
Visible, in the present embodiment, the network address that prestores stored by url queue obtains source code, and will be from
The url network address extracted in source code continues to put into url queue, obtains in source code eventually through html queue
Final data, this form by browser access, directly around these anti-creep means, can obtain
To the information specified, can not only quickly crawl data, moreover it is possible to reduce and crawl cost.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to reality
Execute the required accompanying drawing used in example or description of the prior art to be briefly described, it should be apparent that below,
Accompanying drawing in description is only some embodiments of the present invention, for those of ordinary skill in the art,
On the premise of not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is a kind of data crawling method schematic flow sheet disclosed in the embodiment of the present invention;
Fig. 2 is that disclosed in the embodiment of the present invention, a kind of data crawl system structure schematic diagram.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out
Clearly and completely describe, it is clear that described embodiment is only a part of embodiment of the present invention, and
It is not all, of embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art are not making
Go out the every other embodiment obtained under creative work premise, broadly fall into the scope of protection of the invention.
The embodiment of the invention discloses a kind of data crawling method and system, crawl number with realize low cost
According to.It should be noted that the method crawling data in the present embodiment, it is necessary first to Selenium is installed
Instrument, Selenium is an instrument for web application test.Selenium test directly fortune
Go in a browser, just as real user is in operation.Support the browser of current main flow, and
Can run under multiple platforms.
See Fig. 1, a kind of data crawling method that the embodiment of the present invention provides, including:
S101, from url queue, obtain target network address, and obtain the source code of described target network address;Wherein,
In described url queue, the network address of storage at least includes: prestore network address, and/or, extract from source code
Url network address;
Wherein, described obtain before target network address from url queue, including:
The reception network address that prestores arranges instruction, and arranges instruction in described url queue according to the described network address that prestores
The network address that prestores be configured.
Concrete, reptile of the prior art can only crawl the appointment information of some website, and in this reality
Execute in example and instruction is set by the reading network address that prestores, network address to be read can be configured, thus realize
Multiple website data crawled.
Concrete, default network address in the present embodiment is specifically as follows the main web site of a website, example
As: homepage=http: //www.imdb.com/, so can crawl all webpages under this domain name.As
Fruit merely desires to the content crawling under the webpage of a certain ad hoc rule, it is also possible to arrange concrete webpage rule,
But have to comply with python grammer.Such as: url=
" http://www.imdb.com/name/num%0.7d/?Ref=numbio_bio_nm " %number.
Wherein, described obtain before target network address from url queue, including:
Receiving thread number arranges instruction;Wherein, described Thread Count arranges to instruct and at least includes: url travel line
Number of passes, and/or, html travel line number of passes.
Concrete, the Thread Count in the present embodiment arranges what instruction was set according to main frame configuration for user,
General PC could be arranged to 10.
S102, the source code of described target network address is stored in html queue, and crawls rule according to predetermined,
The final data of described target network address is parsed from the source code of described target network address;
Concrete, predetermined in the present embodiment crawls rule and can find you accurately for pre-set
The rule of the position in the page of the information wanted, rule is arranged must sequencing.Such as:
Css selector, searching id is the content of itemprop, then arrange id=itemprop, searches id
It is the content of job for the class in itemprop, then id=itemprop, class=job are set, with this type of
Push away, it is also possible to directly use regular expression to carry out content location, such as reg=' id=" itme " > (.*?) < ',
Css selector and regular expression collocation can also be used to use, but sequencing must be noted.
S103, judge whether the source code of described target network address exists url network address;If existing, then from described
The source code of target network address extracts url network address, and is stored in described url queue.
Wherein, it is judged that whether the source code of described target network address exists url network address, including:
Judge in the source code of described target network address, if exist and meet the url network address presetting network address rule.
Concrete, in the present embodiment by walking around anti-creep means, i.e. use the form of browser access to obtain
The information that fetching is fixed.Because anti-creep simply prevents, program is substantial amounts of sends request to server, crawls webpage
Content, thus use the form of browser just can access the most easily.
Concrete, the present embodiment can specifically be interpreted as, first reads the network address that prestores from url queue, and
By the access to the network address that prestores, obtain the source code of the network address that prestores, and source code is stored in html queue.
After html queue receives source code, first pass through analysis source code and obtain final data;And analyze source
Whether code having the url network address meeting predetermined network address rule, if having, then from source code, extracting url net
Location, and the url network address of extraction is stored in url queue, and continue to access the url network address in url queue, directly
There is no data to url queue or html queue.
Visible, by accessing url network address, while the source code obtained after accessing url network address preserves
In html queue, get new url according to the source code in html queue the most again and be stored in url
In queue, multiple threads of two queues form circulation work, thus accelerate the speed crawling data.
A kind of data crawling method that the embodiment of the present invention provides, including: from url queue, obtain target network
Location, and obtain the source code of described target network address;Wherein, in described url queue, the network address of storage is at least wrapped
Include: prestore network address, and/or, the url network address extracted from source code;Source code by described target network address
It is stored in html queue, and crawls rule according to predetermined, from the source code of described target network address, parse institute
State the final data of target network address;Judge whether the source code of described target network address exists url network address;If depositing
, then from the source code of described target network address, extract url network address, and be stored in described url queue.
Visible, in the present embodiment, the network address that prestores stored by url queue obtains source code, and will be from
The url network address extracted in source code continues to put into url queue, obtains in source code eventually through html queue
Final data, this form by browser access, directly around these anti-creep means, can obtain
To the information specified, can not only quickly crawl data, moreover it is possible to reduce and crawl cost.
A kind of data provided the embodiment of the present invention below crawl system and are introduced, and described below one
Kind data crawl system can be cross-referenced with above-described a kind of data crawling method.
Seeing Fig. 2, a kind of data that the embodiment of the present invention provides crawl system, including:
Target website acquisition module 100, for obtaining target network address from url queue;
Source code acquisition module 200, for obtaining the source code of described target network address;Wherein, described url
In queue, the network address of storage at least includes: prestore network address, and/or, the url network address extracted from source code;
First is stored in module 300, for the source code of described target network address is stored in html queue;
Data resolution module 400, for crawling rule, from the source code of described target network address according to predetermined
Parse the final data of described target network address;
Judge module 500, for judging whether the source code of described target network address exists url network address;If depositing
, then trigger second and be stored in module 600;
Described second is stored in module 600, extracts url network address in the source code of described target network address, and
It is stored in described url queue.
A kind of data that the embodiment of the present invention provides crawl system, including: target website acquisition module 100,
For obtaining target network address from url queue;Source code acquisition module 200, is used for obtaining described target network
The source code of location;Wherein, in described url queue, the network address of storage at least includes: prestore network address, and/or,
The url network address extracted from source code;First is stored in module 300, for by the source generation of described target network address
Code is stored in html queue;Data resolution module 400, for crawling rule, from described target according to predetermined
The source code of network address parses the final data of described target network address;Judge module 500, is used for judging institute
Whether the source code stating target network address exists url network address;If existing, then trigger second and be stored in module 600;
Described second is stored in module 600, extracts url network address, and be stored in the source code of described target network address
Described url queue.
Visible, in the present embodiment, the network address that prestores stored by url queue obtains source code, and will be from
The url network address extracted in source code continues to put into url queue, obtains in source code eventually through html queue
Final data, this form by browser access, directly around these anti-creep means, can obtain
To the information specified, can not only quickly crawl data, moreover it is possible to reduce and crawl cost.
Wherein, described judge module includes:
Judging unit, in the source code judging described target network address, if exists to meet and presets network address
The url network address of rule.
Wherein, described data crawl system and include:
First receiver module, arranges instruction for the reception network address that prestores;
Module is set, for the network address that prestores described in basis, instruction is set to the network address that prestores in described url queue
It is configured.
Wherein, described data crawl system and include:
Second receiver module, arranges instruction for receiving thread number;Wherein, described Thread Count arranges instruction
At least include: url travel line number of passes, and/or, html travel line number of passes.
In this specification, each embodiment uses the mode gone forward one by one to describe, and each embodiment stresses
Being the difference with other embodiments, between each embodiment, identical similar portion sees mutually.
Described above to the disclosed embodiments, makes professional and technical personnel in the field be capable of or uses
The present invention.Multiple amendment to these embodiments will be aobvious and easy for those skilled in the art
See, generic principles defined herein can without departing from the spirit or scope of the present invention,
Realize in other embodiments.Therefore, the present invention is not intended to be limited to the embodiments shown herein,
And it is to fit to the widest scope consistent with principles disclosed herein and features of novelty.
Claims (8)
1. a data crawling method, it is characterised in that including:
From url queue, obtain target network address, and obtain the source code of described target network address;Wherein, described
In url queue, the network address of storage at least includes: prestore network address, and/or, the url net extracted from source code
Location;
The source code of described target network address is stored in html queue, and crawls rule, from described according to predetermined
The source code of target network address parses the final data of described target network address;
Judge whether the source code of described target network address exists url network address;
If existing, then from the source code of described target network address, extract url network address, and be stored in described url team
Row.
Data crawling method the most according to claim 1, it is characterised in that judge described target network
Whether the source code of location exists url network address, including:
Judge in the source code of described target network address, if exist and meet the url network address presetting network address rule.
Data crawling method the most according to claim 2, it is characterised in that described from url queue
Before middle acquisition target network address, including:
The reception network address that prestores arranges instruction, and arranges instruction in described url queue according to the described network address that prestores
The network address that prestores be configured.
Data crawling method the most according to claim 3, it is characterised in that described from url queue
Before middle acquisition target network address, including:
Receiving thread number arranges instruction;Wherein, described Thread Count arranges to instruct and at least includes: url travel line
Number of passes, and/or, html travel line number of passes.
5. data crawl system, it is characterised in that including:
Target website acquisition module, for obtaining target network address from url queue;
Source code acquisition module, for obtaining the source code of described target network address;Wherein, described url queue
The network address of middle storage at least includes: prestore network address, and/or, the url network address extracted from source code;
First is stored in module, for the source code of described target network address is stored in html queue;
Data resolution module, for crawling rule according to predetermined, solves from the source code of described target network address
Separate out the final data of described target network address;
Judge module, for judging whether the source code of described target network address exists url network address;If existing,
Then trigger second and be stored in module;
Described second is stored in module, extracts url network address, and be stored in the source code of described target network address
Described url queue.
Data the most according to claim 5 crawl system, it is characterised in that described judge module bag
Include:
Judging unit, in the source code judging described target network address, if exists to meet and presets network address
The url network address of rule.
Data the most according to claim 6 crawl system, it is characterised in that described data crawl and are
System includes:
First receiver module, arranges instruction for the reception network address that prestores;
Module is set, for the network address that prestores described in basis, instruction is set to the network address that prestores in described url queue
It is configured.
Data the most according to claim 7 crawl system, it is characterised in that described data crawl and are
System includes:
Second receiver module, arranges instruction for receiving thread number;Wherein, described Thread Count arranges instruction
At least include: url travel line number of passes, and/or, html travel line number of passes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610232182.4A CN105930385A (en) | 2016-04-13 | 2016-04-13 | Data crawling method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610232182.4A CN105930385A (en) | 2016-04-13 | 2016-04-13 | Data crawling method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105930385A true CN105930385A (en) | 2016-09-07 |
Family
ID=56839137
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610232182.4A Pending CN105930385A (en) | 2016-04-13 | 2016-04-13 | Data crawling method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105930385A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649720A (en) * | 2016-12-22 | 2017-05-10 | 北京览群智数据科技有限责任公司 | Data processing method and server |
CN108664559A (en) * | 2018-03-30 | 2018-10-16 | 中山大学 | A kind of automatic crawling method of website and webpage source code |
CN109271145A (en) * | 2018-09-03 | 2019-01-25 | 科大国创软件股份有限公司 | Fast regular method for customizing based on pythonQT and intelligent algorithm |
CN110020076A (en) * | 2017-10-31 | 2019-07-16 | 北京国双科技有限公司 | The method and apparatus that web data crawls |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102043862A (en) * | 2010-12-29 | 2011-05-04 | 重庆新媒农信科技有限公司 | Directional web data extraction method |
CN103577427A (en) * | 2012-07-25 | 2014-02-12 | 中国移动通信集团公司 | Browser kernel based web page crawling method and device and browser containing device |
US20140052433A1 (en) * | 2012-08-16 | 2014-02-20 | Fujitsu Limited | Automatically extracting a model for the behavior of a mobile application |
CN104376063A (en) * | 2014-11-11 | 2015-02-25 | 南京邮电大学 | Multithreading web crawler method based on sort management and real-time information updating system |
-
2016
- 2016-04-13 CN CN201610232182.4A patent/CN105930385A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102043862A (en) * | 2010-12-29 | 2011-05-04 | 重庆新媒农信科技有限公司 | Directional web data extraction method |
CN103577427A (en) * | 2012-07-25 | 2014-02-12 | 中国移动通信集团公司 | Browser kernel based web page crawling method and device and browser containing device |
US20140052433A1 (en) * | 2012-08-16 | 2014-02-20 | Fujitsu Limited | Automatically extracting a model for the behavior of a mobile application |
CN104376063A (en) * | 2014-11-11 | 2015-02-25 | 南京邮电大学 | Multithreading web crawler method based on sort management and real-time information updating system |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649720A (en) * | 2016-12-22 | 2017-05-10 | 北京览群智数据科技有限责任公司 | Data processing method and server |
CN110020076A (en) * | 2017-10-31 | 2019-07-16 | 北京国双科技有限公司 | The method and apparatus that web data crawls |
CN108664559A (en) * | 2018-03-30 | 2018-10-16 | 中山大学 | A kind of automatic crawling method of website and webpage source code |
CN109271145A (en) * | 2018-09-03 | 2019-01-25 | 科大国创软件股份有限公司 | Fast regular method for customizing based on pythonQT and intelligent algorithm |
CN109271145B (en) * | 2018-09-03 | 2021-12-14 | 科大国创软件股份有限公司 | Quick rule customizing method based on pythonQT and intelligent algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8424004B2 (en) | High performance script behavior detection through browser shimming | |
US20150244670A1 (en) | Browser and method for domain name resolution by the same | |
CN104572777B (en) | Webpage loading method and device based on UIWebView component | |
CN107463641A (en) | System and method for improving the access to search result | |
CN107885777A (en) | A kind of control method and system of the crawl web data based on collaborative reptile | |
CN104881608A (en) | XSS vulnerability detection method based on simulating browser behavior | |
US9785710B2 (en) | Automatic crawling of encoded dynamic URLs | |
CN106126747A (en) | Data capture method based on reptile and device | |
CN104881607A (en) | XSS vulnerability detection method based on simulating browser behavior | |
CN105930385A (en) | Data crawling method and system | |
CN106126693A (en) | The sending method of the related data of a kind of webpage and device | |
CN110442815A (en) | Page generation method, system, device and computer readable storage medium | |
CN105528369B (en) | Webpage code-transferring method, device and server | |
CN103248707B (en) | File access method, system and equipment | |
CN102855334A (en) | Browser and method for acquiring domain name system (DNS) resolving data | |
CN106326734A (en) | Method and device for detecting sensitive information | |
US9465814B2 (en) | Annotating search results with images | |
CN106874502A (en) | A kind of method of video search, device and terminal | |
CN103617225B (en) | A kind of associating web pages searching method and system | |
CN110532455A (en) | A kind of Web page picture acquisition methods and system based on Chrome browser | |
CN104778232B (en) | Searching result optimizing method and device based on long query | |
CN103347069A (en) | Method and device for realizing network access | |
US10095791B2 (en) | Information search method and apparatus | |
CN103581321B (en) | A kind of creation method of refer chains, device and safety detection method and client | |
CN103092937B (en) | Visual webpage includes detection method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160907 |