CN108664559A

CN108664559A - A kind of automatic crawling method of website and webpage source code

Info

Publication number: CN108664559A
Application number: CN201810297883.5A
Authority: CN
Inventors: 杨智; 陈锭敏
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-03-30
Filing date: 2018-03-30
Publication date: 2018-10-16

Abstract

The present invention relates to a kind of automatic crawling methods of website and webpage source code, and webpage is crawled in the website determined so that the webpage crawled compares concentration, there is an obvious denominator, convenient to crawl webpage writing crawlers.And crawl webpage in specific website so that the target information to be crawled compares concentration, and complete can obtain must quickly crawl required information.On crawling website and webpage source code, the web-page requests that the browser that must can effectively disguise oneself as is sent out prevent from being identified as machine code by website to crawl website data.Pass through and certain stand-by period is set so that when website or network abnormal conditions occur and code reports an error out of service when not making a response to crawlers, web page source code can be crawled by the automatic code that must run for a long time.By adding agent IP address database, can be effectively prevented when reptile code is blocked the denied access websites IP, program can also replace IP and continue to crawl web page source code automatically.

Description

A kind of automatic crawling method of website and webpage source code

Technical field

The present invention relates to the technical field of web crawlers more particularly to a kind of website and webpage source code sides of crawling automatically Method.

Background technology

With the fast development of Internet technology, the information data on network is in explosive growth.This makes on network It is more and more difficult to find our information datas of needs.How it is for statistical analysis to these diversity, real-time data from And the valuable information for obtaining data behind seems very significant.Exactly under such background, recent years big data Technology rapidly develops, more and more extensive in the application of all trades and professions.It to utilize mass data to analyze information, how to obtain on network Data and being stored just seem and be even more important.

Current people when searching some data, be largely by search engine search for then on website it is direct Browsing obtains.Although this fairly simple convenience of method, when data volume is huge and needs to store, this side Method just needs to take a substantial amount of time and can not often analyze the information for obtaining our needs from these data.

When data volume is more huge, used technical solution is to crawl web page source generation using web crawlers at present Code, then therefrom extract the data of our needs.Web crawlers is a kind of program automatically extracting webpage, mainly has universal network to climb Worm and focused web crawler.Such as application No. is the Chinese invention patent of CN201410021505.6, application discloses a kind of number According to acquisition methods.Its data acquisition is divided into following steps：Initial data analysis and participle operation, obtain keyword； It is scanned for using the keyword, obtains the page of website；Information crawler is carried out to the page of the website, according to the net The source file for the page stood carries out canonical matching with positive, obtains matching then expression formula result；According to the keyword or with mark Label library is compared, and is positioned over the information in the matching result in the corresponding field in corresponding video attribute library respectively； Data analysis is carried out, amount is repeated according to the weight of the page of the website and information to carry out data calculating；For Auto-writing Content carry out edit validation and processing.In another example application No. is the Chinese invention patent of CN201310198598.5, application A kind of web page crawl method and system are disclosed, and this method includes being trained to obtain data extraction conditions to sample web page；It crawls Web page；The web page crawled is parsed into dom tree constructions, obtains web page dom trees, item is extracted according to the data Part analyzes the web page dom trees, extracts required data.

In above-mentioned existing technical solution, the website data obtained using search engine search, due to web database technology Pang Greatly, target data can not be obtained in the short time.And when target data amount is also very huge, number is searched for using search engine It can undoubtedly be taken a substantial amount of time according to then searching to download one by one, and some data can be inevitably missed by checking for human eye. And web page source code is crawled to extract data using web crawlers, target data can must be quick and precisely got really, but It is not explained further to crawling web page source code in existing technical solution.And we are when crawling webpage, by Anti- reptile can be set in Website server, it is the request initiated by automated procedures that can directly refuse those obviously.In anti-reptile The most common means in the inside are exactly to judge whether the request is sent out by browser, if request frequency is too fast, sends out in a short time Send a large amount of request.These means often result in us and can not be successfully to crawl the web page source code of website.

Therefore, how to enable reptile code is high-frequency in a short time to crawl web page source code, therefrom extract valuable The problem of information of value, is urgently to be resolved hurrily.

Invention content

It is an object of the invention to overcome the deficiencies of the prior art and provide one kind can avoid being identified as automatic machine by website Device code, can for a long time automatically operation code crawl web page source code, when reptile code is blocked IP denied access website When can also continue to crawl the automatic crawling method of website and webpage source code of web page source code.

To achieve the above object, technical solution provided by the present invention is：

Include the following steps：

S1, it determines the website containing target information, analyzes the website and determine that webpage where target information and those contain There is the distinctive common feature of target information webpage；

The URL for the Initial page and the webpage that S2, load crawl；

S3, load will crawl the common feature of webpage；

S4, the source code for crawling Initial page, and the URL of the webpage containing target information is checked for, if not depositing Terminate to crawl program shortly, if in the presence of the web page source code containing target information is crawled one by one, believe comprising target until all The web page crawl of breath finishes；

S5, the target source code crawled is stored in the specified folder of designated position, the later stage is facilitated to extract target Information.

Further, when the step S4 crawls web page source code, crawlers are disguised oneself as into computer browser to website Server sends web-page requests, and crawlers is avoided to be identified as machine code and denied access by website.

Further, described when sending web-page requests to Website server using crawlers, modification sends message Request header, modification is with reference to the parameter for sending the request header that the when of asking uses in browser to website so that the request of reptile code More like the access of browser；The parameters for the web-page requests message that browser is sent to website are searched by developing instrument.

Further, the crawlers disguise oneself as computer browser every time to Website server send web-page requests when, It can monitor whether the web-page requests sent out are responded, if request is not responded for the first time, enter the stand-by period, wait for Web-page requests are sent out after time again；If continuous 5 times send out web-page requests and do not responded, Agent IP is replaced, It replaces the subsequent supervention of Agent IP and send web-page requests.

Further, the Agent IP using reptile from Agent IP website crawl gained, and the Agent IP crawled into Database is written after row detection confirmation is available；When needing replacing the IP of crawlers, then available IP is obtained from database It is written in the IP parameters of crawlers.

Compared with prior art, this programme principle and advantage are as follows：

Determine the website containing target information first, analyze the website determine webpage where target information and those Contain the distinctive common feature of target information webpage；Then it loads the URL of the Initial page and the webpage that crawl and to climb Take the common feature of webpage；In addition, the source code of Initial page is crawled, and checks for the webpage containing target information URL, if there is no terminating to crawl program, if in the presence of the web page source code containing target information is crawled one by one, until all Including the web page crawl of target information finishes；Finally, the target source code crawled is stored in the specified file of designated position In folder, the later stage is facilitated to extract target information.

Wherein, specifically, webpage is crawled in the website determined so that the webpage crawled compares concentration, has and compares Apparent denominator conveniently writes crawlers and crawls webpage.And crawl webpage in specific website so as to be crawled Target information compares concentration, complete can obtain and rapidly crawl required information.

When crawling web page source code, by crawlers disguise oneself as computer browser to Website server send web-page requests, Crawlers are avoided to be identified as machine code and denied access by website.

Stand-by period is set, when this abnormal conditions occur when website or network and not make a response to crawlers Code report an error it is out of service, allow to for a long time automatically operation code crawl web page source code.

The IP of Agent IP website is crawled using reptile, and detects whether the IP crawled can be used, if available, data are written Library, if cannot, it rejects.When needing replacing the IP of crawlers, then available IP write-ins reptile is obtained from database In the IP parameters of program.In this way, when computer IP is sealed, using Agent IP, website is made to be thought as that an other computer is accessing The website and proceed to respond to reptile transmission web-page requests, to ensure crawlers can long-time high-frequency must run down It goes.

Description of the drawings

Fig. 1 is a kind of flow chart of the automatic crawling method of website and webpage source code in the present invention.

Specific implementation mode

The present invention is further explained in the light of specific embodiments：

A kind of automatic crawling method of website and webpage source code described in the present embodiment, includes the following steps：

The URL for the Initial page and the webpage that S2, load crawl；(URL, that is, uniform resource locator is to can be from The position of the resource obtained on internet and a kind of succinct expression of access method, are the addresses of standard resource on internet. For each file on internet there are one unique URL, the information that it includes points out that the position of file and browser should How it is handled.Each webpage on website is a file, they are all preserved in defined location, this position is just It is to be embodied with URL, when some webpage of browser access, just needs to propose that application will access the webpage to website, this The information of a application includes just URL, it specifies the specific webpage for needing to access.Crawlers are crawling web page source code When, the URL for obtaining the webpage is just needed, could be filed a request to website, the source code of webpage is crawled)

S3, load will crawl the common feature of webpage；

Wherein, specifically, when crawling web page source code, by crawlers disguise oneself as computer browser to Website server send out Web-page requests are sent, crawlers is avoided to be identified as machine code and denied access by website.

It is explained as follows：When browsing certain website webpages in place using browser, computer is just established with the server of website One reliable connection, this connection are known as TCP connection.It establishes after connection, the browser on computer needs to access on website Webpage, communicated between them, all browsers are communicated with each Website server for convenience, are just provided One unified communication protocol, referred to as http agreements.This agreement provides, when needing to browse some webpage, needs to this The Website server of webpage sends a web-page requests will be the source code of the webpage if Website server respond request Be sent on computer, this source code open on computers be it is seen that webpage.When needs are sent to Website server When web-page requests, browser will be sent on a message to Website server.This message has several important parameters, So that Website server knows what model browser people use, which webpage is needed, the IP of computer is how many.If only It is simply to send web-page requests to Website server using python codes, the request message of python can use asking for acquiescence Seek head.For example, the default request head User-Agent values that python is used are Python-urllib/3.4, and browser access When User-Agent values be：Mozilla/5.0(Windows NT 6.1；WOW64)AppleWebKit/537.36(KHTML, like Gecko)Chrome/43.0.2357.124Safari/537.3.There is apparent difference between the two, it is easy to quilt Website identifies that machine code is accessing website or manual operation browser in access website.In order to avoid automating journey Sequence is rejected, and the request header of message is sent by modification when sending message to Website server using reptile code, with reference to this It is what to send the parameter for the request header that the when of asking uses to website in browser in computer so that the request of reptile code is more It seem the access of browser.The parameters for the web-page requests message that browser is sent to website, by opening in this computer After browser after the developing instrument of F12 keys opening browser, this browser just can be found in developing instrument and is sent to website The parameters of web-page requests message.

During this step S4, also set up there are one the stand-by period, when crawlers normal operation, monitoring each time to Whether the web-page requests that Website server is sent out are responded, if the secondary request fails to be responded, wait for the regular hour Send out web-page requests again afterwards.If being responded, continue next step program, if still not responded, again etc. Wait for that the longer time sends request again.If continuous 5 times are not all responded, replaces the subsequent superventions of IP and send request.

Above-mentioned browser can send a message when sending web-page requests to Website server, and message the inside has several Parameter, wherein there are one the IP address for being computer, such Website server just can know that the request that whom is to its transmission, it is answered This webpage be dealt into where.When crawlers are in the webpage for crawling website, frequency is very fast.It is most common inside anti-reptile A kind of means are exactly to judge your request frequency.If sending a large amount of request in the short time, regardless of people, also can first seal IP is for a period of time.So this is just at a contradictory place：It climbs to be sealed soon very much, climbs too slow and time-consuming.This When, using Agent IP, when current IP is honored as a queen, change the value of the IP parameters inside request message.The present embodiment passes through foundation One database, provide stablize available Agent IP give reptile use.Its major function includes：It obtains and acts on behalf of from proxy web site IP；Agent IP is detected, detects whether can be used.Detection method is to access stable website, such as Baidu, Tencent etc. look into See conditional code；Database is written；Database is cleaned, not available IP is rejected；Obtain an available agency.

The present embodiment determines the targeted website where target information, later which webpage of the determining target information in the website On, these webpages have any distinctive denominator.Later web page source code is crawled using these characteristics.And in existing technology It in scheme, in the method for obtaining data, first determines keyword, searches again for the webpage where keyword, crawl webpage, then from net Data are extracted in page.In contrast, the present embodiment has had determined the website where target information, then the net by determining website Page directly carries out crawling webpage.Apparent difference is exactly that the determination of webpage is different.The prior art is obtained by search key Specific webpage, due to the data volume on existing network it is very huge by keyword search come out webpage often at Thousand up to ten thousand.This makes the webpage amount that our crawlers to be crawled very huge, and the time to be crawled is very long.And And by keyword search, the information on network is varied, and acquired data are frequently not the data that people need, and are increased Burden.And the present embodiment just solves the problems, such as this.

In addition, crawlers are disguised oneself as computer browser to Website server transmission web-page requests by the present embodiment, avoid Crawlers are identified as machine code and denied access by website.In addition, certain stand-by period is set, this make when website or Code reports an error out of service when person's network abnormal conditions occurs and do not make a response to crawlers, automatic for a long time can must transport Line code crawls web page source code.Finally, agent IP address database is added, must can effectively prevent from being blocked when reptile code When the denied access websites IP, program can also replace IP and continue to crawl web page source code automatically.

The examples of implementation of the above are only the preferred embodiments of the invention, and the implementation model of the present invention is not limited with this It encloses, therefore changes made by all shapes according to the present invention, principle, should all cover within the scope of the present invention.

Claims

1. a kind of automatic crawling method of website and webpage source code, which is characterized in that include the following steps：

S1, it determines the website containing target information, analyzes the website and determine that webpage where target information and those contain mesh Mark the distinctive common feature of Intelligence Page；

The URL for the Initial page and the webpage that S2, load crawl；

S3, load will crawl the common feature of webpage；

S4, the source code for crawling Initial page, and the URL of the webpage containing target information is checked for, if there is no i.e. End crawls program, if in the presence of the web page source code containing target information is crawled one by one, until all comprising target information Web page crawl finishes；

2. a kind of automatic crawling method of website and webpage source code according to claim 1, which is characterized in that the step S4 When crawling web page source code, crawlers are disguised oneself as into computer browser to Website server transmission web-page requests, avoid reptile Program is identified as machine code and denied access by website.

3. a kind of automatic crawling method of website and webpage source code according to claim 2, which is characterized in that described to utilize When crawlers send web-page requests to Website server, modification sends the request header of message, and modification refers in browser to net It stands and sends the parameter of the request header used when request so that access of the request of reptile code more like browser；Browser to The parameters for the web-page requests message that website is sent are searched by developing instrument.

4. a kind of automatic crawling method of website and webpage source code according to claim 2, which is characterized in that the reptile journey Sequence disguises oneself as computer browser when sending web-page requests to Website server every time, can monitor the web-page requests that send out whether To response, if request is not responded for the first time, enter the stand-by period, sending out webpage again after the stand-by period asks It asks；If continuous 5 times send out web-page requests and do not responded, Agent IP is replaced, the subsequent supervention of Agent IP is replaced and webpage is sent to ask It asks.

5. a kind of automatic crawling method of website and webpage source code according to claim 4, which is characterized in that the Agent IP Gained is crawled from Agent IP website using reptile, and the Agent IP crawled is detected after confirmation can be used and database is written； When needing replacing the IP of crawlers, then from the IP parameters for obtaining available IP write-ins crawlers in database.