CN109284434A

CN109284434A - Web page contents crawling method, system and storage medium based on R language

Info

Publication number: CN109284434A
Application number: CN201811061186.6A
Authority: CN
Inventors: 张进虎; 麦家健; 林晨曦
Original assignee: Dongguan Shuihuida Data Co Ltd
Current assignee: Dongguan Shuihuida Data Co Ltd
Priority date: 2018-09-12
Filing date: 2018-09-12
Publication date: 2019-01-29

Abstract

The invention discloses a kind of web page contents crawling method, system and storage mediums based on R language, comprising the following steps: builds R language server；Following steps are executed in R language server: being obtained the URL of the webpage of original hierarchical and started browser；Grab the first page information of the original hierarchical page；According to first page information and/or setting condition, judge whether the content for needing to grab next level page, if so, the second page information of the next layer of page of crawl；Conversely, then directly performing the next step rapid；Directly store or handle the page info obtained.The present invention is by R language application in crawler technology, it can be by the function of simulation browser, to solve the problems, such as that asynchronous load Webpage content and source code are inconsistent, so that the availability of the data of crawl is high, a possibility that encoded question occurs is reduced, the speed of data subsequent processing is improved.The present invention can be widely applied to crawler technology.

Description

Web page contents crawling method, system and storage medium based on R language

Technical field

The present invention relates to crawler technology, especially a kind of web page contents crawling method based on R language, system and storage are situated between Matter.

Background technique

Web crawlers is the program for automatically extracting webpage, it is search engine support grid page above and below WWW, network Crawler is the important composition of search engine.Web crawlers obtains on Initial page since the URL of one or several Initial pages URL constantly extract new URL from current page during grabbing webpage and be put into queue, one until meeting system Determine stop condition.

But with the development of encrypting web technology, there is asynchronous loading page content and source code not in page crawl process Consistent problem is more and more, increases the difficulty of information scratching, causes the availability of web crawlers to reduce, therefore web crawlers It needs to be improved.

Summary of the invention

In order to solve the above technical problems, it is an object of the invention to: a kind of web page contents side of crawling based on R language is provided Method, system and storage medium.

First technical solution adopted by the present invention is:

A kind of web page contents crawling method based on R language, comprising the following steps:

Build R language server；

Data grabber step is executed in R language server；

The data grabber step includes:

Grab the first page information of the original hierarchical page；

According to first page information and/or setting condition, judge whether the content for needing to grab next level page, if It is the second page information for then grabbing the next layer of page；Conversely, then directly performing the next step rapid；

First page information and/or second page information the deposit database that will acquire, or the first page that will acquire Information and/or second page information carry out data processing.

Further, described to build R language server, it specifically includes:

Load basis packet, data base call packet and the webpage capture packet of R language；

Configure browser driving, system environment variable and Selenium service.

Further, the first page information of the crawl original hierarchical page, specifically includes:

The setting web page element of the original hierarchical page is grabbed as first page information；

Or

It is scanned in the search box of the original hierarchical page according to setting content, and grabs the member of the setting in search result Element is used as first page information.

Further, further include following executing step in R language server:

When the page quantity that browser is opened reaches given threshold, browser is restarted.

Second technical solution adopted by the present invention is:

A kind of web page contents based on R language crawl system, comprising:

Module is built, for building R language server；

R language server, for executing data grabber；

The R language server includes:

Module is obtained, for obtaining the URL of the webpage of original hierarchical and starting browser；

Handling module, for grabbing the first page information of the original hierarchical page；

Handling module is judged, for judging whether to need to grab next layer according to first page information and/or setting condition The content of the grade page, if so, the second page information of the next layer of page of crawl；Conversely, then directly performing the next step rapid；

Message processing module, first page information and/or second page information for will acquire are stored in database, or The first page information and/or second page information that will acquire carry out data processing.

Further, the module of building includes:

Loading unit, for loading basis packet, data base call packet and the webpage capture packet of R language；

Configuration unit, for configuring browser driving, system environment variable and Selenium service.

Further, the handling module is specifically used for:

Or

Further, the R language server further includes restarting module, described to restart module and be used for:

Third technical solution adopted by the present invention is:

A kind of web page contents based on R language crawl system, comprising:

Memory, for storing program；

Processor executes a kind of web page contents crawling method based on R language for loading described program.

Third technical solution adopted by the present invention is:

A kind of storage medium, computer program stored on the storage medium, realizes a kind of base when described program is executed by processor In the web page contents crawling method of R language.

The beneficial effects of the present invention are: the present invention by R language application in crawler technology, the function of simulation browser can be passed through Can, to solve the problems, such as that asynchronous load Webpage content and source code are inconsistent, so that the availability of the data of crawl is high, reduce A possibility that encoded question occurs facilitates the subsequent processing of data, improves the processing speed of data.

Detailed description of the invention

Fig. 1 is a kind of flow chart of the web page contents crawling method based on R language of the present invention；

Fig. 2 is a kind of flow chart of data grabber step of the present invention.

Specific embodiment

The present invention is further detailed with specific embodiment with reference to the accompanying drawings of the specification.

Referring to Fig.1, a kind of web page contents crawling method based on R language, comprising the following steps:

S100, R language server is built；Including load resource packet and carry out environment configurations.

S200, data grabber step is executed in R language server；

Referring to Fig. 2, the data grabber step includes:

S201, obtain original hierarchical webpage URL and start browser.The webpage URL of the original hierarchical can be User is pre-configured.When executing this step, read by configuration file.

S202, the first page information for grabbing the original hierarchical page；This step can grab the setting web page element in the page； Such as the information of network address, picture or text etc..User can according to actual needs, and the content grabbed to needs is matched It sets.

S203, according to first page information and/or setting condition, judge whether to need to grab the interior of next level page Hold, if so, thening follow the steps S2031；Conversely, thening follow the steps S204.

S2031, the second page information for grabbing the next layer of page；

The condition for judging whether to grab the content of next level page, such as the page in current level can be set in user In when grabbing specific information, then continue the page for grabbing next level, conversely, not grabbing then.Or user can also incite somebody to action Condition setting is to grab the quantity of level, once being greater than the quantity of setting, is not then grabbed down.

S204, the first page information that will acquire and/or second page information are stored in database, or will acquire first Page info and/or second page information carry out data processing.This step can choose first stores data not processedly Come, or by data by pretreatment after store again, it is described pretreatment include but is not limited to format conversion, type conversion or Person's data filtering.

The judgement of step S203 can be continued to execute in the page of next level in the present embodiment, to crawl more lower always The content of the page.

R language is the language and operating environment for statisticalling analyze, drawing.R language be belong to one of GNU system from By, free, open source software, it is one for count the excellent tool calculated with statistical cartography.R language is to calculate Machine user provides a large amount of common statistics and calculates function, and user oneself is supported to write program and carry out to the function of R language Extension.R language itself provides batch processing function library, they constitute original R language platform.User oneself can write Program is extended R language, and the program that user writes generally is extended in R language platform in the form of R lingware packet.R Language is programmed for main programming paradigm with functional expression, while the modern procedures design method such as support Object-oriented Programming Design.Cause The characteristic of R language is utilized in this present embodiment, simulation browser operate by way of come evade asynchronous load web page contents and The inconsistent problem of source code, so that the difficulty of information scratching substantially reduces.

In the present embodiment, R language server is to execute R language codes, realizes that web page contents crawl the x86 service of function Device, in view of the open source characteristic of R language, R language server can be based on several operation systems, including Windows and Linux.

As preferred embodiment, the step S100 includes:

Basis packet, data base call packet and the webpage capture packet of S101, load R language；

S102, the driving of configuration browser, system environment variable and Selenium service.Selenium is one for Web The tool of Application testing, can with analog subscriber browser operation.

As preferred embodiment, the step S202 is specifically included:

The setting web page element of the original hierarchical page is grabbed as first page information；The element-specific can be text Word, network address or picture.

Or

It is scanned in the search box of the original hierarchical page according to setting content, and grabs the member of the setting in search result Element is used as first page information.The characteristic of R language is utilized in the present embodiment, can with the interactive operation of simulation browser, thus Realize crawl by search box search out come content.The circulation crawl of the page at the same level may be implemented in the present embodiment.

Referring to Fig. 2, as preferred embodiment, the case where collapse in order to avoid browser, the data grabber step Suddenly further include:

S205, when browser open page quantity reach given threshold when, restart browser.The present embodiment passes through The condition that one browser is restarted is set, so that browser can reduce the case where browser collapses with schedule reboot.

A kind of web page contents based on R language corresponding with method in Fig. 1 crawl system, comprising:

Module is built, for building R language server；

The R language server includes:

As preferred embodiment, the module of building includes:

As preferred embodiment, the handling module is specifically used for:

Or

It further include restarting module as preferred embodiment, described to restart module and be used for:

A kind of web page contents based on R language crawl system, comprising:

Memory, for storing program；

Processor executes a kind of web page contents side of crawling based on R language corresponding with Fig. 1 for loading described program Method.

A kind of storage medium, computer program stored on the storage medium, realized when described program is executed by processor it is a kind of with The corresponding web page contents crawling method based on R language of Fig. 1.

For the step number in above method embodiment, it is arranged only for the purposes of illustrating explanation, between step Sequence do not do any restriction, the execution of each step in embodiment sequence can according to the understanding of those skilled in the art come into Row is adaptively adjusted.

It is to be illustrated to preferable implementation of the invention, but the present invention is not limited to the embodiment above, it is ripe Various equivalent deformation or replacement can also be made on the premise of without prejudice to spirit of the invention by knowing those skilled in the art, this Equivalent deformation or replacement are all included in the scope defined by the claims of the present application a bit.

Claims

1. a kind of web page contents crawling method based on R language, it is characterised in that: the following steps are included:

Build R language server；

Data grabber step is executed in R language server；

The data grabber step includes:

It obtains the URL of the webpage of original hierarchical and starts browser；

Grab the first page information of the original hierarchical page；

According to first page information and/or setting condition, judge whether the content for needing to grab next level page, if so, Grab the second page information of the next layer of page；Conversely, then directly performing the next step rapid；

First page information and/or second page information the deposit database that will acquire, or the first page information that will acquire And/or second page information carries out data processing.

2. a kind of web page contents crawling method based on R language according to claim 1, it is characterised in that: described to build R Language server specifically includes:

Configure browser driving, system environment variable and Selenium service.

3. a kind of web page contents crawling method based on R language according to claim 1, it is characterised in that: the crawl The first page information of the original hierarchical page, specifically includes:

Or

It is scanned in the search box of the original hierarchical page according to setting content, and grabs the work of the setting element in search result For first page information.

4. a kind of web page contents crawling method based on R language according to claim 1, it is characterised in that: the data Crawl step further include:

5. a kind of web page contents based on R language crawl system, it is characterised in that: include:

Module is built, for building R language server；

The R language server includes:

Handling module is judged, for judging whether to need to grab next level page according to first page information and/or setting condition The content in face, if so, the second page information of the next layer of page of crawl；Conversely, then directly performing the next step rapid；

Message processing module, first page information and/or second page information for will acquire are stored in database, or will obtain The first page information and/or second page information taken carries out data processing.

6. a kind of web page contents based on R language according to claim 5 crawl system, it is characterised in that: described to build Module includes:

7. a kind of web page contents based on R language according to claim 5 crawl system, it is characterised in that: the crawl Module is specifically used for:

Or

8. a kind of web page contents based on R language according to claim 5 crawl system, it is characterised in that: the R language Server further includes restarting module, described to restart module and be used for:

9. a kind of web page contents based on R language crawl system, it is characterised in that: include:

Memory, for storing program；

Processor executes a kind of webpage based on R language according to any one of claims 1-4 for loading described program Content crawling method.

10. a kind of storage medium, computer program stored on the storage medium, it is characterised in that: when described program is executed by processor Realize a kind of web page contents crawling method based on R language according to any one of claims 1-4.