CN109284434A - Web page contents crawling method, system and storage medium based on R language - Google Patents

Web page contents crawling method, system and storage medium based on R language Download PDF

Info

Publication number
CN109284434A
CN109284434A CN201811061186.6A CN201811061186A CN109284434A CN 109284434 A CN109284434 A CN 109284434A CN 201811061186 A CN201811061186 A CN 201811061186A CN 109284434 A CN109284434 A CN 109284434A
Authority
CN
China
Prior art keywords
page
language
page information
browser
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811061186.6A
Other languages
Chinese (zh)
Inventor
张进虎
麦家健
林晨曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongguan Shuihuida Data Co Ltd
Original Assignee
Dongguan Shuihuida Data Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongguan Shuihuida Data Co Ltd filed Critical Dongguan Shuihuida Data Co Ltd
Priority to CN201811061186.6A priority Critical patent/CN109284434A/en
Publication of CN109284434A publication Critical patent/CN109284434A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a kind of web page contents crawling method, system and storage mediums based on R language, comprising the following steps: builds R language server;Following steps are executed in R language server: being obtained the URL of the webpage of original hierarchical and started browser;Grab the first page information of the original hierarchical page;According to first page information and/or setting condition, judge whether the content for needing to grab next level page, if so, the second page information of the next layer of page of crawl;Conversely, then directly performing the next step rapid;Directly store or handle the page info obtained.The present invention is by R language application in crawler technology, it can be by the function of simulation browser, to solve the problems, such as that asynchronous load Webpage content and source code are inconsistent, so that the availability of the data of crawl is high, a possibility that encoded question occurs is reduced, the speed of data subsequent processing is improved.The present invention can be widely applied to crawler technology.

Description

Web page contents crawling method, system and storage medium based on R language
Technical field
The present invention relates to crawler technology, especially a kind of web page contents crawling method based on R language, system and storage are situated between Matter.
Background technique
Web crawlers is the program for automatically extracting webpage, it is search engine support grid page above and below WWW, network Crawler is the important composition of search engine.Web crawlers obtains on Initial page since the URL of one or several Initial pages URL constantly extract new URL from current page during grabbing webpage and be put into queue, one until meeting system Determine stop condition.
But with the development of encrypting web technology, there is asynchronous loading page content and source code not in page crawl process Consistent problem is more and more, increases the difficulty of information scratching, causes the availability of web crawlers to reduce, therefore web crawlers It needs to be improved.
Summary of the invention
In order to solve the above technical problems, it is an object of the invention to: a kind of web page contents side of crawling based on R language is provided Method, system and storage medium.
First technical solution adopted by the present invention is:
A kind of web page contents crawling method based on R language, comprising the following steps:
Build R language server;
Data grabber step is executed in R language server;
The data grabber step includes:
Grab the first page information of the original hierarchical page;
According to first page information and/or setting condition, judge whether the content for needing to grab next level page, if It is the second page information for then grabbing the next layer of page;Conversely, then directly performing the next step rapid;
First page information and/or second page information the deposit database that will acquire, or the first page that will acquire Information and/or second page information carry out data processing.
Further, described to build R language server, it specifically includes:
Load basis packet, data base call packet and the webpage capture packet of R language;
Configure browser driving, system environment variable and Selenium service.
Further, the first page information of the crawl original hierarchical page, specifically includes:
The setting web page element of the original hierarchical page is grabbed as first page information;
Or
It is scanned in the search box of the original hierarchical page according to setting content, and grabs the member of the setting in search result Element is used as first page information.
Further, further include following executing step in R language server:
When the page quantity that browser is opened reaches given threshold, browser is restarted.
Second technical solution adopted by the present invention is:
A kind of web page contents based on R language crawl system, comprising:
Module is built, for building R language server;
R language server, for executing data grabber;
The R language server includes:
Module is obtained, for obtaining the URL of the webpage of original hierarchical and starting browser;
Handling module, for grabbing the first page information of the original hierarchical page;
Handling module is judged, for judging whether to need to grab next layer according to first page information and/or setting condition The content of the grade page, if so, the second page information of the next layer of page of crawl;Conversely, then directly performing the next step rapid;
Message processing module, first page information and/or second page information for will acquire are stored in database, or The first page information and/or second page information that will acquire carry out data processing.
Further, the module of building includes:
Loading unit, for loading basis packet, data base call packet and the webpage capture packet of R language;
Configuration unit, for configuring browser driving, system environment variable and Selenium service.
Further, the handling module is specifically used for:
The setting web page element of the original hierarchical page is grabbed as first page information;
Or
It is scanned in the search box of the original hierarchical page according to setting content, and grabs the member of the setting in search result Element is used as first page information.
Further, the R language server further includes restarting module, described to restart module and be used for:
When the page quantity that browser is opened reaches given threshold, browser is restarted.
Third technical solution adopted by the present invention is:
A kind of web page contents based on R language crawl system, comprising:
Memory, for storing program;
Processor executes a kind of web page contents crawling method based on R language for loading described program.
Third technical solution adopted by the present invention is:
A kind of storage medium, computer program stored on the storage medium, realizes a kind of base when described program is executed by processor In the web page contents crawling method of R language.
The beneficial effects of the present invention are: the present invention by R language application in crawler technology, the function of simulation browser can be passed through Can, to solve the problems, such as that asynchronous load Webpage content and source code are inconsistent, so that the availability of the data of crawl is high, reduce A possibility that encoded question occurs facilitates the subsequent processing of data, improves the processing speed of data.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the web page contents crawling method based on R language of the present invention;
Fig. 2 is a kind of flow chart of data grabber step of the present invention.
Specific embodiment
The present invention is further detailed with specific embodiment with reference to the accompanying drawings of the specification.
Referring to Fig.1, a kind of web page contents crawling method based on R language, comprising the following steps:
S100, R language server is built;Including load resource packet and carry out environment configurations.
S200, data grabber step is executed in R language server;
Referring to Fig. 2, the data grabber step includes:
S201, obtain original hierarchical webpage URL and start browser.The webpage URL of the original hierarchical can be User is pre-configured.When executing this step, read by configuration file.
S202, the first page information for grabbing the original hierarchical page;This step can grab the setting web page element in the page; Such as the information of network address, picture or text etc..User can according to actual needs, and the content grabbed to needs is matched It sets.
S203, according to first page information and/or setting condition, judge whether to need to grab the interior of next level page Hold, if so, thening follow the steps S2031;Conversely, thening follow the steps S204.
S2031, the second page information for grabbing the next layer of page;
The condition for judging whether to grab the content of next level page, such as the page in current level can be set in user In when grabbing specific information, then continue the page for grabbing next level, conversely, not grabbing then.Or user can also incite somebody to action Condition setting is to grab the quantity of level, once being greater than the quantity of setting, is not then grabbed down.
S204, the first page information that will acquire and/or second page information are stored in database, or will acquire first Page info and/or second page information carry out data processing.This step can choose first stores data not processedly Come, or by data by pretreatment after store again, it is described pretreatment include but is not limited to format conversion, type conversion or Person's data filtering.
The judgement of step S203 can be continued to execute in the page of next level in the present embodiment, to crawl more lower always The content of the page.
R language is the language and operating environment for statisticalling analyze, drawing.R language be belong to one of GNU system from By, free, open source software, it is one for count the excellent tool calculated with statistical cartography.R language is to calculate Machine user provides a large amount of common statistics and calculates function, and user oneself is supported to write program and carry out to the function of R language Extension.R language itself provides batch processing function library, they constitute original R language platform.User oneself can write Program is extended R language, and the program that user writes generally is extended in R language platform in the form of R lingware packet.R Language is programmed for main programming paradigm with functional expression, while the modern procedures design method such as support Object-oriented Programming Design.Cause The characteristic of R language is utilized in this present embodiment, simulation browser operate by way of come evade asynchronous load web page contents and The inconsistent problem of source code, so that the difficulty of information scratching substantially reduces.
In the present embodiment, R language server is to execute R language codes, realizes that web page contents crawl the x86 service of function Device, in view of the open source characteristic of R language, R language server can be based on several operation systems, including Windows and Linux.
As preferred embodiment, the step S100 includes:
Basis packet, data base call packet and the webpage capture packet of S101, load R language;
S102, the driving of configuration browser, system environment variable and Selenium service.Selenium is one for Web The tool of Application testing, can with analog subscriber browser operation.
As preferred embodiment, the step S202 is specifically included:
The setting web page element of the original hierarchical page is grabbed as first page information;The element-specific can be text Word, network address or picture.
Or
It is scanned in the search box of the original hierarchical page according to setting content, and grabs the member of the setting in search result Element is used as first page information.The characteristic of R language is utilized in the present embodiment, can with the interactive operation of simulation browser, thus Realize crawl by search box search out come content.The circulation crawl of the page at the same level may be implemented in the present embodiment.
Referring to Fig. 2, as preferred embodiment, the case where collapse in order to avoid browser, the data grabber step Suddenly further include:
S205, when browser open page quantity reach given threshold when, restart browser.The present embodiment passes through The condition that one browser is restarted is set, so that browser can reduce the case where browser collapses with schedule reboot.
A kind of web page contents based on R language corresponding with method in Fig. 1 crawl system, comprising:
Module is built, for building R language server;
The R language server includes:
Module is obtained, for obtaining the URL of the webpage of original hierarchical and starting browser;
Handling module, for grabbing the first page information of the original hierarchical page;
Handling module is judged, for judging whether to need to grab next layer according to first page information and/or setting condition The content of the grade page, if so, the second page information of the next layer of page of crawl;Conversely, then directly performing the next step rapid;
Message processing module, first page information and/or second page information for will acquire are stored in database, or The first page information and/or second page information that will acquire carry out data processing.
As preferred embodiment, the module of building includes:
Loading unit, for loading basis packet, data base call packet and the webpage capture packet of R language;
Configuration unit, for configuring browser driving, system environment variable and Selenium service.
As preferred embodiment, the handling module is specifically used for:
The setting web page element of the original hierarchical page is grabbed as first page information;
Or
It is scanned in the search box of the original hierarchical page according to setting content, and grabs the member of the setting in search result Element is used as first page information.
It further include restarting module as preferred embodiment, described to restart module and be used for:
When the page quantity that browser is opened reaches given threshold, browser is restarted.
A kind of web page contents based on R language crawl system, comprising:
Memory, for storing program;
Processor executes a kind of web page contents side of crawling based on R language corresponding with Fig. 1 for loading described program Method.
A kind of storage medium, computer program stored on the storage medium, realized when described program is executed by processor it is a kind of with The corresponding web page contents crawling method based on R language of Fig. 1.
For the step number in above method embodiment, it is arranged only for the purposes of illustrating explanation, between step Sequence do not do any restriction, the execution of each step in embodiment sequence can according to the understanding of those skilled in the art come into Row is adaptively adjusted.
It is to be illustrated to preferable implementation of the invention, but the present invention is not limited to the embodiment above, it is ripe Various equivalent deformation or replacement can also be made on the premise of without prejudice to spirit of the invention by knowing those skilled in the art, this Equivalent deformation or replacement are all included in the scope defined by the claims of the present application a bit.

Claims (10)

1. a kind of web page contents crawling method based on R language, it is characterised in that: the following steps are included:
Build R language server;
Data grabber step is executed in R language server;
The data grabber step includes:
It obtains the URL of the webpage of original hierarchical and starts browser;
Grab the first page information of the original hierarchical page;
According to first page information and/or setting condition, judge whether the content for needing to grab next level page, if so, Grab the second page information of the next layer of page;Conversely, then directly performing the next step rapid;
First page information and/or second page information the deposit database that will acquire, or the first page information that will acquire And/or second page information carries out data processing.
2. a kind of web page contents crawling method based on R language according to claim 1, it is characterised in that: described to build R Language server specifically includes:
Load basis packet, data base call packet and the webpage capture packet of R language;
Configure browser driving, system environment variable and Selenium service.
3. a kind of web page contents crawling method based on R language according to claim 1, it is characterised in that: the crawl The first page information of the original hierarchical page, specifically includes:
The setting web page element of the original hierarchical page is grabbed as first page information;
Or
It is scanned in the search box of the original hierarchical page according to setting content, and grabs the work of the setting element in search result For first page information.
4. a kind of web page contents crawling method based on R language according to claim 1, it is characterised in that: the data Crawl step further include:
When the page quantity that browser is opened reaches given threshold, browser is restarted.
5. a kind of web page contents based on R language crawl system, it is characterised in that: include:
Module is built, for building R language server;
The R language server includes:
Module is obtained, for obtaining the URL of the webpage of original hierarchical and starting browser;
Handling module, for grabbing the first page information of the original hierarchical page;
Handling module is judged, for judging whether to need to grab next level page according to first page information and/or setting condition The content in face, if so, the second page information of the next layer of page of crawl;Conversely, then directly performing the next step rapid;
Message processing module, first page information and/or second page information for will acquire are stored in database, or will obtain The first page information and/or second page information taken carries out data processing.
6. a kind of web page contents based on R language according to claim 5 crawl system, it is characterised in that: described to build Module includes:
Loading unit, for loading basis packet, data base call packet and the webpage capture packet of R language;
Configuration unit, for configuring browser driving, system environment variable and Selenium service.
7. a kind of web page contents based on R language according to claim 5 crawl system, it is characterised in that: the crawl Module is specifically used for:
The setting web page element of the original hierarchical page is grabbed as first page information;
Or
It is scanned in the search box of the original hierarchical page according to setting content, and grabs the work of the setting element in search result For first page information.
8. a kind of web page contents based on R language according to claim 5 crawl system, it is characterised in that: the R language Server further includes restarting module, described to restart module and be used for:
When the page quantity that browser is opened reaches given threshold, browser is restarted.
9. a kind of web page contents based on R language crawl system, it is characterised in that: include:
Memory, for storing program;
Processor executes a kind of webpage based on R language according to any one of claims 1-4 for loading described program Content crawling method.
10. a kind of storage medium, computer program stored on the storage medium, it is characterised in that: when described program is executed by processor Realize a kind of web page contents crawling method based on R language according to any one of claims 1-4.
CN201811061186.6A 2018-09-12 2018-09-12 Web page contents crawling method, system and storage medium based on R language Pending CN109284434A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811061186.6A CN109284434A (en) 2018-09-12 2018-09-12 Web page contents crawling method, system and storage medium based on R language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811061186.6A CN109284434A (en) 2018-09-12 2018-09-12 Web page contents crawling method, system and storage medium based on R language

Publications (1)

Publication Number Publication Date
CN109284434A true CN109284434A (en) 2019-01-29

Family

ID=65181279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811061186.6A Pending CN109284434A (en) 2018-09-12 2018-09-12 Web page contents crawling method, system and storage medium based on R language

Country Status (1)

Country Link
CN (1) CN109284434A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101702743A (en) * 2009-11-04 2010-05-05 中兴通讯股份有限公司 Self-adaption adjustment method of mobile terminal browser and device thereof
CN101902438A (en) * 2009-05-25 2010-12-01 北京启明星辰信息技术股份有限公司 Method and device for automatically identifying web crawlers
US20140236960A1 (en) * 2013-02-19 2014-08-21 Futurewei Technologies, Inc. System and Method for Database Searching
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101902438A (en) * 2009-05-25 2010-12-01 北京启明星辰信息技术股份有限公司 Method and device for automatically identifying web crawlers
CN101702743A (en) * 2009-11-04 2010-05-05 中兴通讯股份有限公司 Self-adaption adjustment method of mobile terminal browser and device thereof
US20140236960A1 (en) * 2013-02-19 2014-08-21 Futurewei Technologies, Inc. System and Method for Database Searching
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
EASYCHARTS: "《搜狐博客》", 21 November 2017 *
吴小坤: "《数据新闻制作简明教程》", 31 May 2018 *
樊芳芳的博客: "《CSDN博客》", 29 April 2018 *

Similar Documents

Publication Publication Date Title
CN108153594B (en) Resource fragment sorting method of artificial intelligence cloud platform and electronic equipment
US9292359B2 (en) System and method for memory management
JP2021531487A (en) Training of conversation agents using natural language
CN112988185A (en) Cloud application updating method, device and system, electronic equipment and storage medium
CN109740765A (en) A kind of machine learning system building method based on Amazon server
CN110069264A (en) Construct method, apparatus, computer equipment and the storage medium of resource packet
CN104468419A (en) Method and system for recovering configuration of interchanger and interchanger
CN110362341A (en) Business management method, device, equipment and storage medium based on micro services framework
CN112417336B (en) Page display method and device, electronic equipment and storage medium
CN103699653A (en) Method and device for clustering data
CN108897569A (en) The method for cleaning and computer readable storage medium of iOS engineering discarded record
CN109284434A (en) Web page contents crawling method, system and storage medium based on R language
CN116401052A (en) Data processing method, model processing method, electronic equipment and medium
CN107679168B (en) Target website content acquisition method based on java platform
CN110442353A (en) A kind of method, apparatus and electronic equipment of installation kit management
CN103942051A (en) Application deployment method and device based on PAAS platform
CN110597738B (en) Memory release method, terminal and computer readable storage medium
CN106021501A (en) Data storing method and device
CN111045787A (en) Rapid continuous experiment method and system
CN112347394A (en) Method and device for acquiring webpage information, computer equipment and storage medium
CN112540897B (en) Database monitoring method, device, server and medium
CN113535594B (en) Method, device, equipment and storage medium for generating service scene test case
US20230412731A1 (en) Automated interactive voice response interaction using voice prompt classification machine learning frameworks
CN112596855B (en) Container creation method and device
CN109284097A (en) Realize method, equipment, system and the storage medium of complex data analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190129

RJ01 Rejection of invention patent application after publication