CN109284434A - Web page contents crawling method, system and storage medium based on R language - Google Patents
Web page contents crawling method, system and storage medium based on R language Download PDFInfo
- Publication number
- CN109284434A CN109284434A CN201811061186.6A CN201811061186A CN109284434A CN 109284434 A CN109284434 A CN 109284434A CN 201811061186 A CN201811061186 A CN 201811061186A CN 109284434 A CN109284434 A CN 109284434A
- Authority
- CN
- China
- Prior art keywords
- page
- language
- page information
- browser
- web page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a kind of web page contents crawling method, system and storage mediums based on R language, comprising the following steps: builds R language server;Following steps are executed in R language server: being obtained the URL of the webpage of original hierarchical and started browser;Grab the first page information of the original hierarchical page;According to first page information and/or setting condition, judge whether the content for needing to grab next level page, if so, the second page information of the next layer of page of crawl;Conversely, then directly performing the next step rapid;Directly store or handle the page info obtained.The present invention is by R language application in crawler technology, it can be by the function of simulation browser, to solve the problems, such as that asynchronous load Webpage content and source code are inconsistent, so that the availability of the data of crawl is high, a possibility that encoded question occurs is reduced, the speed of data subsequent processing is improved.The present invention can be widely applied to crawler technology.
Description
Technical field
The present invention relates to crawler technology, especially a kind of web page contents crawling method based on R language, system and storage are situated between
Matter.
Background technique
Web crawlers is the program for automatically extracting webpage, it is search engine support grid page above and below WWW, network
Crawler is the important composition of search engine.Web crawlers obtains on Initial page since the URL of one or several Initial pages
URL constantly extract new URL from current page during grabbing webpage and be put into queue, one until meeting system
Determine stop condition.
But with the development of encrypting web technology, there is asynchronous loading page content and source code not in page crawl process
Consistent problem is more and more, increases the difficulty of information scratching, causes the availability of web crawlers to reduce, therefore web crawlers
It needs to be improved.
Summary of the invention
In order to solve the above technical problems, it is an object of the invention to: a kind of web page contents side of crawling based on R language is provided
Method, system and storage medium.
First technical solution adopted by the present invention is:
A kind of web page contents crawling method based on R language, comprising the following steps:
Build R language server;
Data grabber step is executed in R language server;
The data grabber step includes:
Grab the first page information of the original hierarchical page;
According to first page information and/or setting condition, judge whether the content for needing to grab next level page, if
It is the second page information for then grabbing the next layer of page;Conversely, then directly performing the next step rapid;
First page information and/or second page information the deposit database that will acquire, or the first page that will acquire
Information and/or second page information carry out data processing.
Further, described to build R language server, it specifically includes:
Load basis packet, data base call packet and the webpage capture packet of R language;
Configure browser driving, system environment variable and Selenium service.
Further, the first page information of the crawl original hierarchical page, specifically includes:
The setting web page element of the original hierarchical page is grabbed as first page information;
Or
It is scanned in the search box of the original hierarchical page according to setting content, and grabs the member of the setting in search result
Element is used as first page information.
Further, further include following executing step in R language server:
When the page quantity that browser is opened reaches given threshold, browser is restarted.
Second technical solution adopted by the present invention is:
A kind of web page contents based on R language crawl system, comprising:
Module is built, for building R language server;
R language server, for executing data grabber;
The R language server includes:
Module is obtained, for obtaining the URL of the webpage of original hierarchical and starting browser;
Handling module, for grabbing the first page information of the original hierarchical page;
Handling module is judged, for judging whether to need to grab next layer according to first page information and/or setting condition
The content of the grade page, if so, the second page information of the next layer of page of crawl;Conversely, then directly performing the next step rapid;
Message processing module, first page information and/or second page information for will acquire are stored in database, or
The first page information and/or second page information that will acquire carry out data processing.
Further, the module of building includes:
Loading unit, for loading basis packet, data base call packet and the webpage capture packet of R language;
Configuration unit, for configuring browser driving, system environment variable and Selenium service.
Further, the handling module is specifically used for:
The setting web page element of the original hierarchical page is grabbed as first page information;
Or
It is scanned in the search box of the original hierarchical page according to setting content, and grabs the member of the setting in search result
Element is used as first page information.
Further, the R language server further includes restarting module, described to restart module and be used for:
When the page quantity that browser is opened reaches given threshold, browser is restarted.
Third technical solution adopted by the present invention is:
A kind of web page contents based on R language crawl system, comprising:
Memory, for storing program;
Processor executes a kind of web page contents crawling method based on R language for loading described program.
Third technical solution adopted by the present invention is:
A kind of storage medium, computer program stored on the storage medium, realizes a kind of base when described program is executed by processor
In the web page contents crawling method of R language.
The beneficial effects of the present invention are: the present invention by R language application in crawler technology, the function of simulation browser can be passed through
Can, to solve the problems, such as that asynchronous load Webpage content and source code are inconsistent, so that the availability of the data of crawl is high, reduce
A possibility that encoded question occurs facilitates the subsequent processing of data, improves the processing speed of data.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the web page contents crawling method based on R language of the present invention;
Fig. 2 is a kind of flow chart of data grabber step of the present invention.
Specific embodiment
The present invention is further detailed with specific embodiment with reference to the accompanying drawings of the specification.
Referring to Fig.1, a kind of web page contents crawling method based on R language, comprising the following steps:
S100, R language server is built;Including load resource packet and carry out environment configurations.
S200, data grabber step is executed in R language server;
Referring to Fig. 2, the data grabber step includes:
S201, obtain original hierarchical webpage URL and start browser.The webpage URL of the original hierarchical can be
User is pre-configured.When executing this step, read by configuration file.
S202, the first page information for grabbing the original hierarchical page;This step can grab the setting web page element in the page;
Such as the information of network address, picture or text etc..User can according to actual needs, and the content grabbed to needs is matched
It sets.
S203, according to first page information and/or setting condition, judge whether to need to grab the interior of next level page
Hold, if so, thening follow the steps S2031;Conversely, thening follow the steps S204.
S2031, the second page information for grabbing the next layer of page;
The condition for judging whether to grab the content of next level page, such as the page in current level can be set in user
In when grabbing specific information, then continue the page for grabbing next level, conversely, not grabbing then.Or user can also incite somebody to action
Condition setting is to grab the quantity of level, once being greater than the quantity of setting, is not then grabbed down.
S204, the first page information that will acquire and/or second page information are stored in database, or will acquire first
Page info and/or second page information carry out data processing.This step can choose first stores data not processedly
Come, or by data by pretreatment after store again, it is described pretreatment include but is not limited to format conversion, type conversion or
Person's data filtering.
The judgement of step S203 can be continued to execute in the page of next level in the present embodiment, to crawl more lower always
The content of the page.
R language is the language and operating environment for statisticalling analyze, drawing.R language be belong to one of GNU system from
By, free, open source software, it is one for count the excellent tool calculated with statistical cartography.R language is to calculate
Machine user provides a large amount of common statistics and calculates function, and user oneself is supported to write program and carry out to the function of R language
Extension.R language itself provides batch processing function library, they constitute original R language platform.User oneself can write
Program is extended R language, and the program that user writes generally is extended in R language platform in the form of R lingware packet.R
Language is programmed for main programming paradigm with functional expression, while the modern procedures design method such as support Object-oriented Programming Design.Cause
The characteristic of R language is utilized in this present embodiment, simulation browser operate by way of come evade asynchronous load web page contents and
The inconsistent problem of source code, so that the difficulty of information scratching substantially reduces.
In the present embodiment, R language server is to execute R language codes, realizes that web page contents crawl the x86 service of function
Device, in view of the open source characteristic of R language, R language server can be based on several operation systems, including Windows and Linux.
As preferred embodiment, the step S100 includes:
Basis packet, data base call packet and the webpage capture packet of S101, load R language;
S102, the driving of configuration browser, system environment variable and Selenium service.Selenium is one for Web
The tool of Application testing, can with analog subscriber browser operation.
As preferred embodiment, the step S202 is specifically included:
The setting web page element of the original hierarchical page is grabbed as first page information;The element-specific can be text
Word, network address or picture.
Or
It is scanned in the search box of the original hierarchical page according to setting content, and grabs the member of the setting in search result
Element is used as first page information.The characteristic of R language is utilized in the present embodiment, can with the interactive operation of simulation browser, thus
Realize crawl by search box search out come content.The circulation crawl of the page at the same level may be implemented in the present embodiment.
Referring to Fig. 2, as preferred embodiment, the case where collapse in order to avoid browser, the data grabber step
Suddenly further include:
S205, when browser open page quantity reach given threshold when, restart browser.The present embodiment passes through
The condition that one browser is restarted is set, so that browser can reduce the case where browser collapses with schedule reboot.
A kind of web page contents based on R language corresponding with method in Fig. 1 crawl system, comprising:
Module is built, for building R language server;
The R language server includes:
Module is obtained, for obtaining the URL of the webpage of original hierarchical and starting browser;
Handling module, for grabbing the first page information of the original hierarchical page;
Handling module is judged, for judging whether to need to grab next layer according to first page information and/or setting condition
The content of the grade page, if so, the second page information of the next layer of page of crawl;Conversely, then directly performing the next step rapid;
Message processing module, first page information and/or second page information for will acquire are stored in database, or
The first page information and/or second page information that will acquire carry out data processing.
As preferred embodiment, the module of building includes:
Loading unit, for loading basis packet, data base call packet and the webpage capture packet of R language;
Configuration unit, for configuring browser driving, system environment variable and Selenium service.
As preferred embodiment, the handling module is specifically used for:
The setting web page element of the original hierarchical page is grabbed as first page information;
Or
It is scanned in the search box of the original hierarchical page according to setting content, and grabs the member of the setting in search result
Element is used as first page information.
It further include restarting module as preferred embodiment, described to restart module and be used for:
When the page quantity that browser is opened reaches given threshold, browser is restarted.
A kind of web page contents based on R language crawl system, comprising:
Memory, for storing program;
Processor executes a kind of web page contents side of crawling based on R language corresponding with Fig. 1 for loading described program
Method.
A kind of storage medium, computer program stored on the storage medium, realized when described program is executed by processor it is a kind of with
The corresponding web page contents crawling method based on R language of Fig. 1.
For the step number in above method embodiment, it is arranged only for the purposes of illustrating explanation, between step
Sequence do not do any restriction, the execution of each step in embodiment sequence can according to the understanding of those skilled in the art come into
Row is adaptively adjusted.
It is to be illustrated to preferable implementation of the invention, but the present invention is not limited to the embodiment above, it is ripe
Various equivalent deformation or replacement can also be made on the premise of without prejudice to spirit of the invention by knowing those skilled in the art, this
Equivalent deformation or replacement are all included in the scope defined by the claims of the present application a bit.
Claims (10)
1. a kind of web page contents crawling method based on R language, it is characterised in that: the following steps are included:
Build R language server;
Data grabber step is executed in R language server;
The data grabber step includes:
It obtains the URL of the webpage of original hierarchical and starts browser;
Grab the first page information of the original hierarchical page;
According to first page information and/or setting condition, judge whether the content for needing to grab next level page, if so,
Grab the second page information of the next layer of page;Conversely, then directly performing the next step rapid;
First page information and/or second page information the deposit database that will acquire, or the first page information that will acquire
And/or second page information carries out data processing.
2. a kind of web page contents crawling method based on R language according to claim 1, it is characterised in that: described to build R
Language server specifically includes:
Load basis packet, data base call packet and the webpage capture packet of R language;
Configure browser driving, system environment variable and Selenium service.
3. a kind of web page contents crawling method based on R language according to claim 1, it is characterised in that: the crawl
The first page information of the original hierarchical page, specifically includes:
The setting web page element of the original hierarchical page is grabbed as first page information;
Or
It is scanned in the search box of the original hierarchical page according to setting content, and grabs the work of the setting element in search result
For first page information.
4. a kind of web page contents crawling method based on R language according to claim 1, it is characterised in that: the data
Crawl step further include:
When the page quantity that browser is opened reaches given threshold, browser is restarted.
5. a kind of web page contents based on R language crawl system, it is characterised in that: include:
Module is built, for building R language server;
The R language server includes:
Module is obtained, for obtaining the URL of the webpage of original hierarchical and starting browser;
Handling module, for grabbing the first page information of the original hierarchical page;
Handling module is judged, for judging whether to need to grab next level page according to first page information and/or setting condition
The content in face, if so, the second page information of the next layer of page of crawl;Conversely, then directly performing the next step rapid;
Message processing module, first page information and/or second page information for will acquire are stored in database, or will obtain
The first page information and/or second page information taken carries out data processing.
6. a kind of web page contents based on R language according to claim 5 crawl system, it is characterised in that: described to build
Module includes:
Loading unit, for loading basis packet, data base call packet and the webpage capture packet of R language;
Configuration unit, for configuring browser driving, system environment variable and Selenium service.
7. a kind of web page contents based on R language according to claim 5 crawl system, it is characterised in that: the crawl
Module is specifically used for:
The setting web page element of the original hierarchical page is grabbed as first page information;
Or
It is scanned in the search box of the original hierarchical page according to setting content, and grabs the work of the setting element in search result
For first page information.
8. a kind of web page contents based on R language according to claim 5 crawl system, it is characterised in that: the R language
Server further includes restarting module, described to restart module and be used for:
When the page quantity that browser is opened reaches given threshold, browser is restarted.
9. a kind of web page contents based on R language crawl system, it is characterised in that: include:
Memory, for storing program;
Processor executes a kind of webpage based on R language according to any one of claims 1-4 for loading described program
Content crawling method.
10. a kind of storage medium, computer program stored on the storage medium, it is characterised in that: when described program is executed by processor
Realize a kind of web page contents crawling method based on R language according to any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811061186.6A CN109284434A (en) | 2018-09-12 | 2018-09-12 | Web page contents crawling method, system and storage medium based on R language |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811061186.6A CN109284434A (en) | 2018-09-12 | 2018-09-12 | Web page contents crawling method, system and storage medium based on R language |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109284434A true CN109284434A (en) | 2019-01-29 |
Family
ID=65181279
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811061186.6A Pending CN109284434A (en) | 2018-09-12 | 2018-09-12 | Web page contents crawling method, system and storage medium based on R language |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109284434A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101702743A (en) * | 2009-11-04 | 2010-05-05 | 中兴通讯股份有限公司 | Self-adaption adjustment method of mobile terminal browser and device thereof |
CN101902438A (en) * | 2009-05-25 | 2010-12-01 | 北京启明星辰信息技术股份有限公司 | Method and device for automatically identifying web crawlers |
US20140236960A1 (en) * | 2013-02-19 | 2014-08-21 | Futurewei Technologies, Inc. | System and Method for Database Searching |
CN105045838A (en) * | 2015-07-01 | 2015-11-11 | 华东师范大学 | Network crawler system based on distributed storage system |
-
2018
- 2018-09-12 CN CN201811061186.6A patent/CN109284434A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101902438A (en) * | 2009-05-25 | 2010-12-01 | 北京启明星辰信息技术股份有限公司 | Method and device for automatically identifying web crawlers |
CN101702743A (en) * | 2009-11-04 | 2010-05-05 | 中兴通讯股份有限公司 | Self-adaption adjustment method of mobile terminal browser and device thereof |
US20140236960A1 (en) * | 2013-02-19 | 2014-08-21 | Futurewei Technologies, Inc. | System and Method for Database Searching |
CN105045838A (en) * | 2015-07-01 | 2015-11-11 | 华东师范大学 | Network crawler system based on distributed storage system |
Non-Patent Citations (3)
Title |
---|
EASYCHARTS: "《搜狐博客》", 21 November 2017 * |
吴小坤: "《数据新闻制作简明教程》", 31 May 2018 * |
樊芳芳的博客: "《CSDN博客》", 29 April 2018 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108153594B (en) | Resource fragment sorting method of artificial intelligence cloud platform and electronic equipment | |
US9292359B2 (en) | System and method for memory management | |
JP2021531487A (en) | Training of conversation agents using natural language | |
CN112988185A (en) | Cloud application updating method, device and system, electronic equipment and storage medium | |
CN109740765A (en) | A kind of machine learning system building method based on Amazon server | |
CN110069264A (en) | Construct method, apparatus, computer equipment and the storage medium of resource packet | |
CN104468419A (en) | Method and system for recovering configuration of interchanger and interchanger | |
CN110362341A (en) | Business management method, device, equipment and storage medium based on micro services framework | |
CN112417336B (en) | Page display method and device, electronic equipment and storage medium | |
CN103699653A (en) | Method and device for clustering data | |
CN108897569A (en) | The method for cleaning and computer readable storage medium of iOS engineering discarded record | |
CN109284434A (en) | Web page contents crawling method, system and storage medium based on R language | |
CN116401052A (en) | Data processing method, model processing method, electronic equipment and medium | |
CN107679168B (en) | Target website content acquisition method based on java platform | |
CN110442353A (en) | A kind of method, apparatus and electronic equipment of installation kit management | |
CN103942051A (en) | Application deployment method and device based on PAAS platform | |
CN110597738B (en) | Memory release method, terminal and computer readable storage medium | |
CN106021501A (en) | Data storing method and device | |
CN111045787A (en) | Rapid continuous experiment method and system | |
CN112347394A (en) | Method and device for acquiring webpage information, computer equipment and storage medium | |
CN112540897B (en) | Database monitoring method, device, server and medium | |
CN113535594B (en) | Method, device, equipment and storage medium for generating service scene test case | |
US20230412731A1 (en) | Automated interactive voice response interaction using voice prompt classification machine learning frameworks | |
CN112596855B (en) | Container creation method and device | |
CN109284097A (en) | Realize method, equipment, system and the storage medium of complex data analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190129 |
|
RJ01 | Rejection of invention patent application after publication |