CN107463713A - The method of fast verification CSS selector - Google Patents

The method of fast verification CSS selector Download PDF

Info

Publication number
CN107463713A
CN107463713A CN201710734682.2A CN201710734682A CN107463713A CN 107463713 A CN107463713 A CN 107463713A CN 201710734682 A CN201710734682 A CN 201710734682A CN 107463713 A CN107463713 A CN 107463713A
Authority
CN
China
Prior art keywords
css selector
information
source code
css
fast verification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710734682.2A
Other languages
Chinese (zh)
Inventor
张超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Changhong Electric Co Ltd
Original Assignee
Sichuan Changhong Electric Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Changhong Electric Co Ltd filed Critical Sichuan Changhong Electric Co Ltd
Priority to CN201710734682.2A priority Critical patent/CN107463713A/en
Publication of CN107463713A publication Critical patent/CN107463713A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The present invention relates to Java and web technologies.The present invention is solved when writing focused web crawler, is generally climbed in targeted website by the way that dynamic load information is counter, it is difficult to the problem of fast and effectively obtaining CSS selector, it is proposed that a kind of method of fast verification CSS selector, its technical scheme can be summarized as:Info web is captured using CSS selector, judge whether the information of crawl meets needs, when whether the information that crawl is judged when being unsatisfactory for is dynamic load information, browser kernel is called to re-download webpage source code and parse, then CSS selector is write again, and the information parsed is captured using the CSS selector write again, and judge whether the information of crawl meets needs.The invention has the advantages that when due to webpage dynamic information causes CSS selector unavailable when, call browser kernel to re-download webpage source code and the parsing of target web, then write CSS selector again, obtain effective CSS selector.

Description

The method of fast verification CSS selector
Technical field
It the present invention relates to the use of Java and download webpage source code technology, more particularly to CSS selector technology.
Background technology
Web crawlers is according to the program or script of the automatic crawl WEB information of certain rule, also referred to as network follower. In today that network develops rapidly, WWW turns into maximum information carrier, and traditional search engine is as auxiliary people's retrieval There is also certain limitation for the instrument of information:1. different users often has a different retrieval purposes, and search engine meeting Substantial amounts of garbage is returned, causes the waste of resource;2. the target of search engine is Internet resources covering as big as possible, and With becoming increasingly abundant for Internet resources, the limited resource of search engine can not increasingly meet needs;3. search engine is to species Increasing Internet resources are more and more helpless, the intensive money such as picture, database, audio, video with certain structure Source is that search engine cann't be solved;4. search engine is indexed by keyword, it is difficult to be divided by semanteme Analysis retrieval.In order to which the focused web crawler for the orientation crawl Internet resources that solve the above problems arises at the historic moment, focused web crawler root According to target information, orientation visit Internet resources, so as to quickly obtain desired information.
At present, how efficiently quickly writing web crawlers turns into focus, when writing focused web crawler, quick and precisely Acquisition CSS (CSS) selector turn into information scratching key.CSS, i.e. CSS, generally define how to show Show HTML element, layout and outward appearance in the page can be changed by CSS documents, and CSS selector can then select your institute The HTML element of the pattern needed, therefore, fast and accurately obtaining CSS selector turns into the key of information scratching.Targeted website In generally crawled by the way that dynamic load HTML information is counter, therefore be difficult quick obtaining to effective CSS selector, this patent The method for proposing a fast verification CSS selector, have whether fast verification CSS selector can capture by interface chemical industry HTML information, when occur CSS selector it is invalid when, using call browser kernel mode download webpage source code, according to download Network source code, write CSS selector expression formula again, solve information crawler failure problem caused by webpage dynamic load.
Httpclient can be handled preferably to be asked to Web site, and it is a simple HTTP client, Ke Yiyong In sending HTTP request, http response, but not the response of caching server are received, it is impossible to perform insertion in html page JavaScript code, any parsing and processing will not be also carried out to content of pages.
The content of the invention
It is an object of the invention to provide a kind of method of fast verification CSS selector, solve writing focused web crawler When, generally climbed by the way that dynamic load information is counter in targeted website, it is difficult to the problem of fast and effectively obtaining CSS selector.
The present invention solves its technical problem, and the technical scheme of use is:The method of fast verification CSS selector, its feature It is, comprises the following steps:
Step 1, obtain target network address and CSS selector from required webpage and input target network address;
Step 2, the webpage source code by HttpClient download target webs and parsing;
Step 3, input CSS selector and the information to parsing capture;
Step 4, judge whether the information of crawl meets needs, if satisfied, then terminating, if not satisfied, then entering step 5;
Whether step 5, the information for judging to capture are dynamic load information, if so, then entering step 6, if it is not, then entering Step 3;
Step 6, calling browser kernel re-download webpage source code and the parsing of target web;
Step 7, CSS selector is write again, and input;
Step 8, using the CSS selector write again the information parsed is captured;
Step 9, judge whether the information of crawl meets needs, if satisfied, then terminating, if not satisfied, then entering step 8.
Specifically, in step 1, it is described to use GetText () function from required webpage acquisition target network address.
Further, in step 2, the webpage source code by HttpClient download target webs specifically includes following Step:
Step 201, setting HttpCliet network connection parameters;
Step 202, establish Http network connections;
Step 203, using HttpClient Get methods obtain target web webpage source code.
Specifically, in step 201, the setting of HttpCliet network connection parameters specifically includes:
A, request timed out time, default setting are 2 seconds;
B, data timeout time is waited, default setting is 2 seconds;
C, not enough wait time-out time is connected, default setting is 500 milliseconds;
D, whole connection pool maximum number of connections, default setting 200.
Further, in step 2 and step 6, the parsing to the webpage source code of target web has used Jsoup parsings Device.
Specifically, in step 6, browser kernel is called using Selenium Webdriver.
Further, in step 6, the browser is model Chrome browser.
The invention has the advantages that by the method for upper fast verification CSS selector, when judging because webpage dynamic is believed When breath causes CSS selector unavailable, browser kernel is called to re-download webpage source code and the parsing of target web, Ran Houchong CSS selector newly is write, so as to quickly obtain effective CSS selector.
Embodiment
With reference to embodiment, technical scheme is described in detail.
The method of fast verification CSS selector of the present invention, comprises the following steps:
Step 1, obtain target network address and CSS selector from required webpage and input target network address;
Step 2, the webpage source code by HttpClient download target webs and parsing;
Step 3, input CSS selector and the information to parsing capture;
Step 4, judge whether the information of crawl meets needs, if satisfied, then terminating, if not satisfied, then entering step 5;
Whether step 5, the information for judging to capture are dynamic load information, if so, then entering step 6, if it is not, then entering Step 3;
Step 6, calling browser kernel re-download webpage source code and the parsing of target web;
Step 7, CSS selector is write again, and input;
Step 8, using the CSS selector write again the information parsed is captured;
Step 9, judge whether the information of crawl meets needs, if satisfied, then terminating, if not satisfied, then entering step 8.
Embodiment
The method of fast verification CSS selector of the embodiment of the present invention, comprises the following steps:
Step 1, obtain target network address and CSS selector from required webpage and input target network address;
Step 2, the webpage source code of target web is downloaded by HttpClient and parsed using Jsoup resolvers;
Step 3, input CSS selector and the information to parsing capture;
Step 4, judge whether the information of crawl meets needs, if satisfied, then terminating, if not satisfied, then entering step 5;
Whether step 5, the information for judging to capture are dynamic load information, if so, then entering step 6, if it is not, then entering Step 3;
Step 6, using Selenium Webdriver model Chrome browser kernel is called to re-download target The webpage source code of webpage is simultaneously parsed using Jsoup resolvers;
Step 7, CSS selector is write again, and input;
Step 8, using the CSS selector write again the information parsed is captured;
Step 9, judge whether the information of crawl meets needs, if satisfied, then terminating, if not satisfied, then entering step 8.
In the above method, in step 1, obtain target network address from required webpage and use GetText () function.
In step 2, the webpage source code that target web is downloaded by HttpClient specifically includes following steps:
Step 201, setting HttpCliet network connection parameters;
Step 202, establish Http network connections;
Step 203, using HttpClient Get methods obtain target web webpage source code.
Wherein, in step 201, the setting of HttpCliet network connection parameters specifically includes:
A, request timed out time, default setting are 2 seconds;
B, data timeout time is waited, default setting is 2 seconds;
C, not enough wait time-out time is connected, default setting is 500 milliseconds;
D, whole connection pool maximum number of connections, default setting 200.
Above-mentioned parameter can be set according to being actually needed.

Claims (7)

1. the method for fast verification CSS selector, it is characterised in that comprise the following steps:
Step 1, obtain target network address and CSS selector from required webpage and input target network address;
Step 2, the webpage source code by HttpClient download target webs and parsing;
Step 3, input CSS selector and the information to parsing capture;
Step 4, judge whether the information of crawl meets needs, if satisfied, then terminating, if not satisfied, then entering step 5;
Whether step 5, the information for judging to capture are dynamic load information, if so, then entering step 6, if it is not, then entering step 3;
Step 6, calling browser kernel re-download webpage source code and the parsing of target web;
Step 7, CSS selector is write again, and input;
Step 8, using the CSS selector write again the information parsed is captured;
Step 9, judge whether the information of crawl meets needs, if satisfied, then terminating, if not satisfied, then entering step 8.
2. the method for fast verification CSS selector according to claim 1, it is characterised in that described from institute in step 1 Need webpage to obtain target network address and use GetText () function.
3. the method for fast verification CSS selector according to claim 1, it is characterised in that described to pass through in step 2 The webpage source code that HttpClient downloads target web specifically includes following steps:
Step 201, setting HttpCliet network connection parameters;
Step 202, establish Http network connections;
Step 203, using HttpClient Get methods obtain target web webpage source code.
4. the method for fast verification CSS selector according to claim 3, it is characterised in that in step 201, The setting of HttpCliet network connection parameters specifically includes:
A, request timed out time, default setting are 2 seconds;
B, data timeout time is waited, default setting is 2 seconds;
C, not enough wait time-out time is connected, default setting is 500 milliseconds;
D, whole connection pool maximum number of connections, default setting 200.
5. the method for fast verification CSS selector according to claim 1, it is characterised in that right in step 2 and step 6 The parsing of the webpage source code of target web has used Jsoup resolvers.
6. the method for fast verification CSS selector according to claim 1, it is characterised in that in step 6, use Selenium Webdriver call browser kernel.
7. the method for the fast verification CSS selector according to claim 1 or 6, it is characterised in that described clear in step 6 Device of looking at is model Chrome browser.
CN201710734682.2A 2017-08-24 2017-08-24 The method of fast verification CSS selector Pending CN107463713A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710734682.2A CN107463713A (en) 2017-08-24 2017-08-24 The method of fast verification CSS selector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710734682.2A CN107463713A (en) 2017-08-24 2017-08-24 The method of fast verification CSS selector

Publications (1)

Publication Number Publication Date
CN107463713A true CN107463713A (en) 2017-12-12

Family

ID=60550515

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710734682.2A Pending CN107463713A (en) 2017-08-24 2017-08-24 The method of fast verification CSS selector

Country Status (1)

Country Link
CN (1) CN107463713A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763425A (en) * 2010-01-12 2010-06-30 苏州阔地网络科技有限公司 Universal method for capturing webpage contents of any webpage
US20120173967A1 (en) * 2010-12-30 2012-07-05 Opera Software Asa Method and device for cascading style sheet (css) selector matching
CN102662966A (en) * 2012-03-08 2012-09-12 中国科学院计算机网络信息中心 Method and system for obtaining subject-oriented dynamic page content
CN104052630A (en) * 2013-03-14 2014-09-17 北京百度网讯科技有限公司 Method and system for executing verification on website
CN105205080A (en) * 2014-06-26 2015-12-30 阿里巴巴集团控股有限公司 Redundant file clearing method, device and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763425A (en) * 2010-01-12 2010-06-30 苏州阔地网络科技有限公司 Universal method for capturing webpage contents of any webpage
US20120173967A1 (en) * 2010-12-30 2012-07-05 Opera Software Asa Method and device for cascading style sheet (css) selector matching
CN102662966A (en) * 2012-03-08 2012-09-12 中国科学院计算机网络信息中心 Method and system for obtaining subject-oriented dynamic page content
CN104052630A (en) * 2013-03-14 2014-09-17 北京百度网讯科技有限公司 Method and system for executing verification on website
CN105205080A (en) * 2014-06-26 2015-12-30 阿里巴巴集团控股有限公司 Redundant file clearing method, device and system

Similar Documents

Publication Publication Date Title
US8799262B2 (en) Configurable web crawler
US8527504B1 (en) Data network content filtering using categorized filtering parameters
WO2016173200A1 (en) Malicious website detection method and system
US20170243238A1 (en) Synthetic user profiles
JP5505671B2 (en) Update notification method and browser
CN109033115B (en) Dynamic webpage crawler system
CN103873918B (en) Image processing method, device and terminal
US8131753B2 (en) Apparatus and method for accessing and indexing dynamic web pages
CN106126693B (en) Method and device for sending related data of webpage
CN106503134A (en) Browser jumps to the method for data synchronization and device of application program
CN106528659B (en) Control method and device for browser to jump to application program
CN110221871B (en) Webpage acquisition method and device, computer equipment and storage medium
CN112637361B (en) Page proxy method, device, electronic equipment and storage medium
US10169477B2 (en) Method and system for rendering a web page free of inappropriate URLs
CN102096582A (en) Off-line gadget integration development environment
US20110093533A1 (en) Generating site maps
US20190259069A1 (en) Synthetic user profiles and monitoring online advertisements
CN102185830B (en) A kind of method and system of security filtration of network television browser
US9122484B2 (en) Method and apparatus for mashing up web applications
CN103823907A (en) Method, device and engine for integrating on-line video resource addresses
CN103458065A (en) Method for extracting video address based on Webkit kernel under HTML5 standard
CN108062468B (en) Network crawler method based on picture identifying code identification
CN104361067A (en) Method and system for intelligent loading of browser webpage information
WO2020155765A1 (en) Data crawling method for mobile terminal, device, mobile terminal, and storage medium
JP2011043924A (en) Web action history acquisition system, web action history acquisition method, gateway device and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171212

RJ01 Rejection of invention patent application after publication