CN112347332A - XPath-based crawler target positioning method - Google Patents

XPath-based crawler target positioning method Download PDF

Info

Publication number
CN112347332A
CN112347332A CN202011287213.9A CN202011287213A CN112347332A CN 112347332 A CN112347332 A CN 112347332A CN 202011287213 A CN202011287213 A CN 202011287213A CN 112347332 A CN112347332 A CN 112347332A
Authority
CN
China
Prior art keywords
webpage
xpath
blocks
content
monitoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011287213.9A
Other languages
Chinese (zh)
Inventor
乜鹏
王锐璇
董佳霖
陈楚翘
郑羽辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN202011287213.9A priority Critical patent/CN112347332A/en
Publication of CN112347332A publication Critical patent/CN112347332A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The invention belongs to the technical field of computer WEB and the technical field of information capture, and particularly relates to a crawler target positioning method based on a webpage path XPath. The method comprises the following specific steps: step 1, loading website information and acquiring a webpage corresponding to a website; step 2, finding out the relative position of the current content in the webpage according to the monitoring position; step 3, dividing the webpage into blocks, wherein each webpage contains monitoring position content; and 4, determining a monitoring range through human-computer interaction. The invention can meet the requirements of the user on information (news, notice and other contents) monitoring and acquisition based on the actual requirements of the user. The invention realizes the blocking of the webpage based on the tree structure of the webpage, and realizes the accurate positioning of the user requirement by showing in a visual way and in a man-machine interaction way.

Description

XPath-based crawler target positioning method
Technical Field
The invention belongs to the technical field of computer WEB and the technical field of information capture, and particularly relates to a crawler target positioning method based on a webpage path XPath.
Background
As more and more information is brought into people's lives, it becomes more difficult to obtain accurate and effective information in time. In this era of "information explosion", it is becoming increasingly difficult to accurately obtain a variety of information such as education, shopping, news, etc. that are of interest to individuals.
The positioning method of the crawler target has important significance for information grabbing, monitoring webpage information change and the like. The website subscription service based on the RSS technology which is redpolar a few years ago notifies users when the website content is updated, and a crawler target positioning method is not adopted, so that the information updating reminding is not targeted. Most of the later developed web page target positioning methods are identified by determining attributes, such as id, class and the like, of a certain position block, and the method is limited by the naming rule of a web page source code and does not meet the real requirements of users.
Disclosure of Invention
Aiming at the problems, the invention provides an accurate crawler target positioning method according with the user intention, and the crawler target positioning method is more accurate and reasonable.
In order to achieve the purpose, the invention provides the following technical scheme:
a crawler target positioning method based on XPath comprises the following specific steps:
step 1, loading website information and acquiring a webpage corresponding to a website;
step 2, finding out the relative position of the current content in the webpage according to the monitoring position;
step 3, dividing the webpage into blocks, wherein each webpage contains monitoring position content;
and 4, determining a monitoring range through human-computer interaction.
In the further optimization of the technical scheme, in the step 1, the target webpage is visually presented in a form of embedding the webpage by crawling the input webpage source code.
In the further optimization of the technical scheme, the step 2 comprises the steps of selecting a monitored position according to a target webpage and inputting the existing text content of the position; and finding the XPath corresponding to the text content by traversing the DOM tree structure of the webpage source code.
In a further optimization of the technical solution, the specific method for finding the existing content XPath of the monitoring location in step 2 is as follows: traversing DOM tree nodes of the HTML framework webpage, finding tree nodes matched with the input content, and storing paths of the tree nodes.
In a further optimization of the technical solution, the method for partitioning the web page in step 3 comprises: the webpage blocking technology is based on a DOM tree structure of an HTML framework webpage, a path from a root node to a leaf node represents all webpage blocks containing the existing content of a monitoring position, and the webpage blocks are marked in the webpage; dividing the webpage blocks into longitudinal blocks and transverse blocks according to the number of XPath returned in the step 2; when the XPath number is just 1, only one determined position is shown, and only longitudinal partitioning is needed; when the number of XPath is more than 1, it is necessary to perform transverse blocking first and then perform longitudinal blocking.
In a further optimization of the technical solution, the step 4 specifically includes:
step 4.1, returning XPath as empty, and inputting again;
step 4.2, the returned XPath number is 1, the webpage is presented in blocks according to the step 3, the user inputs numbers representing different webpage blocks according to requirements, and the webpage blocks needing to be monitored are fed back;
and 4.3, the returned XPath number is larger than 1, the first interaction is presented in a transverse block mode, the second interaction is presented in a longitudinal block mode, and the user selects the accurate monitoring position.
In the further optimization of the technical scheme, the step 3 specifically comprises the following steps: according to the existing content of the monitoring position, according to the webpage structure, the range containing the content of the monitoring position is divided in the webpage in a longitudinal blocking mode and a transverse blocking mode, and the range is marked with different colors.
Different from the prior art, the beneficial results of the technical scheme are as follows:
the invention can meet the requirements of the user on information (news, notice and other contents) monitoring and acquisition based on the actual requirements of the user. The invention realizes the blocking of the webpage based on the tree structure of the webpage, and realizes the accurate positioning of the user requirement by showing in a visual way and in a man-machine interaction way.
Compared with the method based on the attribute value, the method has higher universality and is suitable for all web pages based on HTML. In addition, the method can realize effective information monitoring by combining with a crawler scheme, and meanwhile, a problem feedback module for the webpage can accurately determine the position of the problem, so that manpower and material resources are saved. Meanwhile, the invention is very user-friendly, does not need a tutorial when getting rid of the computer professional terms, and has concise and understandable human-computer interaction process.
Drawings
FIG. 1 is a flow chart of a crawler target location method based on XPath;
FIG. 2 is a simplified web page partition and DOM tree structure diagram;
FIG. 3 is a simplified diagram of a transverse block;
FIG. 4 is a first schematic view of a page;
FIG. 5 is a second schematic page view;
fig. 6 is a third schematic page diagram.
Detailed Description
To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.
Please refer to fig. 1, which is a flowchart of a XPath-based crawler target positioning method, the method specifically includes the following steps:
step 1, loading the website input by the user
After a user inputs a target website, a background downloads an HTML (hypertext markup language) file of a webpage based on a requests library of python, and then uses a regular expression to analyze an original path of resources such as CSS (cascading style sheets), JavaScript and pictures in the file, the resources are downloaded and stored according to the path to present a complete original webpage instead of a single HTML structure text, and the downloaded target webpage is loaded in an input page, so that the user can conveniently check and select.
Step 2, finding the XPath of the existing content of the monitoring position
Determining the relative position of the existing content in the webpage through the monitoring position input by the user is one of the key steps of the invention.
In a webpage with an HTML framework, a DOM establishes an HTML document into a tree structure, finds nodes with node contents matched with content input by a user by traversing nodes of the DOM tree, records paths of the nodes in HTML codes of the webpage and expresses the paths by XPath. The path expressed based on XPath is a path from the root node < HTML > to the node where the content is located in the DOM tree structure of the HTML framework webpage. The HTML document object in the character string format is converted into an _ Element object by using an etee.HTML function in an lxml library of Python, so that an XPath method can be used for analyzing the XPath, and a find method in the Element tree of the etee can be used for searching for the matched XPath.
Step 3, presenting the webpage blocks and all selectable monitoring ranges
The invention divides the webpage into two types, namely longitudinal blocking and transverse blocking, and solves the problems under different conditions by adopting two modes. The longitudinal partitions mainly classify parent and child nodes of an XPath path, and the transverse partitions mainly mark different positions corresponding to different contents, and the definitions of the two are explained below.
Step 3.1, longitudinal blocking
In an HTML-structured web page, the start and end points of each tag delimit the tag, based on which we can block the web page. The XPath obtained in the previous step represents a path from the root node < html > to the node where the content is located, and each node on the path corresponds to a block in the page. For example, in the Newcastle web, the corresponding path of a news headline content in a web page is represented as/html/body/div 9/div 6/div 2/div/div 2/h 1[1], and its web page block diagram and simplified path diagram are shown in FIG. 2, which is a simplified structure diagram of web page blocks and DOM tree. In the path of this DOM tree, the web page range corresponding to the child node is a subset of the parent node. Each node on this path represents a block range of the web page. The invention analyzes XPath by using the convenience of character string slicing in Python.
In the aspect of visual presentation, a form of adding a frame to each block is adopted, and a specific technical means is to analyze an XPath path, for example, XPath (content) is analyzed into a plurality of parts of/html/body,/html/body/div [9], …,/html/body/div [9]/div [6]/div [2]/div/div [2]/h1[1], and specific principles of color framing of the webpage blocks are detailed in table 1 through a CSS selector, and a CSS selector adopts a { loader: red solid thick; and displaying the modification mode of the electronic device to a user for selection.
TABLE 1 CSS selector specific rules Table
Figure RE-GDA0002868957400000051
Step 3.2, transversely blocking
The horizontal blocks are identified for different locations corresponding to different contents, for example, in the Newcastle disease network, assuming that the input content is "Liipu", there will be more than one location where the content appears, respectively,/html/body/div 10]/div 7/div 1/div 5/div 2/ul/li 11[/a and/html/body/div 10/div 14/div 2/div 1/li 4/a. For the identification of this case, since there is generally no intersection between blocks, it can be represented as a simplified diagram as shown in fig. 3.
The method for partitioning the web page is based on the organization mode of html, and each element corresponds to a range in the web page. Element-based analysis is also the division of the web page scope.
The method has the advantages that: and two modes of transverse partitioning and longitudinal partitioning are used, and positioning is performed from a two-dimensional angle, so that the crawler target is determined more accurately.
The method has the advantages that: meanwhile, the interactive process with the user is combined, so that the selection of the monitoring range is closer to the requirement of the user.
Step 4, determining the actual monitoring range through human-computer interaction
The man-machine interaction is mainly used for determining the specific position to be monitored by the user, the system presents various conditions after the blocking, the user selects the specific range to be monitored, and the accurate positioning of the user requirement is realized in a Q & A mode.
After analyzing the existing contents of the target website and the monitoring position, the invention can carry out different operations according to the number of returned XPath.
Step 4.1, return XPath empty in step 2
And (4) reminding the user that the input content does not appear in the website, needing to check the content, and jumping back to the step 2 to input again.
Step 4.2, there is exactly one XPath returned in step 2
And (3) directly performing longitudinal blocking according to the step (3) and presenting the longitudinal blocking to a user, inputting numbers representing different webpage blocks by the user according to requirements, and feeding back the webpage blocks needing to be monitored.
Step 4.3, the XPath returned in step 2 is more than one
This situation illustrates that more than one content position appears in the web page, and two man-machine interactive questions and answers are needed. Assuming that there are n (n > 1) returned xpaths, the first interactive selection step of the horizontal tile presentation in step 3 represents the selection of one out of n separate tiles; and 4, interactively selecting the longitudinal blocks in the step 3 for the second time, and presenting the longitudinal blocks to the user to select the accurate position to be monitored.
Specific examples are as follows:
1. user input
Target website url: http:// sports
Monitoring the existing content of the position: plum blossom: the two people are the best chief in the heart of the people who teach the people to give a lot of people
Target keywords: c Rou
Because the "existing content of the monitored location" input by the user is complete, only 1 XPath is returned, as shown in FIG. 4, which is a first page diagram, the returned XPath is/html/body/div 4/div 5/div 2/div/ul 1/li 2/a, and vertical blocking can be directly performed.
If the 'monitoring position existing content' input by the user is not complete enough, for example, only two characters of 'Meixi' are input, a plurality of positions can be found in the webpage, and a plurality of results are returned. At this time, horizontal blocking is performed first, as shown in fig. 5, which is a page diagram ii.
Suppose that the user selects 1, and then performs vertical blocking, which is schematically represented as fig. 6, which is a schematic page diagram three. At this time, the user can determine the final monitoring position by selecting from 1-4. And when the monitoring range has two characters of 'C Rou', the user is reminded.
The invention divides the frames in the webpage blocks into different colors corresponding to different numbers, intelligently presents the corresponding relation between the numbers and the color table on the man-machine interaction interface, acquires the target requirement of the user by collecting the numbers fed back by the user, and realizes the man-machine communication based on the determined rule.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising … …" or "comprising … …" does not exclude the presence of additional elements in a process, method, article, or terminal that comprises the element. Further, herein, "greater than," "less than," "more than," and the like are understood to exclude the present numbers; the terms "above", "below", "within" and the like are to be understood as including the number.
Although the embodiments have been described, once the basic inventive concept is obtained, other variations and modifications of these embodiments can be made by those skilled in the art, so that the above embodiments are only examples of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes using the contents of the present specification and drawings, or any other related technical fields, which are directly or indirectly applied thereto, are included in the scope of the present invention.

Claims (7)

1. A crawler target positioning method based on XPath is characterized by comprising the following specific steps:
step 1, loading website information and acquiring a webpage corresponding to a website;
step 2, finding out the relative position of the current content in the webpage according to the monitoring position;
step 3, dividing the webpage into blocks, wherein each webpage contains monitoring position content;
and 4, determining a monitoring range through human-computer interaction.
2. The XPath-based crawler targeting method of claim 1, wherein said step 1 visually presents the destination web page in the form of an embedded web page by crawling the input page source code.
3. The XPath-based crawler targeting method of claim 1, wherein said step 2 comprises selecting a monitored location based on a destination web page and entering existing text at the location; and finding the XPath corresponding to the text content by traversing the DOM tree structure of the webpage source code.
4. A method for XPath-based crawler targeting as recited in claim 3 wherein said step 2 specific method for finding the monitoring location existing content XPath is: traversing DOM tree nodes of the HTML framework webpage, finding tree nodes matched with the input content, and storing paths of the tree nodes.
5. The XPath-based crawler targeting method of claim 3, wherein said step 3 web blocking method is: the webpage blocking technology is based on a DOM tree structure of an HTML framework webpage, a path from a root node to a leaf node represents all webpage blocks containing the existing content of a monitoring position, and the webpage blocks are marked in the webpage; dividing the webpage blocks into longitudinal blocks and transverse blocks according to the number of XPath returned in the step 2; when the XPath number is just 1, only one determined position is shown, and only longitudinal partitioning is needed; when the number of XPath is more than 1, it is necessary to perform transverse blocking first and then perform longitudinal blocking.
6. A XPath-based crawler targeting method as recited in claim 3, wherein said step 4 specifically comprises:
step 4.1, returning XPath as empty, and inputting again;
step 4.2, the returned XPath number is 1, the webpage is presented in blocks according to the step 3, the user inputs numbers representing different webpage blocks according to requirements, and the webpage blocks needing to be monitored are fed back;
and 4.3, the returned XPath number is larger than 1, the first interaction is presented in a transverse block mode, the second interaction is presented in a longitudinal block mode, and the user selects the accurate monitoring position.
7. The XPath-based crawler target positioning method of claim 1, wherein said step 3 is specifically: according to the existing content of the monitoring position, according to the webpage structure, the range containing the content of the monitoring position is divided in the webpage in a longitudinal blocking mode and a transverse blocking mode, and the range is marked with different colors.
CN202011287213.9A 2020-11-17 2020-11-17 XPath-based crawler target positioning method Pending CN112347332A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011287213.9A CN112347332A (en) 2020-11-17 2020-11-17 XPath-based crawler target positioning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011287213.9A CN112347332A (en) 2020-11-17 2020-11-17 XPath-based crawler target positioning method

Publications (1)

Publication Number Publication Date
CN112347332A true CN112347332A (en) 2021-02-09

Family

ID=74364091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011287213.9A Pending CN112347332A (en) 2020-11-17 2020-11-17 XPath-based crawler target positioning method

Country Status (1)

Country Link
CN (1) CN112347332A (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8166054B2 (en) * 2008-05-29 2012-04-24 International Business Machines Corporation System and method for adaptively locating dynamic web page elements
CN102831121A (en) * 2011-06-15 2012-12-19 阿里巴巴集团控股有限公司 Method and system for extracting webpage information
US20130024441A1 (en) * 2011-07-22 2013-01-24 Alibaba Group Holding Limited Configuring web crawler to extract web page information
CN104965901A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Method and apparatus for grabbing content of target page
CN106294885A (en) * 2016-10-09 2017-01-04 华东师范大学 A kind of data collection towards isomery webpage and mask method
CN107220250A (en) * 2016-03-21 2017-09-29 北大方正集团有限公司 A kind of template configuration method and system
CN107729475A (en) * 2017-10-16 2018-02-23 深圳视界信息技术有限公司 Web page element acquisition method, device, terminal and computer-readable recording medium
CN108733405A (en) * 2017-04-13 2018-11-02 富士通株式会社 The method and apparatus that training webpage distribution indicates model
CN109325204A (en) * 2018-09-13 2019-02-12 武汉伯远生物科技有限公司 Web page contents extraction method
CN110110198A (en) * 2017-12-28 2019-08-09 中移(苏州)软件技术有限公司 A kind of method for abstracting web page information and device
CN110134841A (en) * 2018-02-09 2019-08-16 鼎复数据科技(北京)有限公司 The customized real-time method for obtaining website data
CN110222251A (en) * 2019-05-27 2019-09-10 浙江大学 A kind of Service encapsulating method based on Web-page segmentation and searching algorithm
CN110390038A (en) * 2019-07-25 2019-10-29 中南民族大学 Segment method, apparatus, equipment and storage medium based on dom tree

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8166054B2 (en) * 2008-05-29 2012-04-24 International Business Machines Corporation System and method for adaptively locating dynamic web page elements
CN102831121A (en) * 2011-06-15 2012-12-19 阿里巴巴集团控股有限公司 Method and system for extracting webpage information
US20130024441A1 (en) * 2011-07-22 2013-01-24 Alibaba Group Holding Limited Configuring web crawler to extract web page information
CN104965901A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Method and apparatus for grabbing content of target page
CN107220250A (en) * 2016-03-21 2017-09-29 北大方正集团有限公司 A kind of template configuration method and system
CN106294885A (en) * 2016-10-09 2017-01-04 华东师范大学 A kind of data collection towards isomery webpage and mask method
CN108733405A (en) * 2017-04-13 2018-11-02 富士通株式会社 The method and apparatus that training webpage distribution indicates model
CN107729475A (en) * 2017-10-16 2018-02-23 深圳视界信息技术有限公司 Web page element acquisition method, device, terminal and computer-readable recording medium
CN110110198A (en) * 2017-12-28 2019-08-09 中移(苏州)软件技术有限公司 A kind of method for abstracting web page information and device
CN110134841A (en) * 2018-02-09 2019-08-16 鼎复数据科技(北京)有限公司 The customized real-time method for obtaining website data
CN109325204A (en) * 2018-09-13 2019-02-12 武汉伯远生物科技有限公司 Web page contents extraction method
CN110222251A (en) * 2019-05-27 2019-09-10 浙江大学 A kind of Service encapsulating method based on Web-page segmentation and searching algorithm
CN110390038A (en) * 2019-07-25 2019-10-29 中南民族大学 Segment method, apparatus, equipment and storage medium based on dom tree

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李桐宇: "面向领域的网页内容提取及语义标签生成框架", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
陈晓雷: "自适应Web数据抽取技术研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Similar Documents

Publication Publication Date Title
US7406459B2 (en) Concept network
EP2057557B1 (en) Joint optimization of wrapper generation and template detection
US9336279B2 (en) Hidden text detection for search result scoring
US9594730B2 (en) Annotating HTML segments with functional labels
US20060212446A1 (en) Method and system for assessing relevant properties of work contexts for use by information services
US20080235567A1 (en) Intelligent form filler
US20050028156A1 (en) Automatic method and system for formulating and transforming representations of context used by information services
US20060161564A1 (en) Method and system for locating information in the invisible or deep world wide web
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN106202514A (en) Accident based on Agent is across the search method of media information and system
US20090019015A1 (en) Mathematical expression structured language object search system and search method
CN103443786A (en) Machine learning method to identify independent tasks for parallel layout in web browsers
CN109906450A (en) For the method and apparatus by similitude association to electronic information ranking
CN105045875A (en) Personalized information retrieval method and apparatus
CN102741838A (en) System and method for block segmenting, identifying and indexing visual elements, and searching documents
US10810181B2 (en) Refining structured data indexes
KR102157218B1 (en) Data transformation method for spatial data&#39;s semantic annotation
KR20190131778A (en) Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL
US8150878B1 (en) Device method and computer program product for sharing web feeds
Shestakov et al. DEQUE: querying the deep web
CN106372232B (en) Information mining method and device based on artificial intelligence
Wang et al. Enriching descriptions for public web services using information captured from related web pages on the internet
CN112347332A (en) XPath-based crawler target positioning method
CN111666479A (en) Method for searching web page and computer readable storage medium
CN115033643A (en) Data synchronization method, electronic device and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210209