CN112347332A

CN112347332A - XPath-based crawler target positioning method

Info

Publication number: CN112347332A
Application number: CN202011287213.9A
Authority: CN
Inventors: 乜鹏; 王锐璇; 董佳霖; 陈楚翘; 郑羽辰
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-02-09

Abstract

The invention belongs to the technical field of computer WEB and the technical field of information capture, and particularly relates to a crawler target positioning method based on a webpage path XPath. The method comprises the following specific steps: step 1, loading website information and acquiring a webpage corresponding to a website; step 2, finding out the relative position of the current content in the webpage according to the monitoring position; step 3, dividing the webpage into blocks, wherein each webpage contains monitoring position content; and 4, determining a monitoring range through human-computer interaction. The invention can meet the requirements of the user on information (news, notice and other contents) monitoring and acquisition based on the actual requirements of the user. The invention realizes the blocking of the webpage based on the tree structure of the webpage, and realizes the accurate positioning of the user requirement by showing in a visual way and in a man-machine interaction way.

Description

XPath-based crawler target positioning method

Technical Field

The invention belongs to the technical field of computer WEB and the technical field of information capture, and particularly relates to a crawler target positioning method based on a webpage path XPath.

Background

As more and more information is brought into people's lives, it becomes more difficult to obtain accurate and effective information in time. In this era of "information explosion", it is becoming increasingly difficult to accurately obtain a variety of information such as education, shopping, news, etc. that are of interest to individuals.

The positioning method of the crawler target has important significance for information grabbing, monitoring webpage information change and the like. The website subscription service based on the RSS technology which is redpolar a few years ago notifies users when the website content is updated, and a crawler target positioning method is not adopted, so that the information updating reminding is not targeted. Most of the later developed web page target positioning methods are identified by determining attributes, such as id, class and the like, of a certain position block, and the method is limited by the naming rule of a web page source code and does not meet the real requirements of users.

Disclosure of Invention

Aiming at the problems, the invention provides an accurate crawler target positioning method according with the user intention, and the crawler target positioning method is more accurate and reasonable.

In order to achieve the purpose, the invention provides the following technical scheme:

a crawler target positioning method based on XPath comprises the following specific steps:

step 1, loading website information and acquiring a webpage corresponding to a website;

step 2, finding out the relative position of the current content in the webpage according to the monitoring position;

step 3, dividing the webpage into blocks, wherein each webpage contains monitoring position content;

and 4, determining a monitoring range through human-computer interaction.

In the further optimization of the technical scheme, in the step 1, the target webpage is visually presented in a form of embedding the webpage by crawling the input webpage source code.

In the further optimization of the technical scheme, the step 2 comprises the steps of selecting a monitored position according to a target webpage and inputting the existing text content of the position; and finding the XPath corresponding to the text content by traversing the DOM tree structure of the webpage source code.

In a further optimization of the technical solution, the specific method for finding the existing content XPath of the monitoring location in step 2 is as follows: traversing DOM tree nodes of the HTML framework webpage, finding tree nodes matched with the input content, and storing paths of the tree nodes.

In a further optimization of the technical solution, the method for partitioning the web page in step 3 comprises: the webpage blocking technology is based on a DOM tree structure of an HTML framework webpage, a path from a root node to a leaf node represents all webpage blocks containing the existing content of a monitoring position, and the webpage blocks are marked in the webpage; dividing the webpage blocks into longitudinal blocks and transverse blocks according to the number of XPath returned in the step 2; when the XPath number is just 1, only one determined position is shown, and only longitudinal partitioning is needed; when the number of XPath is more than 1, it is necessary to perform transverse blocking first and then perform longitudinal blocking.

In a further optimization of the technical solution, the step 4 specifically includes:

step 4.1, returning XPath as empty, and inputting again;

step 4.2, the returned XPath number is 1, the webpage is presented in blocks according to the step 3, the user inputs numbers representing different webpage blocks according to requirements, and the webpage blocks needing to be monitored are fed back;

and 4.3, the returned XPath number is larger than 1, the first interaction is presented in a transverse block mode, the second interaction is presented in a longitudinal block mode, and the user selects the accurate monitoring position.

In the further optimization of the technical scheme, the step 3 specifically comprises the following steps: according to the existing content of the monitoring position, according to the webpage structure, the range containing the content of the monitoring position is divided in the webpage in a longitudinal blocking mode and a transverse blocking mode, and the range is marked with different colors.

Different from the prior art, the beneficial results of the technical scheme are as follows:

the invention can meet the requirements of the user on information (news, notice and other contents) monitoring and acquisition based on the actual requirements of the user. The invention realizes the blocking of the webpage based on the tree structure of the webpage, and realizes the accurate positioning of the user requirement by showing in a visual way and in a man-machine interaction way.

Compared with the method based on the attribute value, the method has higher universality and is suitable for all web pages based on HTML. In addition, the method can realize effective information monitoring by combining with a crawler scheme, and meanwhile, a problem feedback module for the webpage can accurately determine the position of the problem, so that manpower and material resources are saved. Meanwhile, the invention is very user-friendly, does not need a tutorial when getting rid of the computer professional terms, and has concise and understandable human-computer interaction process.

Drawings

FIG. 1 is a flow chart of a crawler target location method based on XPath;

FIG. 2 is a simplified web page partition and DOM tree structure diagram;

FIG. 3 is a simplified diagram of a transverse block;

FIG. 4 is a first schematic view of a page;

FIG. 5 is a second schematic page view;

fig. 6 is a third schematic page diagram.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

Please refer to fig. 1, which is a flowchart of a XPath-based crawler target positioning method, the method specifically includes the following steps:

step 1, loading the website input by the user

After a user inputs a target website, a background downloads an HTML (hypertext markup language) file of a webpage based on a requests library of python, and then uses a regular expression to analyze an original path of resources such as CSS (cascading style sheets), JavaScript and pictures in the file, the resources are downloaded and stored according to the path to present a complete original webpage instead of a single HTML structure text, and the downloaded target webpage is loaded in an input page, so that the user can conveniently check and select.

Step 2, finding the XPath of the existing content of the monitoring position

Determining the relative position of the existing content in the webpage through the monitoring position input by the user is one of the key steps of the invention.

In a webpage with an HTML framework, a DOM establishes an HTML document into a tree structure, finds nodes with node contents matched with content input by a user by traversing nodes of the DOM tree, records paths of the nodes in HTML codes of the webpage and expresses the paths by XPath. The path expressed based on XPath is a path from the root node < HTML > to the node where the content is located in the DOM tree structure of the HTML framework webpage. The HTML document object in the character string format is converted into an _ Element object by using an etee.HTML function in an lxml library of Python, so that an XPath method can be used for analyzing the XPath, and a find method in the Element tree of the etee can be used for searching for the matched XPath.

Step 3, presenting the webpage blocks and all selectable monitoring ranges

The invention divides the webpage into two types, namely longitudinal blocking and transverse blocking, and solves the problems under different conditions by adopting two modes. The longitudinal partitions mainly classify parent and child nodes of an XPath path, and the transverse partitions mainly mark different positions corresponding to different contents, and the definitions of the two are explained below.

Step 3.1, longitudinal blocking

In an HTML-structured web page, the start and end points of each tag delimit the tag, based on which we can block the web page. The XPath obtained in the previous step represents a path from the root node < html > to the node where the content is located, and each node on the path corresponds to a block in the page. For example, in the Newcastle web, the corresponding path of a news headline content in a web page is represented as/html/body/div 9/div 6/div 2/div/div 2/h 1[1], and its web page block diagram and simplified path diagram are shown in FIG. 2, which is a simplified structure diagram of web page blocks and DOM tree. In the path of this DOM tree, the web page range corresponding to the child node is a subset of the parent node. Each node on this path represents a block range of the web page. The invention analyzes XPath by using the convenience of character string slicing in Python.

In the aspect of visual presentation, a form of adding a frame to each block is adopted, and a specific technical means is to analyze an XPath path, for example, XPath (content) is analyzed into a plurality of parts of/html/body,/html/body/div [9], …,/html/body/div [9]/div [6]/div [2]/div/div [2]/h1[1], and specific principles of color framing of the webpage blocks are detailed in table 1 through a CSS selector, and a CSS selector adopts a { loader: red solid thick; and displaying the modification mode of the electronic device to a user for selection.

TABLE 1 CSS selector specific rules Table

Step 3.2, transversely blocking

The horizontal blocks are identified for different locations corresponding to different contents, for example, in the Newcastle disease network, assuming that the input content is "Liipu", there will be more than one location where the content appears, respectively,/html/body/div 10]/div 7/div 1/div 5/div 2/ul/li 11[/a and/html/body/div 10/div 14/div 2/div 1/li 4/a. For the identification of this case, since there is generally no intersection between blocks, it can be represented as a simplified diagram as shown in fig. 3.

The method for partitioning the web page is based on the organization mode of html, and each element corresponds to a range in the web page. Element-based analysis is also the division of the web page scope.

The method has the advantages that: and two modes of transverse partitioning and longitudinal partitioning are used, and positioning is performed from a two-dimensional angle, so that the crawler target is determined more accurately.

The method has the advantages that: meanwhile, the interactive process with the user is combined, so that the selection of the monitoring range is closer to the requirement of the user.

Step 4, determining the actual monitoring range through human-computer interaction

The man-machine interaction is mainly used for determining the specific position to be monitored by the user, the system presents various conditions after the blocking, the user selects the specific range to be monitored, and the accurate positioning of the user requirement is realized in a Q & A mode.

After analyzing the existing contents of the target website and the monitoring position, the invention can carry out different operations according to the number of returned XPath.

Step 4.1, return XPath empty in step 2

And (4) reminding the user that the input content does not appear in the website, needing to check the content, and jumping back to the step 2 to input again.

Step 4.2, there is exactly one XPath returned in step 2

And (3) directly performing longitudinal blocking according to the step (3) and presenting the longitudinal blocking to a user, inputting numbers representing different webpage blocks by the user according to requirements, and feeding back the webpage blocks needing to be monitored.

Step 4.3, the XPath returned in step 2 is more than one

This situation illustrates that more than one content position appears in the web page, and two man-machine interactive questions and answers are needed. Assuming that there are n (n > 1) returned xpaths, the first interactive selection step of the horizontal tile presentation in step 3 represents the selection of one out of n separate tiles; and 4, interactively selecting the longitudinal blocks in the step 3 for the second time, and presenting the longitudinal blocks to the user to select the accurate position to be monitored.

Specific examples are as follows:

1. user input

Target website url: http:// sports

Monitoring the existing content of the position: plum blossom: the two people are the best chief in the heart of the people who teach the people to give a lot of people

Target keywords: c Rou

Because the "existing content of the monitored location" input by the user is complete, only 1 XPath is returned, as shown in FIG. 4, which is a first page diagram, the returned XPath is/html/body/div 4/div 5/div 2/div/ul 1/li 2/a, and vertical blocking can be directly performed.

If the 'monitoring position existing content' input by the user is not complete enough, for example, only two characters of 'Meixi' are input, a plurality of positions can be found in the webpage, and a plurality of results are returned. At this time, horizontal blocking is performed first, as shown in fig. 5, which is a page diagram ii.

Suppose that the user selects 1, and then performs vertical blocking, which is schematically represented as fig. 6, which is a schematic page diagram three. At this time, the user can determine the final monitoring position by selecting from 1-4. And when the monitoring range has two characters of 'C Rou', the user is reminded.

The invention divides the frames in the webpage blocks into different colors corresponding to different numbers, intelligently presents the corresponding relation between the numbers and the color table on the man-machine interaction interface, acquires the target requirement of the user by collecting the numbers fed back by the user, and realizes the man-machine communication based on the determined rule.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising … …" or "comprising … …" does not exclude the presence of additional elements in a process, method, article, or terminal that comprises the element. Further, herein, "greater than," "less than," "more than," and the like are understood to exclude the present numbers; the terms "above", "below", "within" and the like are to be understood as including the number.

Although the embodiments have been described, once the basic inventive concept is obtained, other variations and modifications of these embodiments can be made by those skilled in the art, so that the above embodiments are only examples of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes using the contents of the present specification and drawings, or any other related technical fields, which are directly or indirectly applied thereto, are included in the scope of the present invention.

Claims

1. A crawler target positioning method based on XPath is characterized by comprising the following specific steps:

and 4, determining a monitoring range through human-computer interaction.

2. The XPath-based crawler targeting method of claim 1, wherein said step 1 visually presents the destination web page in the form of an embedded web page by crawling the input page source code.

3. The XPath-based crawler targeting method of claim 1, wherein said step 2 comprises selecting a monitored location based on a destination web page and entering existing text at the location; and finding the XPath corresponding to the text content by traversing the DOM tree structure of the webpage source code.

4. A method for XPath-based crawler targeting as recited in claim 3 wherein said step 2 specific method for finding the monitoring location existing content XPath is: traversing DOM tree nodes of the HTML framework webpage, finding tree nodes matched with the input content, and storing paths of the tree nodes.

5. The XPath-based crawler targeting method of claim 3, wherein said step 3 web blocking method is: the webpage blocking technology is based on a DOM tree structure of an HTML framework webpage, a path from a root node to a leaf node represents all webpage blocks containing the existing content of a monitoring position, and the webpage blocks are marked in the webpage; dividing the webpage blocks into longitudinal blocks and transverse blocks according to the number of XPath returned in the step 2; when the XPath number is just 1, only one determined position is shown, and only longitudinal partitioning is needed; when the number of XPath is more than 1, it is necessary to perform transverse blocking first and then perform longitudinal blocking.

6. A XPath-based crawler targeting method as recited in claim 3, wherein said step 4 specifically comprises:

step 4.1, returning XPath as empty, and inputting again;

7. The XPath-based crawler target positioning method of claim 1, wherein said step 3 is specifically: according to the existing content of the monitoring position, according to the webpage structure, the range containing the content of the monitoring position is divided in the webpage in a longitudinal blocking mode and a transverse blocking mode, and the range is marked with different colors.