Background
As government information disclosure and data opening efforts are further increased, more and more government information is disclosed on government websites to form a huge amount of government website public information. The existing government website is established, maintained and managed by various departments in each government, and government website public information is conveniently and quickly obtained from the government website, so that huge value is brought to users.
However, the content of these government websites is different, the structure of the web pages is different, and when crawling the government websites, professional technicians are required to analyze the structure of the web pages to locate and crawl the required content, because:
the xpath paths of the required contents in the web page are different, and the xpath paths of the required contents need to be manually analyzed when crawling is performed, so that a great deal of time and labor are obviously spent, the workload is large, and the labor is complex. This model is clearly less efficient in the face of thousands of government web sites.
The present invention relates to the following technical terms:
1. crawling refers to accessing a website, acquiring information from a webpage, and realizing webpage data acquisition.
2. xpath, a language (crawler) that looks up information in a web page (especially an XML document), is used to traverse elements and attributes in a web page (especially an XML document). xpath belongs to the Html path language, which is a language that can be used to determine the location of a part of an Html document.
3. Scapy, a fast, high-level screen-crawling and web-crawling framework (crawler framework) developed by Python, is used to crawl web sites and extract structured data from web pages. The Scapy has wide application range and can be used for data mining, monitoring and automatic testing. In the crawler script based on the script framework, the most critical step is to identify the xpath path of the required content in the webpage so as to crawl the specified webpage content.
4. The internet crawler is a program or script for automatically capturing world wide web information according to a certain rule, and the method mainly comprises two modes:
the first is the whole network crawling of similar hundred-degree search engines;
the second is directional crawling for a certain category, which means crawling for a specified web page content (the targeted content of a specified web page).
However, as for the directional crawling manner, as described above, since the web page layout of the government website is relatively cluttered, the xpath path of the specified web page content (the content required in the web page) is acquired, and a professional is required to check the web page source code on the premise of the url address of the existing web page, and after analysis, the correct xpath path is acquired.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method and a system for automatically acquiring xpath generation crawler script.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a method for automatically acquiring an xpath generation crawler script comprises the following steps:
step 1, acquiring a url address of a webpage, opening the webpage through the url address, and traversing all < a > tags in the webpage;
the < a > tag is used to define a hyperlink;
step 2, taking out the xpath path corresponding to each < a > tag;
and 3, grouping the < a > tags according to the following principle: if the xpath paths are the same, dividing the xpath paths into a group; then, counting the number of the tags < a > after grouping;
step 4, taking out one < a > tag in each group, and opening the link webpage;
step 5, counting the number of the < a > tags and the number of characters in the webpage for each opened webpage in the step 4;
the character number refers to the character number of the < a > label;
step 6, taking out a group with the most characters and the least < a > labels, and recording the corresponding xpath path;
and 7, generating a crawler script according to the corresponding xpath path based on the script framework.
Further, in the method for automatically acquiring the xpath-generated crawler script as described above, in step 1, the web page is a web page including a topic information list.
Further, according to the method for automatically acquiring the xpath generated crawler script, the specific steps of the step 2 are as follows:
step 2.1, acquiring a parent label of the < a > label through a jsup packet;
step 2.2, recursion calling, namely, obtaining a parent label of each parent label;
step 2.3, ending until the obtained parent label is < html >;
and 2.4, sequentially connecting all the obtained parent-level tags to obtain an xpath path of the < a > tag.
Further, according to the method for automatically acquiring the xpath crawling script, the specific step in the step 5 is as follows: all the < a > tags and the corresponding text () content are obtained through the jsup, and the number of the < a > tags and the number of the characters of the text () content are counted.
Further, according to the method for automatically acquiring the xpath crawling script, the specific step in step 7 is as follows: and (4) sending the url address of the webpage in the step (1) and the xpath path recorded in the step (6) to a script frame to generate a corresponding script of the script crawler.
A system for automatically acquiring an xpath generation crawler script for realizing the method comprises the following steps:
the label traversal module is used for opening a webpage corresponding to the url address and traversing all the labels < a > in the webpage;
an xpath path generation module, configured to take out an xpath path corresponding to each < a > tag;
the tag grouping module is used for grouping the < a > tags and counting the number of the < a > tags after grouping;
a link web page obtaining module for opening the link web page according to the < a > tag;
the information counting module is used for counting the number of the < a > tags and the number of characters in the webpage;
the xpath path judging module is used for analyzing the xpath path corresponding to the packet with the maximum number of characters and the minimum number of the < a > tags;
and the crawler script generation module is used for generating a crawler script according to the corresponding xpath path.
Further, according to the system for automatically acquiring the xpath generated crawler script, the crawler script generation module generates the corresponding script of the script according to the url address and the xpath path of the webpage based on the script framework.
The invention has the beneficial effects that: by the method, the xpath path can be automatically acquired only through the url address of the webpage without the participation of professional technicians.
By the method, the xpath path of the needed content in the webpage is automatically positioned, and the crawler script is generated, so that crawler management is simplified and automated; originally, technical personnel have to compile the crawler script, can only maintain nearly hundreds of crawler scripts a month, promote to a ordinary person and can rely on this method to maintain thousands of crawler scripts, greatly improved work efficiency.
Detailed Description
The invention is described in further detail below with reference to the drawings and the detailed description.
Scapy is a currently mainstream crawler framework, and it needs to obtain the Html unique tag of the content (the content that needs to be crawled, such as a topic information list) in the webpage, i.e. the xpath path, to be able to crawl. Therefore, one of the purposes of the invention is to automatically identify the xpath path where the topic information list in the webpage is located, crawl only the information in the topic information list, and filter out other information in the webpage.
The method of the invention mainly comprises the following steps: government websites are relatively common to the structure of web pages containing information on the public of government websites. A web page containing public information about a government site, comprising in large part: menu navigation, notification announcements, subject information lists, other navigation, advertising and other links, etc. However, different web pages have different xpath paths corresponding to the subject information list, and the method provided by the invention can automatically screen the xpath path of the subject information list, automatically generate a crawler script and crawl the information in the subject information list.
Fig. 1 is a flowchart illustrating a method for automatically obtaining an xpath-generated crawler script according to an embodiment of the present invention, where the method mainly includes the following steps:
step 1, acquiring a url address of a webpage, opening the webpage through the url address, and traversing all < a > tags in the webpage;
the < a > tag is used to define a hyperlink;
the webpage is a webpage containing a theme information list, and url addresses of the webpage can be manually collected and sorted, and can also be preset according to the architectures of different government websites;
the specific algorithm for traversing all < a > tags in a web page is as follows: calling a method of the jsup packet to acquire all < a > tags and contents thereof in the webpage; the jsup is a package for analyzing a webpage, is developed by java, and provides a mode similar to DOM and CSS selectors for searching and extracting contents in a document;
step 2, taking out the xpath path corresponding to each < a > tag;
the specific algorithm is as follows:
step 2.1, acquiring a parent label of the < a > label through a jsup packet;
step 2.2, recursion calling, namely, obtaining a parent label of each parent label;
step 2.3, ending until the obtained parent label is < html >;
step 2.4, all the obtained parent-level tags are connected in sequence to obtain an xpath path of the < a > tag;
and 3, grouping the < a > tags according to the following principle: if the xpath paths are the same, dividing the xpath paths into a group; namely: dividing < a > tags with the same Xpath path into a group; then, counting the number of tags < a > after grouping;
step 4, taking out one < a > tag in each group, and opening the link webpage;
due to the grouping in step 3, the opened linked web pages are the same for any < a > tag in the group;
step 5, counting the number of the < a > tags and the number of characters in the webpage for each opened webpage in the step 4;
the character number refers to the character number of the < a > label;
the specific algorithm is as follows: acquiring all the < a > tags and corresponding text () contents through a jsup, and counting the number of the < a > tags and the number of characters of the text () contents;
the number of < a > tags is the number of < a > tags;
the number of characters is the number of characters;
step 6, taking out a group with the most characters and the least < a > labels, and recording the corresponding xpath path;
step 7, based on the Scapy framework, generating a crawler script according to the corresponding xpath path, wherein the crawler script refers to the Scapy crawler script;
the specific algorithm is as follows: sending the url address of the webpage in the step 1 and the xpath path recorded in the step 6 to a script frame to generate a corresponding script of the script crawler;
the script of the script crawler only needs the url address and the xpath path of the webpage, and other contents are basically fixed, so that the script of the script crawler can be generated through the script framework only by obtaining the url and the xpath.
According to the method, through verification of hundreds of government web pages, 80% of web pages containing government website public information can be automatically analyzed to obtain an xpath path corresponding to the subject information list, so that a crawler script is generated, the average time consumption of the method is about 1 minute, and the method is proved to be feasible and high in efficiency.
However, the method does not have the special architecture of some websites, and the method can improve the algorithms of the steps 5 and 6 by summarizing the characteristics of the architectures of the websites according to the specific situations to achieve the purpose of accurately acquiring the xpath path. For example: some web page list links are only a few, while other links are many, and such special cases are not in the scope of the present invention and will not be described in detail.
One embodiment is as follows.
As shown in fig. 3, an example of a web page for a government website, in this example can be seen:
the menu navigation at the top is performed,
the other navigation on the left-hand side,
the lower portion of the advertisement and other links,
a list of subject information in the middle.
The url address of the web page is known.
The method comprises the following specific steps:
step 1, after the webpage is opened, traversing all < a > tags in the webpage;
the following < a > tags are in the web page:
< a href ═ default/news/news/news/nid/5701 >
'a href'/defaults/news/news/nid/5693 'title ═ establishing a period government procurement project video monitoring system special equipment procurement bid notice' >
< a href ═ default/news/news/news/nid/5692 >
(too many labels, not all enumerated)
Step 2, taking out the xpath path corresponding to each < a > tag; the xpath path corresponding to each < a > tag is as follows:
//*[@id="nav_right"]/ul[2]/li[1]/a
//*[@id="newslist"]/ul/li[1]/span[1]/a
//*[@id="newstype_list"]/dl/dt[1]/a
//*[@id="footer"]/div[2]/ul/li[1]/a
step 3, grouping the < a > tags according to the same xpath path; grouping results are as follows:
web page content
|
Xpath path
|
<a>Number of labels
|
Menu navigation
|
//*[@id="nav_right"]/ul[2]/li[1]/a
|
6
|
Subject information list
|
//*[@id="newslist"]/ul/li[1]/span[1]/a
|
15
|
Other navigations
|
//*[@id="newstype_list"]/dl/dt[1]/a
|
5
|
Advertisements and other links
|
//*[@id="footer"]/div[2]/ul/li[1]/a
|
5 |
Step 4, one < a > tag in each group is taken out, the linked web page is opened, in this embodiment, four groups are provided, and four corresponding web pages are as shown in fig. 4,
step 5, counting the number of the < a > tags and the number of characters in the webpage for each opened webpage in the step 4; the statistical results are as follows:
web page content
|
<a>Number of labels
|
Number of web page characters
|
Menu navigation
|
42
|
222
|
Subject information list
|
33
|
1607
|
Other navigations
|
59
|
836
|
Advertisements and other links
|
44
|
283 |
Step 6, taking out a group with the most characters and the least < a > labels, and recording the corresponding xpath path;
according to the table, the path corresponding to the subject information list which accords with the condition that the number of characters is the most and the number of the < a > labels is the least is the xpath path;
and 7, generating a crawler script according to the xpath path corresponding to the theme information list.
Therefore, the xpath path of the topic information list is judged according to the number of the webpage characters of each group of the < a > tags and the number of the < a > tags, so that the crawler management can be completed by ordinary personnel without the participation of professional technicians in identification.
Corresponding to the method shown in fig. 1, an embodiment of the present invention further provides a system for automatically acquiring an xpath-generated crawler script, as shown in fig. 2, where the system includes:
the label traversal module is used for opening a webpage corresponding to the url address and traversing all the labels < a > in the webpage;
an xpath path generation module, configured to take out an xpath path corresponding to each < a > tag;
the tag grouping module is used for grouping the < a > tags and counting the number of the < a > tags after grouping;
a link web page obtaining module for opening the link web page according to the < a > tag;
the information counting module is used for counting the number of the < a > tags and the number of characters in the webpage;
the xpath path judging module is used for analyzing the xpath path corresponding to the packet with the maximum number of characters and the minimum number of the < a > tags;
and the crawler script generation module is used for generating a crawler script according to the corresponding xpath path.
On the basis of the technical scheme, the crawler script generation module generates a corresponding script of the script crawler according to the url address and the xpath path of the webpage based on the script framework.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is intended to include such modifications and variations.