Background technology
Dynamics is opened with data as further increase government information is disclosed, more and more government informations are disclosed in
The government website public information of magnanimity is formed in government website.Existing government website is established by all kinds of departments of governments at all levels
With maintenance, management, obtain government website public information easily and fast from government website, will be brought to user huge
Value.
But the content of these government websites is different, the configurations of webpage, present internet reptile (is also known as net
Network reptile, is referred to as reptile), when being crawled to government website, it is required for professional technician to analyze the structure of webpage, from
And content and it is crawled needed for positioning, this is because:
The xpath paths of required content are different in webpage, and manually content needed for parsing is needed when being crawled
Xpath paths, this obviously will be devoted a tremendous amount of time and manpower, heavy workload, and work is cumbersome.In face of thousands of political affairs
Mansion website, this pattern are obviously less efficient.
The present invention relates to following technical term:
1st, crawl, refer to access website, and information is obtained from webpage, realize collecting webpage data.
2nd, xpath, is the language (reptile) that information is searched in the webpage (especially XML document), for webpage (especially
XML document) in element and attribute traveled through.Xpath belongs to Html path languages, it may be employed to determine Html texts
The language of certain portion in shelves.
3rd, Scrapy, is a quick, high-level screen scraping and web crawl frame (the reptile frame of Python exploitations
Frame), for capturing web site and the data of structuring being extracted from webpage.Scrapy is widely used, can be used for data digging
Pick, monitoring and automatic test.In the reptile script based on Scrapy frames, the step of most critical is institute in identification webpage
The xpath paths of content are needed, to crawl named web page content.
4th, internet reptile, is a kind of program or foot according to certain rule, automatically crawl web message
This, it is mainly two ways:
The first is that the whole network of the search engines such as similar Baidu crawls;
It is for second to be crawled towards the orientation of certain classification, orientation, which crawls, to be referred to crawl named web page content (named web page
Targeted content).
But for the mode that orientation crawls, as previously described, because the page layout of government website is more mixed and disorderly, obtain
The xpath paths of named web page content (required content in webpage), it is necessary to professional technician in the url addresses of existing webpage
Under the premise of, check web page source code, after analysis, obtain correct xpath paths.
The content of the invention
For defect existing in the prior art, it is an object of the invention to provide a kind of automatic acquisition xpath generations to climb
The method and system of worm script, can crawl government website public information based on Scrapy frames by this method, can
To parse the xpath paths of required content in webpage, the automatization level of lifting reptile management automatically.
To achieve the above object, the technical solution adopted by the present invention is as follows:
A kind of automatic method for obtaining xpath generation reptile scripts, comprises the following steps:
Step 1, the url addresses of webpage are obtained, and webpage is opened by url addresses, are traveled through all in webpage<a>Mark
Label;
It is described<a>Label is used to define hyperlink;
Step 2, take out each<a>The corresponding xpath paths of label;
Step 3, by following principle pair<a>Label is grouped:Xpath paths are identical, are divided into one group;Then, unite
After score group<a>Label number;
Step 4, one in every group is taken out<a>Label, opens the linked web pages;
Step 5, for the webpage being each opened in step 4, in statistical web page<a>Label number and word
Number;
The word number refers to<a>The word number of label;
Step 6, take out word number at most and<a>Minimum one group of label number, records its corresponding xpath path;
Step 7, based on Scrapy frames, according to corresponding xpath coordinates measurements reptile script.
Further, a kind of automatic method for obtaining xpath generation reptile scripts as described above, in step 1, the net
Page is the webpage for including subject information list.
Further, a kind of automatic method for obtaining xpath generation reptile scripts as described above, the specific steps of step 2
For:
Step 2.1, by jsoup bags, obtain<a>The parent label of label;
Step 2.2, recursive call, i.e., obtain its parent label again to each parent label;
Step 2.3, until the parent label of acquisition is<html>Terminate;
Step 2.4, all parent labels of acquisition are sequentially connected, are somebody's turn to do<a>The xpath paths of label.
Further, a kind of automatic method for obtaining xpath generation reptile scripts as described above, the specific steps of step 5
For:Owned by jsoup<a>Label, and corresponding text () content, and count<a>Label number and text () are interior
The word number of appearance.
Further, a kind of automatic method for obtaining xpath generation reptile scripts as described above, the specific steps of step 7
For:The xpath paths that the url addresses of webpage in step 1 and step 6 record are sent to Scrapy frames, generation is corresponding
Scrapy reptile scripts.
A kind of automatic system for obtaining xpath generation reptile scripts for being used for realization the above method, including:
<a>Label spider module, for opening the corresponding webpage in url addresses, and travels through all in webpage<a>Label;
Xpath path-generating modules, it is each for taking out<a>The corresponding xpath paths of label;
Labeled packet module, for pair<a>Label is grouped, after statistical packet<a>Label number;
Linked web pages acquisition module, for basis<a>Label opens the linked web pages;
Information Statistics module, in statistical web page<a>Label number and word number;
Xpath paths discrimination module, for analyze word number at most and<a>The minimum packet of label number corresponds to
Xpath paths;
Reptile script generation module, for according to corresponding xpath coordinates measurements reptile script.
Further, a kind of automatic system for obtaining xpath generation reptile scripts as described above, the reptile script life
Into module, based on Scrapy frames, according to the url addresses and xpath paths of webpage, corresponding Scrapy reptiles script is generated.
The beneficial effects of the present invention are:By this method, in the case where being not required professional technician to participate in, need to only pass through
The url addresses of webpage can obtain xpath paths automatically.
By this method, the xpath paths of required content in automatic positioning webpage are realized, reptile script is generated, makes to climb
Scolite reason is simplified, automation;By reptile script must be write by professional technician originally, nearly hundred can only be safeguarded within one month
A reptile script, being lifted to an ordinary person can rely on this method to safeguard thousands of a reptile scripts, greatly improve work
Efficiency.
Embodiment
With reference to Figure of description, the present invention is described in further detail with embodiment.
Scrapy is the reptile frame of current mainstream, it needs to obtain required content in webpage and (needs to be crawled interior
Hold, for example, subject information list) Html unique tags, that is, xpath paths, can be crawled.Therefore, mesh of the invention
One of be the xpath paths that can automatically identify in webpage where subject information list, only in subject information list
Information is crawled, and filters out the other information in webpage.
The method of the invention main thought is:Comparatively government website, includes the webpage of government website public information
Structure be to have certain intercommunity.A basic webpage for including government website public information, mostly includes:Menu is led
Boat, notice bulletin, subject information list, other navigation, the content such as advertisement and other links.But different webpages, theme
The corresponding xpath paths of information list are different from, the present invention provide following method can Automatic sieve select subject information list
Xpath paths, automatically generate reptile script, and the information in subject information list is crawled.
Fig. 1 shows a kind of automatic side for obtaining xpath generation reptile scripts provided in the specific embodiment of the invention
The flow chart of method, this method mainly include the following steps that:
Step 1, the url addresses of webpage are obtained, and webpage is opened by url addresses, are traveled through all in webpage<a>Mark
Label;
It is described<a>Label is used to define hyperlink;
The webpage is the webpage for including subject information list, and the url addresses of such webpage, can artificially collect arrangement,
It can also be preset according to the framework of different government websites;
It is all in the traversal webpage<a>The specific algorithm of label is as follows:The method for calling jsoup bags, obtains webpage
In it is all<a>Label and its content;Jsoup is the bag of an analyzing web page, with java exploitations, there is provided similar DOM,
The content in document is searched and extracted to the mode of CSS selector;
Step 2, take out each<a>The corresponding xpath paths of label;
Specific algorithm is as follows:
Step 2.1, by jsoup bags, obtain<a>The parent label of label;
Step 2.2, recursive call, i.e., obtain its parent label again to each parent label;
Step 2.3, until the parent label of acquisition is<html>Terminate;
Step 2.4, all parent labels of acquisition are sequentially connected, are somebody's turn to do<a>The xpath paths of label;
Step 3, by following principle pair<a>Label is grouped:Xpath paths are identical, are divided into one group;I.e.:Will
Xpath paths are identical<a>Label is divided into one group;Then, after statistical packet<a>Label number;
Step 4, one in every group is taken out<a>Label, opens the linked web pages;
Due to the packet mode of step 3, in any one group<a>Label, the linked web pages opened all are identical
's;
Step 5, for the webpage being each opened in step 4, in statistical web page<a>Label number and word
Number;
The word number refers to<a>The word number of label;
Specific algorithm is as follows:Owned by jsoup<a>Label, and corresponding text () content, and count<a>Mark
Sign the word number of number and text () content;
<a>Label number is<a>The number of label;
Word number, that is, word number of words;
Step 6, take out word number at most and<a>Minimum one group of label number, records its corresponding xpath path;
Step 7, based on Scrapy frames, according to corresponding xpath coordinates measurements reptile script, the reptile script refers to
Scrapy reptile scripts;
Specific algorithm is as follows:The xpath paths that the url addresses of webpage in step 1 and step 6 record are sent to
Scrapy frames, generate corresponding Scrapy reptiles script;
Scrapy reptiles script only needs the url addresses and xpath paths of webpage, and other contents are substantially stationary constant, institute
Scrapy reptile scripts can be generated by Scrapy frames need to only obtain url and xpath.
The method of the invention, can be public comprising government website by 80% by the verification of a government's class webpages up to a hundred
The webpage of information is opened, automatically analyzes and obtains the corresponding xpath paths of subject information list, so as to generate reptile script, the party
Method is averagely 1 minute or so time-consuming, it was demonstrated that method is feasible and efficiency is higher.
But truly having the framework of number of site more special, this method or the acquisition xpath paths for being unable to entirely accurate can
With according to the concrete condition that runs into, the characteristics of by the framework of summarizing these websites, step 5,6 algorithm are improved, reaches accurate
Obtain the purpose in xpath paths.Such as:Some web page listings link only several, and more, this kind of spy is compared in other links
Different situation in the range of the present invention discusses, is not described in detail no longer.
It is a specific embodiment below.
As shown in figure 3, be the webpage example of a government website, it is visible in the example:
The menu navigation at top,
Other navigation in left side,
The advertisement of lower part and other links,
The subject information list at middle part.
The url addresses of the webpage are known.
Comprise the following steps that:
Step 1, after opening the webpage, travel through all in webpage<a>Label;
Have in the webpage following<a>Label:
<Training is passed through in a href="/defaults/news/news/nid/5701 " title=" The 2nd Foreign Language Inst. of Beijings
Institute's five in one Multifunctional laboratory builds government procurement successful project bidding bulletin ">
<A href="/defaults/news/news/nid/5693 " title=" Beijing prison reorganization and expansion security protection system
Construction in a systematic way sets phase government procurement project video monitoring system special project equipment purchase acceptance of the bid bulletin ">
<A href="/defaults/news/news/nid/5692 " title=" Beijing Municipal People's Governments state-owned assets
In contributing enterprise of supervision and management committee 2017-2019 annual accounts audit advisory, quality examination and the procurement item that runs a government
Mark bulletin ">
(label is excessive, does not enumerate all)
Step 2, take out each<a>The corresponding xpath paths of label;Respectively<a>The corresponding xpath paths of label are as follows:
// * [@id=" nav_right "]/ul [2]/li [1]/a
// * [@id=" newslist "]/ul/li [1]/span [1]/a
// * [@id=" newstype_list "]/dl/dt [1]/a
// * [@id=" footer "]/div [2]/ul/li [1]/a
Step 3, by identical pair in xpath paths<a>Label is grouped;Group result such as following table:
Web page contents |
Xpath paths |
<a>Label number |
Menu navigation |
// * [@id=" nav_right "]/ul [2]/li [1]/a |
6 |
Subject information list |
// * [@id=" newslist "]/ul/li [1]/span [1]/a |
15 |
Other navigation |
// * [@id=" newstype_list "]/dl/dt [1]/a |
5 |
Advertisement and other links |
// * [@id=" footer "]/div [2]/ul/li [1]/a |
5 |
Step 4, one in every group is taken out<a>Label, opens the linked web pages, has four points in this specific embodiment
Group, then four corresponding webpages as shown in figure 4,
Step 5, for the webpage being each opened in step 4, in statistical web page<a>Label number and word
Number;Statistical result such as following table:
Web page contents |
<a>Label number |
Webpage word number |
Menu navigation |
42 |
222 |
Subject information list |
33 |
1607 |
Other navigation |
59 |
836 |
Advertisement and other links |
44 |
283 |
Step 6, take out word number at most and<a>Minimum one group of label number, records its corresponding xpath path;
According to upper table, meet " word number at most and<a>Label number is minimum " the information list that is the theme it is corresponding
Xpath paths;
Step 7, according to the corresponding xpath coordinates measurements reptile script of subject information list.
It can be seen from the above that this pass through every group<a>The webpage word number of label with<a>Label number judges subject information
The xpath paths of list, participate in identifying, ordinary person can complete reptile management without professional technician.
It is corresponding with the method shown in Fig. 1, a kind of automatic xpath that obtains is additionally provided in embodiment of the present invention and is given birth to
Into the system of reptile script, as shown in Fig. 2, the system includes:
<a>Label spider module, for opening the corresponding webpage in url addresses, and travels through all in webpage<a>Label;
Xpath path-generating modules, it is each for taking out<a>The corresponding xpath paths of label;
Labeled packet module, for pair<a>Label is grouped, after statistical packet<a>Label number;
Linked web pages acquisition module, for basis<a>Label opens the linked web pages;
Information Statistics module, in statistical web page<a>Label number and word number;
Xpath paths discrimination module, for analyze word number at most and<a>The minimum packet of label number corresponds to
Xpath paths;
Reptile script generation module, for according to corresponding xpath coordinates measurements reptile script.
Based on the above technical solutions, the reptile script generation module, based on Scrapy frames, according to webpage
Url addresses and xpath paths, generate corresponding Scrapy reptiles script.
Obviously, various changes and modifications can be made to the invention without departing from the present invention's by those skilled in the art
Spirit and scope.In this way, if these modifications and changes of the present invention belongs to the model of the claims in the present invention and its equivalent technology
Within enclosing, then the present invention is also intended to comprising including these modification and variations.