CN107943838A

CN107943838A - A kind of automatic method and system for obtaining xpath generation reptile scripts

Info

Publication number: CN107943838A
Application number: CN201711034452.1A
Authority: CN
Inventors: 姬永杰; 陈国强; 王长勇; 任建新
Original assignee: Beijing Great Digital Science And Technology Development Co Ltd
Current assignee: Beijing Great Digital Science And Technology Development Co Ltd
Priority date: 2017-10-30
Filing date: 2017-10-30
Publication date: 2018-04-20
Anticipated expiration: 2037-10-30
Also published as: CN107943838B

Abstract

The invention discloses a kind of automatic method and system for obtaining xpath generation reptile scripts, the described method comprises the following steps：(1) webpage is opened by url addresses, traveled through all in webpage<a>Label；(2) take out each<a>The corresponding xpath paths of label；(3) it is divided into one group by xpath paths are identical；Then, after statistical packet<a>Label number；(4) one in every group is taken out<a>Label, opens the linked web pages；(5) for the webpage being each opened in step 4, in statistical web page<a>Label number and word number；(6) take out word number at most and<a>Minimum one group of label number, records its corresponding xpath path；(7) Scrapy frames are based on, according to corresponding xpath coordinates measurements reptile script.This method can crawl government website public information based on Scrapy frames, can parse the xpath paths of required content in webpage, the automatization level of lifting reptile management automatically.

Description

A kind of automatic method and system for obtaining xpath generation reptile scripts

Technical field

The present invention relates to web crawlers technical field, and in particular to a kind of automatic side for obtaining xpath generation reptile scripts Method and system.The xpath refers to xpath paths.

Background technology

Dynamics is opened with data as further increase government information is disclosed, more and more government informations are disclosed in The government website public information of magnanimity is formed in government website.Existing government website is established by all kinds of departments of governments at all levels With maintenance, management, obtain government website public information easily and fast from government website, will be brought to user huge Value.

But the content of these government websites is different, the configurations of webpage, present internet reptile (is also known as net Network reptile, is referred to as reptile), when being crawled to government website, it is required for professional technician to analyze the structure of webpage, from And content and it is crawled needed for positioning, this is because：

The xpath paths of required content are different in webpage, and manually content needed for parsing is needed when being crawled Xpath paths, this obviously will be devoted a tremendous amount of time and manpower, heavy workload, and work is cumbersome.In face of thousands of political affairs Mansion website, this pattern are obviously less efficient.

The present invention relates to following technical term：

1st, crawl, refer to access website, and information is obtained from webpage, realize collecting webpage data.

2nd, xpath, is the language (reptile) that information is searched in the webpage (especially XML document), for webpage (especially XML document) in element and attribute traveled through.Xpath belongs to Html path languages, it may be employed to determine Html texts The language of certain portion in shelves.

3rd, Scrapy, is a quick, high-level screen scraping and web crawl frame (the reptile frame of Python exploitations Frame), for capturing web site and the data of structuring being extracted from webpage.Scrapy is widely used, can be used for data digging Pick, monitoring and automatic test.In the reptile script based on Scrapy frames, the step of most critical is institute in identification webpage The xpath paths of content are needed, to crawl named web page content.

4th, internet reptile, is a kind of program or foot according to certain rule, automatically crawl web message This, it is mainly two ways：

The first is that the whole network of the search engines such as similar Baidu crawls；

It is for second to be crawled towards the orientation of certain classification, orientation, which crawls, to be referred to crawl named web page content (named web page Targeted content).

But for the mode that orientation crawls, as previously described, because the page layout of government website is more mixed and disorderly, obtain The xpath paths of named web page content (required content in webpage), it is necessary to professional technician in the url addresses of existing webpage Under the premise of, check web page source code, after analysis, obtain correct xpath paths.

The content of the invention

For defect existing in the prior art, it is an object of the invention to provide a kind of automatic acquisition xpath generations to climb The method and system of worm script, can crawl government website public information based on Scrapy frames by this method, can To parse the xpath paths of required content in webpage, the automatization level of lifting reptile management automatically.

To achieve the above object, the technical solution adopted by the present invention is as follows：

A kind of automatic method for obtaining xpath generation reptile scripts, comprises the following steps：

Step 1, the url addresses of webpage are obtained, and webpage is opened by url addresses, are traveled through all in webpage<a>Mark Label；

It is described<a>Label is used to define hyperlink；

Step 2, take out each<a>The corresponding xpath paths of label；

Step 3, by following principle pair<a>Label is grouped：Xpath paths are identical, are divided into one group；Then, unite After score group<a>Label number；

Step 4, one in every group is taken out<a>Label, opens the linked web pages；

Step 5, for the webpage being each opened in step 4, in statistical web page<a>Label number and word Number；

The word number refers to<a>The word number of label；

Step 6, take out word number at most and<a>Minimum one group of label number, records its corresponding xpath path；

Step 7, based on Scrapy frames, according to corresponding xpath coordinates measurements reptile script.

Further, a kind of automatic method for obtaining xpath generation reptile scripts as described above, in step 1, the net Page is the webpage for including subject information list.

Further, a kind of automatic method for obtaining xpath generation reptile scripts as described above, the specific steps of step 2 For：

Step 2.1, by jsoup bags, obtain<a>The parent label of label；

Step 2.2, recursive call, i.e., obtain its parent label again to each parent label；

Step 2.3, until the parent label of acquisition is<html>Terminate；

Step 2.4, all parent labels of acquisition are sequentially connected, are somebody's turn to do<a>The xpath paths of label.

Further, a kind of automatic method for obtaining xpath generation reptile scripts as described above, the specific steps of step 5 For：Owned by jsoup<a>Label, and corresponding text () content, and count<a>Label number and text () are interior The word number of appearance.

Further, a kind of automatic method for obtaining xpath generation reptile scripts as described above, the specific steps of step 7 For：The xpath paths that the url addresses of webpage in step 1 and step 6 record are sent to Scrapy frames, generation is corresponding Scrapy reptile scripts.

A kind of automatic system for obtaining xpath generation reptile scripts for being used for realization the above method, including：

<a>Label spider module, for opening the corresponding webpage in url addresses, and travels through all in webpage<a>Label；

Xpath path-generating modules, it is each for taking out<a>The corresponding xpath paths of label；

Labeled packet module, for pair<a>Label is grouped, after statistical packet<a>Label number；

Linked web pages acquisition module, for basis<a>Label opens the linked web pages；

Information Statistics module, in statistical web page<a>Label number and word number；

Xpath paths discrimination module, for analyze word number at most and<a>The minimum packet of label number corresponds to Xpath paths；

Reptile script generation module, for according to corresponding xpath coordinates measurements reptile script.

Further, a kind of automatic system for obtaining xpath generation reptile scripts as described above, the reptile script life Into module, based on Scrapy frames, according to the url addresses and xpath paths of webpage, corresponding Scrapy reptiles script is generated.

The beneficial effects of the present invention are：By this method, in the case where being not required professional technician to participate in, need to only pass through The url addresses of webpage can obtain xpath paths automatically.

By this method, the xpath paths of required content in automatic positioning webpage are realized, reptile script is generated, makes to climb Scolite reason is simplified, automation；By reptile script must be write by professional technician originally, nearly hundred can only be safeguarded within one month A reptile script, being lifted to an ordinary person can rely on this method to safeguard thousands of a reptile scripts, greatly improve work Efficiency.

Brief description of the drawings

Fig. 1 is a kind of automatic method for obtaining xpath and generating reptile script provided in the specific embodiment of the invention Flow chart；

Fig. 2 is a kind of automatic system for obtaining xpath and generating reptile script provided in the specific embodiment of the invention Structure diagram.

Fig. 3 is certain government website webpage example.

Fig. 4 is in webpage shown in Fig. 3<a>Label corresponds to webpage and opens schematic diagram.

Embodiment

With reference to Figure of description, the present invention is described in further detail with embodiment.

Scrapy is the reptile frame of current mainstream, it needs to obtain required content in webpage and (needs to be crawled interior Hold, for example, subject information list) Html unique tags, that is, xpath paths, can be crawled.Therefore, mesh of the invention One of be the xpath paths that can automatically identify in webpage where subject information list, only in subject information list Information is crawled, and filters out the other information in webpage.

The method of the invention main thought is：Comparatively government website, includes the webpage of government website public information Structure be to have certain intercommunity.A basic webpage for including government website public information, mostly includes：Menu is led Boat, notice bulletin, subject information list, other navigation, the content such as advertisement and other links.But different webpages, theme The corresponding xpath paths of information list are different from, the present invention provide following method can Automatic sieve select subject information list Xpath paths, automatically generate reptile script, and the information in subject information list is crawled.

Fig. 1 shows a kind of automatic side for obtaining xpath generation reptile scripts provided in the specific embodiment of the invention The flow chart of method, this method mainly include the following steps that：

It is described<a>Label is used to define hyperlink；

The webpage is the webpage for including subject information list, and the url addresses of such webpage, can artificially collect arrangement, It can also be preset according to the framework of different government websites；

It is all in the traversal webpage<a>The specific algorithm of label is as follows：The method for calling jsoup bags, obtains webpage In it is all<a>Label and its content；Jsoup is the bag of an analyzing web page, with java exploitations, there is provided similar DOM, The content in document is searched and extracted to the mode of CSS selector；

Step 2, take out each<a>The corresponding xpath paths of label；

Specific algorithm is as follows：

Step 2.1, by jsoup bags, obtain<a>The parent label of label；

Step 2.3, until the parent label of acquisition is<html>Terminate；

Step 2.4, all parent labels of acquisition are sequentially connected, are somebody's turn to do<a>The xpath paths of label；

Step 3, by following principle pair<a>Label is grouped：Xpath paths are identical, are divided into one group；I.e.：Will Xpath paths are identical<a>Label is divided into one group；Then, after statistical packet<a>Label number；

Step 4, one in every group is taken out<a>Label, opens the linked web pages；

Due to the packet mode of step 3, in any one group<a>Label, the linked web pages opened all are identical 's；

The word number refers to<a>The word number of label；

Specific algorithm is as follows：Owned by jsoup<a>Label, and corresponding text () content, and count<a>Mark Sign the word number of number and text () content；

<a>Label number is<a>The number of label；

Word number, that is, word number of words；

Step 7, based on Scrapy frames, according to corresponding xpath coordinates measurements reptile script, the reptile script refers to Scrapy reptile scripts；

Specific algorithm is as follows：The xpath paths that the url addresses of webpage in step 1 and step 6 record are sent to Scrapy frames, generate corresponding Scrapy reptiles script；

Scrapy reptiles script only needs the url addresses and xpath paths of webpage, and other contents are substantially stationary constant, institute Scrapy reptile scripts can be generated by Scrapy frames need to only obtain url and xpath.

The method of the invention, can be public comprising government website by 80% by the verification of a government's class webpages up to a hundred The webpage of information is opened, automatically analyzes and obtains the corresponding xpath paths of subject information list, so as to generate reptile script, the party Method is averagely 1 minute or so time-consuming, it was demonstrated that method is feasible and efficiency is higher.

But truly having the framework of number of site more special, this method or the acquisition xpath paths for being unable to entirely accurate can With according to the concrete condition that runs into, the characteristics of by the framework of summarizing these websites, step 5,6 algorithm are improved, reaches accurate Obtain the purpose in xpath paths.Such as：Some web page listings link only several, and more, this kind of spy is compared in other links Different situation in the range of the present invention discusses, is not described in detail no longer.

It is a specific embodiment below.

As shown in figure 3, be the webpage example of a government website, it is visible in the example：

The menu navigation at top,

Other navigation in left side,

The advertisement of lower part and other links,

The subject information list at middle part.

The url addresses of the webpage are known.

Comprise the following steps that：

Step 1, after opening the webpage, travel through all in webpage<a>Label；

Have in the webpage following<a>Label：

(label is excessive, does not enumerate all)

Step 2, take out each<a>The corresponding xpath paths of label；Respectively<a>The corresponding xpath paths of label are as follows：

// * [@id=" nav_right "]/ul [2]/li [1]/a

// * [@id=" newslist "]/ul/li [1]/span [1]/a

// * [@id=" newstype_list "]/dl/dt [1]/a

// * [@id=" footer "]/div [2]/ul/li [1]/a

Step 3, by identical pair in xpath paths<a>Label is grouped；Group result such as following table：

Web page contents	Xpath paths	<a>Label number
			Menu navigation	// * [@id=" nav_right "]/ul [2]/li [1]/a	6
Subject information list	// * [@id=" newslist "]/ul/li [1]/span [1]/a	15
			Other navigation	// * [@id=" newstype_list "]/dl/dt [1]/a	5
Advertisement and other links	// * [@id=" footer "]/div [2]/ul/li [1]/a	5

Step 4, one in every group is taken out<a>Label, opens the linked web pages, has four points in this specific embodiment Group, then four corresponding webpages as shown in figure 4,

Step 5, for the webpage being each opened in step 4, in statistical web page<a>Label number and word Number；Statistical result such as following table：

Web page contents	<a>Label number	Webpage word number
			Menu navigation	42	222
Subject information list	33	1607
			Other navigation	59	836
Advertisement and other links	44	283

According to upper table, meet " word number at most and<a>Label number is minimum " the information list that is the theme it is corresponding Xpath paths；

Step 7, according to the corresponding xpath coordinates measurements reptile script of subject information list.

It can be seen from the above that this pass through every group<a>The webpage word number of label with<a>Label number judges subject information The xpath paths of list, participate in identifying, ordinary person can complete reptile management without professional technician.

It is corresponding with the method shown in Fig. 1, a kind of automatic xpath that obtains is additionally provided in embodiment of the present invention and is given birth to Into the system of reptile script, as shown in Fig. 2, the system includes：

Based on the above technical solutions, the reptile script generation module, based on Scrapy frames, according to webpage Url addresses and xpath paths, generate corresponding Scrapy reptiles script.

Obviously, various changes and modifications can be made to the invention without departing from the present invention's by those skilled in the art Spirit and scope.In this way, if these modifications and changes of the present invention belongs to the model of the claims in the present invention and its equivalent technology Within enclosing, then the present invention is also intended to comprising including these modification and variations.

Claims

1. a kind of automatic method for obtaining xpath generation reptile scripts, comprises the following steps：

Step 1, the url addresses of webpage are obtained, and webpage is opened by url addresses, are traveled through all in webpage<a>Label；

It is described<a>Label is used to define hyperlink；

Step 2, take out each<a>The corresponding xpath paths of label；

Step 3, by following principle pair<a>Label is grouped：Xpath paths are identical, are divided into one group；Then, statistical packet Afterwards<a>Label number；

Step 4, one in every group is taken out<a>Label, opens the linked web pages；

The word number refers to<a>The word number of label；

A kind of 2. automatic method for obtaining xpath generation reptile scripts according to claim 1, it is characterised in that：Step 1 In, the webpage is the webpage for including subject information list.

A kind of 3. automatic method for obtaining xpath generation reptile scripts according to claim 1, it is characterised in that：Step 2 Concretely comprise the following steps：

Step 2.1, by jsoup bags, obtain<a>The parent label of label；

Step 2.3, until the parent label of acquisition is<html>Terminate；

A kind of 4. automatic method for obtaining xpath generation reptile scripts according to claim 1, it is characterised in that：Step 5 Concretely comprise the following steps：Owned by jsoup<a>Label, and corresponding text () content, and count<a>Label number and The word number of text () content.

A kind of 5. automatic method for obtaining xpath generation reptile scripts according to claim 1, it is characterised in that：Step 7 Concretely comprise the following steps：The xpath paths that the url addresses of webpage in step 1 and step 6 record are sent to Scrapy frames, Generate corresponding Scrapy reptiles script.

6. a kind of automatic system for obtaining xpath generation reptile scripts, including：

Xpath paths discrimination module, for analyze word number at most and<a>The minimum packet of label number is corresponding Xpath paths；

A kind of 7. automatic system for obtaining xpath generation reptile scripts according to claim 6, it is characterised in that：It is described Reptile script generation module, based on Scrapy frames, according to the url addresses and xpath paths of webpage, generates corresponding Scrapy Reptile script.