CN105955984A

CN105955984A - Network data searching method based on crawler mode

Info

Publication number: CN105955984A
Application number: CN201610242327.9A
Authority: CN
Inventors: 郑建宾; 李忠洵; 华锦芝; 高峰
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2016-04-19
Filing date: 2016-04-19
Publication date: 2016-09-21

Abstract

The invention puts forward a network data searching method based on a crawler mode. The method includes: carrying out configuration operation to set configuration file; performing initialization operation based on the configuration file to generate a regular expression based directory page and target page filter and a target page path template; carrying out formal crawling operation based on the directory page and target page filter and the target page template so as to acquire data from a confirmed target webpage and save data. The network data searching method based on the crawler mode disclosed by the invention has high universality and searching accuracy, and is convenient to use.

Description

Searching network data method based on reptile pattern

Technical field

The present invention relates to data search method, more particularly, to searching network data method based on reptile pattern.

Background technology

At present, along with the becoming increasingly abundant of class of business of the increasingly extensive and different field of cyber-net application, scan for obtaining target data interested for the comprised Russia of the webpage in website mass data and become more and more important.

In existing technical scheme, generally the mode of the focused crawler of employing subject-oriented webpage from website gathers data.

But, there are the following problems for above-mentioned existing technical scheme: can only process the page data of a certain type structure of certain separate site due to a single crawlers, thus versatility is poor, i.e. need design for different demand and write different crawlers, thus causing cost up and difficult in maintenance.

Accordingly, there exist following demand: provide and there is higher versatility and searching accuracy and searching network data method based on reptile pattern easy to use.

Summary of the invention

In order to solve the problem existing for above-mentioned prior art, the present invention proposes has higher versatility and searching accuracy and searching network data method based on reptile pattern easy to use.

It is an object of the invention to be achieved through the following technical solutions:

A kind of searching network data method based on reptile pattern, described searching network data method based on reptile pattern comprises the following steps:

(A1) configuration operation is performed to arrange configuration file；

(A2) initialization operation is performed based on described configuration file, to generate catalog page based on regular expression and target pages filter and target pages path template；

(A3) perform formal to crawl operation, to obtain from the target web confirmed and to store data based on described catalog page and target pages filter and target pages template.

In scheme disclosed above, it is preferable that described step (A1) farther includes: arrange the first configuration file, it is the linking inlet ports crawling operation described in trigger that described first configuration file is used for preserving seed URL, described seed URL.

In scheme disclosed above, preferably, described step (A1) farther includes: arrange the second configuration file, described second configuration file is for preserving the one or more key words being associated with target pages, in order to auxiliary needle, the confirmation of target pages is operated, wherein, described second configuration file is further used for defining the logical relation between each key word.

In scheme disclosed above, it is preferable that described step (A1) farther includes: arrange the 3rd configuration file, described 3rd configuration file is for preserving the URL of the target pages that at least two has been confirmed by user.

In scheme disclosed above, it is preferable that described step (A1) farther includes: arrange the 4th configuration file, described 4th configuration file is used for preserving basic configuration information and other auxiliary configuration informations.

In scheme disclosed above, preferably, described step (A1) farther includes: arrange the 5th configuration file, described 5th configuration file is for preserving the one or more labels that need to crawl, and wherein said one or more labels comprise label and the content thereof of first indicated webpage in the URL arranged in described 3rd configuration file.

In scheme disclosed above, preferably, described step (A1) farther includes: when the label substance of the one or more labels in one or more labels that described need crawl is the unfixed many data of quantity, then preserve any two in the unfixed many data of described quantity in the label substance of its correspondence.

In scheme disclosed above, preferably, described step (A1) farther includes: when the link that the label substance of the one or more labels in one or more labels that described need crawl is correlator labels page, then add " subtab " labelling in the relevant position of described 5th configuration file, the subtab of " subtab " labelling lock instruction should preserve in the link of the subtab page and this subtab page subtab data to be crawled and separate with specific symbol.

In scheme disclosed above, preferably, described step (A2) farther includes: read the URL of its all target pages preserved from described 3rd configuration file, and generate target pages regular expression according to the common feature of all URL read and structure and predetermined URL regular expression create-rule therewith, wherein, described target pages regular expression and catalog page regular expression constitute described catalog page and target pages filter.

In scheme disclosed above, preferably, described step (A2) farther includes: the seed URL arranged from described first configuration file, N number of page is crawled in the way of most preferably preferentially combining by breadth First, subsequently catalog page therein is clustered according to clustering rule, to generate catalog page regular expression, wherein catalog page is defined as the number the pointing to target pages URL page more than N.

In scheme disclosed above, it is preferable that described clustering rule includes: incorporated into by URL identical for the host part from url list as same subset, and each subset constitutes one URL bunch；Therewith by each URL bunch of input polymerizer, to aggregate into multiple URL class by described URL bunch, each class URL comprises the URL that several are more similar；Utilize described URL regular expression create-rule, each the URL class produced after every bunch of URL polymerization is generated a URL regular expression respectively.

In scheme disclosed above, preferably, described step (A2) farther includes to generate target pages path template as follows: the URL reading front two articles of target pages from described 3rd configuration file the html file indicated download, the Element generation tree structure being subsequently based in said two html file, take out node identical in two trees therewith, being consequently formed a new tree, described newly-generated tree constitutes the target pages path template of this page type.

In scheme disclosed above, preferably, described step (A2) farther includes to determine required label exact path as follows: after determining described target pages path template, read out, from described 5th configuration file, the label that each need crawl successively, and with the label substance of described label for according to the leaf node first traveling through the described tree preserved, if have found the content of text of a leaf node or property value being equal with this label substance or comprise, then the path of this leaf node and this node is saved in corresponding queue, if leaf node does not has required label, the most bottom-up traversal non-leaf nodes completes same action.Wherein, often during one node of backtracking, if this node does not has node of the same name in the same layer brotgher of node, in node path, then directly preserve its node name, and if same layer has node of the same name, then need the property value first determining whether to distinguish other nodes of the same name, if had, this attribute-name and property value is preserved while preserving this node name the most in the paths, if node of the same name does not has such attribute, then need to preserve this node of the same name at this layer in relative position, in order to distinguish other nodes and can accurately find this node.

In scheme disclosed above, preferably, described step (A3) farther includes: the linking inlet ports preserved from described first configuration file carries out crawling operation in the way of most preferably preferentially combining by breadth First, and utilize described catalog page and target pages filter filter out unwanted URL and retain the URL of catalog page to be analyzed and target pages, wherein, the page of the catalog page to be analyzed of described reservation and the URL instruction of target pages meets described target pages regular expression and/or catalog page regular expression, and the page indicated by described URL all comprises at least one in the one or more key word；The target labels data of the page indicated by URL of target pages to be analyzed based on the described required label exact path described reservation of analysis label required for obtaining.

In scheme disclosed above, it is preferable that described step (A3) farther includes: store described target labels data and operate, with crawling, the abnormal data being associated.

Searching network data method based on reptile pattern disclosed in this invention has the advantage that have higher versatility and searching accuracy and easy to use.

Accompanying drawing explanation

Will be more fully understood that by those skilled in the art, wherein in conjunction with accompanying drawing, the technical characteristic of the present invention and advantage:

Fig. 1 is the flow chart of based on reptile pattern according to an embodiment of the invention searching network data method.

Detailed description of the invention

Fig. 1 is the flow chart of based on reptile pattern according to an embodiment of the invention searching network data method.As it is shown in figure 1, searching network data method based on reptile pattern disclosed in this invention comprises the following steps: that (A1) performs configuration operation to arrange configuration file；(A2) initialization operation is performed based on described configuration file, to generate catalog page based on regular expression and target pages filter and target pages path template；(A3) perform formal to crawl operation, to obtain from the target web confirmed and to store data based on described catalog page and target pages filter and target pages template.

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: arrange the first configuration file, described first configuration file is used for preserving seed URL, described seed URL is the linking inlet ports (exemplarily, described linking inlet ports is the initial page relevant to target pages or catalog page) crawling operation described in trigger.

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: arrange the second configuration file, described second configuration file is for preserving the one or more key words being associated with target pages, in order to auxiliary needle, the confirmation of target pages is operated (exemplarily, described key word is the key word that can generally position target pages content finding one or several nearly all intended target page all to exist in the source file of purpose target pages), wherein, described second configuration file is further used for defining the logical relation between each key word (such as, "AND", "or" etc.).

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: arrange the 3rd configuration file, and described 3rd configuration file is for preserving the URL of the target pages that at least two has been confirmed by user.

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: arrange the 4th configuration file, and described 4th configuration file is used for preserving basic configuration information (such as domain name etc.) and other auxiliary configuration information (such as the pre-maximum crawling operation for initial phase crawls quantity etc. in advance).

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: arrange the 5th configuration file, described 5th configuration file is for preserving the one or more labels that need to crawl, wherein said one or more label comprises the label of first indicated webpage in the URL arranged in described 3rd configuration file and content thereof (exemplarily, bookmark name can be by user according to data self, label substance then needs to replicate from target pages, or manually input, but the data mode preserved necessarily keeps consistent with the data on the page).

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: when the label substance of the one or more labels in one or more labels that described need crawl is the unfixed many data of quantity (such as comment), then preserve any two in the unfixed many data of described quantity in the label substance of its correspondence.

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: when the link that the label substance of the one or more labels in one or more labels that described need crawl is correlator labels page, then add " subtab " labelling in the relevant position of described 5th configuration file, the subtab of " subtab " labelling lock instruction should preserve in the link of the subtab page and this subtab page subtab data to be crawled and separate with specific symbol.

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A2) farther includes: read the URL of its all target pages preserved from described 3rd configuration file, and generate target pages regular expression according to the common feature of all URL read and structure and predetermined URL regular expression create-rule therewith, wherein, described target pages regular expression and catalog page regular expression constitute described catalog page and target pages filter.

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A2) farther includes: the seed URL arranged from described first configuration file, with breadth First and optimal preferential (the most such as, optimal mode of priority can be realized by the judgement of the setting of URL weights and key word) mode that combines crawls N(such as, N=10000) the individual page, subsequently catalog page therein is clustered according to clustering rule, to generate catalog page regular expression, wherein be defined as catalog page pointing to the number of target pages URL more than N(such as N=5) the page.

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A2) farther includes: any URL is divided into tri-parts of host, path and query, wherein path part is made up of a series of catalogues, query part is made up of (such as, if url is http://news.qq.com/a/20150415/044667.htm a series of key-value pairs?Tu_biz=1.114.1.0&du=1, then host part is news.qq.com；Path part is /a/20150415/044667.htm；And query part is tu_biz=1.114.1.0&du=1), thus, described URL regular expression create-rule is as follows:

Parametric form	Create-rule
		Host part	Directly retain former host string
Identical with the parameter of apoplexy due to endogenous wind co-located (except pure digi-tal)	Directly retain original shape formula
		It is all letter with the parameter of apoplexy due to endogenous wind co-located	[a-zA-Z]+
It is all pure digi-tal with the parameter of apoplexy due to endogenous wind co-located	[0-9]+
		With the date parameter that parameter is this type of apoplexy due to endogenous wind co-located, eg:2015-04-27	[0-9]+-[0-9]+-[0-9]+
With apoplexy due to endogenous wind co-located parameter by letter, numeral, "_" combine	\w+
		With apoplexy due to endogenous wind co-located parameter by letter, numeral, "_”、“-", ". " combine	[-\w\.]+
Path followed by	(.[a-zA-Z]+)?/?
		When above form is not mated, use asterisk wildcard	.*

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described clustering rule includes: incorporated into by URL identical for the host part from url list as same subset, and each subset constitutes one URL bunch；Therewith by each URL bunch of input polymerizer, described URL bunch to be aggregated into multiple URL class, each class URL comprises the distance between several more similar URL(i.e. these URL and is all not more than the threshold value preset, for example, it is possible to utilize the distance that the distance (i.e. similarity degree) between the various piece in tri-parts of host, path and query calculates between URL)；Utilize described URL regular expression create-rule, each the URL class produced after every bunch of URL polymerization is generated a URL regular expression respectively.

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A2) farther includes to generate target pages path template as follows: the URL reading front two articles of target pages from described 3rd configuration file the html file indicated download, the Element generation tree structure being subsequently based in said two html file, take out node identical in two trees therewith, being consequently formed a new tree, described newly-generated tree constitutes the target pages path template of this page type.

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A2) farther includes to determine required label exact path (i.e. general purpose routine template) as follows: after determining described target pages path template, read out, from described 5th configuration file, the label that each need crawl successively, and with the label substance of described label for according to the leaf node first traveling through the described tree preserved, if have found the content of text of a leaf node or property value being equal with this label substance or comprise, then the path of this leaf node and this node is saved in corresponding queue, if leaf node does not has required label, the most bottom-up traversal non-leaf nodes completes same action.Wherein, often during one node of backtracking, if this node does not has node of the same name in the same layer brotgher of node, in node path, then directly preserve its node name, and if same layer has node of the same name, then need the property value first determining whether to distinguish other nodes of the same name, if had, this attribute-name and property value is preserved while preserving this node name the most in the paths, if node of the same name does not has such attribute, then need to preserve this node of the same name at this layer in relative position, in order to distinguish other nodes and can accurately find this node.

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A3) farther includes: the linking inlet ports preserved from described first configuration file with breadth First and optimal preferential (the most such as, optimal mode of priority can be realized by the judgement of the setting of URL weights and key word) mode that combines carries out crawling operation, and utilize described catalog page and target pages filter filter out unwanted URL and retain the URL of catalog page to be analyzed and target pages, wherein, the page of the catalog page to be analyzed of described reservation and the URL instruction of target pages meets described target pages regular expression and/or catalog page regular expression, and the page indicated by described URL all comprises at least one in the one or more key word；The target labels data (i.e. label substance) of the page indicated by URL of target pages to be analyzed based on the described required label exact path described reservation of analysis label required for obtaining.

Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A3) farther includes: stores described target labels data and operates, with crawling, the abnormal data being associated.

Therefore, searching network data method based on reptile pattern disclosed in this invention has the advantage that have higher versatility and searching accuracy and easy to use.

Although the present invention is to be described by above-mentioned preferred implementation, but its way of realization is not limited to above-mentioned embodiment.It will be appreciated that in the case of without departing from spirit and scope of the present invention, the present invention can be made different changing and modifications by those skilled in the art.

Claims

1. a searching network data method based on reptile pattern, described searching network data method based on reptile pattern comprises the following steps:

(A1) configuration operation is performed to arrange configuration file；

Searching network data method based on reptile pattern the most according to claim 1, it is characterized in that, described step (A1) farther includes: arrange the first configuration file, it is the linking inlet ports crawling operation described in trigger that described first configuration file is used for preserving seed URL, described seed URL.

Searching network data method based on reptile pattern the most according to claim 2, it is characterized in that, described step (A1) farther includes: arrange the second configuration file, described second configuration file is for preserving the one or more key words being associated with target pages, in order to auxiliary needle, the confirmation of target pages is operated, wherein, described second configuration file is further used for defining the logical relation between each key word.

Searching network data method based on reptile pattern the most according to claim 3, it is characterized in that, described step (A1) farther includes: arrange the 3rd configuration file, and described 3rd configuration file is for preserving the URL of the target pages that at least two has been confirmed by user.

Searching network data method based on reptile pattern the most according to claim 4, it is characterized in that, described step (A1) farther includes: arrange the 4th configuration file, and described 4th configuration file is used for preserving basic configuration information and other auxiliary configuration informations.

Searching network data method based on reptile pattern the most according to claim 5, it is characterized in that, described step (A1) farther includes: arrange the 5th configuration file, described 5th configuration file is for preserving the one or more labels that need to crawl, and wherein said one or more labels comprise label and the content thereof of first indicated webpage in the URL arranged in described 3rd configuration file.

Searching network data method based on reptile pattern the most according to claim 6, it is characterized in that, described step (A1) farther includes: when the label substance of the one or more labels in one or more labels that described need crawl is the unfixed many data of quantity, then preserve any two in the unfixed many data of described quantity in the label substance of its correspondence.

Searching network data method based on reptile pattern the most according to claim 7, it is characterized in that, described step (A1) farther includes: when the link that the label substance of the one or more labels in one or more labels that described need crawl is correlator labels page, then add " subtab " labelling in the relevant position of described 5th configuration file, the subtab of " subtab " labelling lock instruction should preserve in the link of the subtab page and this subtab page subtab data to be crawled and separate with specific symbol.

Searching network data method based on reptile pattern the most according to claim 8, it is characterized in that, described step (A2) farther includes: read the URL of its all target pages preserved from described 3rd configuration file, and generate target pages regular expression according to the common feature of all URL read and structure and predetermined URL regular expression create-rule therewith, wherein, described target pages regular expression and catalog page regular expression constitute described catalog page and target pages filter.

Searching network data method based on reptile pattern the most according to claim 9, it is characterized in that, described step (A2) farther includes: the seed URL arranged from described first configuration file, N number of page is crawled in the way of most preferably preferentially combining by breadth First, subsequently catalog page therein is clustered according to clustering rule, to generate catalog page regular expression, wherein catalog page is defined as the number the pointing to target pages URL page more than N.

11. searching network data methods based on reptile pattern according to claim 10, it is characterised in that described clustering rule includes: URL identical for the host part from url list is incorporated into into same subset, and each subset constitutes one URL bunch；Therewith by each URL bunch of input polymerizer, to aggregate into multiple URL class by described URL bunch, each class URL comprises the URL that several are more similar；Utilize described URL regular expression create-rule, each the URL class produced after every bunch of URL polymerization is generated a URL regular expression respectively.

12. searching network data methods based on reptile pattern according to claim 11, it is characterized in that, described step (A2) farther includes to generate target pages path template as follows: the URL reading front two articles of target pages from described 3rd configuration file the html file indicated download, the Element generation tree structure being subsequently based in said two html file, take out node identical in two trees therewith, being consequently formed a new tree, described newly-generated tree constitutes the target pages path template of this page type.

null13. searching network data methods based on reptile pattern according to claim 12，It is characterized in that，Described step (A2) farther includes to determine required label exact path as follows: after determining described target pages path template，Read out, from described 5th configuration file, the label that each need crawl successively，And with the label substance of described label for according to the leaf node first traveling through the described tree preserved，If have found the content of text of a leaf node or property value being equal with this label substance or comprise，Then the path of this leaf node and this node is saved in corresponding queue，If leaf node does not has required label，The most bottom-up traversal non-leaf nodes completes same action，Wherein，Often during one node of backtracking，If this node does not has node of the same name in the same layer brotgher of node，In node path, then directly preserve its node name，And if same layer having node of the same name，Then need the property value first determining whether to distinguish other nodes of the same name，If had，This attribute-name and property value is preserved while preserving this node name the most in the paths，If node of the same name does not has such attribute，Then need to preserve this node of the same name at this layer in relative position，To distinguish other nodes and can accurately finding this node.

14. searching network data methods based on reptile pattern according to claim 13, it is characterized in that, described step (A3) farther includes: the linking inlet ports preserved from described first configuration file carries out crawling operation in the way of most preferably preferentially combining by breadth First, and utilize described catalog page and target pages filter filter out unwanted URL and retain the URL of catalog page to be analyzed and target pages, wherein, the page of the catalog page to be analyzed of described reservation and the URL instruction of target pages meets described target pages regular expression and/or catalog page regular expression, and the page indicated by described URL all comprises at least one in the one or more key word；The target labels data of the page indicated by URL of target pages to be analyzed based on the described required label exact path described reservation of analysis label required for obtaining.

15. searching network data methods based on reptile pattern according to claim 14, it is characterised in that described step (A3) farther includes: store described target labels data and operate, with crawling, the abnormal data being associated.