CN105760545A

CN105760545A - Configuration rule based website data search method

Info

Publication number: CN105760545A
Application number: CN201610152001.7A
Authority: CN
Inventors: 赵海兵
Original assignee: Hunan Zingrow Information Technology Co Ltd
Current assignee: Hunan Zingrow Information Technology Co Ltd
Priority date: 2016-03-17
Filing date: 2016-03-17
Publication date: 2016-07-13

Abstract

The invention relates to the field of vertical search, in particular to a configuration rule based website data search method. The method is implemented by the following steps: configuring a portal rule, a link rule and a detail rule for a to-be-searched website; analyzing the portal rule to obtain a website portal url, the link rule associated with the portal rule and parameters during website access; analyzing the link rule associated with the portal rule to obtain a link rule grammar and the detail rule associated with the link rule; and analyzing the detail rule associated with the link rule to obtain a detail rule grammar so as to acquire the content in a page. According to the method, developers can be freed from a writing crawler system and can perform data acquisition on one website only by writing configuration rules for different websites; and website rule writing is simpler than direct crawler system writing, and the maintenance is very convenient, so that the development and maintenance costs can be greatly reduced for enterprises.

Description

Website data searching method based on configurable rule

Technical field

The present invention relates to vertical search field, specifically based on the website data searching method of configurable rule.

Background technology

Vertical search engine is the professional search engine for some industry, it is segmentation and the extension of search engine, being that the information that certain class in web page library is special is once integrated, the data that directed point field extracts needs return to user after processing again with some form.Vertical search is relativeUniversal search engineContain much information, inquire about new search engine service pattern inaccurate, that the degree of depth is inadequate etc. puts forward, by the information having certain values provided for a certain specific area, a certain specific crowd or a certain particular demands and related service.Its feature is exactly " special, smart, deep ", and has industry color, the magnanimity information disordering of universal search engine of comparing, and vertical search engine then seems more absorbed, concrete and gos deep into.At present, in vertical search field, owing to website varies, the data of different web sites to be captured, then need to write (or several) crawlers for each website, because accomplishing that a set of program is covered all of website without way.

Summary of the invention

For above-mentioned technical problem, the present invention provides the website data searching method based on configurable rule, the method is based on resolution rules and scans for, as long as being the rule of a website a set of collection of configuration, just can when not revising search engine system, realize the purpose that the content of this website is acquired. thus the purpose of the multiplexing reached, it is possible to greatly save the development cost of the enterprise being engaged in the work of this field and maintain cost..

The technical solution used in the present invention is: based on the website data searching method of configurable rule, it sequentially includes the following steps:

(1) for configuration entrance rule in website to be searched, Link Rule and details rule；

(2) entrance rule, parameter when obtaining the entrance url of website and the Link Rule of entrance rule association and access this website are resolved；

(3) resolve above-mentioned and entrance rule association Link Rule, obtain one for the Link Rule grammer resolving this Website page and the details rule being associated with this Link Rule；

(4) resolve the above-mentioned details being associated with Link Rule rule, obtain several details rule syntaxs for gathering content on this Website page, thus the content gathered on the page.

As preferably, described Link Rule is one or more layers structure；After performing Link Rule, obtain the url of Object linking；If the url of Object linking is the url of details, then Link Rule adopts a Rotating fields；If the url of Object linking is not the url of details, then Link Rule takes multiple structure to carry out recurrence execution.

As preferably, described Link Rule grammer includes CSS matched rule, destination node rule and nodal community rule, and wherein CSS matched rule is the HTML node that application CSS selector grammer selects to need, and obtains a node listing；Destination node rule, for above-mentioned node listing is further mated, obtains one group of node comprising link；Nodal community rule, for the attribute of the above-mentioned node comprising link is mated, obtains one group of url list.

As preferably, described Link Rule grammer also includes filtering rule before CSS, and before CSS, filtering rule is that unwanted node in HTML selected in application CSS selector grammer.

As preferably, described Link Rule grammer also includes filtering rule after CSS, and after CSS, filtering rule is that application CSS selector grammer selects unwanted node, and the node chosen with CSS matched rule compares, if node is identical, then remove the node that CSS matched rule chooses；If node is the child node of the node that CSS matched rule is chosen, then in the node that CSS matched rule is chosen, delete corresponding child node.

As preferably, described Link Rule grammer also includes regular expression rule, and regular expression rule is that the above-mentioned node comprising link is further modified.

As preferably, each described details rule syntax is gathered sub-rule by several and forms, and each gathers sub-rule corresponding to gathering a certain block text information on website.

As preferably, each described collection sub-rule includes CSS matched rule, utilizes the node on CSS matched rule coupling html, and the text that the node that will match to comprises is as final result.

As preferably, each described collection sub-rule also includes filtering rule after CSS, utilizes CSS selector grammer to select unwanted node, and the node mated with CSS matched rule compares, if node is identical, then remove the node that CSS matched rule chooses.

As preferably, each described collection sub-rule also includes regular expression rule, utilizes the text that above-mentioned node is comprised by regular expression rule to do further modification.

From the above, the method can free developer from writing crawler system, developer has only to write the configuration rule for different websites just can realize the data acquisition to a website, and write website rule and be compared to that directly to write crawler system simply too much, safeguard simultaneously and be also convenient for a lot, it is possible to greatly save development cost and maintenance cost for enterprise.

Detailed description of the invention

The present invention is described more detail below, and illustrative examples and explanation in this present invention are used for explaining the present invention, but not as a limitation of the invention.

Based on the website data searching method of configurable rule, first it sequentially include the following steps:, for configuration entrance rule in website to be searched, Link Rule and details rule；Resolving entrance rule, parameter when obtaining the entrance url of website and the Link Rule of entrance rule association and access this website；Entrance url accesses the entrance of this website for command deployment engine, and this entrance is usually the homepage of website, but it could also be possible that website specific page, is the search engine URL that accesses first page of this website；With the Link Rule of entrance rule association for telling which Link Rule search engine uses to gather linking of this page, entrance rule and Link Rule are the relations of one-to-many, so a plurality of Link Rule can be configured here, when configuring a plurality of Link Rule, search engine performs by the sequential iteration of configuration rule；Implementation process can be arranged parameter when accessing this website mainly below these: whether use proxy server when accessing this station, the access frequency of request is set, each run number of times is set, the parameter etc. of request header when accessing website and the character for identifying website that some are relevant to business is set.

Then, resolve above-mentioned and entrance rule association Link Rule, obtain one for the Link Rule grammer resolving this Website page and the details rule being associated with this Link Rule；Link Rule grammer is the core content of Link Rule, and search engine is by resolving this rule just it is known that how to go to obtain the link on this page；Owing to after completing to resolve, search engine can obtain one group of url list for gathering information further, in order to gather the concrete content inside these url, just needing to resolve by the details rule of next stage. entrance rule and Link Rule are the relations of one-to-many, so a plurality of Link Rule can be configured here, when configuring a plurality of Link Rule, search engine performs by the sequencing iteration of configuration rule, until there being a rule parsing success just to terminate.

nullThe effect of Link Rule is in that to tell how capture program goes to gather the link on website,Capture program just can gather the link on website by resolving this rule. and the application adopts this rule of structural design of a kind of layering,One Link Rule can have the structure of multilamellar. and a Link Rule is likely to have the structure of a layer or multilamellar. after often having performed one layer of rule,Search engine will obtain the url of one group of Object linking,If the url of Object linking is the url of details,As long as our structure of one layer is just passable,Going to gather to details rule if thus url is lost. target url is not the url of details，We just to take the structure of multilamellar to carry out the execution of recurrence. when the structure of multilamellar of our configuration,So search engine can be used as each url in the target url list of last layer as new entrance,The Link Rule being continuing with next layer comes its coupling,Result will obtain target url one group new,To the last one layer is just terminated.

First illustratively, one Link Rule grammer is made up of following components: filtering rule before CSS, CSS matched rule, filtering rule after CSS, regular expression rule, destination node rule, nodal community rule. the effect of each rule is as follows: filtering rule before CSS: our undesired node in HTML selected in application CSS selector grammer, and removes in the DOM of HTML；CSS matched rule: the HTML node that we are required selected in application CSS selector grammer；Filtering rule after CSS: our unwanted node should be selected by CSS selector grammer, and compare with the node chosen in previous step, if identical with the node in previous step, then remove the node chosen in previous step, if the child node of previous step interior joint；Then the node chosen in previous step is deleted corresponding child node. destination node rule: continue remaining node application CSS selector grammatical rules after filtering is selected corresponding node, such as<A>label node etc.；Nodal community rule: the node chosen in previous step is applied this attribution rule, selects the URL that the attribute on node is corresponding, the HREF attribute on such as A node；Regular expression rule: using regular expression rule, it is further processed modification. than if desired for situations such as one part of joint linked are replaced. after our acquisition engine reads the Link Rule of this layer, the link on the page can be gathered in the following order:

1, before utilizing CSS, filtering rule filters out our undesired html content, just skips over this step without filtering rule before configuration CSS and gathers；

2, utilizing CSS matched rule to remove the node on coupling html, result should be a node listing；

3, after utilizing CSS, filtering rule filters out our undesired node in previous step result, just skips over this step without filtering rule after configuration CSS and gathers；

4, utilizing destination node rule that the result in previous step is further mated, result will obtain one group of node comprising link；

5, utilize nodal community rule that the attribute of all nodes comprising link of previous step is mated, finally will obtain one group of url list；

6, utilize regular expression rule that the result of previous step is further modified, if not then skip over this step.

For needing the website of the Link Rule of configuring multi-layer, we are accomplished by using the grammar design of multilamellar:

{ { ground floor rule } } { { second layer rule } } ... .{{ n-th layer rule } }, we just can realize the rule of infinite layering and pick up in such a way.

Then, resolve the above-mentioned details being associated with Link Rule rule, obtain several details rule syntaxs for gathering content on this Website page, thus the content gathered on the page.The effect of details rule is to tell how capture program goes to gather the text message on website, capture program just can realize the collection of the information to website by this rule. and each details rule syntax is made up of several collection sub-rules specifically, and each gathers sub-rule corresponding to gathering a certain block text information on website.Specifically to configure several collection sub-rule depending on concrete business.Than if desired for the words gathering news, it usually needs configure following collection sub-rule: title sub-rule, text sub-rule, author's sub-rule, issuing time sub-rule etc..So ConfigurationDetails rule configures one or more exactly gathers sub-rule, then these are gathered sub-rule and be combined just passable.One gathers sub-rule and is made up of following 3 parts: CSS matched rule, CSS filtering rule, regular expression rule.Their effect is as follows respectively: CSS matched rule: the HTML node that we are required selected in application CSS selector grammer；Filtering rule after CSS: our unwanted node selected in application CSS selector grammer, and compare with the node chosen in previous step, if identical with the node in previous step, then remove the selection in previous step；Regular expression rule: using regular expression rule is further processed modification, ratio is if desired for situations such as one part of joint linked are replaced.After our acquisition engine reads the rule of this layer, the information on the page can be gathered in the following order:

1, utilizing CSS matched rule to remove the node on coupling html, result should be a node listing；

2, after utilizing CSS, filtering rule filters out our undesired node in previous step result, just skips over this step without filtering rule after configuration CSS and gathers；

3, judging whether also have configuration regular expression rule, without configuration regular expression rule, the text comprised by the node matched above is as final result；If being configured with regular expression, then text node comprised does further modification, and the result after modification as final result.

News to gather Sina website's page below, illustrates how our rule configures:

1, configuration entrance rule.

Entrance URL is set to the homepage of Sina:http://www.sina.com.cn。

Configure the ID:1 (this ID realizes being as the criterion of distribution with system, fills in 1 herein and uses for citing) of next Link Rule associated

2, configuration Link Rule.

Link Rule grammer is arranged, and we need to gather all of news links in homepage herein, through observation we have found that the URL of news links be substantially with: numeral .shtml ending. so our rule can so configure:

CSS matched rule could be arranged to: body

Meaning is grammer according to CSS the BODY node of the whole document target as us, because all of link is all contained by BODY node.

After CSS, filtering rule could be arranged to: a:not (a [href ~=d{4}.shtml])

Grammatical rules according to CSS, the meaning of this rule is that the URL filtering out the link of BODY label does not comprise data+.shtml link. so we can ensure that result is all the link of news, rather than the link of the non-new text such as second-level directory.

Destination node rule could be arranged to: a

Grammatical rules according to CSS, is exactly that all of a label in BODY node is as directory node.

Nodal community rule could be arranged to: href

Meaning is to choose the href attribute url as a result in a label.

Configure the ID:2 (this ID realizes being as the criterion of distribution with system, fills in 2 herein and uses for citing) of next collection rule associated

3, ConfigurationDetails rule.

Assume that we need the content gathered to be the title of news, text, these fields of issuing time, if the news package that previous step collects contains this URL:http://news.sina.com.cn/c/2015-11-30/doc- ifxmainy1476425.shtml, we are just to gather this URL here, and we can configure collection rule as follows:

Title CSS matched rule: h1

According to CSS grammer, this rule is that to choose a label be the node of h1. owing to this CSS matched rule has been able to collect the title of news, we need not mate filtering rule and the canonical matched rule of title again.

Text CSS matched rule: #artibody

According to CSS grammer, it is the node of artibody that this rule chooses an ID. similarly, as this CSS rule has been able to collect the text of news, we are also without the filtering rule and the canonical matched rule that additionally configure text.

Issuing time CSS matched rule: #navtimeSource

According to CSS grammer, it is the node of navtimeSource that this rule chooses an ID. owing to this node includes the information of time, we have just collected issuing time, and we are no longer necessary to additionally configure the filtering rule of issuing time and canonical matched rule.

Last: Created:2015-12-01 16:37 Tuesday

Emacs24.4.1(Orgmode8.2.10)

Validate

The technical scheme above embodiment of the present invention provided is described in detail, principle and the embodiment of the embodiment of the present invention are set forth by specific case used herein, and the explanation of above example is only applicable to help to understand the principle of the embodiment of the present invention；Simultaneously for one of ordinary skill in the art, according to the embodiment of the present invention, all will change in detailed description of the invention and range of application, in sum, this specification content should not be construed as limitation of the present invention.

Claims

1., based on the website data searching method of configurable rule, it sequentially includes the following steps:

2. according to claim 1 based on the website data searching method of configurable rule, it is characterised in that: described Link Rule is one or more layers structure；After performing Link Rule, obtain the url of Object linking；If the url of Object linking is the url of details, then Link Rule adopts a Rotating fields；If the url of Object linking is not the url of details, then Link Rule takes multiple structure to carry out recurrence execution.

3. as claimed in claim 1 based on the website data searching method of configurable rule, it is characterized in that: described Link Rule grammer includes CSS matched rule, destination node rule and nodal community rule, wherein CSS matched rule is the HTML node that application CSS selector grammer selects to need, and obtains a node listing；Destination node rule, for above-mentioned node listing is further mated, obtains one group of node comprising link；Nodal community rule, for the attribute of the above-mentioned node comprising link is mated, obtains one group of url list.

4. as claimed in claim 3 based on the website data searching method of configurable rule, it is characterised in that: described Link Rule grammer also includes filtering rule before CSS, and before CSS, filtering rule is that unwanted node in HTML selected in application CSS selector grammer.

5. as claimed in claim 3 based on the website data searching method of configurable rule, it is characterized in that: described Link Rule grammer also includes filtering rule after CSS, after CSS, filtering rule is that application CSS selector grammer selects unwanted node, and the node chosen with CSS matched rule compares, if node is identical, then remove the node that CSS matched rule chooses；If node is the child node of the node that CSS matched rule is chosen, then in the node that CSS matched rule is chosen, delete corresponding child node.

6. as claimed in claim 1 based on the website data searching method of configurable rule, it is characterised in that: described Link Rule grammer also includes regular expression rule, and regular expression rule is that the above-mentioned node comprising link is further modified.

7. as claimed in claim 1 based on the website data searching method of configurable rule, it is characterised in that: each described details rule syntax is gathered sub-rule by several and forms, and each gathers sub-rule corresponding to gathering a certain block text information on website.

8. as claimed in claim 7 based on the website data searching method of configurable rule, it is characterized in that: each described collection sub-rule includes CSS matched rule, utilize the node on CSS matched rule coupling html, and the text that the node that will match to comprises is as final result.

9. as claimed in claim 7 based on the website data searching method of configurable rule, it is characterized in that: each described collection sub-rule also includes filtering rule after CSS, CSS selector grammer is utilized to select unwanted node, and the node mated with CSS matched rule compares, if node is identical, then remove the node that CSS matched rule chooses.

10. as claimed in claim 7 based on the website data searching method of configurable rule, it is characterised in that: each described collection sub-rule also includes regular expression rule, utilizes the text that above-mentioned node is comprised by regular expression rule to do further modification.