CN103970898A

CN103970898A - Method and device for extracting information based on multistage rule base

Info

Publication number: CN103970898A
Application number: CN201410227611.XA
Authority: CN
Inventors: 张可; 柴毅; 马号; 刘建环; 田甜
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2014-05-27
Filing date: 2014-05-27
Publication date: 2014-08-06

Abstract

A method for extracting information based on a multistage rule base comprises the steps that (1) a URL address of web pages is obtained; (2) the web pages corresponding to the URL address are downloaded; (3) a web page tree-type structure chart is obtained; (4) web page clustering is conducted, web pages are selected from the web pages to be clustered to serve as a training set, and a clustering rule of the web pages is defined according to a robot learning method; (5) a searching result is extracted; (6) information is collected and displayed. After the web page tree-type structure chart is obtained in the step (3) and the web pages are clustered in the step (4), the recall ratio of the retrieved information can be effectively increased, the clustering rule is automatically generated by means of robot learning in a training set mode, manual clustering is not needed, the automation degree of searching is effectively increased, and the condition of large-area use is achieved on the premise that the recall ratio is guaranteed. According to a device for extracting the information based on the multistage rule base, a hardware foundation is provided for an information extraction process, cost is low, and the device is suitable for large-scale use.

Description

A kind of information extracting method and device based on multistage rule base

Technical field

The present invention relates to computer search engine technique field, particularly a kind of information extracting method and device.

Background technology

Along with spread and the application of computing machine and network, the whole world has all entered the large information age, and for the large information age, information search engine becomes requisite gordian technique.The information search method that current information search engine adopts has following four kinds:

1, the information extraction technology based on HTML structure; This technology completes information extraction according to the design feature of HTML, and the tree structure by DOM model is the extraction of information in webpage is equivalent to the extraction to nodal information in tree structure.Shortcoming: can cause when excessive cannot information extraction when the page changes;

2, the WEB information extraction technology based on natural language; This technology has been ignored structure of web page, does not consider webpage label factor, only according to existing contact between natural language itself, web page text information is analyzed.Shortcoming: information extraction speed is slow, when processing multiagent WEB document, if main body is not carried out to piece division, easily causes information extraction failure;

3, the information extraction technology based on body (Ontology); By related notion, attribute, relation, constraint and term etc. in this field, formed, mainly utilize the descriptor of body to data in this field, do not considering in the page structure situation of WEB, only according to the feature of data semantic, realize information extraction.Shortcoming: although the method dirigibility and strong adaptability, its automaticity is low;

4, the information extraction technology based on wrapper (Wrapper) study; After professional internet developer's analyzing web site structure, the program of hand-coding wrapper, the wrapper of writing out can only be for a class webpage.Shortcoming: for a large amount of webpages, just need to analyze a large amount of structures, and the complicated structure of a lot of websites, even for professional, the time of the writing cost of each wrapper is very huge, and people spend very large energy in website structure analysis with above program debug.

Above 4 kinds of modes are summarized, can find: the method not high to html document Structure Dependence, although its automaticity is high, cannot process baroque webpage, and the accuracy of its extraction is lower, practicality is poor; The method high to html document Structure Dependence, can process the webpage of labyrinth, but its automaticity is low, and it is high to rely on the artificial information extraction mode extraction accuracy participating in, but automaticity is low, the information extraction mode that automaticity is high has the drawback of the low poor practicability of accuracy conventionally.

Summary of the invention

One object of the present invention is just to provide a kind of information extracting method based on multistage rule base, and it can complete information search and extract under the prerequisite by artificial cluster not, has significantly improved the automaticity of search engine; Meanwhile, it can analyze cluster to the info web searching automatically, has significantly improved the recall ratio of information.

This object of the present invention is to realize by such technical scheme, and it includes following steps:

1) inputted search key word, obtains all webpage URL addresses relevant to key word;

2) according to step 1) in the webpage URL address that obtains, the webpage that download URL address is corresponding;

3) to step 2) in the webpage downloaded carry out pre-service, obtain webpage tree figure;

4) according to step 3) in the webpage tree figure that obtains, carry out webpage cluster, from webpage to be clustered, choose webpage as training set, by machine learning method, obtain web page template and define the clustering rule of webpage;

5) Search Results extracts, and according to the key word of input, adopts XPath rule location node, then adopts XSLT rule to carry out information extraction;

6) according to step 5) in the result extracted, the information of extracting in dissimilar webpage is gathered to demonstration.

Further, step 1), be correlated with as same or similar with key word.

Further, the method for down loading step 2) is reptile method for down loading.

Further, step 3) webpage pre-service described in, the concrete grammar that obtains webpage tree figure is:

3-1) to step 2) in the webpage downloaded carry out Web Cleanout, the html text that does not meet standard is converted to the text that meets XML standard, and washes unallowable instruction digit and the mistake of absconding;

3-2) to step 3-1) result that obtains is carried out DOM parsing, by XML standard text resolution, is document object Document;

3-3) structure of web page graphically shows, document object Document is graphically shown as to Dom tree, by tree construction, structure of web page is analyzed and the extraction to host node information.

Further, step 3-2), XML standard text is resolved to as adopts DOM4j or jdom kit.

The concrete generation method of clustering rule further, step 4) is:

4-1) webpage similarity is calculated, and adopts tree Path Matching Algorithm to calculate webpage similarity, forms similarity matrix;

4-2) by clustering algorithm, webpage is carried out to cluster, clustering algorithm adopts the agglomerative algorithm of cohesion level, and bunch spacing tolerance in agglomerative algorithm adopts an average chain method to calculate, average chain method be input as step 4-1) in the similarity matrix that forms.

Further, step 4-1) and step 4-2) specific formula for calculation be:

sim (h_{i}, h_{j}) = (\frac{Σ_{k = 1}^{pn (h_{i})} sim (p_{ik}, bp (p_{ik}))}{pn (h_{i})} + \frac{Σ_{k = 1}^{pn (h_{j})} sim (p_{jk}, bp (p_{jk}))}{pn (h_{j})}) \div 2

Wherein, h _ithe all set of paths that represent webpage, p _ikfor h _iin one tree path, bp (p _jk) expression p _jkwith respect to h _ibest matching path, sim (h _i, h _j) represent the similarity of webpage, on (h _i) expression h _itree total number of paths, pn (h _j) expression h _jtree total number of paths.The codomain of structure of web page similarity is [0,1], and it is more similar that its value more approaches the structure of two webpages of 1 expression;

d_{avg} (c_{i}, c_{j}) = \frac{1}{n_{i} n_{j}} Σ_{p &Element; c_{i}} Σ_{p^{'} &Element; c_{j}} | p - p^{'} |

Wherein, n _ia bunch c _ithe number of middle object, n _ja bunch c _jthe number of middle object.

Further, step 5) rule of XSLT described in adopts Rule Generation Algorithm to obtain from template webpage, and the node that is input as message block father node of Rule Generation Algorithm, is output as XSLT rule.

Another object of the present invention is just to provide a kind of information extracting device based on multistage rule base, and it can realize the full-automation search of information, and the info web searching is analyzed to cluster, has significantly improved the recall ratio of information.

This object of the present invention is to realize by such technical scheme, and it includes, and URL address acquisition module, web page code acquisition module, webpage pretreatment module, webpage cluster module, info web extraction module, information display module, clustering rule are set up module, information extraction rule is set up module, webpage clustering rule storehouse and information extraction rule base;

URL address acquisition module is obtained the URL address of related web page according to search key, URL address information is sent to web page code module;

Web page code module, according to URL address information downloading web pages, is sent to webpage pretreatment module by the info web of download;

Webpage pretreatment module is carried out pre-service to info web, obtains webpage tree figure, and webpage tree figure is sent to webpage clustering apparatus;

Webpage clustering apparatus, according to the information in webpage clustering rule storehouse, carries out webpage cluster to the webpage in webpage tree, and the info web after cluster is sent to info web extraction module, and the information in webpage clustering rule storehouse is set up module by clustering rule and generated;

The info web of info web extraction module after to cluster carries out information extraction, the information of extraction is sent to information display module, information extraction rule base provides information extraction rule for info web extraction module, and the information extraction rule in information extraction rule base is set up module by information extraction rule and generated;

The information that information extraction modules display web page information extraction modules sends.

Owing to having adopted technique scheme, the present invention has advantages of as follows:

Information extracting method based on multistage rule base of the present invention, realizes information extraction by 6 steps: 1) obtain webpage URL address; 2) webpage corresponding to download URL address; 3) obtain webpage tree figure; 4) carry out webpage cluster, from webpage to be clustered, choose webpage as training set, by machine learning method, obtain web page template and define the clustering rule of webpage; 5) Search Results extracts; 6) information gathers demonstration.Step 3 wherein) generating web page tree and step 4) in webpage cluster after, the information recall ratio retrieving can effectively improve, and step 4) clustering rule in is by the mode of training set, by machine learning, automatically generate, do not need cluster manually, the automaticity that has effectively improved search, is guaranteeing under the prerequisite of recall ratio, has the condition that large area is used.Information extracting device based on multistage rule base of the present invention, for information extraction flow process provides hardware foundation, its low price, is applicable to extensive use.

Other advantages of the present invention, target and feature will be set forth to a certain extent in the following description, and to a certain extent, based on will be apparent to those skilled in the art to investigating below, or can be instructed from the practice of the present invention.Target of the present invention and other advantages can be realized and be obtained by instructions and claims below.

Accompanying drawing explanation

Accompanying drawing of the present invention is described as follows.

Fig. 1 is information extraction schematic flow sheet of the present invention;

Fig. 2 is apparatus structure schematic diagram of the present invention.

Embodiment

Below in conjunction with drawings and Examples, the invention will be further described.

An information extracting method based on multistage rule base, concrete steps are as follows:

1) URL address acquisition.First adopt the mode of search sequence to search for the related web page of search key, obtain the URL address of webpage.The all URLs address relevant to search sequence contained in the URL address herein obtaining, and is a large amount of addresses, non-single address.

2) page download.Acquired webpage URL address Adoption Network crawler technology is downloaded to related web page code.

3) webpage pre-service.The webpage having obtained is processed to the Dom Tree of the standard that obtains.Comprise: Web Cleanout, DOM resolve and structure of web page graphically shows.

Web Cleanout refers to: html page reparation is converted into the XML document that meets standard.Because HTML does not strictly observe XHTML standard, so the mistake of absconding may appear unallowable instruction digit and in a page, Web Cleanout is mainly that these mistakes are revised, and avoids occurring parse error.

DOM resolves and refers to: XML format text is resolved to document object Document, for example, can adopt analytical tool DOM4j or jdom to resolve XML format text, to obtain document object.

The graphical demonstration of structure of web page refers to: the graphical demonstration of text object is obtained to Dom tree, by tree construction, structure of web page is analyzed and the extraction to host node information.

4) webpage cluster.From webpage to be clustered, choose a part of webpage as training set, by machine learning method, obtain web page template and define the clustering rule of webpage.Specifically comprise:

Similarity calculating method is chosen: average chain method obtains bunch spacing need to set up similarity matrix, therefore first need to calculate the similarity between webpage, and the similarity calculating method that the present invention adopts is tree Path Matching Algorithm, the method is than tree edit distance algorithm, its complexity is lower, and institute takes time still less.

Clustering algorithm is chosen: what Web Page Clustering Algorithm herein adopted is Agglomerative Hierarchical Clustering algorithm, and the tolerance of bunch spacing adopts average chain method, and the end condition that cluster finishes is that the distance when between any Liang Ge family is greater than given threshold value Q.

Similarity algorithm formula is as follows:

sim (h_{i}, h_{j}) = (\frac{Σ_{k = 1}^{pn (h_{i})} sim (p_{ik}, bp (p_{ik}))}{pn (h_{i})} + \frac{Σ_{k = 1}^{pn (h_{j})} sim (p_{jk}, bp (p_{jk}))}{pn (h_{j})}) \div 2

Wherein, h _ithe all set of paths that represent webpage, p _ikfor h _iin one tree path, bp (p _jk) expression p _jkwith respect to h _ibest matching path, sim (h _i, h _j) represent the similarity of webpage, pn (h _i) expression h _itree total number of paths, pn (h _j) expression h _jtree total number of paths.

Average chain method formula is as follows:

d_{avg} (c_{i}, c_{j}) = \frac{1}{n_{i} n_{j}} Σ_{p &Element; c_{i}} Σ_{p^{'} &Element; c_{j}} | p - p^{'} |

5) info web extracts.The dissimilar webpage obtaining for webpage cluster, takes specific information extraction rule to extract info web.

Information extraction rule obtains: information extraction rule adopts XSLT to describe, and accurately locates the position of information node to be extracted with XPath in XHTML document.Because automated manner definition rule accuracy is lower, so Rule Extraction herein adopts manual intervention mode to obtain.For example: this class webpage of respective column tabular form, first choose the template webpage that can reflect this class structure of web page feature, adopt the father node of block information key in XPATH locating template webpage, then according to certain Rule Extraction Algorithm, extracting rule that can obtaining information.The father node that is specifically input as block information key of this algorithm, is output as XSLT file.

Information extraction rule obtains: that information extraction rule adopts is XSLT, accurately locates the position of information node to be extracted with XPath in XHTML document.Because automated manner definition rule accuracy is lower, so Rule Extraction herein adopts manual intervention mode to obtain.

XSLT Rule mode is: extracting rule is to adopt certain Rule Generation Algorithm to obtain from template webpage, and therefore dissimilar webpage, exists its corresponding XSLT rule.Rule Generation Algorithm is one section of existing program, and the node that is input as message block father node of program, is output as XSLT rule.Template webpage is to have typical structure in a class webpage, can reflect the webpage of such webpage characteristic feature.

6) information shows.Carry out, after information extraction, the information of extracting in dissimilar webpage being gathered and being shown completing webpage.

The existing information extracting method based on structure of web page, although its accuracy is high, automaticity is relatively low, this method is intended to meet under the prerequisite of certain information extraction accuracy, improves information extraction automaticity, and recall ratio.Proposition is carried out cluster analysis to all webpages that inquire by search sequence, has improved the recall ratio of information.The dissimilar webpage of proposition after to cluster extracts web page contents according to different information extracting methods, improved information extraction automaticity, and because being adopts specific extraction rule to the webpage of certain kinds, therefore in information extraction accuracy rate, also obtained certain improvement

An information extracting device based on multistage rule base, includes that URL address acquisition module, web page code acquisition module, webpage pretreatment module, webpage cluster module, info web extraction module, information display module, clustering rule are set up module, information extraction rule is set up module, webpage clustering rule storehouse and information extraction rule base;

Information extracting device based on multistage rule base of the present invention, for information extraction flow process provides hardware foundation, its low price, is applicable to extensive use.

Finally explanation is, above embodiment is only unrestricted in order to technical scheme of the present invention to be described, although the present invention is had been described in detail with reference to preferred embodiment, those of ordinary skill in the art is to be understood that, can modify or be equal to replacement technical scheme of the present invention, and not departing from aim and the scope of the technical program, it all should be encompassed in the middle of claim scope of the present invention.

Claims

1. the information extracting method based on multistage rule base, is characterized in that, said method comprising the steps of:

2. a kind of information extracting method based on multistage rule base as claimed in claim 1, is characterized in that step 1) described in be correlated with as same or similar with key word.

3. a kind of information extracting method based on multistage rule base as claimed in claim 1, is characterized in that step 2) described in method for down loading be reptile method for down loading.

4. a kind of information extracting method based on multistage rule base as claimed in claim 1, is characterized in that step 3) described in webpage pre-service, the concrete grammar that obtains webpage tree figure is:

5. a kind of information extracting method based on multistage rule base of finding as claim 4, is characterized in that step 3-2) in DOM4j or jdom kit are resolved as adopted to XML standard text.

6. a kind of information extracting method based on multistage rule base as claimed in claim 1, is characterized in that step 4) described in the concrete generation method of clustering rule be:

7. a kind of information extracting method based on multistage rule base as claimed in claim 6, is characterized in that step 4-1) and step 4-2) specific formula for calculation be:

sim (h_{i}, h_{j}) = (\frac{Σ_{k = 1}^{pn (h_{i})} sim (p_{ik}, bp (p_{ik}))}{pn (h_{i})} + \frac{Σ_{k = 1}^{pn (h_{j})} sim (p_{jk}, bp (p_{jk}))}{pn (h_{j})}) \div 2

Wherein, h _ithe all set of paths that represent webpage, p _ikfor h _iin one tree path, bp (p _jk) expression p _jkwith respect to h _ibest matching path, sim (h _i, h _j) represent the similarity of webpage, pn (h _i) expression h _itree total number of paths, pn (h _j) expression h _jtree total number of paths.The codomain of structure of web page similarity is [0,1], and it is more similar that its value more approaches the structure of two webpages of 1 expression;

d_{avg} (c_{i}, c_{j}) = \frac{1}{n_{i} n_{j}} Σ_{p &Element; c_{i}} Σ_{p^{'} &Element; c_{j}} | p - p^{'} |

8. a kind of information extracting method based on multistage rule base as claimed in claim 1, it is characterized in that, step 5) rule of XSLT described in adopts Rule Generation Algorithm to obtain from template webpage, and the node that is input as message block father node of Rule Generation Algorithm, is output as XSLT rule.

9. adopt the device that method is carried out information extraction described in claim 1-8 any one, it is characterized in that: described device includes that URL address acquisition module, web page code acquisition module, webpage pretreatment module, webpage cluster module, info web extraction module, information display module, clustering rule are set up module, information extraction rule is set up module, webpage clustering rule storehouse and information extraction rule base;