CN110471645A

CN110471645A - A kind of Adaptive Web page data abstracting method and system based on template

Info

Publication number: CN110471645A
Application number: CN201810436651.3A
Authority: CN
Inventors: 李艳霞; 刘鹏; 刘学
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2018-05-09
Filing date: 2018-05-09
Publication date: 2019-11-19

Abstract

The invention discloses a kind of Adaptive Web page data abstracting method and system based on template, which comprises step 1) establish include several templates data pick-up template library；Step 2) grabs the html source code of Web page, constructs webpage dom tree；Step 3) extracts Web page URL, is successively matched with the template in data pick-up template library, if whole successful match, select the highest template of matching degree as optimal data extraction template, be transferred to step 5)；Otherwise, it enters step 4)；Step 4) reformulates new template according to the data that it fails to match, and data pick-up template library is added, is transferred to step 3)；Step 5) carries out data pick-up according to optimal data extraction template, if data pick-up is completely successful, data pick-up terminates；Otherwise, it enters step 6)；Step 6) carries out optimal data extraction template to carry out data pick-up after adaptively modifying.This method can reduce the manual intervention in data extraction process.

Description

A kind of Adaptive Web page data abstracting method and system based on template

Technical field

It is the present invention relates to the data pick-up development technique field of Web page surface self-adaption, in particular to a kind of based on template Adaptive Web page data abstracting method and system.

Background technique

Web data extraction is the important process of the step in web data excacation.Web data extraction is exactly by Web page Partly-structured data is extracted according to certain method on face, saves as structured format, such as save as XML file or It is medium to store database.Traditional web data abstracting method is the data pick-up for certain a kind of specific information source mostly, main It to be made of a series of execution code of decimation rules predetermined and these rules, not make full use of page data Structure feature, and the structure of the page there are certain requirements, if page structure is just cannot accurately carrying out very much for dynamic change Data pick-up causes data pick-up to fail.

Web data extraction technique can be divided into the Data Extraction Technology based on page DOM structure, the data based on statistical theory Extraction technique and Data Extraction Technology based on page visual signature.Data Extraction Technology wherein based on page DOM structure is answered With the most extensively.

The research for being currently based on page DOM structure, which has focused largely on, derives the specific page, according to certain class webpage The corresponding instance path of data object in feature spanning tree, can not be adaptive when structure of web page changes, even if occurring Vary less, it is still desirable to carry out manual analysis and modification.

Summary of the invention

It is an object of the invention to overcome the research of existing page DOM structure to have focused largely on to carry out the specific page Derive, according to the corresponding instance path of the data object in certain class web page characteristics spanning tree, when structure of web page changes without Method is adaptive, even if what is occurred varies less, it is still desirable to carry out the defect of manual analysis and modification, be based on to provide one kind The Adaptive Web page data abstracting method and system of extraction template.

To achieve the goals above, the present invention provides a kind of, and the Adaptive Web page data based on extraction template extracts Method, which comprises

Step 1) establishes the data pick-up template library comprising several data pick-up templates；

Step 2) grabs the html source code of Web page, thus constructs webpage dom tree；

Step 3) extracts Web page URL, is successively matched with the template in data pick-up template library, if whole With success, then selects the highest template of matching degree as optimal data extraction template, be transferred to step 5)；Otherwise it enters step 4)；

Step 4) reformulates new template according to the data that it fails to match, and data pick-up template library is added, is transferred to step 3)；

Step 5) carries out data pick-up, if data pick-up is completely successful, data pick-up according to optimal data extraction template Terminate；Otherwise, it enters step 6)；

Step 6) carries out optimal data extraction template to carry out data pick-up after adaptively modifying, and data pick-up terminates.

As a kind of improvement of the above method, the data pick-up template of the step 1) includes address block and data block, In, the address block includes: indicating the website of data pick-up<site>with the page network address of expression data pick-up<url>, The data block includes: indicating to need the XPath path expression set<xpaths>of the page data extracted and indicates data <rule>of search rule；

<data>indicates the data for needing to extract, and is made of multiple<node>labels；In<node>label,<nodeId>table Show the mark for extracting data,<title>indicates to extract the meaning of data；In<rule>label,<keyword>indicates keyword rule Then,<tag>indicates Html label rule, and<context>indicates context rule, includes<content>and<distance>two A label, respectively context and at a distance from present node,<font>include<color>and<size>two labels, Respectively font color and font size.

As a kind of improvement of the above method, the data search rule includes keyword search rule, html tag searches Rope rule and contextual search rule；

Keyword search rule are as follows: if the corresponding text information of target data be in Web page it is unique, Text information is added in corresponding<keyword>label in a template, as keyword rule, keyword degree of correlation d_key (n_txt,m_key) is defined as:

Wherein, n_txtFor the corresponding text information of dom tree interior joint data, m_keyFor<keyword>corresponding in template mark The value of label；

The html tag search rule are as follows: if it is that html tag information is in Web page that target data is corresponding Special, then the html tag information is added in corresponding<tag>label in a template, as html tag rule；HTML mark Sign degree of correlation d_tag(n_tag,m_tag) is defined as:

Wherein, n_tagFor the corresponding html tag information of DOM tree node data, | n_tag| it is n_tagTime occurred in dom tree Number, m_tagFor the value of<tag>corresponding in template label；

Contextual search rule are as follows: if the data to be extracted are not easy to search for, but it have be easy to search for up and down Text, then the search to target data can be converted into the search to its context；After finding its context, based on context Position positions target data, context-sensitive degree d_com(n_dist,m_dist) is defined as:

Wherein, n_distFor the distance between dom tree interior joint data and corresponding context, m_distFor it is corresponding in template < Distance > label value.

As a kind of improvement of the above method, the step 2) is specifically included:

Step 2-1) crawl Web page html source code；

Step 2-2) standardize the html source code grabbed, complies with web standard, make using with semantic label Structure, appearance and behavior separation；

Step 2-3) the data structure building webpage dom tree based on DOM node.

As a kind of improvement of the above method, the calculation formula of the matching degree of the step 3) are as follows:

Wherein, w is Web page, and m is data pick-up template, url_wFor the address url of Web page, url_mFor data pick-up In template<url>the data of label；S(url_w) indicate the string assemble for separating corresponding url with "/", | S (url_w)∩S (url_m) | indicate url_wAnd url_mPartial string length, | min (S (url_w),S(url_m)) | indicate url_wAnd url_mIn compared with The length of small set.

As a kind of improvement of the above method, the successful match of the step 3) refers to: the domain name of the url of Web page with <site>label data in data pick-up template is identical, and matching degree is greater than specified threshold t, i.e. R (w, m)>t.

As a kind of improvement of the above method, the step 6) specifically:

The evaluation of estimate for calculating dom tree interior joint by bottom-up sequence according to target data search rule, if encountered The node of evaluation of estimate great-than search threshold value, then search for success, using node data as target data, and by the XPath table of node It is added up to formula in the XPath queue of template corresponding data item.

As a kind of improvement of the above method, the target data search rule includes keyword search rule, HTML Tag search rule and contextual search rule,

A kind of Adaptive Web page data extraction system based on template, including memory, processor and be stored in Computer program that is on reservoir and can running on a processor, which is characterized in that the processor executes real when described program The step of existing preceding claim method.

The present invention has the advantages that

Method of the invention not only defines corresponding decimation rule when formulating extraction template, but also according to page data Text feature, html tag feature, contextual feature and visual feature definition adaptable search rule；Extraction can be effectively improved Efficiency, and effectively reduce the manual intervention in extraction process.

Detailed description of the invention

Fig. 1 is the flow chart of the Adaptive Web page data abstracting method of the invention based on template.

Specific embodiment

The present invention is described in detail now in conjunction with attached drawing.

The present invention is the Adaptive Web data pick-up based on template, therefore needs packaged template before carrying out template matching Library.The details of specific data pick-up template are as follows:

Wherein, it uses<site>indicate the website of data pick-up,<url>indicate the page network address of data pick-up,<data> It indicates the data for needing to extract, is made of multiple<node>labels.In<node>label,<nodeId>indicates to extract the mark of data Know,<title>indicates to extract the meaning of data, and<xpaths>indicates to need the XPath path expression of the page data extracted Set,<rule>indicate data search rule.In<rule>label,<keyword>indicates that keyword rule,<tag>indicate Html label rule,<context>indicates context rule, includes<content>and<distance>two labels, respectively Context and at a distance from present node,<font>include<color>and<size>two labels, respectively font color And font size.

As described in Figure 1, the present invention provides the Adaptive Web page data abstracting method based on template, the method packets It includes:

Step 1) establishes data pick-up template library；

Each data pick-up template is mainly made of address block and data block two parts, and wherein address block includes<site> Indicate data pick-up website and<url>indicate that the page network address of data pick-up, data block include<xpaths>it indicates to need The XPath path expression set of the page data of extraction,<rule>indicate data search rule.

It uses<site>indicate the website of data pick-up,<url>indicate the page network address of data pick-up,<data>it indicates to need The data to be extracted are made of multiple<node>labels；

In<node>label,<nodeId>indicates to extract the mark of data, and<title>indicates to extract the meaning of data,< The XPath path expression set for the page data that xpaths>expression needs to extract,<rule>indicate data search rule；

In<rule>label,<keyword>indicates that keyword rule,<tag>indicate Html label rule,<context> It indicates context rule, includes<content>and<distance>two labels, respectively context and and present node Distance,<font>includes<color>and<size>two labels, respectively font color and font size.

The data search rule of the data block includes keyword search rule, html tag search rule, context Search rule, wherein

The keyword search rule be if the corresponding text information of target data be in Web page it is unique, Text information is added in corresponding<keyword>label in a template, as keyword rule.Keyword degree of correlation d_key (n_txt,m_key) may be defined as:

The html tag search rule is if it be html tag information is special in Web page that target data is corresponding Different, then the html tag information is added in corresponding<tag>label in a template, as html tag rule.Html tag Degree of correlation d_tag(n_tag,m_tag) is defined as:

The contextual search rule be if the data to be extracted are not easy to search for, but it have be easy to search for up and down Text, then the search to target data can be converted into the search to its context.After finding its context, based on context Position positions target data.Context-sensitive degree may be defined as:

Step 2) grabs the html source code of Web page to be extracted, is standardized and constructs webpage dom tree；It specifically includes:

Step 2-1) crawl page source code function is write, and grab Web page html source code；

Step 2-2) HTML code that grabs of standardization, meet web standard, using having semantic label, make structure, Appearance and behavior separation；

Step 2-3) it studies the data structure of DOM node and constructs webpage dom tree.

Step 3) extracts Web page URL, is successively matched with the template in data pick-up template library, if whole With success, then selects the highest template of matching degree as optimal data extraction template, enter step 4)；If template matching It is unsuccessful, then it is transferred to step 6)；

The similar page under website is normally based on same web page template and generates, in appearance, contents and distribution and layout architecture On it is all closely similar, its main feature is that their dom tree trunk structure be it is identical, only the filling data of leaf node are different. In url and extraction template by calculating webpage to be extracted<url>the similarity degree of label data obtains webpage to be extracted With the matching degree of extraction template.

The matching degree R (w, m) may be defined as:

Wherein, w is webpage to be extracted, and m is extraction template, url_wFor the address url of webpage to be extracted, url_mTo extract mould In plate<url>the data of label；S(url_w) indicate the string assemble for separating corresponding url with "/", | S (url_w)∩S (url_m) | indicate url_wAnd url_mPartial string length, | min (S (url_w),S(url_m)) | indicate url_wAnd url_mIn compared with The length of small set；

When the domain name of the url of webpage to be extracted is identical as<site>label data in data pick-up template, and match When degree is greater than specified threshold t, i.e. R (w, m) > t, webpage to be extracted and the success of data pick-up template matching.

Step 4) carries out data pick-up according to optimal data extraction template, if data pick-up is completely successful, data are taken out Take end, otherwise, be transferred to step 5)；

Step 5) adaptively modifies optimal data extraction template: according to the search rule in template by bottom-up Sequence calculate dom tree interior joint evaluation of estimate search for success if encountering the node of evaluation of estimate great-than search threshold value, will Node data is added in the XPath queue of template corresponding data item as target data, and by the XPath expression formula of node；So After carry out data pick-up, data pick-up terminates；

Mainly node each on dom tree is evaluated, is to determine that the data on each node are by calculating It is no to extract successfully.Then determine whether template needs to carry out small-scale modification according to this evaluation of estimate.

Three rules of target data search:

If the corresponding text information of target data be in Web page it is unique, in a template it is corresponding < Text information is added in keyword > label, as keyword rule.Keyword correlation d_key(n_txt,m_key) may be defined as:

Wherein n_txtFor the corresponding text information of dom tree interior joint data, m_keyFor<keyword>corresponding in template label Value；

If target data is corresponding be html tag information be in Web page it is special, in a template it is corresponding < The html tag information is added in tag > label, as html tag rule.Html tag degree of correlation d_tag(n_tag,m_tag) can determine Justice are as follows:

Wherein n_tagFor the corresponding html tag information of DOM tree node data, | n_tag| it is n_tagTime occurred in dom tree Number, m_tagFor the value of<tag>corresponding in template label；

Fix information in context representation page near target data；It is can be found that according to the visual signature of Web page Information in webpage is according to semantic piecemeal, information being visually closer in the page of semantic similarity.If the number to be extracted According to being not easy to search for, but it has the context being easy to search for, then can be converted into the search of target data to its context Search.After finding its context, position based on context positions target data.Context-sensitive degree may be defined as:

Wherein n_distFor the distance between dom tree interior joint data and corresponding context, m_distFor it is corresponding in template < Distance > label value.

Step 6) reformulates new template according to the data that it fails to match, and template library is added, is transferred to step 2).

It should be noted last that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting.Although ginseng It is described the invention in detail according to embodiment, those skilled in the art should understand that, to technical side of the invention Case is modified or replaced equivalently, and without departure from the spirit and scope of technical solution of the present invention, should all be covered in the present invention Scope of the claims in.

Claims

1. a kind of Adaptive Web page data abstracting method based on template, which comprises

Step 3) extract Web page URL, successively matched with the template in data pick-up template library, if all matching at Function then selects the highest template of matching degree as optimal data extraction template, is transferred to step 5)；Otherwise it enters step 4)；

Step 5) carries out data pick-up according to optimal data extraction template, if data pick-up is completely successful, data pick-up terminates； Otherwise, it enters step 6)；

2. the Adaptive Web page data abstracting method according to claim 1 based on template, which is characterized in that described The data pick-up template of step 1) includes address block and data block, wherein the address block includes: indicating the website of data pick-up 's<site>with the page network address of expression data pick-up<url>, the data block includes: indicating the page data for needing to extract XPath path expression set<xpaths>and indicate data search rule<rule>；

<data>indicates the data for needing to extract, and is made of multiple<node>labels；In<node>label,<nodeId>indicates to take out The mark for evidence of fetching,<title>indicate to extract the meaning of data；In<rule>label,<keyword>indicates keyword rule,< Tag>expression Html label rule,<context>indicates context rule, includes<content>and<distance>two marks Label, respectively context and at a distance from present node,<font>include<color>and<size>two labels, are distinguished For font color and font size.

3. the Adaptive Web page data abstracting method according to claim 2 based on template, which is characterized in that described Data search rule includes keyword search rule, html tag search rule and contextual search rule；

The keyword search rule are as follows: if the corresponding text information of target data is uniquely, in mould in Web page Text information is added in corresponding<keyword>label in plate, as keyword rule, keyword degree of correlation d_key(n_txt, m_key) is defined as:

Wherein, n_txtFor the corresponding text information of dom tree interior joint data, m_keyFor<keyword>corresponding in template label Value；

The html tag search rule are as follows: if it be html tag information is special in Web page that target data is corresponding , then the html tag information is added in corresponding<tag>label in a template, as html tag rule；Html tag phase Pass degree d_tag(n_tag,m_tag) is defined as:

Wherein, n_tagFor the corresponding html tag information of DOM tree node data, | n_tag| it is n_tagThe number occurred in dom tree, m_tagFor the value of<tag>corresponding in template label；

The contextual search rule are as follows: if the data to be extracted are not easy to search for, but it has the context being easy to search for, that Search to its context can be converted into the search of target data；After finding its context, position based on context Position target data, context-sensitive degree d_com(n_dist,m_dist) is defined as:

4. the Adaptive Web page data abstracting method according to claim 1 based on template, which is characterized in that described Step 2) specifically includes:

Step 2-1) crawl Web page html source code；

Step 2-2) standardize the html source code grabbed, complies with web standard, make to tie using with semantic label Structure, appearance and behavior separation；

Step 2-3) the data structure building webpage dom tree based on DOM node.

5. the Adaptive Web page data abstracting method according to claim 1 or 4 based on template, which is characterized in that institute State the calculation formula of the matching degree of step 3) are as follows:

Wherein, w is Web page, and m is data pick-up template, url_wFor the address url of Web page, url_mFor data pick-up template In<url>the data of label；S(url_w) indicate the string assemble for separating corresponding url with "/", | S (url_w)∩S (url_m) | indicate url_wAnd url_mPartial string length, | min (S (url_w),S(url_m)) | indicate url_wAnd url_mIn compared with The length of small set.

6. the Adaptive Web page data abstracting method according to claim 5 based on template, which is characterized in that described The successful match of step 3) refers to: the domain name of the url of Web page is identical as<site>label data in data pick-up template, And matching degree is greater than specified threshold t, i.e. R (w, m) > t.

7. the Adaptive Web page data abstracting method according to claim 1 based on template, which is characterized in that described Step 6) specifically:

The evaluation of estimate for calculating dom tree interior joint by bottom-up sequence according to target data search rule, if encountering evaluation It is worth the node of great-than search threshold value, then searches for success, using node data as target data, and by the XPath expression formula of node It is added in the XPath queue of template corresponding data item.

8. the Adaptive Web page data abstracting method according to claim 7 based on template, which is characterized in that described Target data search rule includes that keyword search rule, html tag search rule and contextual search are regular,

9. a kind of Adaptive Web page data extraction system based on template, including memory, processor and it is stored in storage Computer program that is on device and can running on a processor, which is characterized in that the processor is realized when executing described program The step of one of claim 1~8 the method.