CN103559199A

CN103559199A - Web information extraction method and web information extraction device

Info

Publication number: CN103559199A
Application number: CN201310455343.2A
Authority: CN
Inventors: 刘旭东; 孙海龙; 周子龙; 张日崇
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2013-09-29
Filing date: 2013-09-29
Publication date: 2014-02-05
Anticipated expiration: 2033-09-29
Also published as: CN103559199B

Abstract

The invention provides a web information extraction method and a web information extraction device. The web information extraction method includes: correspondingly structuring multiple first DOM trees according to multiple sample web pages with attributes marked, and structuring a decision-making tree according to the first DOM trees; correspondingly structuring multiple second DOM trees according to multiple sample web pages with attributes unmarked, and optimizing the decision-making tree according to the second DOM trees; extracting structured information of to-be-extracted web pages according to the optimized decision-making tree; wherein the sample web pages with the attributes marked as well as the sample web pages with the attributes unmarked and the to-be-extracted web pages belong to the same field. Due to the fact that the decision-making tree is structured according to the sample web pages with the attributes marked and optimized according to the sample web pages with the attributes unmarked, and structuring and optimizing of the decision-making tree do not just rely on layout styles of the web pages, the optimized decision-making tree is applicable to web information extraction of various layout styles in the same field. Therefore, the web information extraction method and the web information extraction device are timesaving and laborsaving.

Description

Webpage information extraction method and device

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for extracting webpage information.

Background

Web page Information Extraction (Web ie), that is, taking a Web page as an Information source, and extracting target Information from the Information source. Most of data on a webpage are described by using a semi-structured Hypertext Markup Language (HTML), but due to lack of description of the data, an application program cannot directly analyze and utilize massive information on the webpage, and great waste of resources is caused. The purpose of extracting the webpage information is to extract target information hidden in a semi-structured HTML (hypertext markup language) page, express the target information in a more structured and semantic clear form, and provide convenience for a user to inquire data in the webpage and an application program to directly utilize the data in the webpage.

In the face of massive and varied data, the conventional webpage information extraction method is mainly a manual template configuration method based on webpage source codes, and the method is used for extracting information by manually configuring different templates for webpages with different layouts and matching the corresponding templates with the webpages to be extracted.

However, the prior art has the following defects: even for sites in the same field, due to the fact that the number of webpages is large, and the layout style is diversified and diversified, a large number of templates need to be configured manually, and the configured templates need to be changed frequently to ensure the accuracy of information extraction, and time and labor are wasted.

Disclosure of Invention

The invention provides a method and a device for extracting webpage information, which are used for solving the problems that time and labor are wasted when a manual template configuration method is adopted to extract webpage information for sites in the same field in the prior art.

In one aspect, the present invention provides a method for extracting web page information, including:

constructing a plurality of corresponding DOM trees of the first document object type according to the sample webpages with labeled attributes, and constructing a decision tree according to the DOM trees;

constructing a plurality of corresponding second DOM trees according to the sample webpages without the marked attributes, and optimizing the decision tree according to the plurality of second DOM trees;

extracting the structural information of the webpage to be extracted according to the optimized decision tree;

the plurality of sample web pages with labeled attributes, the plurality of sample web pages without labeled attributes and the web page to be extracted belong to the same field of web pages.

In another aspect, the present invention provides a web page information extraction apparatus, including:

the construction module is used for constructing a plurality of corresponding DOM trees of the first document object type according to the sample webpages with the labeled attributes and constructing a decision tree according to the DOM trees;

the optimization module is used for constructing a plurality of corresponding second DOM trees according to the plurality of sample webpages without the marked attributes and optimizing the decision tree according to the plurality of second DOM trees;

the extraction module is used for extracting the structural information of the webpage to be extracted according to the optimized decision tree;

According to the webpage information extraction method provided by the invention, the decision tree is constructed according to a plurality of sample webpages with labeled attributes, and the decision tree is optimized according to a plurality of sample webpages without labeled attributes.

Drawings

Fig. 1 is a schematic flowchart of an embodiment of a method for extracting web page information according to the present invention;

FIG. 2 is a schematic flowchart of a webpage information extraction method according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of a DOM tree node hierarchical clustering process;

FIG. 4 is a schematic diagram of a decision tree structure;

fig. 5 is a schematic structural diagram of an embodiment of a web page information extraction apparatus provided in the present invention.

Detailed Description

The technical solution of the present invention is further described in detail by the following specific embodiments and the accompanying drawings.

Fig. 1 is a schematic flowchart of an embodiment of a web page information extraction method provided by the present invention. As shown in fig. 1, the method may specifically include:

s101, constructing a plurality of corresponding DOM trees of the first document object type according to the sample webpages marked with the node attributes, and constructing a decision tree according to the DOM trees.

Specifically, the internet is a huge knowledge base, and a large amount of valuable information is contained in the web pages. The internet contains a number of areas, such as: news, shopping, social, forums, blogs, etc. With the rapid development of internet technology, the functions in the web pages become stronger, and the websites in various fields have the characteristics of multiple information categories and large information amount. The different domains correspond to respective attributes, such as: the news webpage contains attributes such as title, main body, user comment, related news and the like; the shopping webpage contains attributes such as commodity names, prices, parameters, user comments, related commodities and the like.

The sample web pages marked with the attributes are manually marked with the attributes, namely, a small number of sample web pages are manually marked with the attributes such as titles, subjects, user comments, related news and the like for news web pages. And constructing a first DOM tree according to the sample webpage with the labeled attributes, wherein the attribute of each node in the first DOM tree can be determined according to the labeled attributes of the sample webpage with the labeled attributes.

The web page is described by a semi-structured HTML language, and due to the loose nature of the HTML language and the various habits of web page developers, the format of the HTML page is very disordered, and many web pages have HTML coding problems, such as: improper label placement, lack of a closed label, non-escaping special characters, etc. In the process of building the DOM tree, the HTML code needs to be cleaned: (1) the tags with errors are corrected so that the tags are used in pairs, i.e. there is a start tag and an end tag (denoted by < > and </> respectively). (2) And eliminating useless labels and contents. For example: < | is an HTML annotation; < style > </style > is an internal style text; < script > </script > is a JavaScript script. The tag < head >. the tag < head > contains no text content and can be culled. The DOM tree is also known as a tag tree.

The decision tree is constructed according to the constructed first DOM tree, and specifically, the decision tree can be constructed by using various existing algorithms, such as C4.5 algorithm, CLS algorithm, ID3 algorithm, CHAID algorithm, CART algorithm or FACT algorithm. Decision trees are a typical classification method, where data is first processed, readable rules and decisions are generated using an inductive algorithm, and then new data is analyzed using the decisions. In essence, a decision tree is a process of classifying data through a series of rules. The decision tree has the advantages that: the reasoning process is easy to understand, and the decision reasoning process can be expressed in the form of If Then; the reasoning process completely depends on the value characteristics of the attribute variables; attribute variables which do not contribute to the target variable can be automatically ignored, and reference is provided for judging the importance of the attribute variables and reducing the number of the variables.

S102, constructing a plurality of corresponding second DOM trees according to the sample webpages without the marked node attributes, and optimizing the decision tree according to the plurality of second DOM trees.

Specifically, a second DOM tree is constructed according to a large number of sample webpages without labeled attributes, and the attribute of each node in the second DOM tree is uncertain. The process of building the DOM tree is described in relation to step S101. And judging the classification of each node in the second DOM tree according to the characteristics of each node in the second DOM tree and the characteristics of each node in the first DOM tree, and adding each node in the second DOM tree to the decision tree so as to dynamically optimize the decision tree.

And S103, extracting the structural information of the webpage to be extracted according to the optimized decision tree.

Specifically, according to the optimized decision tree, extracting the structural information of the webpage to be extracted according to the attributes.

It should be noted that, in this embodiment, the sample web page with labeled attributes, the sample web page without labeled attributes, and the web page to be extracted belong to the same field.

With the continuous development and maturity of various technologies in the web2.0 era, Mashup mixed as a new internet application development mode attracts the attention of numerous application developers. More and more enterprises or individuals choose to use the existing external resources to complete the construction of the application in a combined processing mode, so that the core business value is efficiently realized, flexible and changeable market conditions are quickly adapted, and scene application programs meeting the respective requirements are built. However, most of the content sources of Mashup Application integration are third-party open data and Application Programming Interfaces (APIs), which have some limitations in practical Application, and the capability of integrating web page information is insufficient, so that the requirement of personalized web page information integration by a user cannot be met.

By the webpage information extraction method described in this embodiment, a large number of webpages can be used as data service sources, webpage structural information corresponding to domain attributes can be accurately extracted from webpages in different domains, and the rich multisource webpage structural information sources are aggregated through mashup application to form a comprehensive webpage information integration service, so that personalized requirements of different users can be better met.

According to the webpage information extraction method provided by the embodiment, the decision tree is constructed according to the plurality of sample webpages with labeled attributes, and the decision tree is optimized according to the plurality of sample webpages without labeled attributes.

Fig. 2 is a schematic flowchart of a webpage information extraction method according to another embodiment of the present invention. As shown in fig. 2, the method may specifically include:

s201, constructing a plurality of corresponding DOM trees of the first document object type according to the sample webpages marked with the attributes.

Specifically, refer to the related description in step 101, and are not described herein again.

S202, performing hierarchical clustering on each node of each first DOM tree according to the layout characteristics of each node of each first DOM tree.

In particular, different nodes in the DOM tree have different characteristics with which the nodes can be distinguished from multiple perspectives. The node features are divided into two types, namely page-level and site-level, from the aspect of obtaining the node features, the page-level features (such as layout features and character features) can be obtained only through a single webpage, and the site-level features (such as semantic features) need to be obtained through comprehensive comparison and analysis of a plurality of webpages of the same site.

The following detailed description is made for the layout features, character features, and semantic features, respectively:

1. layout features

The layout features belong to page-level features and can be directly obtained through DOM nodes.

(1) DOM path (path)

DOM path refers to the path from the root to the node, which accurately points to the node's position in the DOM tree.

(2) Label (R)

There are many types of tags, and different tags have different roles. For example: < div > refers to a block; < p > refers to a paragraph; < ul > and < li > are used in combination to refer to a list.

(3) Label attributes

The label attribute is as follows: < div id = "yom-ad-LREC" class = "yom-adyom-ad-LREC" >, id and class are two attributes of the div tag, where the class attribute value is related to CSS, which determines how the font, color, background color, etc. of the node is displayed in the web page.

2. Character features

The character features belong to page-level features and can be directly obtained through the content of DOM nodes.

(1) Node content

Node content refers to the sum of all the textual content of the node itself and the node children.

(2) Anchor Text specific gravity (the probability of the Anchor Text, ATP for short)

ATP_iIs node N_iNode content anchor text character number and node N_iNode N and the total number of characters in the node contents of_iThe ratio of the sum of the numbers of the labels is shown in formula (1):

{ATP}_{i} = \frac{{ATC}_{i}}{C_{i} + T_{i}} - - - (1)

wherein, C_iIs node N_iTotal number of characters in the node contents, T_iIs node N_iNumber of medium tags, i.e. ATC_iIs node N_iThe number of anchor text characters in the node contents of (1).

(3) Number of words

The number of words is the number of words obtained by segmenting the contents of the node by using a delimiter such as {',.

(4) Ending punctuation

The end punctuation refers to the punctuation mark at the end of the respective text content of the node itself and all the child nodes.

(5) Total number of punctuation marks

The total number of punctuation marks refers to the total number of various punctuation marks in the content of the node.

3. Semantic features

The semantic features are site-level features and can be obtained only by comparing and analyzing a plurality of webpages of the same site.

(1) Content Rate of Change (CV)

In the same site, the node contents of the nodes with the same DOM path and node attribute in different webpages may be the same, may be similar, and may be completely different. The content change rate CV is a ratio of the number of different node contents in the class node to the total number of compared nodes in a plurality of web pages at the same site, as shown in formula (2):

CV = \frac{UE}{CMP} - - - (2)

wherein, the UE is the number of nodes with different node contents, and CMP is the total number of compared nodes.

(2) Suffix and suffix

Suffix and prefix refer to a sub-string of a string starting position (prefix) and ending position (suffix).

(3) Subtitle with small title

A subtitle refers to a prefix-like text in a matrix block in a web page that indicates the nature of the content of the matrix block.

In this step, the layout characteristics of the nodes are adopted, and split hierarchical clustering is performed on each node in the first DOM tree according to the layout similarity, specifically:

the webpage structure is divided through the tags and tag attributes of all nodes of the DOM tree, and various webpage layouts and visual effects are presented. Each layer of nodes in the DOM tree has only one clustering opportunity, breadth-first traversal is performed from top to bottom of a root node, and the hierarchical clustering of splitting is performed by taking the similarity of the layout characteristics of all nodes as a main measurement index, so that the clustering with the maximum intra-class similarity, namely the minimum intra-class similarity, and the maximum inter-class similarity, namely the maximum inter-class dispersion is finally generated. The method specifically comprises the following steps:

s2021, generating a first node layout feature vector corresponding to each node of each first DOM tree according to the layout features of each node of each first DOM tree;

s2022, calculating a first intra-class dissimilarity and a first inter-class dispersion of each node of each first DOM tree according to the first node layout feature vector corresponding to each node of each first DOM tree;

s2023, according to the first intra-class dissimilarity and the first inter-class dispersion, performing hierarchical clustering on each node of each first DOM tree under the optimization conditions that the first intra-class dissimilarity is minimum and the first inter-class dispersion is maximum.

Specifically, assume that a node is located at a certain level of the first DOM tree, and is classified into c (c < = a), each node is a p-dimensional vector and represents p layout features for node clustering: { label, label attribute, number of child node layers, number of child nodes in layer 1, number of child node labels in layer 1, number of child nodes in layer 2, and label of child node in layer 2.

Assuming that the best cluster is C { C1, C2.. Cc }, each cluster Ci contains n_iA node (i =1, 2.. c). The layout similarity between any two nodes xi and xj is measured by the absolute distance d (xi, xj), as shown in equation (3):

1. degree of intra-class dissimilarity:

in class Ci, the average distance of any node to the other ni-1 nodes is Dj,

and if the average value of each node Dj is obtained, the node dissimilarity Ri in the Ci class is as follows:

<math> <mrow> <msub> <mi>R</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <munderover> <mi>Σ</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> </munderover> <mo>|</mo> <msub> <mi>D</mi> <mi>j</mi> </msub> <mo>-</mo> <msub> <mover> <mi>d</mi> <mo>&OverBar;</mo> </mover> <msub> <mi>n</mi> <mi>i</mi> </msub> </msub> <mo>|</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein,

<math> <mrow> <msub> <mi>D</mi> <mi>j</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </mfrac> <munder> <mi>Σ</mi> <mrow> <mi>x</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mi>i</mi> </msub> </mrow> </munder> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>,</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>

<math> <mrow> <msub> <mover> <mi>d</mi> <mo>&OverBar;</mo> </mover> <msub> <mi>n</mi> <mi>i</mi> </msub> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <munderover> <mi>Σ</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> </munderover> <msub> <mi>D</mi> <mi>j</mi> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow> </math>

2. degree of inter-class dispersion

The inter-class dispersion Ti between the Ci class and other c-1 classes is measured by the layout distance of two nodes with the closest distance between the two classes:

3. index of clustering performance

For good clustering, the more similar the samples within the same class, the better, and the smaller the sample similarity between different classes, the better. Therefore, the smaller the intra-class dissimilarity, the greater the inter-class dispersion, and the better the clustering result. The clustering performance index P is:

fig. 3 is a schematic diagram of a DOM tree node hierarchical clustering process, and as shown in fig. 3, the split hierarchical clustering starts from the root node body, and is judged layer by layer from top to bottom, and c corresponding to the layer is taken to cluster when the clustering performance index P is maximum. Li6 is grouped into a class C2.2.1, for example.

S203, determining a first candidate node of each first DOM tree according to the clustering result.

If all child nodes corresponding to the node are gathered into one type, determining the node as a first candidate node; and if the node has no corresponding child node and a node which belongs to the same father node as the node but is not clustered with the node into one class exists, determining the node as a first candidate node.

Specifically, as shown in fig. 3, if the same layer fails to be aggregated into a node of the same class, there will be no association between their child nodes; if all nodes under a certain branch are grouped into one type, taking the branch node as a candidate node, for example, all child nodes under C2.2 are grouped into one type, and therefore, taking C2.2 as the candidate node; if there is no corresponding child node for a node, and there is a node that belongs to the same parent node as the node but is not clustered with the node, then the node is taken as a candidate node, e.g., C1.1 without child nodes therebelow, and there is a node C1.2 that belongs to the same parent node C1 but is not clustered with the node, so that the node C1.1 is taken as a candidate node. The candidate nodes finally determined in fig. 3 are: c1.1, C1.2, C2.2, C2.1.2, C2.1.3, C2.3.1, C2.1.1.1, and C2.1.1.2.

Optionally, after the candidate nodes are determined, the homologous noise reduction processing may be performed on the first candidate node, so as to remove the noise node in the first candidate node.

Particularly, with the rapid development of internet technology, noise in web pages is increasingly abundant. In addition to traditional navigation, copyright notices, fixed advertisements, a great deal of dynamic noise is now present in web pages, such as: hot spot recommendations for websites and dynamically generated advertisements. The existing webpage denoising method usually eliminates anchor texts as noise, and is easy to cause the mistaken elimination of valuable anchor text information such as related news, commodity recommendation and the like. In order to ensure the accuracy of webpage information extraction, a DOM tree candidate node homologous noise reduction method is provided for eliminating the noise nodes, and the specific process is as follows:

s2031, three pages (pageA 1/pageA 2/pageB) are obtained, and the three pages are all from the same template at the same site. pageA1 is a page obtained from a different request for the same news than pageA2, and pageB is another news page than a.

S2032, adopting breadth-first traversal, selecting candidate nodes with the same DOM path and label attributes in the three pages, and calculating the node contents.

S2033, judging and denoising according to conditions: for pageA1, pageA2, and pageB, if the A node content is equal to the B node content, then the node is a noise node; for pageA1 and pageA2, an A1 node is a noise node if the node content is not equal to the A2 node content.

It should be noted here that the process of manually labeling the node attribute may also be performed after the candidate node is determined.

S204, counting the multiple characteristics of each first candidate node, and constructing a decision tree according to the counting result so as to classify each first candidate node, wherein the category number of the classification result is the same as that of the attribute of the first candidate node.

Wherein the plurality of features includes layout features and at least one of the following features: character features and semantic features.

Specifically, n features F such as layout, characters and semantics of each candidate node₁，F₂，......，F_nCounting is carried out, each candidate node corresponds to an n-dimensional feature vector X = [ X ]₁，x₂，......，x_n]. Based on the statistical results, the decision tree can be constructed by using various existing algorithms, such as the C4.5 algorithm, the CLS algorithm, the ID3 algorithm, the CHAID algorithm, the CART algorithm, or the FACT algorithm. Fig. 4 is a schematic structural diagram of a decision tree, and as shown in fig. 4, if m attribute classifications exist in the field of the sample, the classification result of the decision tree in the first step is m classifications, each leaf in the decision tree corresponds to one classification, one classification corresponds to a node set, and each node in the set is also an n-dimensional vector. Wherein V1, V2, V3 and V4 are classification feature rules, C₁，C₂，......，C_mIs the classification result.

Among them, the C4.5 algorithm was developed by Quinlan in 1993 with further optimization improvement based on the ID3 algorithm. The C4.5 algorithm is a classification decision tree algorithm in the machine learning algorithm, which selects a decision rule with an information gain rate, and the core algorithm thereof is the ID3 algorithm. The method inherits all advantages of the ID3 algorithm, adds functions of discretization of continuous data, processing of unknown rules, rule generation and the like on the basis of the ID3, and overcomes the defects of the ID3 algorithm.

(1) Information gain

In the decision tree classification problem, the information gain is the difference between the information of the decision tree before and after the decision tree is subjected to the rule selection partition. The larger the certain rule information Gain, the larger the amount of information provided for classification, and the smaller the uncertainty for classification after selection, the information Gain (S, F) is as shown in equation (9):

<math> <mrow> <mfenced open='' close=''> <mtable> <mtr> <mtd> <mi>Gain</mi> <mrow> <mo>(</mo> <mi>S</mi> <mo>,</mo> <mi>F</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>Rntropy</mi> <mrow> <mo>(</mo> <mi>S</mi> <mo>)</mo> </mrow> <mo>-</mo> <mi>ExpectedEntropy</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>F</mi> </msub> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mo>=</mo> <mo>-</mo> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msub> <mi>p</mi> <mi>i</mi> </msub> <msub> <mi>log</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <munder> <mi>Σ</mi> <mrow> <mi>v</mi> <mo>&Element;</mo> <mi>VofF</mi> </mrow> </munder> <mi>p</mi> <mrow> <mo>(</mo> <mi>v</mi> <mo>)</mo> </mrow> <msub> <mi>p</mi> <mi>vj</mi> </msub> <msub> <mi>log</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>vj</mi> </msub> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>9</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein, Encopy (S) is the original entropy of the classification system, ExpectedEncopy (SF) is the condition after the feature F is fixedEntropy. m represents the number of classes in the sample set, p_iRepresenting the probability of the ith class, dividing a sample set S into V subsets according to V different values of a certain characteristic, wherein the V subsets are S1, S2_vjRepresenting the probability of the jth class in SV.

(2) Splitting information:

the splitting information is used for measuring the breadth and uniformity of feature rule splitting data, and actually is the entropy of S on each value of the test feature. The Split information Split (S, F) is shown in equation (10):

<math> <mrow> <mi>Split</mi> <mrow> <mo>(</mo> <mi>S</mi> <mo>,</mo> <mi>F</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munder> <mi>Σ</mi> <mrow> <mi>v</mi> <mo>&Element;</mo> <mi>V</mi> </mrow> </munder> <mfrac> <mrow> <mo>|</mo> <msub> <mi>S</mi> <mi>v</mi> </msub> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <mi>S</mi> <mo>|</mo> </mrow> </mfrac> <mo>*</mo> <msub> <mi>log</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mfrac> <mrow> <mo>|</mo> <msub> <mi>S</mi> <mi>v</mi> </msub> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <mi>S</mi> <mo>|</mo> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> </mrow> </math>

(3) information gain ratio

The information gain ratio represents the proportion of useful information produced by the decision tree branches. Thus, the larger this value, the more useful information a branch contains. F is characterized, S is a sample set, and the information Gain ratio is based on the information Gain (S, F) and the Split information Split (S, F), as shown in equation (11):

GainRatio (F, S) = \frac{Gain (S, F)}{Split (S, F)} - - - (11)

s205, constructing a plurality of corresponding second DOM trees according to the sample webpages without the marked attributes.

Specifically, a second DOM tree is constructed according to a large number of sample webpages which are not subjected to attribute manual labeling, and the attribute of each node in the first DOM tree is uncertain. The process of building the DOM tree is described in step 101, and is not described herein.

S206, performing hierarchical clustering on each node of each second DOM tree according to the layout characteristics of each node of each second DOM tree. The method specifically comprises the following steps:

s2061, generating second node layout feature vectors corresponding to the nodes of the second DOM trees according to the layout features of the nodes of the second DOM trees;

s2062, calculating the second intra-class dissimilarity and the second inter-class divergence of each node of each second DOM tree according to the second node layout feature vector corresponding to each node of each second DOM tree;

s2063, according to the dissimilarity degree in the second class and the dispersion degree between the second classes, carrying out hierarchical clustering on each node of each second DOM tree under the condition that the dissimilarity degree in the first class is minimum and the dispersion degree between the first classes is maximum.

S207, determining a second candidate node of each second DOM tree according to the clustering result.

If all the child nodes corresponding to the node are gathered into one type, determining the node as a second candidate node; and if the node has no corresponding child node and a node which belongs to the same father node as the node but is not clustered with the node into one class exists, determining the node as a second candidate node.

In steps S206, S2061 to S2063 and S207, reference is made to the relevant descriptions in steps S202, S2021 to S2023 and S203, respectively, and the description is omitted here for brevity.

Optionally, after the second candidate node is determined, the second candidate node is subjected to homologous denoising processing, and a noise node in the second candidate node is removed. For a specific process, refer to the above description about removing noise nodes by the DOM tree candidate node homologous noise reduction method, which is not described herein again.

And sequentially carrying out the following operations on each second candidate node:

s208, counting a plurality of characteristics of the second candidate node to be classified, judging the classification of the classification result of the second candidate node according to the counting result, and adding the second candidate node into the corresponding classification in the decision tree according to the judgment result.

Specifically, the plurality of characteristics counted for the second candidate node to be classified are the same as the plurality of characteristics counted for the first candidate node. Such as layout, character, and semantic features.

The method for judging the classification result of the second candidate node according to the statistical result specifically comprises the following steps:

s2081, generating a second candidate node feature vector corresponding to the second candidate node according to the statistical result;

s2082, calculating a central node feature vector corresponding to a central node of each category of the classification result, wherein the central node feature vector is an arithmetic average value of first candidate node feature vectors corresponding to all first candidate nodes in the category and second candidate node feature vectors corresponding to all second candidate nodes;

s2083, judging the classification of the classification result of the second candidate node by adopting a weighted minimum distance classifier algorithm according to the second candidate node feature vector corresponding to the second candidate node and the center node feature vector corresponding to each center node; wherein each weight in the weighted minimum distance classifier algorithm is an information gain rate of each of the plurality of features.

Specifically, in the constructed decision tree, each classification corresponds to a node set, each node in the set corresponds to an n-dimensional vector, and the kth class C is calculated_k(k = 1.... m) the arithmetic mean of the feature vectors of the first candidate nodes corresponding to all the first candidate nodes and the feature vectors of the second candidate nodes corresponding to all the second candidate nodes, and the calculation result is the arithmetic mean representing the category C_kN-dimensional vector C corresponding to the central node of (k =1,.. m)_k=[C_k1，C_k2，......，C_kn]。

The second candidate node X to be classified also corresponds to an n-dimensional eigenvector X = [ X ]₁，X₂，......，X_n]By calculating it with each C_k=[C_k1，C_k2，......，C_kn]A distance D (X, C) therebetween_k) Then, X is judged to belong to the category closest to the X. Since the features used for data classification have different roles in practical applications, a weighted distance formula is used for calculating the minimum distance, as shown in formula (12):

D ({X, C}_{k}) = \sqrt{w_{1} {(x_{1} - C_{k 1})}^{2} + w_{2} {(x_{2} - C_{k 2})}^{2} + . . . + w_{n} {(x_{n} - C_{kn})}^{2}} - - - (12)

wherein, the weight value w_i(i =1,... n) represents the feature F_i(i =1,..., n) is important for the minimum distance classification, as the magnitude of the classification impact is reasonably chosen. For the decision tree constructed by C4.5 algorithm, since the information gain rate is used to select the classification rule, it will be specific hereSign F_iThe information gain rate of (i =1,,...., n) is taken as its corresponding weight w_i（i=1，，......，n）。

After completing the classification of a second candidate node, the following steps are performed to dynamically optimize the decision tree:

s209, the information gain ratio of each of the plurality of features is recalculated.

If the information gain rate and the ranking of each feature in the plurality of features are changed, reconstructing a decision tree, and updating each weight in the weighted minimum distance classifier algorithm and a central node feature vector corresponding to a central node of each category of the classification result;

if the information gain rate of each feature in the plurality of features changes but the ranking does not change, updating each weight in the weighted minimum distance classifier algorithm and a central node feature vector corresponding to a central node of each category of the classification result;

if the information gain rate of each feature in the plurality of features is converged, recalculating the feature vector of the central node corresponding to the central node of each category of the classification result, and if the feature vector of the central node corresponding to the central node of each category of the classification result is not changed, stopping counting the plurality of features of the next second candidate node.

Specifically, because fewer labeled sample pages are adopted when the decision tree is constructed, the constructed decision tree may have larger deviation and cannot well distinguish each candidate node, so that after each second candidate node is classified, the information gain rate of each feature needs to be recalculated, dynamic optimization is performed according to the size and ranking condition of the information gain rate of each feature, and the decision tree is dynamically optimized and adjusted by using the information gain rate along with the increase of nodes in each classification set until the information gain rate of each feature converges and each C has a smaller number of labeled sample pages_kAnd is not changed. The specific process is as follows:

inputting: sample set with unlabeled attributes

And (3) outputting: optimal decision tree

01: while (sample set is not empty)

02: selecting a sample page, calculating its candidate nodes and features

03: judging the category of each candidate node by using a minimum distance classifier

04: calculating the information gain rate of each feature F1, F2, … …, Fn

05: if (all information gain rate and rank change)

06: generating a new decision tree

07: updating each weight of the minimum distance classifier

08: else if (the gain rate of each information is just a change in magnitude)

09: updating each weight of the minimum distance classifier

10: else (judging whether optimization is finished)

11: if (convergence of information gain ratio of each feature)&&Each C_kInvariable)

12: after the optimization is finished, outputting an optimal decision tree

13：end if

14：end if

15：end while

16: if (not getting the optimal decision tree)

17: adding sample pages, jumping to step 1, and continuing optimization

18：end if

And S210, extracting the structural information of the webpage to be extracted according to the optimized decision tree.

Fig. 5 is a schematic structural diagram of an embodiment of a web page information extraction apparatus provided in the present invention. As shown in fig. 5, the web page information extraction apparatus in this embodiment is a specific main body for executing the web page information extraction method, and may specifically include: a construction module 51, an optimization module 52 and an extraction module 53, wherein:

the building module 51 is configured to build a plurality of corresponding DOM trees of the first document object type according to the plurality of sample webpages with labeled attributes, and build a decision tree according to the plurality of DOM trees;

specifically, the attribute-labeled sample web pages are manually labeled on a small number of sample web pages, for example, for news web pages, the attributes such as titles, subjects, user comments, and related news are manually labeled. The building module 51 builds a first DOM tree according to the sample web page with the labeled attributes, and the attribute of each node in the first DOM tree can be determined according to the labeled attributes of the sample web page with the labeled attributes.

During the construction of the DOM tree, the construction module 51 needs to clean the HTML code: (1) the tags with errors are corrected so that the tags are used in pairs, i.e. there is a start tag and an end tag (denoted by < > and </> respectively). (2) And eliminating useless labels and contents. For example: < | is an HTML annotation; < style > </style > is an internal style text; < script > </script > is a JavaScript script. The tag < head >. the tag < head > contains no text content and can be culled. The DOM tree is also known as a tag tree.

The decision tree is constructed according to the constructed first DOM tree, and specifically, the decision tree can be constructed by using various existing algorithms, such as C4.5 algorithm, CLS algorithm, ID3 algorithm, CHAID algorithm, CART algorithm or FACT algorithm.

The optimization module 52 is configured to construct a plurality of corresponding second DOM trees according to the plurality of sample webpages without attributes labeled, and optimize the decision tree according to the plurality of second DOM trees;

specifically, the optimization module 52 constructs a second DOM tree according to a large number of sample web pages without labeled attributes, where the attribute of each node in the second DOM tree is uncertain. The process of the optimization module 52 building the DOM tree refers to the relevant description of the DOM tree built by the building module 51. The optimization module 52 determines the attribute of each node in the second DOM tree according to the characteristics of each node in the second DOM tree and the characteristics of each node in the first DOM tree, and adds each node in the second DOM tree to the decision tree to dynamically optimize the decision tree.

And the extraction module 53 is configured to extract the structural information of the web page to be extracted according to the optimized decision tree.

Specifically, the extraction module 53 extracts the structured information of the web page to be extracted according to the attributes according to the optimized decision tree.

Further, the building module 51 may be specifically configured to:

and carrying out hierarchical clustering on each node of each first DOM tree according to the layout characteristics of each node of each first DOM tree. Specifically, the method comprises the following steps:

generating a first node layout feature vector corresponding to each node of each first DOM tree according to the layout features of each node of each first DOM tree; calculating the first intra-class dissimilarity and the first inter-class dispersion of each node of each first DOM tree according to the first node layout characteristic vector corresponding to each node of each first DOM tree; and performing hierarchical clustering on each node of each first DOM tree according to the first intra-class dissimilarity and the first inter-class dispersion under the condition of the minimum intra-class dissimilarity and the maximum first inter-class dispersion.

Assuming that the best cluster is C { C1, C2.. Cc }, each cluster Ci contains n_iA node (i =1, 2.. c). According toAnd (3) to (8) judging layer by layer from top to bottom from the root node of the DOM tree, and clustering c corresponding to the layer when the clustering performance index P is maximum.

Further, the building module 51 may be specifically configured to:

and determining a first candidate node of each first DOM tree according to the clustering result.

Optionally, after the candidate node is determined, the constructing module 51 may be further configured to:

and carrying out homologous noise reduction processing on the first candidate node, and removing the noise node in the first candidate node. Specifically, the method comprises the following steps:

three pages (pageA 1/pageA 2/pageB) are obtained, and all the three pages come from the same site and the same template. pageA1 is a page obtained from a different request for the same news than pageA2, and pageB is another news page than a.

And adopting breadth-first traversal, selecting candidate nodes with the same DOM path and label attributes in the three pages, and calculating the node contents.

And (3) judging and denoising according to conditions: for pageA1, pageA2, and pageB, if the A node content is equal to the B node content, then the node is a noise node; for pageA1 and pageA2, an A1 node is a noise node if the node content is not equal to the A2 node content.

Further, the building module 51 may be specifically configured to:

and counting the multiple characteristics of each first candidate node, and constructing a decision tree according to the counting result so as to classify each first candidate node, wherein the category number of the classification result is the same as that of the attribute of the first candidate node.

Specifically, n features F such as layout, characters and semantics of each candidate node₁，F₂，......，F_nCounting is carried out, each candidate node corresponds to an n-dimensional feature vector X = [ X ]₁，x₂，......，x_n]. Based on the statistical results, the decision tree can be constructed by using various existing algorithms, such as the C4.5 algorithm, the CLS algorithm, the ID3 algorithm, the CHAID algorithm, the CART algorithm, or the FACT algorithm. If the field of the sample is m attribute classifications, the classification result of the decision tree in the first step is m classifications, each leaf in the decision tree corresponds to one classification, one classification corresponds to a node set, and each node in the set is also an n-dimensional vector.

Further, the optimization module 52 may be specifically configured to:

and performing hierarchical clustering on each node of each second DOM tree according to the layout characteristics of each node of each second DOM tree. Specifically, the method comprises the following steps:

generating a second node layout feature vector corresponding to each node of each second DOM tree according to the layout features of each node of each second DOM tree;

calculating the dissimilarity degree in the second class and the dispersion degree between the second classes of the nodes of each second DOM tree according to the second node layout characteristic vector corresponding to each node of each second DOM tree;

and performing hierarchical clustering on each node of each second DOM tree according to the second intra-class dissimilarity and the second inter-class dispersion under the optimization conditions of the minimum intra-class dissimilarity and the maximum inter-class dispersion.

Specifically, the process of performing hierarchical clustering on each node of each second DOM tree by the optimization module 52 may refer to the related description of the process of performing hierarchical clustering on each node of each first DOM tree by the construction module 51, which is not described herein again.

Further, the optimization module 52 may be specifically configured to:

and determining a second candidate node of each second DOM tree according to the clustering result.

Specifically, the process of determining the second candidate node of each second DOM tree by the optimization module 52 may refer to the related description of the process of determining the first candidate node of each first DOM tree by the construction module 51, and is not described herein again.

Optionally, after the second candidate node is determined, the optimization module 52 may be further configured to:

and carrying out homologous noise reduction processing on the second candidate node, and removing the noise node in the second candidate node. For a specific process, refer to the description of the building module 51 about removing noise nodes by the DOM tree candidate node homologous noise reduction method, which is not described herein again.

Further, the optimization module 52 may be specifically configured to:

and sequentially carrying out the following operations on each second candidate node: and counting the plurality of characteristics of the second candidate node, judging the category of the classification result of the second candidate node according to the counting result, and adding the second candidate node into the corresponding category in the decision tree according to the judgment result.

Specifically, the plurality of characteristics counted by the optimization module 52 for the second candidate node are the same as the plurality of characteristics counted by the construction module for the first candidate node. Such as layout, character, and semantic features.

Further, the optimization module 52 may be specifically configured to:

generating a second candidate node feature vector corresponding to the second candidate node according to the statistical result; calculating a central node feature vector corresponding to a central node of each category of the classification result, wherein the central node feature vector is an arithmetic average value of first candidate node feature vectors corresponding to all first candidate nodes and second candidate node feature vectors corresponding to all second candidate nodes in the category; judging the category of the classification result of the second candidate node by adopting a weighted minimum distance classifier algorithm according to the second candidate node feature vector corresponding to the second candidate node and the central node feature vector corresponding to each central node; wherein each weight in the weighted minimum distance classifier algorithm is an information gain rate of each of the plurality of features.

Specifically, in the constructed decision tree, each classification corresponds to a node set, each node in the set corresponds to an n-dimensional vector, and the optimization module 52 calculates the kth class C_k(k = 1.... m) the arithmetic mean of the feature vectors of the first candidate nodes corresponding to all the first candidate nodes and the feature vectors of the second candidate nodes corresponding to all the second candidate nodes, and the calculation result is the arithmetic mean representing the category C_kN-dimensional vector C corresponding to the central node of (k =1,.. m)_k=[C_k1，C_k2，......，C_kn]。

The second candidate node X to be classified also corresponds to an n-dimensional eigenvector X = [ X ]₁，X₂，......，X_n]The optimization module 52 calculates this and each C using equation (12)_k=[C_k1，C_k2，......，C_kn]A distance D (X, C) therebetween_k) Then, X is judged to belong to the category closest to the X.

After completing the classification of a second candidate node, optimization module 52 may be further configured to:

recalculating the information gain rate of each of the plurality of features, if the information gain rate and the ranking of each of the plurality of features are changed, reconstructing a decision tree, and updating each weight in the weighted minimum distance classifier algorithm and a central node feature vector corresponding to a central node of each category of the classification result; or,

if the information gain rate of each feature in the plurality of features changes but the ranking does not change, updating each weight in the weighted minimum distance classifier algorithm and a central node feature vector corresponding to a central node of each category of the classification result; or,

Specifically, because the labeled sample pages used in the construction of the decision tree are fewer, the constructed decision tree may have a larger deviation and cannot distinguish each candidate node well, and therefore, after each classification of a second candidate node is completed, the optimization module 52 needs to recalculate the information gain rate of each feature, perform dynamic optimization according to the magnitude and ranking condition of the information gain rate of each feature, and adjust the decision tree by using the information gain rate dynamic optimization with the increase of nodes in each classification set until the information gain rate of each feature converges and each C has a smaller number of labeled sample pages_kAnd is not changed.

By the webpage information extraction device described in this embodiment, a large number of webpages can be used as data service sources, webpage structural information corresponding to domain attributes can be accurately extracted from webpages in different domains, and the rich multisource webpage structural information sources are aggregated through mashup application to form comprehensive webpage information integration service, so that personalized requirements of different users can be better met.

The web page information extraction device provided by the embodiment constructs the decision tree according to the plurality of sample web pages with labeled attributes, and optimizes the decision tree according to the plurality of sample web pages without labeled attributes.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for extracting web page information is characterized by comprising the following steps:

the plurality of sample web pages with labeled attributes, the plurality of sample web pages without labeled attributes and the web page to be extracted belong to the same field.

2. The method of claim 1, wherein said building a decision tree from said plurality of first DOM trees comprises:

performing hierarchical clustering on each node of each first DOM tree according to the layout characteristics of each node of each first DOM tree, and determining a first candidate node of each first DOM tree according to a clustering result;

and counting a plurality of characteristics of each first candidate node, and constructing the decision tree according to a counting result so as to classify each first candidate node, wherein the category number of the classification result is the same as that of the attribute of the first candidate node.

3. The method of claim 2, wherein optimizing the decision tree according to the plurality of second DOM trees comprises:

performing hierarchical clustering on each node of each second DOM tree according to the layout characteristics of each node of each second DOM tree, and determining a second candidate node of each second DOM tree according to a clustering result;

sequentially performing the following operations on each second candidate node: and counting the plurality of characteristics of the second candidate node, judging the category of the classification result to which the second candidate node belongs according to the counting result, and adding the second candidate node to the corresponding category in the decision tree according to the judging result.

4. The method according to claim 3, wherein said hierarchically clustering the nodes of each of the first DOM trees according to the layout characteristics of the nodes of each of the first DOM trees comprises:

generating a first node layout feature vector corresponding to each node of each first DOM tree according to the layout features of each node of each first DOM tree;

calculating the first intra-class dissimilarity and the first inter-class dispersion of each node of each first DOM tree according to the first node layout feature vector corresponding to each node of each first DOM tree;

performing hierarchical clustering on each node of each first DOM tree according to the first intra-class dissimilarity and the first inter-class dispersion under the condition that the first intra-class dissimilarity is minimum and the first inter-class dispersion is maximum;

the hierarchical clustering of the nodes of each second DOM tree according to the layout characteristics of the nodes of each second DOM tree comprises:

calculating the second intra-class dissimilarity and the second inter-class dispersion of each node of each second DOM tree according to the second node layout feature vector corresponding to each node of each second DOM tree;

and according to the second intra-class dissimilarity and the second inter-class dispersion, performing hierarchical clustering on each node of each second DOM tree under the condition that the second intra-class dissimilarity is minimum and the second inter-class dispersion is maximum.

5. The method according to claim 3, wherein said determining a first candidate node of each of said first DOM trees according to the clustering result comprises:

if all child nodes corresponding to the node are gathered into one type, determining the node as the first candidate node; or,

if the node has no corresponding child node and a node which belongs to the same father node as the node but does not belong to the same class as the node exists, determining the node as the first candidate node;

the determining a second candidate node of each second DOM tree according to the clustering result includes:

if all the child nodes corresponding to the node are gathered into one type, determining the node as the second candidate node; or,

and if the node has no corresponding child node and a node which belongs to the same father node as the node but does not form a cluster with the node exists, determining the node as the second candidate node.

6. The method of claim 3, wherein the plurality of features includes the layout feature and at least one of the following: character features and semantic features.

7. The method of claim 3, wherein the determining the classification of the classification result to which the second candidate node belongs according to the statistical result comprises:

generating a second candidate node feature vector corresponding to the second candidate node according to the statistical result;

calculating a central node feature vector corresponding to a central node of each category of the classification result, wherein the central node feature vector is an arithmetic average value of first candidate node feature vectors corresponding to all the first candidate nodes and second candidate node feature vectors corresponding to all the second candidate nodes in the category;

judging the category of the classification result to which the second candidate node belongs by adopting a weighted minimum distance classifier algorithm according to the second candidate node feature vector corresponding to the second candidate node and the center node feature vector corresponding to each center node; wherein each weight in the weighted minimum distance classifier algorithm is an information gain rate of each of the plurality of features.

8. The method of claim 7, wherein after adding the second candidate node to the corresponding category in the decision tree according to the determination result, further comprising:

recalculating the information gain rate of each of the plurality of features, if the information gain rate and the ranking of each of the plurality of features are changed, reconstructing the decision tree, and updating each weight in the weighted minimum distance classifier algorithm and a central node feature vector corresponding to a central node of each category of the classification result; or,

if the information gain rate of each feature in the plurality of features is converged, recalculating a central node feature vector corresponding to the central node of each category of the classification result, and if the central node feature vector corresponding to the central node of each category of the classification result is not changed, stopping counting the plurality of features of the next second candidate node.

9. The method of claim 3, wherein prior to counting the plurality of features of each of the first candidate nodes, further comprising:

carrying out homologous noise reduction processing on the first candidate node, and removing a noise node in the first candidate node;

before the performing the following operations on each second candidate node in turn, the method further includes:

and carrying out homologous noise reduction processing on the second candidate node, and removing the noise node in the second candidate node.

10. An apparatus for extracting web page information, comprising: