CN114528459A

CN114528459A - Semantic-based webpage information extraction method and system

Info

Publication number: CN114528459A
Application number: CN202210044347.0A
Authority: CN
Inventors: 郭岩; 王之威; 刘杨昊; 刘悦; 薛源海; 俞晓明; 沈华伟; 程学旗
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-05-24

Abstract

The invention provides a semantic-based webpage information extraction method, which comprises the following steps: acquiring a target DOM tree of a target webpage, splitting a target skeleton node of the target DOM tree according to sentences to obtain a target skeleton sub-node of the target DOM tree; classifying all the target framework sub-nodes according to the target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree; and clustering node paths formed by all the target information child nodes to obtain a target information tree of the target webpage, and extracting webpage information contained in the target information tree. The invention also provides a semantic-based webpage information extraction system and a data processing device for realizing the semantic-based webpage information extraction.

Description

Semantic-based webpage information extraction method and system

Technical Field

The invention belongs to the technical field of network information, and particularly relates to a method and a system for extracting webpage information.

Background

Currently, we have entered the internet era in a comprehensive way, and can see tens of thousands of various types of information distributed in different web pages every day. For different types of information in a web page, there are many invalid redundant information that are useless to the reader. For an original webpage, the content of the original webpage is usually subjected to secondary processing of information to obtain useful structured information, the effective structured data has great application value, and the whole process is also called as information extraction of the webpage.

Which information in the web page is key information is determined by a specific task, taking a news web page as an example, generally speaking, a reader pays attention to specific information such as news content, publishing time, authors, comments, pictures and the like, and other interference information in the web page, such as advertisements, copyright statements, entry links of other pages, visual effects and the like, is often not the focus of attention. For text information, especially body information, in a web page, such as: the text information such as news text, blog text and the like has great significance for many practical application scenes, and deeper mining analysis can be performed based on the data, for example, the text information can be used as an input source of downstream tasks such as public opinion analysis, associated recommendation and the like. The input data of the downstream task is in a structured and more standard data format, so that the method for efficiently extracting valuable structured information from the semi-structured data of the webpage has extremely important research significance and value.

The existing webpage information extraction method can be divided into the following steps: rule-based methods and machine learning-based methods. The rule-based methods are various and can be classified into a static template-based method, a wrapper induction method, an automatic template generation method and a heuristic method.

The heuristic method based on the rule is that a webpage template or an extraction rule is mostly generated in a mode of manually designing or automatically calculating the rule template based on the similar webpage, and the template or the rule is used for extraction. The method based on the static template needs to compile the template manually, different templates need to be compiled for different webpages, the template is invalid after the webpages are updated, the maintenance cost is high, and a user is required to have certain programming capability. The wrapper induction method mainly induces a webpage set labeled by a user to generate an extraction template of a corresponding website, and needs to label the webpage based on manual work. The method for automatically generating the template assumes that a webpage set to be extracted is generated by one template, analyzes the webpages with similar structures and automatically generates the template, does not require prior knowledge of the webpages, but has low efficiency and low accuracy for generating the template for the complicated webpages. Heuristic methods mostly search similar repetitive strings in the DOM tree to locate data areas, automatically analyze structural features, and extract according to heuristic information. The method based on machine learning usually takes webpage HTML source codes as a sequence, and extracts the sequence by manually screening features and labeling the sequence by adopting a machine learning model. The rule-based method requires different rules or template extraction according to different web pages, and updating of web pages will cause rule failure.

The machine learning-based method models by manually selecting features, which cannot accurately express the basic content and theme of a web page, and the method cannot extract any web page.

Both a rule-based method and a machine learning method face the following challenges, on one hand, different webpage typesetting types are various, the combination types of pictures and characters are very different, and extraction rules or models suitable for all webpages are difficult to find; on the other hand, the structure of the web page may change continuously with the update of the web site, and the statistical characteristics of the web page also change, so that the extraction rule or model that has been well performed before often fails after the update of the web page structure.

Based on the challenges of the two aspects, it is necessary to find out the characteristics of the web page that do not change with the changes of the layout, the architecture and the like, and model the essence of the web page, which is the semantic meaning of the web page. The essence of the method is that the characteristics of the webpage are manually selected for calculation, the characteristics only consider the statistical significance of the text or the label, the semantics of the word cannot be learned according to the context, and the semantics of the sentence cannot be expressed, so that the methods cannot adapt to the structural change of the webpage, and the extraction fails.

Disclosure of Invention

In order to solve the above problems, the present invention provides a semantic-based web page information extraction method, which includes: acquiring a target DOM tree of a target webpage, splitting a target skeleton node of the target DOM tree according to sentences to obtain a target skeleton sub-node of the target DOM tree; classifying all the target framework sub-nodes according to target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree; and clustering node paths formed by all the target information sub-nodes to obtain a target information tree of the target webpage, and extracting webpage information contained in the target information tree.

The webpage information extraction method adopts a BERT pre-training language model as the classification model.

The webpage information extraction method further comprises the step of performing off-line learning on the BERT pre-training language model: analyzing a known webpage into a known DOM tree, obtaining known skeleton nodes of the known DOM tree, labeling information of all the known skeleton nodes, labeling the known skeleton nodes related to a downstream task as known core information nodes, and labeling the known skeleton nodes unrelated to the downstream task as non-core information nodes; splitting each known skeleton node according to sentences to obtain known skeleton sub-nodes, wherein the marks of the known skeleton nodes are the same as the marks of the corresponding known skeleton nodes; randomly dividing all the known skeleton sub-nodes into a training set and a verification set, setting a plurality of groups of initial parameters, respectively adjusting each group of initial parameters by the training set, and respectively verifying the classification precision of each BERT pre-training language model for completing parameter adjustment by the test set; and selecting the BERT pre-training language model with the highest classification precision as the classification model.

The invention relates to a webpage information extraction method, wherein the BERT pre-training language model comprises the following steps:

using a cross entropy loss function:

to be provided with

Parameter learning is carried out;

for known skeleton sub-nodes, T is the target task semantic,

is composed of

CLS is a unit in the output layer of the BERT pre-training language model, gamma is the weight of the BERT pre-training language model, N is the weight of the BERT pre-training language model_iFor a known web page W_iIs known to the DOM tree，M_ijIs composed of

The number of child nodes after the split is,

as a sub-node of the skeleton

The real tag of (1).

The invention also provides a semantic-based webpage information extraction system, which comprises the following steps: the node splitting module is used for acquiring target skeleton sub-nodes; acquiring a target DOM tree of a target webpage, splitting a target skeleton node of the target DOM tree according to sentences to obtain a target skeleton sub-node of the target DOM tree; the node classification module is used for classifying all the target framework sub-nodes according to target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree; and the information extraction module is used for clustering node paths formed by all the target information sub-nodes to obtain a target information tree of the target webpage and extracting webpage information contained in the target information tree.

The webpage information extraction system adopts a BERT pre-training language model as the classification model.

The web page information extraction system of the invention also comprises: the offline learning module is used for performing offline learning on the BERT pre-training language model; the node marking module is used for analyzing a known webpage into a known DOM tree, obtaining known skeleton nodes of the known DOM tree, marking the information of all the known skeleton nodes, marking the known skeleton nodes related to a downstream task as known core information nodes, and marking the known skeleton nodes unrelated to the downstream task as non-core information nodes; splitting each known skeleton node according to sentences to obtain known skeleton sub-nodes, wherein the marks of the known skeleton nodes are the same as the marks of the corresponding known skeleton nodes; the model adjusting module is used for randomly dividing all the known skeleton sub-nodes into a training set and a verification set, setting a plurality of groups of initial parameters, adjusting each group of initial parameters by the training set respectively, and verifying the classification accuracy of each BERT pre-training language model for completing parameter adjustment by the test set respectively; and selecting the BERT pre-training language model with the highest classification precision as the classification model.

The invention relates to a webpage information extraction system, wherein the BERT pre-training language model comprises the following steps:

using a cross entropy loss function:

to be provided with

Parameter learning is carried out; wherein,

for known skeleton sub-nodes, T is the target task semantic,

is composed of

CLS is a unit in the output layer of the BERT pre-training language model, gamma is the weight of the BERT pre-training language model, N is the weight of the BERT pre-training language model_iFor a known web page W_iOf a known DOM tree, M_ijIs composed of

The number of child nodes after the split is,

as a sub-node of the skeleton

The real tag of (1).

The present invention also provides a computer-readable storage medium storing computer-executable instructions, which when executed, implement the aforementioned semantic-based web page information extraction method.

The invention also provides a data processing device, which comprises the computer-readable storage medium, and when a processor of the data processing device calls and executes the computer-executable instructions in the computer-readable storage medium, the semantic-based webpage information extraction is realized.

Drawings

FIG. 1 is a flow chart of the semantic-based web page information extraction method of the present invention.

FIG. 2 is a schematic diagram of the BERT model offline learning of the present invention.

FIG. 3 is a schematic diagram of a data processing apparatus according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

When the inventor conducts webpage information extraction research, the inventor finds that text and noise information to be extracted by a downstream task have semantic essential difference, and the existing webpage information extraction technology does not model text semantics. The existing webpage information extraction method needs to manually extract the characteristics of each sub-node as the input of a model, only takes the statistical characteristics of a text into consideration, cannot learn the semantics of words according to the context, cannot express the semantics of sentences, and enables a classifier to be unstable in classification results of the sentences synonymous with different structures. In order to reduce the error of model classification, a classifier is required to be able to model the semantics of the text. The inventors have conducted investigations and found that the pre-trained language model bert (bidirectional Encoder Representation from transforms) can learn text semantics well. BERT is an attention-based bi-directional language modeling method. The BERT directly refers to an Encoder module in a Transformer architecture, and has bidirectional coding capability and strong feature extraction capability. There are two steps in the BERT framework: pre-training and fine-tuning. During pre-training, BERT performs self-supervised learning through Mask Language Modeling (MLM) and Next Sentence Prediction (NSP) to obtain semantic representation of each word; during fine adjustment, only a small amount of data of a specific task needs to be input, so that the semantics can be adjusted according to the specific downstream task, and the domain knowledge of the downstream task is learned. BERT adopts a self-attention mechanism, words are represented by words of the context of the sentence in which the words are positioned in a weighting mode, and then a full connection layer is adopted to carry out relevance learning on the current word sequence and the node category, so that the relevance between the node category and the context of the sentence is obtained. In the present invention, the pre-trained BERT is used directly and the fine-tuning and prediction are performed on this basis. In addition, in order to apply the model to the extraction of any text data mining downstream task, the inventor also provides two stages of off-line learning and on-line extraction, wherein the off-line learning stage learns the domain knowledge of the downstream task by finely adjusting the pre-training language model BERT; and in the online extraction stage, the finely adjusted BERT is used for classifying the skeleton sub-nodes of the webpage, and the classification result is optimized by using a path clustering algorithm to obtain main webpage information with consistent semantics. In addition, in order to improve the robustness of the model in webpage information extraction, the inventor also provides a Path Clustering (PC) algorithm, and on the basis of performing semantic modeling and extraction on webpage text information, the extraction result of the semantic modeling is optimized through the structural information of the modeled webpage.

The invention aims to solve the problem that the extraction accuracy is greatly reduced after a webpage is updated in the prior art, and provides a webpage information extraction method and system based on semantics, wherein the extraction method comprises the following steps: splitting text nodes of a webpage DOM tree, wherein the split nodes are used as input of a webpage information extraction model, so that the classification granularity is thinned, and the classification accuracy is improved; the text in the webpage is classified by using a BERT pre-training language model, and the semantic meaning of the text in the webpage is modeled, so that the accuracy of webpage text classification is improved; optimizing the classification result by adopting a Path Clustering algorithm (Path Clustering), so that the quality of single webpage information extraction and the robustness of webpage core information extraction are improved; the method comprises the following steps of separating parameter learning and extraction stages of a model by adopting an off-line learning-on-line extraction framework, pre-training a large number of extraction models aiming at different downstream tasks in the off-line learning stage, directly selecting the corresponding models for extraction in the on-line extraction stage according to specific downstream task requirements, and applying the models to extraction aiming at any text data mining downstream task by matching the two stages; and adopting a webpage information extraction evaluation index CA to measure the extraction quality of a single webpage and the ratio of the webpage with the extraction quality reaching the standard.

FIG. 1 is a flow chart of the semantic-based web page information extraction method of the present invention. As shown in fig. 1, preprocessing is performed first, the input HTML is parsed into a DOM tree, a skeleton node is split to obtain a skeleton sub-node, and then the sub-node is input to the trimmed BERT in the offline learning stage for classification, and whether the skeleton sub-node is a core information sub-node is marked. And then, clustering all the skeleton sub-nodes marked as core information sub-nodes by adopting a path clustering algorithm to obtain paths of the core root nodes, and accordingly obtaining a final extraction result.

Specifically, the semantic-based webpage information extraction method specifically comprises the following steps:

step S1, dividing nodes;

to web page W_iThe plain text content irrelevant to typesetting, font size and the like is defined as the skeleton information S of the webpage_iN, N is the total number of web pages. Each node in the DOM tree represents one or a pair of HTML tags or text items in the tags, all texts in the HTML are leaf nodes in the DOM tree, and the leaf nodes storing the texts are called as skeleton nodes and are marked as s_ij，j＝1,...,N_iJ represents a web page W_iThe j-th skeleton node, N, marked in the backward traversal order in the DOM tree_iAs web pages W_iThe DOM tree ofThe number of dots. Because the skeleton nodes of the web page correspond to the web page texts one by one, the web page W is divided into a plurality of parts_iSkeleton information S of_iUsing N of DOM tree of the web page_iA skeleton node is represented as

Accordingly, the webpage information extraction task is converted into a classification problem of the webpage framework nodes.

Step S2, splitting nodes;

skeleton nodes of the same webpage only contain extremely short sentences, and some skeleton nodes contain paragraphs formed by a plurality of sentences, so that the problem of uneven input granularity exists if the skeleton nodes are directly classified, and the accuracy of a classification model is reduced. To solve this problem, coarse-grained nodes need to be split. Specifically, the text in each skeleton node is divided according to sentences, so that one skeleton node is also split into one or more child nodes, the child nodes inherit the characteristics of the skeleton node before splitting, have ancestor nodes which are the same as the nodes before splitting and are brother nodes, and the sequence between the child nodes is the sequence between sentences before splitting, namely the child nodes s_ijThe split child node is s_ijThe skeleton sub-node of (1) is recorded as

M_ijThe number of child nodes after the split of the skeleton node. Thus the skeleton information S_iNot only can be represented by all skeleton nodes, but also can be represented as all skeleton sub-node sequences, namely:

therefore, the extraction problem of the webpage information is simplified into the classification problem of all the skeleton sub-nodes in the webpage.

Step S3, classifying nodes;

the scoring class model is

Given a classification threshold θ

When s is_ij ^kSemantically related to the target task T, s_ij ^kAnd marking as a core information child node, otherwise, marking as a non-core information child node. Then the web page W_iCore information C of_iExpressed as:

C_i＝[s_ij ^k|f(s_ij ^k,T)＞θ,j∈[1,N_i],k∈[1,M_ij]],i＝1,...,N。

after webpage information extraction is converted into classification of skeleton sub-nodes, a classification model is obtained

The pretrained BERT is used directly and the fine-tuning prediction is performed on the basis of the pretrained BERT. Model input as Each skeleton sub-node

Text sequence in (1)

Since it is a sequential classification task, only the output CLSs need to be classified. Constructing a classification model:

using a cross entropy loss function:

the objective function is minimized and parameter learning is performed as follows:

CLS is a unit in the output layer of the BERT pre-training language model, gamma is the weight of the BERT pre-training language model,

is composed of

The real tag of (1).

Step S4, clustering paths;

the input of the path clustering algorithm is a node path which is classified into core information child nodes by BERT, the output is a path, a node corresponding to the path is called a core root node, and a sub-tree taking the core root node as a root is called a core information tree. And according to the core information tree, all the skeleton sub-nodes in the core information tree are marked as core information sub-nodes again, and other skeleton sub-nodes are marked as non-core information sub-nodes again to serve as final results after the classification results are corrected.

Memory skeleton sub-node s_ij ^kIs a sequence of

Wherein

Indicating the length of the path, i.e. from the root node to the node s_ij ^kNumber of nodes on the path, elements in the sequence

Indicating that the node in the path with index t is the next child of its parent.

The path clustering algorithm comprises the following steps: (1) inputting a node path to be clustered, initializing a queue for recording the path clustering result, and aligning the path from a root node; (2) clustering layer by layer from the root node, and recording the total number of the elements in the ith layer as L_iThe greatest amountElement of many is denoted as e_maxThe number thereof is recorded as

Setting a path clustering coefficient alpha (alpha is more than or equal to 0 and less than or equal to 1); (3) when in use

When it is, the current e_maxAdding into queue and keeping the element of the layer as e_maxContinuing to perform next-layer clustering on the reserved paths; (4) when in use

When the iteration is terminated, the iteration is returned to the queue, and the nodes in the queue arrange the represented paths in sequence.

In the method for extracting webpage information, a BERT pre-trained language model is adopted for a classification model, so that the method for extracting webpage information further comprises an off-line learning step, a large amount of skeleton information of webpages is used for fine tuning the BERT pre-trained language model to learn the field information of downstream tasks, and as shown in fig. 2, the specific process is as follows:

selecting a large number of HTML source codes of a webpage, analyzing the HTML source codes into a DOM tree to obtain all skeleton nodes of the webpage, manually marking the text nodes, marking the text nodes related to a downstream task as core information nodes (positive samples), and marking the text nodes unrelated to the downstream task as non-core information nodes (negative samples);

splitting each framework node according to sentences to obtain framework sub-nodes, wherein the marks of the framework sub-nodes are the same as the marks of the text nodes before splitting;

randomly dividing all skeleton sub-nodes into a training set and a verification set, selecting a plurality of groups of parameters for the BERT, finely adjusting the BERT by using the training set for each group of parameters, verifying the classification precision by using the verification set, selecting the best parameter combination and the corresponding finely adjusted BERT which are expressed on the verification set as the result of off-line learning, and storing the model and the parameters to the local for the calling of an on-line extraction part.

In addition, in order to accurately evaluate the extraction effect of the webpage information, the invention also provides CA index evaluation:

defining the accuracy rate of extracting the webpage information as CA. Given a threshold h of decimation quality, in the embodiment of the present invention, h is taken to be 0.9, F1_iFor testing page W_iF1-score when testing Page W_iWith F1_iIf the value is more than h, the extraction is recorded as successful, and a is present_iOtherwise, the extraction is recorded as failure, with a_i0, then

And N is the total number of the test pages. The CA can measure the extraction effect of the model on the granularity of the webpage level, and the proportion of the webpage reaching the extraction quality requirement can be intuitively known through the CA.

FIG. 3 is a schematic diagram of a data processing apparatus according to the present invention. As shown in fig. 3, the present invention further provides a data processing apparatus, which includes a processor and a computer-readable storage medium, wherein the processor retrieves and executes executable instructions in the computer-readable storage medium to perform information extraction on a web page; the computer readable storage medium stores executable instructions, and when the executable instructions are executed by the processor, the semantic-based webpage information extraction method is realized. It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by a program instructing associated hardware (e.g., a processor) and the program may be stored in a readable storage medium, such as a read-only memory, a magnetic or optical disk, etc. All or some of the steps of the above embodiments may also be implemented using one or more integrated circuits. Accordingly, the modules in the above embodiments may be implemented in hardware, for example, by an integrated circuit, or in software, for example, by a processor executing programs/instructions stored in a memory. Embodiments of the invention are not limited to any specific form of combination of hardware and software.

The semantic-based webpage information extraction method creatively adds text semantics of the webpage as input information for modeling, learns domain knowledge from webpages related to downstream tasks, extracts core information of the webpage and provides effective structured input data for different downstream tasks. Compared with the rule-based method, the method does not need to design different rules or extract templates according to different webpages, and does not need to worry about the problem of rule failure caused by updating and updating of the webpages; compared with a method based on machine learning, the method abandons the selection and modeling of the traditional characteristics (the statistical characteristics of DOM tree nodes), pays more attention to the semantics of the webpage text, and improves the extraction robustness and the generalization capability aiming at different downstream tasks.

The above embodiments are only for illustrating the invention and are not to be construed as limiting the invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention, therefore, all equivalent technical solutions also fall into the scope of the invention, and the scope of the invention is defined by the claims.

Claims

1. A webpage information extraction method based on semantics is characterized by comprising the following steps:

acquiring a target DOM tree of a target webpage, splitting a target skeleton node of the target DOM tree according to sentences to obtain a target skeleton sub-node of the target DOM tree;

classifying all the target framework sub-nodes according to target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree;

and clustering node paths formed by all the target information sub-nodes to obtain a target information tree of the target webpage, and extracting webpage information contained in the target information tree.

2. The method for extracting web page information of claim 1, wherein a BERT pre-trained language model is used as the classification model.

3. The method for extracting web page information according to claim 2, further comprising the step of performing offline learning on the BERT pre-trained language model:

analyzing a known webpage into a known DOM tree, obtaining known skeleton nodes of the known DOM tree, labeling information of all the known skeleton nodes, labeling the known skeleton nodes related to a downstream task as known core information nodes, and labeling the known skeleton nodes unrelated to the downstream task as non-core information nodes; splitting each known skeleton node according to sentences to obtain known skeleton sub-nodes, wherein the marks of the known skeleton nodes are the same as the marks of the corresponding known skeleton nodes;

randomly dividing all the known skeleton sub-nodes into a training set and a verification set, setting a plurality of groups of initial parameters, respectively adjusting each group of initial parameters by the training set, and respectively verifying the classification precision of each BERT pre-training language model for completing parameter adjustment by the test set; and selecting the BERT pre-training language model with the highest classification precision as the classification model.

4. The method for extracting web page information of claim 3, wherein the BERT pre-training language model is:

using a cross entropy loss function:

to be provided with

Parameter learning is carried out;

wherein,

for known skeleton sub-nodes, T is the target task semantic,

is composed of

The number of child nodes after the split is,

as a skeleton sub-node

The real tag of (1).

5. A semantic-based web page information extraction system is characterized by comprising:

the node splitting module is used for acquiring target skeleton sub-nodes; acquiring a target DOM tree of a target webpage, splitting a target framework node of the target DOM tree according to sentences to obtain a target framework sub-node of the target DOM tree;

the node classification module is used for classifying all the target framework sub-nodes according to target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree;

and the information extraction module is used for clustering node paths formed by all the target information sub-nodes to obtain a target information tree of the target webpage and extracting webpage information contained in the target information tree.

6. The web page information extraction system according to claim 5, wherein a BERT pre-trained language model is used as the classification model.

7. The web page information extraction system according to claim 6, further comprising:

the offline learning module is used for performing offline learning on the BERT pre-training language model; specifically comprises

The node marking module is used for analyzing the known webpage into a known DOM tree, obtaining known skeleton nodes of the known DOM tree, marking the information of all the known skeleton nodes, marking the known skeleton nodes related to the downstream task as known core information nodes, and marking the known skeleton nodes unrelated to the downstream task as non-core information nodes; splitting each known skeleton node according to sentences to obtain known skeleton sub-nodes, wherein the marks of the known skeleton nodes are the same as the marks of the corresponding known skeleton nodes;

the model adjusting module is used for randomly dividing all the known skeleton sub-nodes into a training set and a verification set, setting a plurality of groups of initial parameters, adjusting each group of initial parameters by the training set respectively, and verifying the classification accuracy of each BERT pre-training language model for completing parameter adjustment by the test set respectively; and selecting the BERT pre-training language model with the highest classification precision as the classification model.

8. The system for extracting web page information of claim 7, wherein the BERT pre-trained language model is:

using a cross entropy loss function:

to be provided with

To carry outLearning parameters;

wherein,

for known skeleton sub-nodes, T is the target task semantic,

is composed of

The number of child nodes after the split is,

as a sub-node of the skeleton

The real tag of (1).

9. A computer-readable storage medium storing computer-executable instructions which, when executed, implement the semantic-based web page information extraction method according to any one of claims 1 to 4.

10. A data processing apparatus comprising the computer-readable storage medium of claim 9, wherein the semantic-based extraction of web page information is performed when the processor of the data processing apparatus retrieves and executes the computer-executable instructions of the computer-readable storage medium.