CN114528459A - Semantic-based webpage information extraction method and system - Google Patents

Semantic-based webpage information extraction method and system Download PDF

Info

Publication number
CN114528459A
CN114528459A CN202210044347.0A CN202210044347A CN114528459A CN 114528459 A CN114528459 A CN 114528459A CN 202210044347 A CN202210044347 A CN 202210044347A CN 114528459 A CN114528459 A CN 114528459A
Authority
CN
China
Prior art keywords
nodes
target
information
skeleton
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210044347.0A
Other languages
Chinese (zh)
Inventor
郭岩
王之威
刘杨昊
刘悦
薛源海
俞晓明
沈华伟
程学旗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202210044347.0A priority Critical patent/CN114528459A/en
Publication of CN114528459A publication Critical patent/CN114528459A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • G06F40/146Coding or compression of tree-structured data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a semantic-based webpage information extraction method, which comprises the following steps: acquiring a target DOM tree of a target webpage, splitting a target skeleton node of the target DOM tree according to sentences to obtain a target skeleton sub-node of the target DOM tree; classifying all the target framework sub-nodes according to the target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree; and clustering node paths formed by all the target information child nodes to obtain a target information tree of the target webpage, and extracting webpage information contained in the target information tree. The invention also provides a semantic-based webpage information extraction system and a data processing device for realizing the semantic-based webpage information extraction.

Description

Semantic-based webpage information extraction method and system
Technical Field
The invention belongs to the technical field of network information, and particularly relates to a method and a system for extracting webpage information.
Background
Currently, we have entered the internet era in a comprehensive way, and can see tens of thousands of various types of information distributed in different web pages every day. For different types of information in a web page, there are many invalid redundant information that are useless to the reader. For an original webpage, the content of the original webpage is usually subjected to secondary processing of information to obtain useful structured information, the effective structured data has great application value, and the whole process is also called as information extraction of the webpage.
Which information in the web page is key information is determined by a specific task, taking a news web page as an example, generally speaking, a reader pays attention to specific information such as news content, publishing time, authors, comments, pictures and the like, and other interference information in the web page, such as advertisements, copyright statements, entry links of other pages, visual effects and the like, is often not the focus of attention. For text information, especially body information, in a web page, such as: the text information such as news text, blog text and the like has great significance for many practical application scenes, and deeper mining analysis can be performed based on the data, for example, the text information can be used as an input source of downstream tasks such as public opinion analysis, associated recommendation and the like. The input data of the downstream task is in a structured and more standard data format, so that the method for efficiently extracting valuable structured information from the semi-structured data of the webpage has extremely important research significance and value.
The existing webpage information extraction method can be divided into the following steps: rule-based methods and machine learning-based methods. The rule-based methods are various and can be classified into a static template-based method, a wrapper induction method, an automatic template generation method and a heuristic method.
The heuristic method based on the rule is that a webpage template or an extraction rule is mostly generated in a mode of manually designing or automatically calculating the rule template based on the similar webpage, and the template or the rule is used for extraction. The method based on the static template needs to compile the template manually, different templates need to be compiled for different webpages, the template is invalid after the webpages are updated, the maintenance cost is high, and a user is required to have certain programming capability. The wrapper induction method mainly induces a webpage set labeled by a user to generate an extraction template of a corresponding website, and needs to label the webpage based on manual work. The method for automatically generating the template assumes that a webpage set to be extracted is generated by one template, analyzes the webpages with similar structures and automatically generates the template, does not require prior knowledge of the webpages, but has low efficiency and low accuracy for generating the template for the complicated webpages. Heuristic methods mostly search similar repetitive strings in the DOM tree to locate data areas, automatically analyze structural features, and extract according to heuristic information. The method based on machine learning usually takes webpage HTML source codes as a sequence, and extracts the sequence by manually screening features and labeling the sequence by adopting a machine learning model. The rule-based method requires different rules or template extraction according to different web pages, and updating of web pages will cause rule failure.
The machine learning-based method models by manually selecting features, which cannot accurately express the basic content and theme of a web page, and the method cannot extract any web page.
Both a rule-based method and a machine learning method face the following challenges, on one hand, different webpage typesetting types are various, the combination types of pictures and characters are very different, and extraction rules or models suitable for all webpages are difficult to find; on the other hand, the structure of the web page may change continuously with the update of the web site, and the statistical characteristics of the web page also change, so that the extraction rule or model that has been well performed before often fails after the update of the web page structure.
Based on the challenges of the two aspects, it is necessary to find out the characteristics of the web page that do not change with the changes of the layout, the architecture and the like, and model the essence of the web page, which is the semantic meaning of the web page. The essence of the method is that the characteristics of the webpage are manually selected for calculation, the characteristics only consider the statistical significance of the text or the label, the semantics of the word cannot be learned according to the context, and the semantics of the sentence cannot be expressed, so that the methods cannot adapt to the structural change of the webpage, and the extraction fails.
Disclosure of Invention
In order to solve the above problems, the present invention provides a semantic-based web page information extraction method, which includes: acquiring a target DOM tree of a target webpage, splitting a target skeleton node of the target DOM tree according to sentences to obtain a target skeleton sub-node of the target DOM tree; classifying all the target framework sub-nodes according to target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree; and clustering node paths formed by all the target information sub-nodes to obtain a target information tree of the target webpage, and extracting webpage information contained in the target information tree.
The webpage information extraction method adopts a BERT pre-training language model as the classification model.
The webpage information extraction method further comprises the step of performing off-line learning on the BERT pre-training language model: analyzing a known webpage into a known DOM tree, obtaining known skeleton nodes of the known DOM tree, labeling information of all the known skeleton nodes, labeling the known skeleton nodes related to a downstream task as known core information nodes, and labeling the known skeleton nodes unrelated to the downstream task as non-core information nodes; splitting each known skeleton node according to sentences to obtain known skeleton sub-nodes, wherein the marks of the known skeleton nodes are the same as the marks of the corresponding known skeleton nodes; randomly dividing all the known skeleton sub-nodes into a training set and a verification set, setting a plurality of groups of initial parameters, respectively adjusting each group of initial parameters by the training set, and respectively verifying the classification precision of each BERT pre-training language model for completing parameter adjustment by the test set; and selecting the BERT pre-training language model with the highest classification precision as the classification model.
The invention relates to a webpage information extraction method, wherein the BERT pre-training language model comprises the following steps:
Figure BDA0003471554730000031
using a cross entropy loss function:
Figure BDA0003471554730000032
to be provided with
Figure BDA0003471554730000033
Parameter learning is carried out;
Figure BDA0003471554730000034
for known skeleton sub-nodes, T is the target task semantic,
Figure BDA0003471554730000035
is composed of
Figure BDA0003471554730000036
CLS is a unit in the output layer of the BERT pre-training language model, gamma is the weight of the BERT pre-training language model, N is the weight of the BERT pre-training language modeliFor a known web page WiIs known to the DOM tree,MijIs composed of
Figure BDA0003471554730000037
The number of child nodes after the split is,
Figure BDA0003471554730000038
as a sub-node of the skeleton
Figure BDA0003471554730000039
The real tag of (1).
The invention also provides a semantic-based webpage information extraction system, which comprises the following steps: the node splitting module is used for acquiring target skeleton sub-nodes; acquiring a target DOM tree of a target webpage, splitting a target skeleton node of the target DOM tree according to sentences to obtain a target skeleton sub-node of the target DOM tree; the node classification module is used for classifying all the target framework sub-nodes according to target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree; and the information extraction module is used for clustering node paths formed by all the target information sub-nodes to obtain a target information tree of the target webpage and extracting webpage information contained in the target information tree.
The webpage information extraction system adopts a BERT pre-training language model as the classification model.
The web page information extraction system of the invention also comprises: the offline learning module is used for performing offline learning on the BERT pre-training language model; the node marking module is used for analyzing a known webpage into a known DOM tree, obtaining known skeleton nodes of the known DOM tree, marking the information of all the known skeleton nodes, marking the known skeleton nodes related to a downstream task as known core information nodes, and marking the known skeleton nodes unrelated to the downstream task as non-core information nodes; splitting each known skeleton node according to sentences to obtain known skeleton sub-nodes, wherein the marks of the known skeleton nodes are the same as the marks of the corresponding known skeleton nodes; the model adjusting module is used for randomly dividing all the known skeleton sub-nodes into a training set and a verification set, setting a plurality of groups of initial parameters, adjusting each group of initial parameters by the training set respectively, and verifying the classification accuracy of each BERT pre-training language model for completing parameter adjustment by the test set respectively; and selecting the BERT pre-training language model with the highest classification precision as the classification model.
The invention relates to a webpage information extraction system, wherein the BERT pre-training language model comprises the following steps:
Figure BDA0003471554730000041
using a cross entropy loss function:
Figure BDA0003471554730000042
to be provided with
Figure BDA0003471554730000043
Parameter learning is carried out; wherein,
Figure BDA0003471554730000044
for known skeleton sub-nodes, T is the target task semantic,
Figure BDA0003471554730000045
is composed of
Figure BDA0003471554730000046
CLS is a unit in the output layer of the BERT pre-training language model, gamma is the weight of the BERT pre-training language model, N is the weight of the BERT pre-training language modeliFor a known web page WiOf a known DOM tree, MijIs composed of
Figure BDA0003471554730000047
The number of child nodes after the split is,
Figure BDA0003471554730000048
as a sub-node of the skeleton
Figure BDA0003471554730000049
The real tag of (1).
The present invention also provides a computer-readable storage medium storing computer-executable instructions, which when executed, implement the aforementioned semantic-based web page information extraction method.
The invention also provides a data processing device, which comprises the computer-readable storage medium, and when a processor of the data processing device calls and executes the computer-executable instructions in the computer-readable storage medium, the semantic-based webpage information extraction is realized.
Drawings
FIG. 1 is a flow chart of the semantic-based web page information extraction method of the present invention.
FIG. 2 is a schematic diagram of the BERT model offline learning of the present invention.
FIG. 3 is a schematic diagram of a data processing apparatus according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
When the inventor conducts webpage information extraction research, the inventor finds that text and noise information to be extracted by a downstream task have semantic essential difference, and the existing webpage information extraction technology does not model text semantics. The existing webpage information extraction method needs to manually extract the characteristics of each sub-node as the input of a model, only takes the statistical characteristics of a text into consideration, cannot learn the semantics of words according to the context, cannot express the semantics of sentences, and enables a classifier to be unstable in classification results of the sentences synonymous with different structures. In order to reduce the error of model classification, a classifier is required to be able to model the semantics of the text. The inventors have conducted investigations and found that the pre-trained language model bert (bidirectional Encoder Representation from transforms) can learn text semantics well. BERT is an attention-based bi-directional language modeling method. The BERT directly refers to an Encoder module in a Transformer architecture, and has bidirectional coding capability and strong feature extraction capability. There are two steps in the BERT framework: pre-training and fine-tuning. During pre-training, BERT performs self-supervised learning through Mask Language Modeling (MLM) and Next Sentence Prediction (NSP) to obtain semantic representation of each word; during fine adjustment, only a small amount of data of a specific task needs to be input, so that the semantics can be adjusted according to the specific downstream task, and the domain knowledge of the downstream task is learned. BERT adopts a self-attention mechanism, words are represented by words of the context of the sentence in which the words are positioned in a weighting mode, and then a full connection layer is adopted to carry out relevance learning on the current word sequence and the node category, so that the relevance between the node category and the context of the sentence is obtained. In the present invention, the pre-trained BERT is used directly and the fine-tuning and prediction are performed on this basis. In addition, in order to apply the model to the extraction of any text data mining downstream task, the inventor also provides two stages of off-line learning and on-line extraction, wherein the off-line learning stage learns the domain knowledge of the downstream task by finely adjusting the pre-training language model BERT; and in the online extraction stage, the finely adjusted BERT is used for classifying the skeleton sub-nodes of the webpage, and the classification result is optimized by using a path clustering algorithm to obtain main webpage information with consistent semantics. In addition, in order to improve the robustness of the model in webpage information extraction, the inventor also provides a Path Clustering (PC) algorithm, and on the basis of performing semantic modeling and extraction on webpage text information, the extraction result of the semantic modeling is optimized through the structural information of the modeled webpage.
The invention aims to solve the problem that the extraction accuracy is greatly reduced after a webpage is updated in the prior art, and provides a webpage information extraction method and system based on semantics, wherein the extraction method comprises the following steps: splitting text nodes of a webpage DOM tree, wherein the split nodes are used as input of a webpage information extraction model, so that the classification granularity is thinned, and the classification accuracy is improved; the text in the webpage is classified by using a BERT pre-training language model, and the semantic meaning of the text in the webpage is modeled, so that the accuracy of webpage text classification is improved; optimizing the classification result by adopting a Path Clustering algorithm (Path Clustering), so that the quality of single webpage information extraction and the robustness of webpage core information extraction are improved; the method comprises the following steps of separating parameter learning and extraction stages of a model by adopting an off-line learning-on-line extraction framework, pre-training a large number of extraction models aiming at different downstream tasks in the off-line learning stage, directly selecting the corresponding models for extraction in the on-line extraction stage according to specific downstream task requirements, and applying the models to extraction aiming at any text data mining downstream task by matching the two stages; and adopting a webpage information extraction evaluation index CA to measure the extraction quality of a single webpage and the ratio of the webpage with the extraction quality reaching the standard.
FIG. 1 is a flow chart of the semantic-based web page information extraction method of the present invention. As shown in fig. 1, preprocessing is performed first, the input HTML is parsed into a DOM tree, a skeleton node is split to obtain a skeleton sub-node, and then the sub-node is input to the trimmed BERT in the offline learning stage for classification, and whether the skeleton sub-node is a core information sub-node is marked. And then, clustering all the skeleton sub-nodes marked as core information sub-nodes by adopting a path clustering algorithm to obtain paths of the core root nodes, and accordingly obtaining a final extraction result.
Specifically, the semantic-based webpage information extraction method specifically comprises the following steps:
step S1, dividing nodes;
to web page WiThe plain text content irrelevant to typesetting, font size and the like is defined as the skeleton information S of the webpageiN, N is the total number of web pages. Each node in the DOM tree represents one or a pair of HTML tags or text items in the tags, all texts in the HTML are leaf nodes in the DOM tree, and the leaf nodes storing the texts are called as skeleton nodes and are marked as sij,j=1,...,NiJ represents a web page WiThe j-th skeleton node, N, marked in the backward traversal order in the DOM treeiAs web pages WiThe DOM tree ofThe number of dots. Because the skeleton nodes of the web page correspond to the web page texts one by one, the web page W is divided into a plurality of partsiSkeleton information S ofiUsing N of DOM tree of the web pageiA skeleton node is represented as
Figure BDA00034715547300000710
Accordingly, the webpage information extraction task is converted into a classification problem of the webpage framework nodes.
Step S2, splitting nodes;
skeleton nodes of the same webpage only contain extremely short sentences, and some skeleton nodes contain paragraphs formed by a plurality of sentences, so that the problem of uneven input granularity exists if the skeleton nodes are directly classified, and the accuracy of a classification model is reduced. To solve this problem, coarse-grained nodes need to be split. Specifically, the text in each skeleton node is divided according to sentences, so that one skeleton node is also split into one or more child nodes, the child nodes inherit the characteristics of the skeleton node before splitting, have ancestor nodes which are the same as the nodes before splitting and are brother nodes, and the sequence between the child nodes is the sequence between sentences before splitting, namely the child nodes sijThe split child node is sijThe skeleton sub-node of (1) is recorded as
Figure BDA0003471554730000071
MijThe number of child nodes after the split of the skeleton node. Thus the skeleton information SiNot only can be represented by all skeleton nodes, but also can be represented as all skeleton sub-node sequences, namely:
Figure BDA0003471554730000072
therefore, the extraction problem of the webpage information is simplified into the classification problem of all the skeleton sub-nodes in the webpage.
Step S3, classifying nodes;
the scoring class model is
Figure BDA0003471554730000073
Given a classification threshold θ
Figure BDA0003471554730000074
When s isij kSemantically related to the target task T, sij kAnd marking as a core information child node, otherwise, marking as a non-core information child node. Then the web page WiCore information C ofiExpressed as:
Ci=[sij k|f(sij k,T)>θ,j∈[1,Ni],k∈[1,Mij]],i=1,...,N。
after webpage information extraction is converted into classification of skeleton sub-nodes, a classification model is obtained
Figure BDA0003471554730000075
The pretrained BERT is used directly and the fine-tuning prediction is performed on the basis of the pretrained BERT. Model input as Each skeleton sub-node
Figure BDA0003471554730000076
Text sequence in (1)
Figure BDA0003471554730000077
Since it is a sequential classification task, only the output CLSs need to be classified. Constructing a classification model:
Figure BDA0003471554730000078
using a cross entropy loss function:
Figure BDA0003471554730000079
the objective function is minimized and parameter learning is performed as follows:
Figure BDA0003471554730000081
CLS is a unit in the output layer of the BERT pre-training language model, gamma is the weight of the BERT pre-training language model,
Figure BDA0003471554730000082
is composed of
Figure BDA0003471554730000083
The real tag of (1).
Step S4, clustering paths;
the input of the path clustering algorithm is a node path which is classified into core information child nodes by BERT, the output is a path, a node corresponding to the path is called a core root node, and a sub-tree taking the core root node as a root is called a core information tree. And according to the core information tree, all the skeleton sub-nodes in the core information tree are marked as core information sub-nodes again, and other skeleton sub-nodes are marked as non-core information sub-nodes again to serve as final results after the classification results are corrected.
Memory skeleton sub-node sij kIs a sequence of
Figure BDA0003471554730000084
Wherein
Figure BDA0003471554730000085
Indicating the length of the path, i.e. from the root node to the node sij kNumber of nodes on the path, elements in the sequence
Figure BDA0003471554730000086
Indicating that the node in the path with index t is the next child of its parent.
The path clustering algorithm comprises the following steps: (1) inputting a node path to be clustered, initializing a queue for recording the path clustering result, and aligning the path from a root node; (2) clustering layer by layer from the root node, and recording the total number of the elements in the ith layer as LiThe greatest amountElement of many is denoted as emaxThe number thereof is recorded as
Figure BDA0003471554730000087
Setting a path clustering coefficient alpha (alpha is more than or equal to 0 and less than or equal to 1); (3) when in use
Figure BDA0003471554730000088
When it is, the current emaxAdding into queue and keeping the element of the layer as emaxContinuing to perform next-layer clustering on the reserved paths; (4) when in use
Figure BDA0003471554730000089
When the iteration is terminated, the iteration is returned to the queue, and the nodes in the queue arrange the represented paths in sequence.
In the method for extracting webpage information, a BERT pre-trained language model is adopted for a classification model, so that the method for extracting webpage information further comprises an off-line learning step, a large amount of skeleton information of webpages is used for fine tuning the BERT pre-trained language model to learn the field information of downstream tasks, and as shown in fig. 2, the specific process is as follows:
selecting a large number of HTML source codes of a webpage, analyzing the HTML source codes into a DOM tree to obtain all skeleton nodes of the webpage, manually marking the text nodes, marking the text nodes related to a downstream task as core information nodes (positive samples), and marking the text nodes unrelated to the downstream task as non-core information nodes (negative samples);
splitting each framework node according to sentences to obtain framework sub-nodes, wherein the marks of the framework sub-nodes are the same as the marks of the text nodes before splitting;
randomly dividing all skeleton sub-nodes into a training set and a verification set, selecting a plurality of groups of parameters for the BERT, finely adjusting the BERT by using the training set for each group of parameters, verifying the classification precision by using the verification set, selecting the best parameter combination and the corresponding finely adjusted BERT which are expressed on the verification set as the result of off-line learning, and storing the model and the parameters to the local for the calling of an on-line extraction part.
In addition, in order to accurately evaluate the extraction effect of the webpage information, the invention also provides CA index evaluation:
defining the accuracy rate of extracting the webpage information as CA. Given a threshold h of decimation quality, in the embodiment of the present invention, h is taken to be 0.9, F1iFor testing page WiF1-score when testing Page WiWith F1iIf the value is more than h, the extraction is recorded as successful, and a is presentiOtherwise, the extraction is recorded as failure, with ai0, then
Figure BDA0003471554730000091
And N is the total number of the test pages. The CA can measure the extraction effect of the model on the granularity of the webpage level, and the proportion of the webpage reaching the extraction quality requirement can be intuitively known through the CA.
FIG. 3 is a schematic diagram of a data processing apparatus according to the present invention. As shown in fig. 3, the present invention further provides a data processing apparatus, which includes a processor and a computer-readable storage medium, wherein the processor retrieves and executes executable instructions in the computer-readable storage medium to perform information extraction on a web page; the computer readable storage medium stores executable instructions, and when the executable instructions are executed by the processor, the semantic-based webpage information extraction method is realized. It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by a program instructing associated hardware (e.g., a processor) and the program may be stored in a readable storage medium, such as a read-only memory, a magnetic or optical disk, etc. All or some of the steps of the above embodiments may also be implemented using one or more integrated circuits. Accordingly, the modules in the above embodiments may be implemented in hardware, for example, by an integrated circuit, or in software, for example, by a processor executing programs/instructions stored in a memory. Embodiments of the invention are not limited to any specific form of combination of hardware and software.
The semantic-based webpage information extraction method creatively adds text semantics of the webpage as input information for modeling, learns domain knowledge from webpages related to downstream tasks, extracts core information of the webpage and provides effective structured input data for different downstream tasks. Compared with the rule-based method, the method does not need to design different rules or extract templates according to different webpages, and does not need to worry about the problem of rule failure caused by updating and updating of the webpages; compared with a method based on machine learning, the method abandons the selection and modeling of the traditional characteristics (the statistical characteristics of DOM tree nodes), pays more attention to the semantics of the webpage text, and improves the extraction robustness and the generalization capability aiming at different downstream tasks.
The above embodiments are only for illustrating the invention and are not to be construed as limiting the invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention, therefore, all equivalent technical solutions also fall into the scope of the invention, and the scope of the invention is defined by the claims.

Claims (10)

1. A webpage information extraction method based on semantics is characterized by comprising the following steps:
acquiring a target DOM tree of a target webpage, splitting a target skeleton node of the target DOM tree according to sentences to obtain a target skeleton sub-node of the target DOM tree;
classifying all the target framework sub-nodes according to target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree;
and clustering node paths formed by all the target information sub-nodes to obtain a target information tree of the target webpage, and extracting webpage information contained in the target information tree.
2. The method for extracting web page information of claim 1, wherein a BERT pre-trained language model is used as the classification model.
3. The method for extracting web page information according to claim 2, further comprising the step of performing offline learning on the BERT pre-trained language model:
analyzing a known webpage into a known DOM tree, obtaining known skeleton nodes of the known DOM tree, labeling information of all the known skeleton nodes, labeling the known skeleton nodes related to a downstream task as known core information nodes, and labeling the known skeleton nodes unrelated to the downstream task as non-core information nodes; splitting each known skeleton node according to sentences to obtain known skeleton sub-nodes, wherein the marks of the known skeleton nodes are the same as the marks of the corresponding known skeleton nodes;
randomly dividing all the known skeleton sub-nodes into a training set and a verification set, setting a plurality of groups of initial parameters, respectively adjusting each group of initial parameters by the training set, and respectively verifying the classification precision of each BERT pre-training language model for completing parameter adjustment by the test set; and selecting the BERT pre-training language model with the highest classification precision as the classification model.
4. The method for extracting web page information of claim 3, wherein the BERT pre-training language model is:
Figure FDA0003471554720000011
using a cross entropy loss function:
Figure FDA0003471554720000012
to be provided with
Figure FDA0003471554720000013
Parameter learning is carried out;
wherein,
Figure FDA0003471554720000021
for known skeleton sub-nodes, T is the target task semantic,
Figure FDA0003471554720000022
is composed of
Figure FDA0003471554720000023
CLS is a unit in the output layer of the BERT pre-training language model, gamma is the weight of the BERT pre-training language model, N is the weight of the BERT pre-training language modeliFor a known web page WiOf a known DOM tree, MijIs composed of
Figure FDA0003471554720000024
The number of child nodes after the split is,
Figure FDA0003471554720000025
as a skeleton sub-node
Figure FDA0003471554720000026
The real tag of (1).
5. A semantic-based web page information extraction system is characterized by comprising:
the node splitting module is used for acquiring target skeleton sub-nodes; acquiring a target DOM tree of a target webpage, splitting a target framework node of the target DOM tree according to sentences to obtain a target framework sub-node of the target DOM tree;
the node classification module is used for classifying all the target framework sub-nodes according to target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree;
and the information extraction module is used for clustering node paths formed by all the target information sub-nodes to obtain a target information tree of the target webpage and extracting webpage information contained in the target information tree.
6. The web page information extraction system according to claim 5, wherein a BERT pre-trained language model is used as the classification model.
7. The web page information extraction system according to claim 6, further comprising:
the offline learning module is used for performing offline learning on the BERT pre-training language model; specifically comprises
The node marking module is used for analyzing the known webpage into a known DOM tree, obtaining known skeleton nodes of the known DOM tree, marking the information of all the known skeleton nodes, marking the known skeleton nodes related to the downstream task as known core information nodes, and marking the known skeleton nodes unrelated to the downstream task as non-core information nodes; splitting each known skeleton node according to sentences to obtain known skeleton sub-nodes, wherein the marks of the known skeleton nodes are the same as the marks of the corresponding known skeleton nodes;
the model adjusting module is used for randomly dividing all the known skeleton sub-nodes into a training set and a verification set, setting a plurality of groups of initial parameters, adjusting each group of initial parameters by the training set respectively, and verifying the classification accuracy of each BERT pre-training language model for completing parameter adjustment by the test set respectively; and selecting the BERT pre-training language model with the highest classification precision as the classification model.
8. The system for extracting web page information of claim 7, wherein the BERT pre-trained language model is:
Figure FDA0003471554720000027
using a cross entropy loss function:
Figure FDA0003471554720000028
to be provided with
Figure FDA0003471554720000031
To carry outLearning parameters;
wherein,
Figure FDA0003471554720000032
for known skeleton sub-nodes, T is the target task semantic,
Figure FDA0003471554720000033
is composed of
Figure FDA0003471554720000034
CLS is a unit in the output layer of the BERT pre-training language model, gamma is the weight of the BERT pre-training language model, N is the weight of the BERT pre-training language modeliFor a known web page WiOf a known DOM tree, MijIs composed of
Figure FDA0003471554720000035
The number of child nodes after the split is,
Figure FDA0003471554720000036
as a sub-node of the skeleton
Figure FDA0003471554720000037
The real tag of (1).
9. A computer-readable storage medium storing computer-executable instructions which, when executed, implement the semantic-based web page information extraction method according to any one of claims 1 to 4.
10. A data processing apparatus comprising the computer-readable storage medium of claim 9, wherein the semantic-based extraction of web page information is performed when the processor of the data processing apparatus retrieves and executes the computer-executable instructions of the computer-readable storage medium.
CN202210044347.0A 2022-01-14 2022-01-14 Semantic-based webpage information extraction method and system Pending CN114528459A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210044347.0A CN114528459A (en) 2022-01-14 2022-01-14 Semantic-based webpage information extraction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210044347.0A CN114528459A (en) 2022-01-14 2022-01-14 Semantic-based webpage information extraction method and system

Publications (1)

Publication Number Publication Date
CN114528459A true CN114528459A (en) 2022-05-24

Family

ID=81621550

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210044347.0A Pending CN114528459A (en) 2022-01-14 2022-01-14 Semantic-based webpage information extraction method and system

Country Status (1)

Country Link
CN (1) CN114528459A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117576710A (en) * 2024-01-15 2024-02-20 西湖大学 Method and device for generating natural language text based on graph for big data analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117576710A (en) * 2024-01-15 2024-02-20 西湖大学 Method and device for generating natural language text based on graph for big data analysis
CN117576710B (en) * 2024-01-15 2024-05-28 西湖大学 Method and device for generating natural language text based on graph for big data analysis

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
Khan et al. A novel natural language processing (NLP)–based machine translation model for English to Pakistan sign language translation
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN103646112B (en) Dependency parsing field self-adaption method based on web search
CN110609983B (en) Structured decomposition method for policy file
US20220414463A1 (en) Automated troubleshooter
CN113196277A (en) System for retrieving natural language documents
CN115098634B (en) Public opinion text emotion analysis method based on semantic dependency relationship fusion characteristics
CN113609838B (en) Document information extraction and mapping method and system
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN114528459A (en) Semantic-based webpage information extraction method and system
JP2013054607A (en) Rearrangement rule learning device, method and program, and translation device, method and program
CN110377753B (en) Relation extraction method and device based on relation trigger word and GRU model
CN116483314A (en) Automatic intelligent activity diagram generation method
Lin et al. Chinese story generation of sentence format control based on multi-channel word embedding and novel data format
CN116384387A (en) Automatic combination and examination method and device
CN115840815A (en) Automatic abstract generation method based on pointer key information
CN114490937A (en) Comment analysis method and device based on semantic perception
CN114330350A (en) Named entity identification method and device, electronic equipment and storage medium
CN113705207A (en) Grammar error recognition method and device
Chen Identification of Grammatical Errors of English Language Based on Intelligent Translational Model
CN112101001A (en) Method and system for judging similarity of unstructured texts
US11995394B1 (en) Language-guided document editing
CN111259159A (en) Data mining method, device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination