CN114528459A - Semantic-based webpage information extraction method and system - Google Patents
Semantic-based webpage information extraction method and system Download PDFInfo
- Publication number
- CN114528459A CN114528459A CN202210044347.0A CN202210044347A CN114528459A CN 114528459 A CN114528459 A CN 114528459A CN 202210044347 A CN202210044347 A CN 202210044347A CN 114528459 A CN114528459 A CN 114528459A
- Authority
- CN
- China
- Prior art keywords
- nodes
- target
- information
- skeleton
- sub
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 74
- 238000013145 classification model Methods 0.000 claims abstract description 17
- 238000012545 processing Methods 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims description 45
- 238000000034 method Methods 0.000 claims description 35
- 238000002372 labelling Methods 0.000 claims description 7
- 238000012360 testing method Methods 0.000 claims description 7
- 238000012795 verification Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 6
- 239000000284 extract Substances 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 4
- 238000000050 ionisation spectroscopy Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 201000005488 Capillary Leak Syndrome Diseases 0.000 description 2
- 208000001353 Coffin-Lowry syndrome Diseases 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000006698 induction Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/146—Coding or compression of tree-structured data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a semantic-based webpage information extraction method, which comprises the following steps: acquiring a target DOM tree of a target webpage, splitting a target skeleton node of the target DOM tree according to sentences to obtain a target skeleton sub-node of the target DOM tree; classifying all the target framework sub-nodes according to the target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree; and clustering node paths formed by all the target information child nodes to obtain a target information tree of the target webpage, and extracting webpage information contained in the target information tree. The invention also provides a semantic-based webpage information extraction system and a data processing device for realizing the semantic-based webpage information extraction.
Description
Technical Field
The invention belongs to the technical field of network information, and particularly relates to a method and a system for extracting webpage information.
Background
Currently, we have entered the internet era in a comprehensive way, and can see tens of thousands of various types of information distributed in different web pages every day. For different types of information in a web page, there are many invalid redundant information that are useless to the reader. For an original webpage, the content of the original webpage is usually subjected to secondary processing of information to obtain useful structured information, the effective structured data has great application value, and the whole process is also called as information extraction of the webpage.
Which information in the web page is key information is determined by a specific task, taking a news web page as an example, generally speaking, a reader pays attention to specific information such as news content, publishing time, authors, comments, pictures and the like, and other interference information in the web page, such as advertisements, copyright statements, entry links of other pages, visual effects and the like, is often not the focus of attention. For text information, especially body information, in a web page, such as: the text information such as news text, blog text and the like has great significance for many practical application scenes, and deeper mining analysis can be performed based on the data, for example, the text information can be used as an input source of downstream tasks such as public opinion analysis, associated recommendation and the like. The input data of the downstream task is in a structured and more standard data format, so that the method for efficiently extracting valuable structured information from the semi-structured data of the webpage has extremely important research significance and value.
The existing webpage information extraction method can be divided into the following steps: rule-based methods and machine learning-based methods. The rule-based methods are various and can be classified into a static template-based method, a wrapper induction method, an automatic template generation method and a heuristic method.
The heuristic method based on the rule is that a webpage template or an extraction rule is mostly generated in a mode of manually designing or automatically calculating the rule template based on the similar webpage, and the template or the rule is used for extraction. The method based on the static template needs to compile the template manually, different templates need to be compiled for different webpages, the template is invalid after the webpages are updated, the maintenance cost is high, and a user is required to have certain programming capability. The wrapper induction method mainly induces a webpage set labeled by a user to generate an extraction template of a corresponding website, and needs to label the webpage based on manual work. The method for automatically generating the template assumes that a webpage set to be extracted is generated by one template, analyzes the webpages with similar structures and automatically generates the template, does not require prior knowledge of the webpages, but has low efficiency and low accuracy for generating the template for the complicated webpages. Heuristic methods mostly search similar repetitive strings in the DOM tree to locate data areas, automatically analyze structural features, and extract according to heuristic information. The method based on machine learning usually takes webpage HTML source codes as a sequence, and extracts the sequence by manually screening features and labeling the sequence by adopting a machine learning model. The rule-based method requires different rules or template extraction according to different web pages, and updating of web pages will cause rule failure.
The machine learning-based method models by manually selecting features, which cannot accurately express the basic content and theme of a web page, and the method cannot extract any web page.
Both a rule-based method and a machine learning method face the following challenges, on one hand, different webpage typesetting types are various, the combination types of pictures and characters are very different, and extraction rules or models suitable for all webpages are difficult to find; on the other hand, the structure of the web page may change continuously with the update of the web site, and the statistical characteristics of the web page also change, so that the extraction rule or model that has been well performed before often fails after the update of the web page structure.
Based on the challenges of the two aspects, it is necessary to find out the characteristics of the web page that do not change with the changes of the layout, the architecture and the like, and model the essence of the web page, which is the semantic meaning of the web page. The essence of the method is that the characteristics of the webpage are manually selected for calculation, the characteristics only consider the statistical significance of the text or the label, the semantics of the word cannot be learned according to the context, and the semantics of the sentence cannot be expressed, so that the methods cannot adapt to the structural change of the webpage, and the extraction fails.
Disclosure of Invention
In order to solve the above problems, the present invention provides a semantic-based web page information extraction method, which includes: acquiring a target DOM tree of a target webpage, splitting a target skeleton node of the target DOM tree according to sentences to obtain a target skeleton sub-node of the target DOM tree; classifying all the target framework sub-nodes according to target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree; and clustering node paths formed by all the target information sub-nodes to obtain a target information tree of the target webpage, and extracting webpage information contained in the target information tree.
The webpage information extraction method adopts a BERT pre-training language model as the classification model.
The webpage information extraction method further comprises the step of performing off-line learning on the BERT pre-training language model: analyzing a known webpage into a known DOM tree, obtaining known skeleton nodes of the known DOM tree, labeling information of all the known skeleton nodes, labeling the known skeleton nodes related to a downstream task as known core information nodes, and labeling the known skeleton nodes unrelated to the downstream task as non-core information nodes; splitting each known skeleton node according to sentences to obtain known skeleton sub-nodes, wherein the marks of the known skeleton nodes are the same as the marks of the corresponding known skeleton nodes; randomly dividing all the known skeleton sub-nodes into a training set and a verification set, setting a plurality of groups of initial parameters, respectively adjusting each group of initial parameters by the training set, and respectively verifying the classification precision of each BERT pre-training language model for completing parameter adjustment by the test set; and selecting the BERT pre-training language model with the highest classification precision as the classification model.
The invention relates to a webpage information extraction method, wherein the BERT pre-training language model comprises the following steps:
using a cross entropy loss function:
to be provided withParameter learning is carried out;for known skeleton sub-nodes, T is the target task semantic,is composed ofCLS is a unit in the output layer of the BERT pre-training language model, gamma is the weight of the BERT pre-training language model, N is the weight of the BERT pre-training language modeliFor a known web page WiIs known to the DOM tree,MijIs composed ofThe number of child nodes after the split is,as a sub-node of the skeletonThe real tag of (1).
The invention also provides a semantic-based webpage information extraction system, which comprises the following steps: the node splitting module is used for acquiring target skeleton sub-nodes; acquiring a target DOM tree of a target webpage, splitting a target skeleton node of the target DOM tree according to sentences to obtain a target skeleton sub-node of the target DOM tree; the node classification module is used for classifying all the target framework sub-nodes according to target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree; and the information extraction module is used for clustering node paths formed by all the target information sub-nodes to obtain a target information tree of the target webpage and extracting webpage information contained in the target information tree.
The webpage information extraction system adopts a BERT pre-training language model as the classification model.
The web page information extraction system of the invention also comprises: the offline learning module is used for performing offline learning on the BERT pre-training language model; the node marking module is used for analyzing a known webpage into a known DOM tree, obtaining known skeleton nodes of the known DOM tree, marking the information of all the known skeleton nodes, marking the known skeleton nodes related to a downstream task as known core information nodes, and marking the known skeleton nodes unrelated to the downstream task as non-core information nodes; splitting each known skeleton node according to sentences to obtain known skeleton sub-nodes, wherein the marks of the known skeleton nodes are the same as the marks of the corresponding known skeleton nodes; the model adjusting module is used for randomly dividing all the known skeleton sub-nodes into a training set and a verification set, setting a plurality of groups of initial parameters, adjusting each group of initial parameters by the training set respectively, and verifying the classification accuracy of each BERT pre-training language model for completing parameter adjustment by the test set respectively; and selecting the BERT pre-training language model with the highest classification precision as the classification model.
The invention relates to a webpage information extraction system, wherein the BERT pre-training language model comprises the following steps:
using a cross entropy loss function:
to be provided withParameter learning is carried out; wherein,for known skeleton sub-nodes, T is the target task semantic,is composed ofCLS is a unit in the output layer of the BERT pre-training language model, gamma is the weight of the BERT pre-training language model, N is the weight of the BERT pre-training language modeliFor a known web page WiOf a known DOM tree, MijIs composed ofThe number of child nodes after the split is,as a sub-node of the skeletonThe real tag of (1).
The present invention also provides a computer-readable storage medium storing computer-executable instructions, which when executed, implement the aforementioned semantic-based web page information extraction method.
The invention also provides a data processing device, which comprises the computer-readable storage medium, and when a processor of the data processing device calls and executes the computer-executable instructions in the computer-readable storage medium, the semantic-based webpage information extraction is realized.
Drawings
FIG. 1 is a flow chart of the semantic-based web page information extraction method of the present invention.
FIG. 2 is a schematic diagram of the BERT model offline learning of the present invention.
FIG. 3 is a schematic diagram of a data processing apparatus according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
When the inventor conducts webpage information extraction research, the inventor finds that text and noise information to be extracted by a downstream task have semantic essential difference, and the existing webpage information extraction technology does not model text semantics. The existing webpage information extraction method needs to manually extract the characteristics of each sub-node as the input of a model, only takes the statistical characteristics of a text into consideration, cannot learn the semantics of words according to the context, cannot express the semantics of sentences, and enables a classifier to be unstable in classification results of the sentences synonymous with different structures. In order to reduce the error of model classification, a classifier is required to be able to model the semantics of the text. The inventors have conducted investigations and found that the pre-trained language model bert (bidirectional Encoder Representation from transforms) can learn text semantics well. BERT is an attention-based bi-directional language modeling method. The BERT directly refers to an Encoder module in a Transformer architecture, and has bidirectional coding capability and strong feature extraction capability. There are two steps in the BERT framework: pre-training and fine-tuning. During pre-training, BERT performs self-supervised learning through Mask Language Modeling (MLM) and Next Sentence Prediction (NSP) to obtain semantic representation of each word; during fine adjustment, only a small amount of data of a specific task needs to be input, so that the semantics can be adjusted according to the specific downstream task, and the domain knowledge of the downstream task is learned. BERT adopts a self-attention mechanism, words are represented by words of the context of the sentence in which the words are positioned in a weighting mode, and then a full connection layer is adopted to carry out relevance learning on the current word sequence and the node category, so that the relevance between the node category and the context of the sentence is obtained. In the present invention, the pre-trained BERT is used directly and the fine-tuning and prediction are performed on this basis. In addition, in order to apply the model to the extraction of any text data mining downstream task, the inventor also provides two stages of off-line learning and on-line extraction, wherein the off-line learning stage learns the domain knowledge of the downstream task by finely adjusting the pre-training language model BERT; and in the online extraction stage, the finely adjusted BERT is used for classifying the skeleton sub-nodes of the webpage, and the classification result is optimized by using a path clustering algorithm to obtain main webpage information with consistent semantics. In addition, in order to improve the robustness of the model in webpage information extraction, the inventor also provides a Path Clustering (PC) algorithm, and on the basis of performing semantic modeling and extraction on webpage text information, the extraction result of the semantic modeling is optimized through the structural information of the modeled webpage.
The invention aims to solve the problem that the extraction accuracy is greatly reduced after a webpage is updated in the prior art, and provides a webpage information extraction method and system based on semantics, wherein the extraction method comprises the following steps: splitting text nodes of a webpage DOM tree, wherein the split nodes are used as input of a webpage information extraction model, so that the classification granularity is thinned, and the classification accuracy is improved; the text in the webpage is classified by using a BERT pre-training language model, and the semantic meaning of the text in the webpage is modeled, so that the accuracy of webpage text classification is improved; optimizing the classification result by adopting a Path Clustering algorithm (Path Clustering), so that the quality of single webpage information extraction and the robustness of webpage core information extraction are improved; the method comprises the following steps of separating parameter learning and extraction stages of a model by adopting an off-line learning-on-line extraction framework, pre-training a large number of extraction models aiming at different downstream tasks in the off-line learning stage, directly selecting the corresponding models for extraction in the on-line extraction stage according to specific downstream task requirements, and applying the models to extraction aiming at any text data mining downstream task by matching the two stages; and adopting a webpage information extraction evaluation index CA to measure the extraction quality of a single webpage and the ratio of the webpage with the extraction quality reaching the standard.
FIG. 1 is a flow chart of the semantic-based web page information extraction method of the present invention. As shown in fig. 1, preprocessing is performed first, the input HTML is parsed into a DOM tree, a skeleton node is split to obtain a skeleton sub-node, and then the sub-node is input to the trimmed BERT in the offline learning stage for classification, and whether the skeleton sub-node is a core information sub-node is marked. And then, clustering all the skeleton sub-nodes marked as core information sub-nodes by adopting a path clustering algorithm to obtain paths of the core root nodes, and accordingly obtaining a final extraction result.
Specifically, the semantic-based webpage information extraction method specifically comprises the following steps:
step S1, dividing nodes;
to web page WiThe plain text content irrelevant to typesetting, font size and the like is defined as the skeleton information S of the webpageiN, N is the total number of web pages. Each node in the DOM tree represents one or a pair of HTML tags or text items in the tags, all texts in the HTML are leaf nodes in the DOM tree, and the leaf nodes storing the texts are called as skeleton nodes and are marked as sij,j=1,...,NiJ represents a web page WiThe j-th skeleton node, N, marked in the backward traversal order in the DOM treeiAs web pages WiThe DOM tree ofThe number of dots. Because the skeleton nodes of the web page correspond to the web page texts one by one, the web page W is divided into a plurality of partsiSkeleton information S ofiUsing N of DOM tree of the web pageiA skeleton node is represented asAccordingly, the webpage information extraction task is converted into a classification problem of the webpage framework nodes.
Step S2, splitting nodes;
skeleton nodes of the same webpage only contain extremely short sentences, and some skeleton nodes contain paragraphs formed by a plurality of sentences, so that the problem of uneven input granularity exists if the skeleton nodes are directly classified, and the accuracy of a classification model is reduced. To solve this problem, coarse-grained nodes need to be split. Specifically, the text in each skeleton node is divided according to sentences, so that one skeleton node is also split into one or more child nodes, the child nodes inherit the characteristics of the skeleton node before splitting, have ancestor nodes which are the same as the nodes before splitting and are brother nodes, and the sequence between the child nodes is the sequence between sentences before splitting, namely the child nodes sijThe split child node is sijThe skeleton sub-node of (1) is recorded asMijThe number of child nodes after the split of the skeleton node. Thus the skeleton information SiNot only can be represented by all skeleton nodes, but also can be represented as all skeleton sub-node sequences, namely:
therefore, the extraction problem of the webpage information is simplified into the classification problem of all the skeleton sub-nodes in the webpage.
Step S3, classifying nodes;
the scoring class model isGiven a classification threshold θWhen s isij kSemantically related to the target task T, sij kAnd marking as a core information child node, otherwise, marking as a non-core information child node. Then the web page WiCore information C ofiExpressed as:
Ci=[sij k|f(sij k,T)>θ,j∈[1,Ni],k∈[1,Mij]],i=1,...,N。
after webpage information extraction is converted into classification of skeleton sub-nodes, a classification model is obtainedThe pretrained BERT is used directly and the fine-tuning prediction is performed on the basis of the pretrained BERT. Model input as Each skeleton sub-nodeText sequence in (1)Since it is a sequential classification task, only the output CLSs need to be classified. Constructing a classification model:
using a cross entropy loss function:
the objective function is minimized and parameter learning is performed as follows:
CLS is a unit in the output layer of the BERT pre-training language model, gamma is the weight of the BERT pre-training language model,is composed ofThe real tag of (1).
Step S4, clustering paths;
the input of the path clustering algorithm is a node path which is classified into core information child nodes by BERT, the output is a path, a node corresponding to the path is called a core root node, and a sub-tree taking the core root node as a root is called a core information tree. And according to the core information tree, all the skeleton sub-nodes in the core information tree are marked as core information sub-nodes again, and other skeleton sub-nodes are marked as non-core information sub-nodes again to serve as final results after the classification results are corrected.
Memory skeleton sub-node sij kIs a sequence ofWhereinIndicating the length of the path, i.e. from the root node to the node sij kNumber of nodes on the path, elements in the sequenceIndicating that the node in the path with index t is the next child of its parent.
The path clustering algorithm comprises the following steps: (1) inputting a node path to be clustered, initializing a queue for recording the path clustering result, and aligning the path from a root node; (2) clustering layer by layer from the root node, and recording the total number of the elements in the ith layer as LiThe greatest amountElement of many is denoted as emaxThe number thereof is recorded asSetting a path clustering coefficient alpha (alpha is more than or equal to 0 and less than or equal to 1); (3) when in useWhen it is, the current emaxAdding into queue and keeping the element of the layer as emaxContinuing to perform next-layer clustering on the reserved paths; (4) when in useWhen the iteration is terminated, the iteration is returned to the queue, and the nodes in the queue arrange the represented paths in sequence.
In the method for extracting webpage information, a BERT pre-trained language model is adopted for a classification model, so that the method for extracting webpage information further comprises an off-line learning step, a large amount of skeleton information of webpages is used for fine tuning the BERT pre-trained language model to learn the field information of downstream tasks, and as shown in fig. 2, the specific process is as follows:
selecting a large number of HTML source codes of a webpage, analyzing the HTML source codes into a DOM tree to obtain all skeleton nodes of the webpage, manually marking the text nodes, marking the text nodes related to a downstream task as core information nodes (positive samples), and marking the text nodes unrelated to the downstream task as non-core information nodes (negative samples);
splitting each framework node according to sentences to obtain framework sub-nodes, wherein the marks of the framework sub-nodes are the same as the marks of the text nodes before splitting;
randomly dividing all skeleton sub-nodes into a training set and a verification set, selecting a plurality of groups of parameters for the BERT, finely adjusting the BERT by using the training set for each group of parameters, verifying the classification precision by using the verification set, selecting the best parameter combination and the corresponding finely adjusted BERT which are expressed on the verification set as the result of off-line learning, and storing the model and the parameters to the local for the calling of an on-line extraction part.
In addition, in order to accurately evaluate the extraction effect of the webpage information, the invention also provides CA index evaluation:
defining the accuracy rate of extracting the webpage information as CA. Given a threshold h of decimation quality, in the embodiment of the present invention, h is taken to be 0.9, F1iFor testing page WiF1-score when testing Page WiWith F1iIf the value is more than h, the extraction is recorded as successful, and a is presentiOtherwise, the extraction is recorded as failure, with ai0, thenAnd N is the total number of the test pages. The CA can measure the extraction effect of the model on the granularity of the webpage level, and the proportion of the webpage reaching the extraction quality requirement can be intuitively known through the CA.
FIG. 3 is a schematic diagram of a data processing apparatus according to the present invention. As shown in fig. 3, the present invention further provides a data processing apparatus, which includes a processor and a computer-readable storage medium, wherein the processor retrieves and executes executable instructions in the computer-readable storage medium to perform information extraction on a web page; the computer readable storage medium stores executable instructions, and when the executable instructions are executed by the processor, the semantic-based webpage information extraction method is realized. It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by a program instructing associated hardware (e.g., a processor) and the program may be stored in a readable storage medium, such as a read-only memory, a magnetic or optical disk, etc. All or some of the steps of the above embodiments may also be implemented using one or more integrated circuits. Accordingly, the modules in the above embodiments may be implemented in hardware, for example, by an integrated circuit, or in software, for example, by a processor executing programs/instructions stored in a memory. Embodiments of the invention are not limited to any specific form of combination of hardware and software.
The semantic-based webpage information extraction method creatively adds text semantics of the webpage as input information for modeling, learns domain knowledge from webpages related to downstream tasks, extracts core information of the webpage and provides effective structured input data for different downstream tasks. Compared with the rule-based method, the method does not need to design different rules or extract templates according to different webpages, and does not need to worry about the problem of rule failure caused by updating and updating of the webpages; compared with a method based on machine learning, the method abandons the selection and modeling of the traditional characteristics (the statistical characteristics of DOM tree nodes), pays more attention to the semantics of the webpage text, and improves the extraction robustness and the generalization capability aiming at different downstream tasks.
The above embodiments are only for illustrating the invention and are not to be construed as limiting the invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention, therefore, all equivalent technical solutions also fall into the scope of the invention, and the scope of the invention is defined by the claims.
Claims (10)
1. A webpage information extraction method based on semantics is characterized by comprising the following steps:
acquiring a target DOM tree of a target webpage, splitting a target skeleton node of the target DOM tree according to sentences to obtain a target skeleton sub-node of the target DOM tree;
classifying all the target framework sub-nodes according to target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree;
and clustering node paths formed by all the target information sub-nodes to obtain a target information tree of the target webpage, and extracting webpage information contained in the target information tree.
2. The method for extracting web page information of claim 1, wherein a BERT pre-trained language model is used as the classification model.
3. The method for extracting web page information according to claim 2, further comprising the step of performing offline learning on the BERT pre-trained language model:
analyzing a known webpage into a known DOM tree, obtaining known skeleton nodes of the known DOM tree, labeling information of all the known skeleton nodes, labeling the known skeleton nodes related to a downstream task as known core information nodes, and labeling the known skeleton nodes unrelated to the downstream task as non-core information nodes; splitting each known skeleton node according to sentences to obtain known skeleton sub-nodes, wherein the marks of the known skeleton nodes are the same as the marks of the corresponding known skeleton nodes;
randomly dividing all the known skeleton sub-nodes into a training set and a verification set, setting a plurality of groups of initial parameters, respectively adjusting each group of initial parameters by the training set, and respectively verifying the classification precision of each BERT pre-training language model for completing parameter adjustment by the test set; and selecting the BERT pre-training language model with the highest classification precision as the classification model.
4. The method for extracting web page information of claim 3, wherein the BERT pre-training language model is:
using a cross entropy loss function:
wherein,for known skeleton sub-nodes, T is the target task semantic,is composed ofCLS is a unit in the output layer of the BERT pre-training language model, gamma is the weight of the BERT pre-training language model, N is the weight of the BERT pre-training language modeliFor a known web page WiOf a known DOM tree, MijIs composed ofThe number of child nodes after the split is,as a skeleton sub-nodeThe real tag of (1).
5. A semantic-based web page information extraction system is characterized by comprising:
the node splitting module is used for acquiring target skeleton sub-nodes; acquiring a target DOM tree of a target webpage, splitting a target framework node of the target DOM tree according to sentences to obtain a target framework sub-node of the target DOM tree;
the node classification module is used for classifying all the target framework sub-nodes according to target task semantics by using a classification model to obtain target information sub-nodes of the target DOM tree;
and the information extraction module is used for clustering node paths formed by all the target information sub-nodes to obtain a target information tree of the target webpage and extracting webpage information contained in the target information tree.
6. The web page information extraction system according to claim 5, wherein a BERT pre-trained language model is used as the classification model.
7. The web page information extraction system according to claim 6, further comprising:
the offline learning module is used for performing offline learning on the BERT pre-training language model; specifically comprises
The node marking module is used for analyzing the known webpage into a known DOM tree, obtaining known skeleton nodes of the known DOM tree, marking the information of all the known skeleton nodes, marking the known skeleton nodes related to the downstream task as known core information nodes, and marking the known skeleton nodes unrelated to the downstream task as non-core information nodes; splitting each known skeleton node according to sentences to obtain known skeleton sub-nodes, wherein the marks of the known skeleton nodes are the same as the marks of the corresponding known skeleton nodes;
the model adjusting module is used for randomly dividing all the known skeleton sub-nodes into a training set and a verification set, setting a plurality of groups of initial parameters, adjusting each group of initial parameters by the training set respectively, and verifying the classification accuracy of each BERT pre-training language model for completing parameter adjustment by the test set respectively; and selecting the BERT pre-training language model with the highest classification precision as the classification model.
8. The system for extracting web page information of claim 7, wherein the BERT pre-trained language model is:
using a cross entropy loss function:
wherein,for known skeleton sub-nodes, T is the target task semantic,is composed ofCLS is a unit in the output layer of the BERT pre-training language model, gamma is the weight of the BERT pre-training language model, N is the weight of the BERT pre-training language modeliFor a known web page WiOf a known DOM tree, MijIs composed ofThe number of child nodes after the split is,as a sub-node of the skeletonThe real tag of (1).
9. A computer-readable storage medium storing computer-executable instructions which, when executed, implement the semantic-based web page information extraction method according to any one of claims 1 to 4.
10. A data processing apparatus comprising the computer-readable storage medium of claim 9, wherein the semantic-based extraction of web page information is performed when the processor of the data processing apparatus retrieves and executes the computer-executable instructions of the computer-readable storage medium.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210044347.0A CN114528459A (en) | 2022-01-14 | 2022-01-14 | Semantic-based webpage information extraction method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210044347.0A CN114528459A (en) | 2022-01-14 | 2022-01-14 | Semantic-based webpage information extraction method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114528459A true CN114528459A (en) | 2022-05-24 |
Family
ID=81621550
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210044347.0A Pending CN114528459A (en) | 2022-01-14 | 2022-01-14 | Semantic-based webpage information extraction method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114528459A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117576710A (en) * | 2024-01-15 | 2024-02-20 | 西湖大学 | Method and device for generating natural language text based on graph for big data analysis |
-
2022
- 2022-01-14 CN CN202210044347.0A patent/CN114528459A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117576710A (en) * | 2024-01-15 | 2024-02-20 | 西湖大学 | Method and device for generating natural language text based on graph for big data analysis |
CN117576710B (en) * | 2024-01-15 | 2024-05-28 | 西湖大学 | Method and device for generating natural language text based on graph for big data analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113011533B (en) | Text classification method, apparatus, computer device and storage medium | |
Khan et al. | A novel natural language processing (NLP)–based machine translation model for English to Pakistan sign language translation | |
CN113806563B (en) | Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material | |
CN103646112B (en) | Dependency parsing field self-adaption method based on web search | |
CN110609983B (en) | Structured decomposition method for policy file | |
US20220414463A1 (en) | Automated troubleshooter | |
CN113196277A (en) | System for retrieving natural language documents | |
CN115098634B (en) | Public opinion text emotion analysis method based on semantic dependency relationship fusion characteristics | |
CN113609838B (en) | Document information extraction and mapping method and system | |
CN113191148A (en) | Rail transit entity identification method based on semi-supervised learning and clustering | |
CN113360582B (en) | Relation classification method and system based on BERT model fusion multi-entity information | |
CN114528459A (en) | Semantic-based webpage information extraction method and system | |
JP2013054607A (en) | Rearrangement rule learning device, method and program, and translation device, method and program | |
CN110377753B (en) | Relation extraction method and device based on relation trigger word and GRU model | |
CN116483314A (en) | Automatic intelligent activity diagram generation method | |
Lin et al. | Chinese story generation of sentence format control based on multi-channel word embedding and novel data format | |
CN116384387A (en) | Automatic combination and examination method and device | |
CN115840815A (en) | Automatic abstract generation method based on pointer key information | |
CN114490937A (en) | Comment analysis method and device based on semantic perception | |
CN114330350A (en) | Named entity identification method and device, electronic equipment and storage medium | |
CN113705207A (en) | Grammar error recognition method and device | |
Chen | Identification of Grammatical Errors of English Language Based on Intelligent Translational Model | |
CN112101001A (en) | Method and system for judging similarity of unstructured texts | |
US11995394B1 (en) | Language-guided document editing | |
CN111259159A (en) | Data mining method, device and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |