CN116628303A - Semi-structured webpage attribute value extraction method and system based on prompt learning - Google Patents

Semi-structured webpage attribute value extraction method and system based on prompt learning Download PDF

Info

Publication number
CN116628303A
CN116628303A CN202310462355.1A CN202310462355A CN116628303A CN 116628303 A CN116628303 A CN 116628303A CN 202310462355 A CN202310462355 A CN 202310462355A CN 116628303 A CN116628303 A CN 116628303A
Authority
CN
China
Prior art keywords
text
task
node
prompt
dom tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310462355.1A
Other languages
Chinese (zh)
Inventor
曹聪
冯佳丽
曹亚男
袁方方
李保珂
卢毓海
刘燕兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202310462355.1A priority Critical patent/CN116628303A/en
Publication of CN116628303A publication Critical patent/CN116628303A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a semi-structured webpage attribute value extraction method and system based on prompt learning, which relate to the field of Internet, wherein DOM tree visual angle prompts of variable nodes are searched according to a DOM tree simplification algorithm, then a task template containing task description is designed to obtain template visual angle prompt information, finally a pre-training language model based on an encoder-decoder structure is introduced, the 'prompt' is used as a core operation, the characteristics of field data and target task characteristics are comprehensively analyzed, the prompt information of two visual angles is designed, the dual-visual angle prompt information is filled and fused through the template, the pre-training language model is guided to perform task learning in a semantic layer and a task layer in a combined manner through prompt learning, the effective combination of the pre-training language model and the attribute value extraction task is realized, and the excellent model performance under the field marking data scarcity scene is realized.

Description

Semi-structured webpage attribute value extraction method and system based on prompt learning
Technical Field
The invention relates to the field of Internet, in particular to a semi-structured webpage attribute value extraction method and system based on prompt learning.
Background
There are a huge number of semi-structured web pages in the internet for describing entities, and the contents of these web pages are often carefully edited and audited, which contains a great deal of high-quality entity attribute information. The attribute information is used as a valuable production data, is widely applied to various scientific researches and actual business scenes, and helps to promote the downstream task effect. However, since the semi-structured web page has a huge scale, and the content of the web page is complex and the layout is variable, extracting the structured attribute information from the semi-structured web page is a very challenging task. At present, the method for extracting the attribute value of the semi-structured webpage is mainly divided into the following three types:
firstly, a semi-structured webpage attribute value extraction method based on a wrapper, namely a method for constructing an extraction rule or template based on a shallow rule of target data in a webpage. For each web site, the on-site web pages typically have the same or similar layout structure, a feature that allows for a certain presentation of the data in the web page. Therefore, the extraction of the target data can be realized by formulating rules according to the found relevant rules. The document "Hammer J, mchugh J G, garcia-Molin h.structured data: the TSIMMIS experience, british Computer Society,1997," first discover the rules of characteristics such as HTML tags, XPath paths, etc. of target data by analyzing web page source codes, and then design a series of sequentially executed instructions according to these rules, thereby realizing accurate extraction of target data. Document "Li Ping, zhu Jianbo, zhou Lixin, etc. computer application, 2014,34 (3): 733-737.) a set of search location language and operating language was designed for rapid construction of web page information extraction templates, determining extraction rules, based on a rapid construction of the template shopping information extraction method [ J ]. In the extraction process, the method firstly matches the URL of the webpage with a corresponding information extraction template, then analyzes the webpage content by using the template, extracts a target data field and forms structured data.
Secondly, a semi-structured webpage attribute value extraction method based on visual features is provided, and the method considers the visual features such as the size, proportion, layout relativity and the like of text blocks in a webpage from the characteristics of the webpage layout. To facilitate a viewer's quick understanding and locating of web page content, web pages often divide and arrange information in a visual structure. Therefore, from the viewpoint of visual characteristics, analysis and understanding of web page contents are very important methods. The literature "Hao Q, rui C, pang Y, et al from one tree to a forest: a unified solution for structured web data extraction [ C ]// International Acm Sigir Conference on Research & Development in Information retrieval al acm, 2011" proposes a method based on rendering features, and uses visual features such as the size and position of a node text block in a DOM tree to explore the distance between web page blocks, so as to be used for extracting attribute values in a web page. The literature "Kumar A, morabia K, wang J, et al CoVA: context-aware Visual Attention for Webpage Information Extraction [ J ]. ArXiv e-prints,2021." proposes a neural network model encoding visual features-CoVA, which model comprises a representation network of CNNs, pooling layers and position encoders, and a schematic force neural network. The model firstly utilizes the visual characteristics such as the size, the spatial position and the like of the content blocks of each node in the web page screenshot of the representation network learning, and then further applies the graph attention neural network to transfer and aggregate the visual characteristics of the nodes so as to realize the enhancement of the visual representation of the nodes.
Thirdly, a semi-structured webpage attribute value extraction method based on DOM tree features mainly focuses on related features in a webpage bottom DOM tree, such as features of node Xpath, node HTML labels, node text content, node association and the like. The webpage presenting effect is closely related to the writing mode of the HTML source code, the webpage DOM tree is used for analyzing the HTML document, and therefore, the positioning or understanding of webpage content by using the characteristics of the bottom DOM tree is a very important approach. The literature "Lin B Y, eng Y, vo N, et al FreeDOM A Transferable Neural Architecture for Structured Information Extraction on Web Documents [ J ]. ACM, 2020" constructs a two-stage model FreeDOM to implement the inference of DOM tree node attribute types. In the first stage of learning, the model mainly extracts local features of nodes to realize node classification, and mainly comprises text features, previous semantic features, node HTML tags, node content formats and other discrete features of the nodes. In the second stage of learning, the model learns the dependency relationship between nodes by constructing node pairs, thereby correcting the classification result in the first stage. The SimpleDOM proposed by the literature "Zhou Y, shaping Y, vo N, et al, simple DOM Trees for Transferable Attribute Extraction from the Web [ J ].2021 ] simplifies a two-stage model of the FreeDOM, builds an attribute value extraction model with integrity, and after the model retrieves the current node context according to the DOM tree, the model performs serialization splicing on the text characteristics of the node and the context node thereof, thereby obtaining the enhanced representation of the node semantics. Meanwhile, the method introduces discrete features such as node XPath, node HTML label and the like to further improve generalization of the model. The above work enables the SimpDOM to have the ability to train on a seed web site and to extract on a new web site.
While early wrapper-based methods required only a small number of labeled web pages per website, building and periodically updating templates for each website resulted in such methods requiring significant labor and time costs and poor practicality. The method based on feature learning does not depend on templates, but uses a few seed websites to realize generalization of the model in other websites in the same field. However, each seed web site requires thousands of annotated web pages, which results in a significant amount of web page level data collection and annotation work still being performed, otherwise the annotation data scarcity may limit the learning ability of the model.
Disclosure of Invention
The invention aims to provide a semi-structured webpage attribute value extraction method and system based on prompt learning, which are used for solving the problem that the existing model learning capacity is limited due to the lack of field annotation data, and can extract the attribute information of a semi-structured webpage and have generalization capacity.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a semi-structured webpage attribute value extraction method based on prompt learning comprises the following steps:
for a semi-structured web page, text nodes in the web page DOM tree comprise fixed nodes and variable nodes, and for each variable node, DOM tree visual angle prompts of semantic layers are searched;
designing a task template to add task prompts for text contents of variable nodes to obtain template visual angle prompts of a task layer;
rewriting text content of each variable node and DOM tree visual angle prompts of the variable nodes through a task template, and mapping a label in a mask task template to a text to realize fusion of the DOM tree visual angle prompts and the template visual angle prompts so as to obtain double visual angle prompts;
inputting double-view prompt information to an encoder end by utilizing a pre-training language model based on an encoder-decoder structure, predicting a text of a mask position by the decoder end, wherein the text consists of word list words of the pre-training language model, and determining an attribute type according to a mapping relation between the text and a predefined attribute;
training the pre-training language model, calculating the matching probability of the text output by the decoder and the label mapping text, calculating a loss function according to the matching probability, and optimizing the loss function;
and predicting the attribute type of the variable node by using a trained pre-training language model for the semi-structured webpage to be processed, obtaining normalized probability output of each position of the decoder end, calculating the score of each attribute type according to the probability, and taking the attribute type with the highest score as a prediction result.
Further, for each variable node, the step of retrieving semantic hint information for the DOM tree view thereof includes:
firstly, for all nodes, searching ancestor nodes according to Xpath, and putting the nodes into a set corresponding to each ancestor node;
then, for the variable node, backtracking is started from the nearest ancestor node, if one fixed node exists in the set corresponding to the ancestor node and the distance between the node and the current variable node meets the requirement, the backtracking is stopped, the node is used as semantic prompt information of the DOM tree view angle, otherwise, the backtracking of the ancestor node of a higher layer is continued, and the process is repeated.
Further, when the pre-training language model based on the encoder-decoder structure is utilized for prediction, a mask is set in a blank position in the task template in advance to serve as a label placeholder, and then the pre-training language model learns content information and length information of the mask position through training.
Further, the loss function is a log likelihood loss function, and optimizing the loss function refers to minimizing negative log likelihood of all variable nodes.
Further, when calculating the score, the predicted attribute type name is converted into a label, and the score is normalized by using the length of the label mapping text.
A semi-structured web page attribute value extraction system based on prompt learning comprises a memory and a processor, wherein a computer program is stored in the memory, and the processor realizes the steps of the method when executing the program.
The invention has the following advantages:
the invention provides a semi-structured webpage attribute value extraction method based on prompt learning, which firstly searches DOM tree visual angle prompts of variable nodes according to a DOM tree simplification algorithm, and can improve the domain knowledge retrieval capability of a pre-training language model. And then designing a task template containing task description to obtain template visual angle prompt information, so as to excite task understanding capability of the pre-training language model. Finally, as the pre-training language model is rich in priori knowledge and semantic expression, the problem of task information deficiency can be effectively relieved, the pre-training language model based on an encoder-decoder structure is introduced, the prompting is used as a core operation, the field data characteristics and the target task characteristics are comprehensively analyzed, the prompting information of two visual angles is designed, the dual-visual-angle prompting information is filled and fused through a template, the pre-training language model is guided to perform task learning in a combined manner of prompting learning at a semantic level and a task level, the effective combination of the pre-training language model and an attribute value extraction task is realized, and the excellent model performance under the field labeling data scarcity scene is realized.
Drawings
Fig. 1 is a framework diagram of a semi-structured web page attribute value extraction method based on prompt learning.
FIG. 2 is a general flow chart of semi-structured web page attribute value extraction.
Fig. 3 is a graph of the ablation experimental result of the semi-structured web page attribute value extraction method based on prompt learning.
Detailed Description
In order to make the technical features and advantages or technical effects of the technical scheme of the invention more obvious and understandable, the following detailed description is given with reference to the accompanying drawings.
As described in the literature Devlin J, chang M W, lee K, et al, bert: pre-training of deep bidirectional transformers for language understanding [ J ]. ArXiv preprint arXiv:1810.04805,2018 ], the rich prior knowledge and semantic expression of the Pre-trained language model can effectively alleviate the problem of lack of task information, so that the invention introduces the Pre-trained language model based on an encoder-decoder structure and adopts a prompt learning paradigm to realize the effective combination of the model and the attribute value extraction task. According to the method, firstly, DOM tree visual angle prompts of semantic layers of variable nodes are searched according to a DOM tree simplification algorithm, then a task template containing task description is designed, template visual angle prompts of the task layers are obtained, finally, double visual angle prompt information is filled and fused through the template, a pre-training language model is guided to conduct task learning in a combined mode at the semantic layers and the task layers, and domain knowledge searching capacity and task understanding capacity of the pre-training language model are improved, so that a webpage-level small sample extraction scene is dealt with.
As shown in FIG. 1, the semi-structured web page attribute value extraction method and system based on prompt learning (hereinafter referred to as EDDVPL) provided by the invention comprise 5 components, namely DOM tree visual angle prompt construction, template visual angle prompt and double visual angle prompt construction, label word mapping, model training and model inference. Each of the parts will be described in detail below.
(1) And constructing a DOM tree visual angle prompt.
In the semi-structured web pages, the contexts corresponding to different structural relationships can provide different clue information for the classification of variable nodes, and the characteristics can be generalized among different semi-structured web pages. The description type context of the variable node explicitly and intuitively prompts which type of information the current node content is in a semantic level, and has important reference significance. Taking fig. 2 as an example, when a variable node corresponds to "Vicky Jenson", its description context "Director" may prompt the relevant definition of the variable node. The goal of this section is therefore to retrieve a descriptive context for each variable node x as its DOM tree based visual angle hint information.
The text nodes in the web page DOM tree may be divided into fixed nodes N f And variable node N v . Wherein, the fixed node keeps consistent among different webpages of the same website, and the content of the variable node is changed frequently. Since attribute values are typically different in different web pages, EDDVPL narrows down the range of nodes to be classified (also referred to as candidate nodes) in a web page to variable nodes. After the variable nodes are determined, EDDVPL retrieves DOM tree view cues for each variable node x. A DOM tree is a hierarchical structure made up of a series of nodes that originate in a node called a "root," i.e., the root node, and extend downward from layer to layer. For each variable node x, all nodes on the path from the root node to x (excluding x itself) are its ancestors. Given a set of text nodesSatisfy x and x n ∈X n DOM tree perspective cues of x, i.e., existing in a fixed set of nodes that contains at most one node, are not more than a constant D from their lowest common ancestor>In the set, x is satisfied DOM ∈X DOM Is the only one in the DOM tree originating from x DOM And x the fixed node of the lowest common ancestor.
(2) Template visual angle prompt and construction of double visual angle prompt
For example, in the literature "Liu P, yuan W, fu J, et al Pre-train, prompt, and pretreatment: A Systematic Survey of Prompting Methods in Natural Language Processing [ J ]]2021, "task templates in prompt learning ]By adding a task description (i.e., a template view hint at the task level) to the original text, the model is prompted and directed what to do next. With the prompt of the task level, the model can quickly search and fully utilize the task related knowledge learned during pre-training, and the task solving becomes easier.
For any variable node x E N in a web page v The text content is expressed asThe DOM tree visual angle prompt is expressed as +.>The EDDVPL firstly designs a task template, and then rewrites x and the visual angle prompt information of the DOM tree according to the task template, so that the visual angle prompt of the DOM tree at the semantic level and the visual angle prompt of the template at the task level are mutually fused, and a double visual angle prompt is obtained. In this way, the model's double view hint input x prompt Can be expressed as:
specifically, the task templates designed by the EDDVPL are:
the EDDVPL is aimed at inputting x when the encoder end prompt And then, predicting the content of the blank position, namely the vocabulary text corresponding to the attribute type of x, at the decoder end.
(3) Tag word mapping
After obtaining the dual view cues for DOM tree view and template view, the pre-trained language model needs to predict the text z where the task template blank (i.e., mask position) is masked (mask). Here a one-to-one label mapping function phi is required to implement the predefined set of propertiesAnd text set to be predicted +.>The mapping relationship is as follows:
the above formula shows that the function phi can name any one of a predefined set of attributesMapping to text consisting of pre-trained language model vocabulary words +.>EDDVPL implements tag word mapping by removing meaningless symbols within attribute names or adding qualifiers to attribute names.
It is noted that, when the classification task is performed, the task template is required to be mapped to the tag mapping text based on prompt learning of the self-coding pre-training language model, and then a final result is obtained by predicting the token of each mask position. However, when a new datum appears, its text length to be predicted is unknown. At this time, the model cannot determine the number of masks, and cannot classify the masks.
To solve this problem, the literature "Schick T, H schutze.representation Cloze Questions for Few Shot Text Classification and Natural Language Inference [ J ]. 2020" and the literature "Han X, zhao W, ding N, et al, ptr: prompt tuning with rules for text classification [ J ]. AI Open,2022, 3:182-192" attempt to fix the text length after each tag is mapped. But this approach can lead to loss or disruption of semantic information. Accordingly, the present invention employs a pre-trained language model based on an encoder-decoder structure to avoid the above-described problems. After a mask is given as a label placeholder in a task template, the model of the structure can learn content information and length information of the mask position at the same time during training, so that a prediction result with any length is generated at a decoder end, and the problem of inconsistent label length is solved.
(4) Model training
Representing a pre-trained language model asGiven double view hint input +.>And a predicted target, tag map text phi (y). Assuming that the output sequence of the decoder is o, at this time, the matching probability between the decoded output and the tag mapping text (i.e. the original tag mapping text of the mask) is calculated as follows:
wherein P represents a PProbability function o t and φt (y) t-th word, o, representing the output sequence o and the tag map text phi (y), respectively <t Representing the decoded sequence to the left of the t-th word. The goal is to minimize the negative log likelihood of all variable nodes, the formula for the loss function is as follows:
wherein ,representing the loss function, the base of log is e.
(5) Model inference
When classifying new data, firstly obtaining normalized probability output of each position of decoder end, then for each attribute typeCalculating a score:
wherein score y Score, P (o) t =φ t (y)) is the word phi in the normalized probability output of the decoder terminal t moment t Probability value of (y). In calculating the score, to avoid bias of the prediction result towards longer text attribute types, the method follows the literature "Chen Y, harbeck D, hennig l.multilingual Relation Classification via Efficient and Effective Prompting [ J]arXiv preprint arXiv:2210.13838,2022 "the score is normalized using the length of the tag map text phi (y) (i.e., the sum of the probabilities used in the above equation)Divided by |phi (y) |), the attribute type with the highest score is finally selected as the prediction result.
This section mainly describes detailed embodiments of the summary. Firstly introducing a data set, an evaluation index, a comparison method and implementation details used in the experimental process; then, a comparison experiment is carried out on a public dataset SWDE proposed by the literature "Hao Q, rui C, pang Y, et al from one tree to a forest: a unified solution for structured web data extraction [ C ]// International Acm Sigir Conference on Research & Development in Information Retrieval. ACM,2011.", and the experimental result is displayed; finally, the effectiveness of each part of the proposed method is verified through an ablation experiment.
(1) Introduction to data set
SWDE (Structured Web Data Extraction) the public dataset contains 8 vertical fields, each consisting of 10 websites with 3 to 5 attributes. The specific statistics of the dataset are shown in table 1.
Table 1 SWDE dataset statistics table
FIELD Number of websites Number of web pages Attributes of
auto 10 17923 model,price,engine,fuel economy
book 10 20000 title,author,isbn,publisher,publish date
camera 10 5258 model,price,manufacturer
job 10 20000 title,company,location,date posted
movie 10 20000 title,director,genre,mpaa rating
nbaplayer 10 4405 name,team,height,weight
restaurant 10 20000 name,address,phone,cuisine
university 10 16705 name,phone,website,type
In conducting experiments, the EDDVPL repartitioned the data set. For each domain, the experiment takes k (k can take 10, 50 and 100) pages for constructing a training set in each website under the domain, and the rest pages are adopted for constructing a test set. The specific data of the training set and the test set in each field when k takes different values are shown in table 2.
Table 2k details of training and test data for each field when different values are taken
(2) Description of evaluation index
Following the evaluation index of the existing study, the present experiment calculates the page level F1 score to evaluate the performance of the proposed method. Specifically, for each attribute, accuracy refers to the number of pages that are correctly extracted to the target attribute value divided by the number of all pages that extracted that attribute; recall (recovery) refers to the number of pages that are correctly drawn to the target attribute value divided by the number of pages that contain the target attribute value (see literature "Hao Q, rui C, pang Y, et al from one tree to a forest: a unified solution for structured web data extraction [ C ]// International Acm Sigir Conference on Research & Development in Information retrieval. Acm, 2011"). The page level F1 score is then the harmonic mean of the above precision and recall.
(3) Contrast method
In order to prove the advantage of extracting the attribute values of the method in a webpage-level small sample scene, the simple DOM and the DOM2R-Graph are selected as comparison methods in the experiment. In particular, simpDOM uses an underlying DOM tree structure to avoid using rendering-based features. It enhances node representation by retrieving context for DOM tree nodes and capturing discrete features. Because of the consistency of these features across web sites, the SimPEM can be extracted from other unseen web sites after training with several seed web sites. DOM2R-Graph (see literature: feng J, cao C, yuan F, et al DOM2R-Graph: A Web Attribute Extraction Architecture with Relation-Aware Heterogeneous Graph Transformer [ C ]// Neural Information Processing:29th International Conference,ICONIP 2022,Virtual Event,November 22-26,2022,Proceedings,Part I.Cham:Springer International Publishing,2023:468-479.) simplifies and models the web page DOM tree into a heterogeneous Graph, and fine-grained representation of nodes is obtained by capturing the influence of context structure relationships on semantic interactions in the Graph to promote extraction effects. Because text semantics and context structure relationships are generalizable features between websites, DOM2R-Graph can perform attribute value extraction across websites excellently.
(4) Implementation details
In the data preprocessing stage, EDDVPL firstly uses an LXML library to analyze the HTML source code of the webpage to obtain the DOM tree structure of the webpage. The chapter then distinguishes between fixed nodes and variable nodes in web pages by heuristic algorithms based on the literature "LinB Y, shaping Y, vo N, et al FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents [ J ]. ACM,2020 ]. And, the chapter sets the maximum distance D of the variable node and its DOM tree view hint information to their lowest common ancestor to 2.
During the model training phase, EDDVPL uses T5 provided by the transducers library in the Hugging Face base As a pre-training language model, the training batch data size is set to be 16, the learning rate is 0.0002, and the training times are 20.
(5) Comparing the experimental results
To fully verify the effectiveness of EDDVPL, this section has been experimentally verified on SWDE datasets. The comparison results of the EDDVPL and comparison methods on different sized training sets are shown in Table 3, where k represents the number of web pages used for training in each web site.
TABLE 3 experimental results of EDDVPL and comparative methods on SWDE datasets
From the results in the table, it can be seen that when k=10, the performance of the SimpDOM is better than that of the DOM2R-Graph, because the SimpDOM is rich in discrete features, which can provide a certain reasoning basis for the DOM2R-Graph, compared with the DOM2R-Graph which only focuses on complex features such as semantics and structure, in the case of extremely few labeling data. With this data set-up, the effect of the EDDVPL model is significantly superior to all of the methods described above. This is because SimpDOM and DOM2R-Graph rely on task data to learn complex semantic knowledge or web page structure that is consistent between websites, and very little training data does not provide them with sufficient learning opportunities. Whereas for EDDVPL, the following two aspects determine its performance superiority:
1) The pre-trained language model contains more comprehensive semantic expressions and rich prior knowledge, which provides a good basis for the model to infer node attribute types with little training data.
2) EDDVPL (electronic data distribution platform) rapidly guides a pre-training language model to understand what needs to be done in a task template construction mode, and judges and excites related knowledge of the field in a semantic layer auxiliary model by introducing DOM tree visual angle prompt information, so that effective combination of the pre-training language model and task targets and field data is realized. Therefore, the model can quickly make full use of limited data even with a small number of training web pages.
With the increase of k, the existing method can further learn the webpage related features with stronger field and task pertinence, so that the generalization capability of the method is further improved, and the gap between the method and the EDDVPL is gradually reduced. However, the EDDVPL can still obtain an even better effect in various fields according to the stronger task understanding capability and the important DOM tree prompt information.
(1) Ablation experiment results
To verify the effectiveness of the EDDVPL partial designs, this section performed ablation experiments on a common dataset SWDE. Specifically, this section designs two variant models for proving contributions from different perspective cues, the relevant descriptions of the variant models are as follows:
1) In order to prove the important effect of the DOM tree visual angle prompt information of the semantic layer, the experiment removes the DOM tree visual angle prompt information, and Template filling is carried out only by using the text of the node.
2) In order to prove the effectiveness of the template visual angle prompt information of the task layer, the experiment removes the task template, and the node self text and the DOM tree visual angle prompt are input into the model in a continuous sequence mode.
When 10 web pages are taken as training sets (i.e., k=10) from each web site, the ablation experimental results of the respective fields are shown in fig. 3. As can be seen from FIG. 3, the effect of both Template-view and DOM-view is reduced relative to the full model. This is because when the DOM tree visual angle prompt information is removed, the model understands the task target, but the related knowledge cannot be quickly searched due to the fact that the field text information is less; when the template visual angle prompt information of the task layer is removed, the model only has relevant semantic knowledge, and the task requirement is fuzzy, so that the rapid convergence is difficult. The experimental results prove that the prompts of the two visual angles both play a forward guiding role on the model, and also indicate that the pre-training language model can realize the maximum utilization of the existing knowledge and limited data only by tightly combining with tasks and fields.
Although the present invention has been described with reference to the above embodiments, it should be understood that the invention is not limited thereto, and that modifications and equivalents may be made thereto by those skilled in the art, which modifications and equivalents are intended to be included within the scope of the present invention as defined by the appended claims.

Claims (9)

1. The semi-structured webpage attribute value extraction method based on prompt learning is characterized by comprising the following steps of:
for a semi-structured web page, text nodes in the web page DOM tree comprise fixed nodes and variable nodes, and for each variable node, DOM tree visual angle prompts of semantic layers are searched;
designing a task template to add task prompts for text contents of variable nodes to obtain template visual angle prompts of a task layer;
rewriting text content of each variable node and DOM tree visual angle prompts of the variable nodes through a task template, and mapping a label in a mask task template to a text to realize fusion of the DOM tree visual angle prompts and the template visual angle prompts so as to obtain double visual angle prompts;
inputting double-view prompt information to an encoder end by utilizing a pre-training language model based on an encoder-decoder structure, predicting a text of a mask position by the decoder end, wherein the text consists of word list words of the pre-training language model, and determining an attribute type according to a mapping relation between the text and a predefined attribute;
training the pre-training language model, calculating the matching probability of the text output by the decoder and the label mapping text, calculating a loss function according to the matching probability, and optimizing the loss function;
and predicting the attribute type of the variable node by using a trained pre-training language model for the semi-structured webpage to be processed, obtaining normalized probability output of each position of the decoder end, calculating the score of each attribute type according to the probability, and taking the attribute type with the highest score as a prediction result.
2. The method of claim 1, wherein the step of retrieving semantic hints for each variable node for its DOM tree view comprises:
firstly, for all nodes, searching ancestor nodes according to Xpath, and putting the nodes into a set corresponding to each ancestor node;
then, for the variable node, backtracking is started from the nearest ancestor node, if one fixed node exists in the set corresponding to the ancestor node and the distance between the node and the current variable node meets the requirement, the backtracking is stopped, the node is used as semantic prompt information of the DOM tree view angle, otherwise, the backtracking of the ancestor node of a higher layer is continued, and the process is repeated.
3. The method of claim 1, wherein a mask is set as a tag placeholder in a blank position in a task template in advance when prediction is performed using a pre-trained language model based on an encoder-decoder structure, and then the pre-trained language model learns content information and length information of the mask position through training.
4. The method of claim 1 wherein the equation for calculating the probability of matching the text output by the decoder with the tag map text is as follows:
wherein P represents a matching probability function, o t and φt (y) the t-th word representing the output sequence o and the tag map text phi (y), respectively, y representing the attribute type, i<t represents the decoded sequence to the left of the t-th word,representing a task template->Text content representing variable node x, +.>DOM tree view cues representing variable node x.
5. The method of claim 1, wherein the loss function is a log likelihood loss function, and optimizing the loss function is minimizing negative log likelihood for all variable nodes.
6. The method of claim 5, wherein the loss function is calculated as:
wherein ,represents a loss function, P represents a matching probability function, o t and φt (y) respectivelyT-th word representing output sequence o and tag map text phi (y), y representing attribute type, o<t represents the decoded sequence to the left of the t-th word, the log base is e,representing a task template->Text content representing variable node x, +.>DOM tree view angle prompt representing variable node x, N v Representing a set of variable nodes.
7. The method of claim 1, wherein the predicted attribute type name is converted to a label and the score is normalized using the length of the label mapped text when calculating the score.
8. The method of claim 7, wherein the score is calculated by the formula:
wherein score y Score, P (o) t =φ t (y)) represents word phi in normalized probability output at time t of decoder end t Probability value of (y), o t and φt (y) the t-th word of the output sequence o and the tag map text phi (y), respectively, y representing the attribute type.
9. A semi-structured web page attribute value extraction system based on prompt learning, comprising a memory and a processor, wherein a computer program is stored on the memory, which processor, when executing the program, implements the steps of the method of any one of claims 1-8.
CN202310462355.1A 2023-04-26 2023-04-26 Semi-structured webpage attribute value extraction method and system based on prompt learning Pending CN116628303A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310462355.1A CN116628303A (en) 2023-04-26 2023-04-26 Semi-structured webpage attribute value extraction method and system based on prompt learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310462355.1A CN116628303A (en) 2023-04-26 2023-04-26 Semi-structured webpage attribute value extraction method and system based on prompt learning

Publications (1)

Publication Number Publication Date
CN116628303A true CN116628303A (en) 2023-08-22

Family

ID=87640792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310462355.1A Pending CN116628303A (en) 2023-04-26 2023-04-26 Semi-structured webpage attribute value extraction method and system based on prompt learning

Country Status (1)

Country Link
CN (1) CN116628303A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994098A (en) * 2023-09-27 2023-11-03 西南交通大学 Large model prompt learning method based on category attribute knowledge enhancement

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994098A (en) * 2023-09-27 2023-11-03 西南交通大学 Large model prompt learning method based on category attribute knowledge enhancement
CN116994098B (en) * 2023-09-27 2023-12-05 西南交通大学 Large model prompt learning method based on category attribute knowledge enhancement

Similar Documents

Publication Publication Date Title
US11520812B2 (en) Method, apparatus, device and medium for determining text relevance
Taheriyan et al. Learning the semantics of structured data sources
US7676465B2 (en) Techniques for clustering structurally similar web pages based on page features
US7680858B2 (en) Techniques for clustering structurally similar web pages
Liao et al. Unsupervised approaches for textual semantic annotation, a survey
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN105404674A (en) Knowledge-dependent webpage information extraction method
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN116628303A (en) Semi-structured webpage attribute value extraction method and system based on prompt learning
Wei et al. Online education recommendation model based on user behavior data analysis
Huang et al. Design and implementation of oil and gas information on intelligent search engine based on knowledge graph
Jannach et al. Automated ontology instantiation from tabular web sources—the AllRight system
Ye et al. Learning object models from semistructured web documents
CN115982390B (en) Industrial chain construction and iterative expansion development method
Swe Intelligent information retrieval within digital library using domain ontology
Sabri et al. WEIDJ: Development of a new algorithm for semi-structured web data extraction
Angrosh et al. Ontology-based modelling of related work sections in research articles: Using crfs for developing semantic data based information retrieval systems
Sijin et al. Fuzzy conceptualization of the search queries
Carme et al. The lixto project: Exploring new frontiers of web data extraction
Swe Concept Based Intelligent Information Retrieval within Digital Library
Liu et al. Research on adaptive wrapper in deep web data extraction
Wang et al. PAREI: A progressive approach for Web API recommendation by combining explicit and implicit information
Shin et al. Deep-learning-based image tagging for semantic image annotation
Li et al. Multi-strategies Integrated Information Extraction for Scholar Profiling Task
Jia et al. Leveraging Large Language Models for Semantic Query Processing in a Scholarly Knowledge Graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination