CN116628303A

CN116628303A - Semi-structured webpage attribute value extraction method and system based on prompt learning

Info

Publication number: CN116628303A
Application number: CN202310462355.1A
Authority: CN
Inventors: 曹聪; 冯佳丽; 曹亚男; 袁方方; 李保珂; 卢毓海; 刘燕兵
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-08-22

Abstract

The invention discloses a semi-structured webpage attribute value extraction method and system based on prompt learning, which relate to the field of Internet, wherein DOM tree visual angle prompts of variable nodes are searched according to a DOM tree simplification algorithm, then a task template containing task description is designed to obtain template visual angle prompt information, finally a pre-training language model based on an encoder-decoder structure is introduced, the 'prompt' is used as a core operation, the characteristics of field data and target task characteristics are comprehensively analyzed, the prompt information of two visual angles is designed, the dual-visual angle prompt information is filled and fused through the template, the pre-training language model is guided to perform task learning in a semantic layer and a task layer in a combined manner through prompt learning, the effective combination of the pre-training language model and the attribute value extraction task is realized, and the excellent model performance under the field marking data scarcity scene is realized.

Description

Semi-structured webpage attribute value extraction method and system based on prompt learning

Technical Field

The invention relates to the field of Internet, in particular to a semi-structured webpage attribute value extraction method and system based on prompt learning.

Background

There are a huge number of semi-structured web pages in the internet for describing entities, and the contents of these web pages are often carefully edited and audited, which contains a great deal of high-quality entity attribute information. The attribute information is used as a valuable production data, is widely applied to various scientific researches and actual business scenes, and helps to promote the downstream task effect. However, since the semi-structured web page has a huge scale, and the content of the web page is complex and the layout is variable, extracting the structured attribute information from the semi-structured web page is a very challenging task. At present, the method for extracting the attribute value of the semi-structured webpage is mainly divided into the following three types:

firstly, a semi-structured webpage attribute value extraction method based on a wrapper, namely a method for constructing an extraction rule or template based on a shallow rule of target data in a webpage. For each web site, the on-site web pages typically have the same or similar layout structure, a feature that allows for a certain presentation of the data in the web page. Therefore, the extraction of the target data can be realized by formulating rules according to the found relevant rules. The document "Hammer J, mchugh J G, garcia-Molin h.structured data: the TSIMMIS experience, british Computer Society,1997," first discover the rules of characteristics such as HTML tags, XPath paths, etc. of target data by analyzing web page source codes, and then design a series of sequentially executed instructions according to these rules, thereby realizing accurate extraction of target data. Document "Li Ping, zhu Jianbo, zhou Lixin, etc. computer application, 2014,34 (3): 733-737.) a set of search location language and operating language was designed for rapid construction of web page information extraction templates, determining extraction rules, based on a rapid construction of the template shopping information extraction method [ J ]. In the extraction process, the method firstly matches the URL of the webpage with a corresponding information extraction template, then analyzes the webpage content by using the template, extracts a target data field and forms structured data.

Secondly, a semi-structured webpage attribute value extraction method based on visual features is provided, and the method considers the visual features such as the size, proportion, layout relativity and the like of text blocks in a webpage from the characteristics of the webpage layout. To facilitate a viewer's quick understanding and locating of web page content, web pages often divide and arrange information in a visual structure. Therefore, from the viewpoint of visual characteristics, analysis and understanding of web page contents are very important methods. The literature "Hao Q, rui C, pang Y, et al from one tree to a forest: a unified solution for structured web data extraction [ C ]// International Acm Sigir Conference on Research & Development in Information retrieval al acm, 2011" proposes a method based on rendering features, and uses visual features such as the size and position of a node text block in a DOM tree to explore the distance between web page blocks, so as to be used for extracting attribute values in a web page. The literature "Kumar A, morabia K, wang J, et al CoVA: context-aware Visual Attention for Webpage Information Extraction [ J ]. ArXiv e-prints,2021." proposes a neural network model encoding visual features-CoVA, which model comprises a representation network of CNNs, pooling layers and position encoders, and a schematic force neural network. The model firstly utilizes the visual characteristics such as the size, the spatial position and the like of the content blocks of each node in the web page screenshot of the representation network learning, and then further applies the graph attention neural network to transfer and aggregate the visual characteristics of the nodes so as to realize the enhancement of the visual representation of the nodes.

Thirdly, a semi-structured webpage attribute value extraction method based on DOM tree features mainly focuses on related features in a webpage bottom DOM tree, such as features of node Xpath, node HTML labels, node text content, node association and the like. The webpage presenting effect is closely related to the writing mode of the HTML source code, the webpage DOM tree is used for analyzing the HTML document, and therefore, the positioning or understanding of webpage content by using the characteristics of the bottom DOM tree is a very important approach. The literature "Lin B Y, eng Y, vo N, et al FreeDOM A Transferable Neural Architecture for Structured Information Extraction on Web Documents [ J ]. ACM, 2020" constructs a two-stage model FreeDOM to implement the inference of DOM tree node attribute types. In the first stage of learning, the model mainly extracts local features of nodes to realize node classification, and mainly comprises text features, previous semantic features, node HTML tags, node content formats and other discrete features of the nodes. In the second stage of learning, the model learns the dependency relationship between nodes by constructing node pairs, thereby correcting the classification result in the first stage. The SimpleDOM proposed by the literature "Zhou Y, shaping Y, vo N, et al, simple DOM Trees for Transferable Attribute Extraction from the Web [ J ].2021 ] simplifies a two-stage model of the FreeDOM, builds an attribute value extraction model with integrity, and after the model retrieves the current node context according to the DOM tree, the model performs serialization splicing on the text characteristics of the node and the context node thereof, thereby obtaining the enhanced representation of the node semantics. Meanwhile, the method introduces discrete features such as node XPath, node HTML label and the like to further improve generalization of the model. The above work enables the SimpDOM to have the ability to train on a seed web site and to extract on a new web site.

While early wrapper-based methods required only a small number of labeled web pages per website, building and periodically updating templates for each website resulted in such methods requiring significant labor and time costs and poor practicality. The method based on feature learning does not depend on templates, but uses a few seed websites to realize generalization of the model in other websites in the same field. However, each seed web site requires thousands of annotated web pages, which results in a significant amount of web page level data collection and annotation work still being performed, otherwise the annotation data scarcity may limit the learning ability of the model.

Disclosure of Invention

The invention aims to provide a semi-structured webpage attribute value extraction method and system based on prompt learning, which are used for solving the problem that the existing model learning capacity is limited due to the lack of field annotation data, and can extract the attribute information of a semi-structured webpage and have generalization capacity.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a semi-structured webpage attribute value extraction method based on prompt learning comprises the following steps:

for a semi-structured web page, text nodes in the web page DOM tree comprise fixed nodes and variable nodes, and for each variable node, DOM tree visual angle prompts of semantic layers are searched;

designing a task template to add task prompts for text contents of variable nodes to obtain template visual angle prompts of a task layer;

rewriting text content of each variable node and DOM tree visual angle prompts of the variable nodes through a task template, and mapping a label in a mask task template to a text to realize fusion of the DOM tree visual angle prompts and the template visual angle prompts so as to obtain double visual angle prompts;

inputting double-view prompt information to an encoder end by utilizing a pre-training language model based on an encoder-decoder structure, predicting a text of a mask position by the decoder end, wherein the text consists of word list words of the pre-training language model, and determining an attribute type according to a mapping relation between the text and a predefined attribute;

training the pre-training language model, calculating the matching probability of the text output by the decoder and the label mapping text, calculating a loss function according to the matching probability, and optimizing the loss function;

and predicting the attribute type of the variable node by using a trained pre-training language model for the semi-structured webpage to be processed, obtaining normalized probability output of each position of the decoder end, calculating the score of each attribute type according to the probability, and taking the attribute type with the highest score as a prediction result.

Further, for each variable node, the step of retrieving semantic hint information for the DOM tree view thereof includes:

firstly, for all nodes, searching ancestor nodes according to Xpath, and putting the nodes into a set corresponding to each ancestor node;

then, for the variable node, backtracking is started from the nearest ancestor node, if one fixed node exists in the set corresponding to the ancestor node and the distance between the node and the current variable node meets the requirement, the backtracking is stopped, the node is used as semantic prompt information of the DOM tree view angle, otherwise, the backtracking of the ancestor node of a higher layer is continued, and the process is repeated.

Further, when the pre-training language model based on the encoder-decoder structure is utilized for prediction, a mask is set in a blank position in the task template in advance to serve as a label placeholder, and then the pre-training language model learns content information and length information of the mask position through training.

Further, the loss function is a log likelihood loss function, and optimizing the loss function refers to minimizing negative log likelihood of all variable nodes.

Further, when calculating the score, the predicted attribute type name is converted into a label, and the score is normalized by using the length of the label mapping text.

A semi-structured web page attribute value extraction system based on prompt learning comprises a memory and a processor, wherein a computer program is stored in the memory, and the processor realizes the steps of the method when executing the program.

The invention has the following advantages:

the invention provides a semi-structured webpage attribute value extraction method based on prompt learning, which firstly searches DOM tree visual angle prompts of variable nodes according to a DOM tree simplification algorithm, and can improve the domain knowledge retrieval capability of a pre-training language model. And then designing a task template containing task description to obtain template visual angle prompt information, so as to excite task understanding capability of the pre-training language model. Finally, as the pre-training language model is rich in priori knowledge and semantic expression, the problem of task information deficiency can be effectively relieved, the pre-training language model based on an encoder-decoder structure is introduced, the prompting is used as a core operation, the field data characteristics and the target task characteristics are comprehensively analyzed, the prompting information of two visual angles is designed, the dual-visual-angle prompting information is filled and fused through a template, the pre-training language model is guided to perform task learning in a combined manner of prompting learning at a semantic level and a task level, the effective combination of the pre-training language model and an attribute value extraction task is realized, and the excellent model performance under the field labeling data scarcity scene is realized.

Drawings

Fig. 1 is a framework diagram of a semi-structured web page attribute value extraction method based on prompt learning.

FIG. 2 is a general flow chart of semi-structured web page attribute value extraction.

Fig. 3 is a graph of the ablation experimental result of the semi-structured web page attribute value extraction method based on prompt learning.

Detailed Description

In order to make the technical features and advantages or technical effects of the technical scheme of the invention more obvious and understandable, the following detailed description is given with reference to the accompanying drawings.

As described in the literature Devlin J, chang M W, lee K, et al, bert: pre-training of deep bidirectional transformers for language understanding [ J ]. ArXiv preprint arXiv:1810.04805,2018 ], the rich prior knowledge and semantic expression of the Pre-trained language model can effectively alleviate the problem of lack of task information, so that the invention introduces the Pre-trained language model based on an encoder-decoder structure and adopts a prompt learning paradigm to realize the effective combination of the model and the attribute value extraction task. According to the method, firstly, DOM tree visual angle prompts of semantic layers of variable nodes are searched according to a DOM tree simplification algorithm, then a task template containing task description is designed, template visual angle prompts of the task layers are obtained, finally, double visual angle prompt information is filled and fused through the template, a pre-training language model is guided to conduct task learning in a combined mode at the semantic layers and the task layers, and domain knowledge searching capacity and task understanding capacity of the pre-training language model are improved, so that a webpage-level small sample extraction scene is dealt with.

As shown in FIG. 1, the semi-structured web page attribute value extraction method and system based on prompt learning (hereinafter referred to as EDDVPL) provided by the invention comprise 5 components, namely DOM tree visual angle prompt construction, template visual angle prompt and double visual angle prompt construction, label word mapping, model training and model inference. Each of the parts will be described in detail below.

(1) And constructing a DOM tree visual angle prompt.

In the semi-structured web pages, the contexts corresponding to different structural relationships can provide different clue information for the classification of variable nodes, and the characteristics can be generalized among different semi-structured web pages. The description type context of the variable node explicitly and intuitively prompts which type of information the current node content is in a semantic level, and has important reference significance. Taking fig. 2 as an example, when a variable node corresponds to "Vicky Jenson", its description context "Director" may prompt the relevant definition of the variable node. The goal of this section is therefore to retrieve a descriptive context for each variable node x as its DOM tree based visual angle hint information.

The text nodes in the web page DOM tree may be divided into fixed nodes N _f And variable node N _v . Wherein, the fixed node keeps consistent among different webpages of the same website, and the content of the variable node is changed frequently. Since attribute values are typically different in different web pages, EDDVPL narrows down the range of nodes to be classified (also referred to as candidate nodes) in a web page to variable nodes. After the variable nodes are determined, EDDVPL retrieves DOM tree view cues for each variable node x. A DOM tree is a hierarchical structure made up of a series of nodes that originate in a node called a "root," i.e., the root node, and extend downward from layer to layer. For each variable node x, all nodes on the path from the root node to x (excluding x itself) are its ancestors. Given a set of text nodesSatisfy x and x _n ∈X _n DOM tree perspective cues of x, i.e., existing in a fixed set of nodes that contains at most one node, are not more than a constant D from their lowest common ancestor>In the set, x is satisfied _DOM ∈X _DOM Is the only one in the DOM tree originating from x _DOM And x the fixed node of the lowest common ancestor.

(2) Template visual angle prompt and construction of double visual angle prompt

For example, in the literature "Liu P, yuan W, fu J, et al Pre-train, prompt, and pretreatment: A Systematic Survey of Prompting Methods in Natural Language Processing [ J ]]2021, "task templates in prompt learning ]By adding a task description (i.e., a template view hint at the task level) to the original text, the model is prompted and directed what to do next. With the prompt of the task level, the model can quickly search and fully utilize the task related knowledge learned during pre-training, and the task solving becomes easier.

For any variable node x E N in a web page _v The text content is expressed asThe DOM tree visual angle prompt is expressed as +.>The EDDVPL firstly designs a task template, and then rewrites x and the visual angle prompt information of the DOM tree according to the task template, so that the visual angle prompt of the DOM tree at the semantic level and the visual angle prompt of the template at the task level are mutually fused, and a double visual angle prompt is obtained. In this way, the model's double view hint input x _prompt Can be expressed as:

specifically, the task templates designed by the EDDVPL are:

the EDDVPL is aimed at inputting x when the encoder end _prompt And then, predicting the content of the blank position, namely the vocabulary text corresponding to the attribute type of x, at the decoder end.

(3) Tag word mapping

After obtaining the dual view cues for DOM tree view and template view, the pre-trained language model needs to predict the text z where the task template blank (i.e., mask position) is masked (mask). Here a one-to-one label mapping function phi is required to implement the predefined set of propertiesAnd text set to be predicted +.>The mapping relationship is as follows:

the above formula shows that the function phi can name any one of a predefined set of attributesMapping to text consisting of pre-trained language model vocabulary words +.>EDDVPL implements tag word mapping by removing meaningless symbols within attribute names or adding qualifiers to attribute names.

It is noted that, when the classification task is performed, the task template is required to be mapped to the tag mapping text based on prompt learning of the self-coding pre-training language model, and then a final result is obtained by predicting the token of each mask position. However, when a new datum appears, its text length to be predicted is unknown. At this time, the model cannot determine the number of masks, and cannot classify the masks.

To solve this problem, the literature "Schick T, H schutze.representation Cloze Questions for Few Shot Text Classification and Natural Language Inference [ J ]. 2020" and the literature "Han X, zhao W, ding N, et al, ptr: prompt tuning with rules for text classification [ J ]. AI Open,2022, 3:182-192" attempt to fix the text length after each tag is mapped. But this approach can lead to loss or disruption of semantic information. Accordingly, the present invention employs a pre-trained language model based on an encoder-decoder structure to avoid the above-described problems. After a mask is given as a label placeholder in a task template, the model of the structure can learn content information and length information of the mask position at the same time during training, so that a prediction result with any length is generated at a decoder end, and the problem of inconsistent label length is solved.

(4) Model training

Representing a pre-trained language model asGiven double view hint input +.>And a predicted target, tag map text phi (y). Assuming that the output sequence of the decoder is o, at this time, the matching probability between the decoded output and the tag mapping text (i.e. the original tag mapping text of the mask) is calculated as follows:

wherein P represents a PProbability function o _t and φ_t (y) t-th word, o, representing the output sequence o and the tag map text phi (y), respectively _<t Representing the decoded sequence to the left of the t-th word. The goal is to minimize the negative log likelihood of all variable nodes, the formula for the loss function is as follows:

wherein ,representing the loss function, the base of log is e.

(5) Model inference

When classifying new data, firstly obtaining normalized probability output of each position of decoder end, then for each attribute typeCalculating a score:

wherein score _y Score, P (o) _t ＝φ _t (y)) is the word phi in the normalized probability output of the decoder terminal t moment _t Probability value of (y). In calculating the score, to avoid bias of the prediction result towards longer text attribute types, the method follows the literature "Chen Y, harbeck D, hennig l.multilingual Relation Classification via Efficient and Effective Prompting [ J]arXiv preprint arXiv:2210.13838,2022 "the score is normalized using the length of the tag map text phi (y) (i.e., the sum of the probabilities used in the above equation)Divided by |phi (y) |), the attribute type with the highest score is finally selected as the prediction result.

This section mainly describes detailed embodiments of the summary. Firstly introducing a data set, an evaluation index, a comparison method and implementation details used in the experimental process; then, a comparison experiment is carried out on a public dataset SWDE proposed by the literature "Hao Q, rui C, pang Y, et al from one tree to a forest: a unified solution for structured web data extraction [ C ]// International Acm Sigir Conference on Research & Development in Information Retrieval. ACM,2011.", and the experimental result is displayed; finally, the effectiveness of each part of the proposed method is verified through an ablation experiment.

(1) Introduction to data set

SWDE (Structured Web Data Extraction) the public dataset contains 8 vertical fields, each consisting of 10 websites with 3 to 5 attributes. The specific statistics of the dataset are shown in table 1.

Table 1 SWDE dataset statistics table

FIELD	Number of websites	Number of web pages	Attributes of
				auto	10	17923	model,price,engine,fuel economy
book	10	20000	title,author,isbn,publisher,publish date
				camera	10	5258	model,price,manufacturer
job	10	20000	title,company,location,date posted
				movie	10	20000	title,director,genre,mpaa rating
nbaplayer	10	4405	name,team,height,weight
				restaurant	10	20000	name,address,phone,cuisine
university	10	16705	name,phone,website,type

In conducting experiments, the EDDVPL repartitioned the data set. For each domain, the experiment takes k (k can take 10, 50 and 100) pages for constructing a training set in each website under the domain, and the rest pages are adopted for constructing a test set. The specific data of the training set and the test set in each field when k takes different values are shown in table 2.

Table 2k details of training and test data for each field when different values are taken

(2) Description of evaluation index

Following the evaluation index of the existing study, the present experiment calculates the page level F1 score to evaluate the performance of the proposed method. Specifically, for each attribute, accuracy refers to the number of pages that are correctly extracted to the target attribute value divided by the number of all pages that extracted that attribute; recall (recovery) refers to the number of pages that are correctly drawn to the target attribute value divided by the number of pages that contain the target attribute value (see literature "Hao Q, rui C, pang Y, et al from one tree to a forest: a unified solution for structured web data extraction [ C ]// International Acm Sigir Conference on Research & Development in Information retrieval. Acm, 2011"). The page level F1 score is then the harmonic mean of the above precision and recall.

(3) Contrast method

In order to prove the advantage of extracting the attribute values of the method in a webpage-level small sample scene, the simple DOM and the DOM2R-Graph are selected as comparison methods in the experiment. In particular, simpDOM uses an underlying DOM tree structure to avoid using rendering-based features. It enhances node representation by retrieving context for DOM tree nodes and capturing discrete features. Because of the consistency of these features across web sites, the SimPEM can be extracted from other unseen web sites after training with several seed web sites. DOM2R-Graph (see literature: feng J, cao C, yuan F, et al DOM2R-Graph: A Web Attribute Extraction Architecture with Relation-Aware Heterogeneous Graph Transformer [ C ]// Neural Information Processing:29th International Conference,ICONIP 2022,Virtual Event,November 22-26,2022,Proceedings,Part I.Cham:Springer International Publishing,2023:468-479.) simplifies and models the web page DOM tree into a heterogeneous Graph, and fine-grained representation of nodes is obtained by capturing the influence of context structure relationships on semantic interactions in the Graph to promote extraction effects. Because text semantics and context structure relationships are generalizable features between websites, DOM2R-Graph can perform attribute value extraction across websites excellently.

(4) Implementation details

In the data preprocessing stage, EDDVPL firstly uses an LXML library to analyze the HTML source code of the webpage to obtain the DOM tree structure of the webpage. The chapter then distinguishes between fixed nodes and variable nodes in web pages by heuristic algorithms based on the literature "LinB Y, shaping Y, vo N, et al FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents [ J ]. ACM,2020 ]. And, the chapter sets the maximum distance D of the variable node and its DOM tree view hint information to their lowest common ancestor to 2.

During the model training phase, EDDVPL uses T5 provided by the transducers library in the Hugging Face _base As a pre-training language model, the training batch data size is set to be 16, the learning rate is 0.0002, and the training times are 20.

(5) Comparing the experimental results

To fully verify the effectiveness of EDDVPL, this section has been experimentally verified on SWDE datasets. The comparison results of the EDDVPL and comparison methods on different sized training sets are shown in Table 3, where k represents the number of web pages used for training in each web site.

TABLE 3 experimental results of EDDVPL and comparative methods on SWDE datasets

From the results in the table, it can be seen that when k=10, the performance of the SimpDOM is better than that of the DOM2R-Graph, because the SimpDOM is rich in discrete features, which can provide a certain reasoning basis for the DOM2R-Graph, compared with the DOM2R-Graph which only focuses on complex features such as semantics and structure, in the case of extremely few labeling data. With this data set-up, the effect of the EDDVPL model is significantly superior to all of the methods described above. This is because SimpDOM and DOM2R-Graph rely on task data to learn complex semantic knowledge or web page structure that is consistent between websites, and very little training data does not provide them with sufficient learning opportunities. Whereas for EDDVPL, the following two aspects determine its performance superiority:

1) The pre-trained language model contains more comprehensive semantic expressions and rich prior knowledge, which provides a good basis for the model to infer node attribute types with little training data.

2) EDDVPL (electronic data distribution platform) rapidly guides a pre-training language model to understand what needs to be done in a task template construction mode, and judges and excites related knowledge of the field in a semantic layer auxiliary model by introducing DOM tree visual angle prompt information, so that effective combination of the pre-training language model and task targets and field data is realized. Therefore, the model can quickly make full use of limited data even with a small number of training web pages.

With the increase of k, the existing method can further learn the webpage related features with stronger field and task pertinence, so that the generalization capability of the method is further improved, and the gap between the method and the EDDVPL is gradually reduced. However, the EDDVPL can still obtain an even better effect in various fields according to the stronger task understanding capability and the important DOM tree prompt information.

(1) Ablation experiment results

To verify the effectiveness of the EDDVPL partial designs, this section performed ablation experiments on a common dataset SWDE. Specifically, this section designs two variant models for proving contributions from different perspective cues, the relevant descriptions of the variant models are as follows:

1) In order to prove the important effect of the DOM tree visual angle prompt information of the semantic layer, the experiment removes the DOM tree visual angle prompt information, and Template filling is carried out only by using the text of the node.

2) In order to prove the effectiveness of the template visual angle prompt information of the task layer, the experiment removes the task template, and the node self text and the DOM tree visual angle prompt are input into the model in a continuous sequence mode.

When 10 web pages are taken as training sets (i.e., k=10) from each web site, the ablation experimental results of the respective fields are shown in fig. 3. As can be seen from FIG. 3, the effect of both Template-view and DOM-view is reduced relative to the full model. This is because when the DOM tree visual angle prompt information is removed, the model understands the task target, but the related knowledge cannot be quickly searched due to the fact that the field text information is less; when the template visual angle prompt information of the task layer is removed, the model only has relevant semantic knowledge, and the task requirement is fuzzy, so that the rapid convergence is difficult. The experimental results prove that the prompts of the two visual angles both play a forward guiding role on the model, and also indicate that the pre-training language model can realize the maximum utilization of the existing knowledge and limited data only by tightly combining with tasks and fields.

Although the present invention has been described with reference to the above embodiments, it should be understood that the invention is not limited thereto, and that modifications and equivalents may be made thereto by those skilled in the art, which modifications and equivalents are intended to be included within the scope of the present invention as defined by the appended claims.

Claims

1. The semi-structured webpage attribute value extraction method based on prompt learning is characterized by comprising the following steps of:

2. The method of claim 1, wherein the step of retrieving semantic hints for each variable node for its DOM tree view comprises:

3. The method of claim 1, wherein a mask is set as a tag placeholder in a blank position in a task template in advance when prediction is performed using a pre-trained language model based on an encoder-decoder structure, and then the pre-trained language model learns content information and length information of the mask position through training.

4. The method of claim 1 wherein the equation for calculating the probability of matching the text output by the decoder with the tag map text is as follows:

wherein P represents a matching probability function, o _t and φ_t (y) the t-th word representing the output sequence o and the tag map text phi (y), respectively, y representing the attribute type, i<t represents the decoded sequence to the left of the t-th word,representing a task template->Text content representing variable node x, +.>DOM tree view cues representing variable node x.

5. The method of claim 1, wherein the loss function is a log likelihood loss function, and optimizing the loss function is minimizing negative log likelihood for all variable nodes.

6. The method of claim 5, wherein the loss function is calculated as:

wherein ,represents a loss function, P represents a matching probability function, o _t and φ_t (y) respectivelyT-th word representing output sequence o and tag map text phi (y), y representing attribute type, o<t represents the decoded sequence to the left of the t-th word, the log base is e,representing a task template->Text content representing variable node x, +.>DOM tree view angle prompt representing variable node x, N _v Representing a set of variable nodes.

7. The method of claim 1, wherein the predicted attribute type name is converted to a label and the score is normalized using the length of the label mapped text when calculating the score.

8. The method of claim 7, wherein the score is calculated by the formula:

wherein score _y Score, P (o) _t ＝φ _t (y)) represents word phi in normalized probability output at time t of decoder end _t Probability value of (y), o _t and φ_t (y) the t-th word of the output sequence o and the tag map text phi (y), respectively, y representing the attribute type.

9. A semi-structured web page attribute value extraction system based on prompt learning, comprising a memory and a processor, wherein a computer program is stored on the memory, which processor, when executing the program, implements the steps of the method of any one of claims 1-8.