CN110968761B

CN110968761B - Webpage structured data self-adaptive extraction method

Info

Publication number: CN110968761B
Application number: CN201911196582.4A
Authority: CN
Inventors: 陈星�; 郭莹楠; 杨植; 郑勇杰; 陈晓娜
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2022-07-08
Anticipated expiration: 2039-11-29
Also published as: CN110968761A; WO2021103557A1

Abstract

The invention relates to a self-adaptive extraction method of webpage structured data, which comprises the steps of firstly packaging an extraction template, judging whether the structure of a target webpage is changed or not according to the extraction template, and finding data in the target webpage according to the path of the data in the extraction template if the structure of the target webpage is not changed; if the structure of the target webpage is changed, calculating the similarity between the designated area of the extracted template and all areas of the target webpage, taking the area with the highest similarity as a candidate area, mapping data items in the candidate area, calculating the similarity between the node corresponding to each data item and all nodes with text contents not empty in the target webpage, wherein each data item corresponds to the node with the highest similarity. The invention can still correctly extract the target data after the structure of the webpage changes.

Description

Self-adaptive extraction method for webpage structured data

Technical Field

The invention relates to the field of extraction of web page structured data of the Internet of things, in particular to a self-adaptive extraction method of web page structured data.

Background

The Internet (Internet) is a huge resource bank, the number of current Web pages reaches hundreds of billions, the Web pages are continuously increased at an incredible speed every hour, the rapid development of the Internet causes the information to show an explosive growth, and the Web is used as a main carrier of the Internet information and is full of various information. In order to collect the effective information we need contained in the Web page, various Web data extraction techniques have been proposed.

However, the current Web data extraction technology generally only aims at a specific Web page structure, and when the Web page is updated iteratively, the problem of a change in the Web page structure may be encountered, so that Web page information cannot be extracted or wrong information is extracted.

Disclosure of Invention

In view of the above, the present invention provides a method for adaptively extracting web page structured data, which can still correctly extract target data after a web page structure changes.

The invention is realized by adopting the following scheme: a webpage structured data self-adaptive extraction method comprises the following steps:

packaging the extraction template, judging whether the structure of the target webpage is changed or not according to the extraction template, and if not, finding the data in the target webpage according to the path of the data in the extraction template; if the structure of the target webpage is changed, calculating the similarity between the designated area of the extracted template and all areas of the target webpage, taking the area with the highest similarity as a candidate area, mapping data items in the candidate area, calculating the similarity between the node corresponding to each data item and all nodes with text contents not empty in the target webpage, wherein each data item corresponds to the node with the highest similarity.

Further, the encapsulation extraction template specifically comprises the following steps:

step S11: inputting a target webpage, data to be extracted and the name of an extraction template, calling a JS script by a system to extract information of all nodes in the webpage, and analyzing and generating a DOM tree;

step S12: finding out a designated sub-tree containing data to be extracted in the DOM tree according to the input labeling information;

step S13: and crawling the information of the subtree to store a file Template in a specific format, wherein Json represents that the specific area of the webpage needs to extract the structured representation of data, and DOMTree represents a DOM tree subtree of the specific area of the webpage.

Further, in step S13, the Json is expressed as:

Json＝<name₁:value₁,name₂:value₂,...,name_n:value_n>；

in the formula, name_iIs the name of the data to be extracted, value_iIs the data value corresponding to the data name;

the DOMTree is expressed as:

DOMTree＝<Node₁,Node₂,…,Node_n>；

in the formula, Node_iIs a Node of the tree, wherein Node₁Is the root node of the subtree;

one Node in the given DOM tree is represented as:

Node＝<tag,Father,Child,xpath,text,Attri>；

in the formula, tag is a label name of the node, Father is a Father node of the node, Child is a Child node list of the node, xpath is a path of the node, text is text content of the node, and Attri is a characteristic attribute of the node;

given a feature Attribute Attribute for a node, it is expressed as:

Attri＝<id,class,x,y,w,h>；

in the formula, id is the page id of the node label, class is the class name of the node label, x is the distance between the node and the left frame of the page, y is the distance between the node and the top of the page, w is the width of the area occupied by the node in the page, and h is the height of the area occupied by the node in the page;

given a path xpath of a Node, it is represented as a sequence:

path＝</tag₁[x₁]/tag₂[x₂]/…/tag_n[x_n]>；

where tag denotes a label name on the path, x_iIndicates that the node is the xth node in the same level in the DOM tree_iAnd (4) each node.

Further, the step of judging whether the structure of the target webpage is changed according to the extracted template specifically includes:

reading all node information of json strings and subtrees in the extracted template, analyzing the node information into a DOM tree, calling a JS script to extract all node information in a target page, and analyzing and generating the DOM tree;

finding a sub-tree under the path of the DOM tree root node generated by extracting the template, judging whether the structures of the two sub-trees are changed, if the similarity of the two sub-trees is greater than a specified threshold value, the structure of the target webpage is not changed; otherwise, the structure of the target webpage is considered to be changed.

Further, the calculating the similarity between the specified area of the extracted template and all areas of the target webpage, and taking the area with the highest similarity as a candidate area specifically comprises the following steps:

step S21: judging the path similarity between the designated area and each area in the target webpage;

step S22: judging the structural similarity between the designated area and each area in the target webpage;

step S23: judging the text similarity between the designated area and each area in the target webpage;

step S24: and for each region in the target webpage, respectively carrying out weighting calculation on the path similarity among the regions, the structure similarity among the regions and the text similarity among the regions according to preset weights to obtain the total similarity between the region and the specified region, and selecting the region with the highest total similarity as a candidate region.

Further, the mapping of the data items in the candidate region is performed, and the similarity calculation is performed on the node corresponding to each data item and all nodes of which the text content is not empty in the target webpage, where the node corresponding to each data item with the highest similarity specifically includes the following steps:

step S21: calculating the path similarity between each data item in the designated area and each data item in the candidate area;

step S22: calculating the structural similarity between each data item in the designated area and each data item in the candidate area;

step S23: calculating text similarity between each data item in the designated area and each data item in the candidate area;

step S24: and for each data item in the designated area, respectively performing weighted calculation on the path similarity, the structure similarity and the text recognition in the steps S21 to S23 according to preset weights to obtain the total similarity between the data item and each data item in the candidate area, and selecting the data item with the highest total similarity as the data item in the candidate area corresponding to the data item in the designated area.

Compared with the prior art, the invention has the following beneficial effects: according to the method, the characteristic values of all areas of the webpage are extracted through page rendering, and information such as the DOM tree structure of the webpage and the text similarity is combined, so that the target data can still be correctly extracted after the webpage structure is changed.

Drawings

FIG. 1 is a schematic diagram of the method of an embodiment of the present invention.

Fig. 2 is an example JS script 1 for system call, where Algorithm1 is a crawler script and Algorithm2 is a search tree Algorithm, according to an embodiment of the present invention.

Fig. 3 is an example 2 of a system call JS script according to the embodiment of the present invention. Wherein Algorithm3 is an in-region data item matching Algorithm.

Fig. 4 is an example of a web page before and after updating according to an embodiment of the present invention, where (a) is before updating the web page and (b) is after updating the web page.

Fig. 5 is a schematic diagram of an extraction result of the method of the embodiment.

Detailed Description

The invention is further explained by the following embodiments in conjunction with the drawings.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure herein. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present embodiment provides a method for adaptively extracting web page structured data, including the following steps:

Preferably, the encapsulation process of the extraction template can be described as inputting the website of the information to be extracted, the label information of the data to be extracted and the name of the extraction template, the system finds the corresponding block on the webpage according to the label content, and encapsulates the characteristic value and the label information of the block into the extraction template to be stored as a file with a specific format. The step after the template is packaged and extracted is a web data self-adaptive extraction process, which can be described as firstly reading required template information from the extracted template, wherein the required template information comprises information to be extracted from a specific area of an old page and the characteristic attribute of the area, analyzing the page into a DOM tree by using an existing crawler tool and acquiring the characteristic attribute of the page, finding an area under the path of the current page according to the path of the area specified by the extracted template, calculating the similarity of the areas, judging whether the two areas are similar, if so, the structure of the page is not changed, and extracting the specified information; and if the similarity is smaller than a specified threshold value, changing the structure of the webpage, and performing adaptive matching of the new webpage and the old webpage. The adaptive matching process of the new webpage and the old webpage can be divided into two stages: target region matching and intra-region data item mapping. The two stages comprise path similarity calculation, structure similarity calculation and text similarity calculation, the similarity between the nodes is comprehensively calculated from the three aspects, and the accuracy of self-adaption is improved.

In this embodiment, the package extraction template specifically includes the following steps:

Preferably, in step S11, the algorithm of the JS script called by the system is as shown in fig. 2.

In this embodiment, in step S13, Json is expressed as:

Json＝<name₁:value₁,name₂:value₂,...,name_n:value_n>；

the DOMTree is expressed as:

DOMTree＝<Node₁,Node₂,…,Node_n>；

one Node in the given DOM tree is represented as:

Node＝<tag,Father,Child,xpath,text,Attri>；

given a feature Attribute Attribute for a node, it is expressed as:

Attri＝<id,class,x,y,w,h>；

given a path xpath of a Node, it is represented as a sequence:

path＝</tag₁[x₁]/tag₂[x₂]/…/tag_n[x_n]>；

where tag denotes a label name on the path, x_iIndicates that the node is the x-th node in the same level in the DOM tree_iAnd (4) each node.

Preferably, after the extraction template is obtained, the data required in the webpage can be extracted by inputting the name of the template and the website address of the target webpage. The process of Web data self-adaptive extraction can be divided into 3 steps: 1. and reading all node information of the json strings and sub-trees in the extracted template, analyzing the node information into a DOM tree, calling the JS script to extract all node information in the target page, and analyzing and generating the DOM tree. 2. Finding subtrees under the path of the DOM tree root node generated by extracting the template, judging whether the structures of the two subtrees are changed, if the similarity is greater than a specified threshold value, not changing the structure of the webpage, and finding data in the target webpage according to the path of the data in the extracting template; and if the similarity is smaller than a specified threshold, starting an adaptive matching stage. 3. In the self-adaptation stage, the similarity of the designated area of the extracted template and all areas of the target webpage is calculated, the similarity calculation comprises path similarity, structure similarity and text similarity, finally the total similarity is obtained by weighted average of all the similarities, the area with the highest similarity is taken as a candidate area, and mapping of data items in the area is carried out. Similarity calculation is carried out on nodes corresponding to each data item and all nodes with text contents not empty in the target webpage, the similarity calculation is also divided into path similarity, structure similarity and text similarity, weighted average is taken, and each data item corresponds to a node with the highest similarity.

In this embodiment, the determining whether the structure of the target webpage is changed according to the extracted template specifically includes:

finding subtrees under the path of the DOM tree root node generated by extracting the template, judging whether the structures of the two subtrees are changed, and if the similarity of the two subtrees is greater than a specified threshold value, the structure of the target webpage is not changed; otherwise, the structure of the target webpage is considered to be changed.

In this embodiment, regarding target area matching, a web page may be structurally divided into several areas, that is, a DOM tree of the web page is divided into several sub-trees, a template is extracted to store feature attributes of a specified area crawled from the web page in advance and a data structure to be extracted, a process of comparing the area similarity is to perform similarity calculation on all feature values and attributes of the specified area stored in the template and feature values and attributes of all areas of the input web page, and an area with the highest similarity is regarded as a specified area after iterative update of the web page.

Specifically, the calculating and extracting similarity between the template-specified area and all areas of the target webpage, and taking the area with the highest similarity as the candidate area specifically comprises the following steps:

step S21: judging the path similarity between the designated area and each area in the target webpage; by observing a large number of webpages before and after iterative updating, the fact that most of the subblocks of the webpages only move near the original position even if the webpage structure changes is found, and therefore the path similarity of the two areas can be used as an index for observing the area similarity. And taking the DOM tree path of the root nodes of the two areas as two variables to construct a formula. The common web page DOM tree path is generally regarded as a tag sequence contained from a root node to a leaf node, the traditional web page DOM tree path matching model adopts path matching to calculate the similarity of the path sequence, only sequence matching is considered, the position of the tree path in the web page DOM tree is ignored, obviously, the tree path is not in accordance with reality, and the calculated similarity result can not truly and effectively reflect actual similar information. Therefore, the present embodiment proposes an improved path similarity calculation method, for two tree paths:

xpath_i＝＜/tagName₁[x₁]/tagName₂[x₂]/.../tagName_n[x_n]＞，

xpath_tar＝＜/tagName₁[x₁]/tagName₂[x₂]/.../tagName_n[x_n]＞，

the DOM tree path similarity between them is defined as follows:

sim(xpath_i,xpath_tar)＝st(xpath_i,xpath_tar)*ω₁+sp(xpath_i,xpath_tar)*(1-ω₁)；

wherein,

representing the similarity of the label sequences of the path of the tree, path_i(tagName_i)∩path_tar(tagName_j) Represents the longest common label sequence length, len (path), of the two paths starting from the root node_i) Show path_iThe length of the tag sequence of (a);

the position similarity of the two tree paths is shown, and the node number of the two paths with the same layer sequence number in the longest common label sequence starting from the root node is shown.

The path similarity is mainly composed of st (path)_i,path_tar) And sp (path)_i,path_tar) The two parts are formed and respectively reflect the label sequence and the position information in the path similarity, omega is the weight between the label sequence and the position information, the value range is 0-1, and the importance of the two parts in the path similarity can be adjusted by changing omega.

Step S22: judging the structural similarity between the designated area and each area in the target webpage; the similarity of the structures between the areas mainly considers a virtual structure and a real structure, namely the structure of a DOM tree and the structure of a webpage visualization, and the similarity is composed of two parts: the tree structure similarity and the coordinates and the size of the area in the webpage. The tree structure similarity comprises whether father nodes are consistent or not, total node number comparison contained in the tree and DOM tree height comparison; the coordinates and size of the regions in the web page include the height, width, length from the top of the page, and width from the left side of the page. For two regions, a comparison of the structural similarity between them is defined as follows:

sim(treestru_i,treestru_tar)＝st(T_i,T_tar)*ω+sp(T_i,T_tar)*(1-ω)；

wherein,

representing the similarity, equivalent (root), of the DOM tree structure of a web page_i,root_tar) Showing whether the root nodes of the two areas are consistent or not, T_i(node) represents T_iTotal number of nodes involved, H (T)_i) Represents T_iI.e. the number of node levels of the DOM tree. Omega_i(i ═ 1,2,3) is the weight between them, and ranges from 0 to 1.

Wherein,

height (T) representing the similarity of the size and coordinates of the two regions in the entire page_i) Height, width (T) of the indication area_i) Indicates the width of the region, top (T)_i) Represents T_iThe length of the represented area from the top of the page, left (T)_i) Represents T_iWidth, ω, of the represented region from the left side of the page_i(i ═ 1,2,3,4) is the weight between them, and ranges from 0 to 1.

The similarity of the structure between the regions is mainly composed of st (T)_i,T_tar) And sp (T)_i,T_tar) The two parts are formed and respectively represent DOM tree structure information and graphical interface layout information in the structural similarity, omega is the weight between the DOM tree structure information and the graphical interface layout information, the value range is 0-1, and the importance of the two parts in the structural similarity can be adjusted by changing omega.

Step S23: judging the text similarity between the designated area and each area in the target webpage; the text similarity is also a measurement factor of the similarity between the regions, and the embodiment uses the synonym forest to calculate the similarity between words. All in oneThe semantic word forest carries out semantic classification, and the words are organized into a five-level tree structure, and each unit synonym adopts eight-digit coding. The structure includes synonymy relations, high-level relations and hyponymy relations of word senses. For the fifth level, words are grouped, with one character added to the end of the code to mark that the corresponding word is a synonym ("═"), a homolog ("#") or that the group has only one word ("@"). With this encoding rule, the present implementation performs similarity calculation of chinese text using the following algorithm. The text within a region may be viewed as a sentence, which is composed of several words. As previously mentioned, calculating text similarity is essentially calculating sentence similarity. Thus, the present embodiment may use this formula, using word similarity sim (word; text)_tar) Obtaining text similarity:

sim(w,text)＝max(sim(word,word₁),...,sim(word,word_k))，

where w is a word and text is all the text in a region, containing k words. sim (word )_i) Is the similarity of two words. text_iAnd text_tarFor all text in both regions, defined as text_i＝{w_i,1,w_i,2,...,w_i,m},text_tar＝{w_tar,1,w_tar,2,...,w_tar,n}. m and n are text respectively_iAnd text_tarThe number of split words. Text similarity contains two metrics: the similarity of the text contents and the length of all texts in the two areas are compared. Omega is the weight between the two parts, the value range is 0-1, and the importance of the two parts in the text similarity can be adjusted by changing omega.

Wherein, the calculation of the total similarity adopts the following formula:

through the calculation, the area with the highest similarity to the target area can be obtained, the area is regarded as a suspected target area, and if the similarity is larger than a certain threshold value, data item matching in the next calculation area is carried out; if the similarity is smaller than the threshold, it indicates that the target area cannot be found in the updated webpage. Firstly, defining, namely, taking nodes with text contents not being empty in an area to form a node set to be matched as follows:

Items＝<node₁,node₂,...,node_k>；

the set comprises k nodes_i. Fig. 3 is an algorithm for data item matching in the present embodiment.

In this embodiment, the mapping of the data items in the candidate region, and performing similarity calculation on the node corresponding to each data item and all nodes of the target webpage whose text contents are not empty, where the node corresponding to each data item with the highest similarity specifically includes the following steps:

step S21: calculating the path similarity between each data item in the designated area and each data item in the candidate area; the path similarity between the data items is calculated according to the formula constructed in the past, except that the parameter path here is an intra-area path rather than a path in the whole webpage, that is:

path＝xpath-xpath_root；

wherein, xpath_rootFor the path of the root node of the area, the calculation formula of the path similarity of the data item is as follows:

step S22: calculating the structural similarity between each data item in the designated area and the candidate area; the structural similarity between data items is mainly considered as follows: the tab properties of the page and the relative position within the region. The tag attributes of the page comprise whether tag names are consistent, whether tag ids are consistent, whether font types in the css style are consistent, and whether font sizes and colors are consistent; the relative position within a region includes a comparison of the length from the top of the page and the width from the left side of the page. The structural similarity calculation for two data items is defined as follows:

wherein,

indicating the similarity, equal (tagName), of the tag attributes of the data items_i,tagName_tar) Whether the label names are consistent or not is shown, if so, the label names are 1, otherwise, the label names are 0; equal (id)_i,id_tar) And whether the tag ids are consistent or not is shown, if so, the tag id is 1, otherwise, the tag id is 0. equivalent (font-family)_i,font-family_tar)、equal(font-size_i,font-size_tar)、equal(font-color_i,font-color_tar) Respectively representing whether the font types are consistent or not, the font sizes and the colors are consistent or not, if so, the font sizes and the colors are 1, otherwise, the font sizes and the colors are 0. Omega_iIs the weight between them, and has a value in the range of 0-1.

Representing the relative position of the data item within the region, representing the length of the data item from the top of the page, representing the width of the data item from the left side of the page, are of equal importance, so the weights are each half and are all 0.5.

Step S23: calculating the text similarity between each data item in the designated area and each data item in the candidate area; the text similarity calculation for the data items also uses the formula defined above, except that the text content text contains only the text content of a single data item, rather than all the text within the entire region.

sim(w,nodetext)＝max(sim(word,word₁),...,sim(word,word_s))，

Wherein, nodeText is the text contained in a single node, and contains s words, sim (word )_i) Is the similarity of two words. nodetext_iAnd nodetext_tarIs the text in two nodes, defined as nodeText_i＝{w_i,1,w_i,2,...,w_i,p},nodetext_tar＝{w_tar,1,w_tar,2,...,w_tar,q}. p and q are nodetext respectively_iAnd nodetext_tarThe number of split words. Text similarity contains two metrics: the similarity of the text contents and the length of all texts in the two areas are compared. Omega is the weight between the two parts, the value range is 0-1, and the importance of the two parts in the text similarity can be adjusted by changing omega.

Step S24: and for each data item in the designated area, respectively performing weighted calculation on the path similarity, the structure similarity and the text recognition in the steps S21-S23 according to preset weights to obtain the total similarity between the data item and each data item in the candidate area, and selecting the data item with the highest total similarity as the data item in the candidate area corresponding to the data item in the designated area. Calculating the similarity between all data items in the region and the data items in the specific region specified by the configuration file, and calculating the total similarity by using the obtained three measurement factors according to a certain weight to obtain the following formula:

in particular, to better illustrate the effect of the embodiment, as shown in fig. 4, fig. 4 is an example of a change before and after updating a web page, it can be seen that the structure of the web page is greatly changed, the position and size of the target area are changed, and the data content to be extracted is also changed. If the target block needed by the user cannot be located after the web page structure is changed by using the traditional web page data extraction algorithm, the corresponding relation of new and old version of the web page data item cannot be found, which is not beneficial to the large-scale extraction of data, and the embodiment hopes to monitor the change of the web page in real time and adaptively adjust the extraction template to adapt to the updating of the web page.

The feasibility of the method of this embodiment is discussed with respect to this example, and the information of the doctor is extracted for this embodiment of the web page. First, url and annotated json data of the website {' recommendation heat (integrated): '3.5' ' thank you for you: ': 1', ' gift: '0', ' department: 'department of medical university subsidiary hospital ophthalmology' in southwest ',' good intentions: ' correction and prevention of keratopathy, corneal refractive surgery, uveal disease, ametropia, ocular laser examination and treatment ', ' brief introduction: ' Zheng, woman, assistant chief and ren physicians, assistant professor, medical master and members of Chinese medical society engaged in clinical medical treatment, teaching and scientific research for over 10 years. ' and a name ' sector ' of the extraction template are input into a system, and the corresponding extraction template is obtained through operation. When the information of the region is to be extracted, the name of the extraction template and the webpage url are input, the system can analyze the webpage into a DOM tree storing all node information of the page, then the region under the path of the current webpage is found according to the region path specified by the extraction template, the similarity of the regions is calculated, whether the two regions are similar or not is judged, if so, the structure of the webpage is not changed, and the specified information is extracted; and if the similarity is smaller than the specified threshold, changing the structure of the webpage, and performing adaptive matching of the new webpage and the old webpage and updating of the extracted template. The information extracted before and after updating of the web page shown in fig. 4 is shown in fig. 5, where (a) is data extracted before updating of the web page, and (b) is data extracted after updating of the web page. It can be seen from the figure that the method of the embodiment can still effectively extract data under the condition that the structure of the webpage is greatly changed.

In summary, the method provided by this embodiment defines not only the corresponding extraction rule when formulating the extraction template, but also an adaptive matching rule according to the text feature, the HTML tag feature, the visual feature, and the DOM tree structure feature of the page data. Matching the web with the corresponding extraction template, and extracting data according to an extraction rule after matching is successful; and if the page is changed and the xpath expression fails, re-searching data according to the self-adaptive matching rule and updating the xpath. Experimental results show that the method has high accuracy and effectively reduces manual intervention in the extraction process.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A webpage structured data self-adaptive extraction method is characterized by comprising the following steps:

packaging the extraction template, judging whether the structure of the target webpage is changed or not according to the extraction template, and if not, finding the data in the target webpage according to the path of the data in the extraction template; if the structure of the target webpage is changed, calculating the similarity between the designated area of the extracted template and all areas of the target webpage, taking the area with the highest similarity as a candidate area, mapping data items in the candidate area, calculating the similarity between the node corresponding to each data item and all nodes with text contents not empty in the target webpage, wherein each data item corresponds to the node with the highest similarity;

the specific step of judging whether the structure of the target webpage is changed according to the extraction template is as follows:

finding subtrees under the path of the DOM tree root node generated by extracting the template, judging whether the structures of the two subtrees are changed, and if the similarity of the two subtrees is greater than a specified threshold value, the structure of the target webpage is not changed; otherwise, the structure of the target webpage is changed;

the method for calculating the similarity between the specified area of the extracted template and all areas of the target webpage and taking the area with the highest similarity as the candidate area specifically comprises the following steps of:

step S24: for each region in the target webpage, respectively performing weighted calculation on the path similarity among the regions, the structure similarity among the regions and the text similarity among the regions according to preset weights to obtain the total similarity of the region and a specified region, and selecting the region with the highest total similarity as a candidate region;

the mapping of the data items in the candidate area is performed, similarity calculation is performed on the node corresponding to each data item and all nodes with text contents not empty in the target webpage, and the node with the highest similarity corresponding to each data item specifically comprises the following steps:

step S25: calculating the path similarity between each data item in the designated area and each data item in the candidate area;

step S26: calculating the structural similarity between each data item in the designated area and each data item in the candidate area;

step S27: calculating the text similarity between each data item in the designated area and each data item in the candidate area;

step S28: and for each data item in the designated area, respectively performing weighted calculation on the path similarity, the structure similarity and the text similarity in the steps S25 to S27 according to preset weights to obtain the total similarity between the data item and each data item in the candidate area, and selecting the data item with the highest total similarity as the data item in the candidate area corresponding to the data item in the designated area.

2. The method for adaptively extracting web page structured data according to claim 1, wherein the encapsulating extraction template specifically comprises the following steps:

3. The method for adaptively extracting web page structural data according to claim 2, wherein in step S13, Json represents:

Json＝<name₁:value₁,name₂:value₂,...,name_n:value_n>；

the DOMTree is expressed as:

DOMTree＝<Node₁,Node₂,…,Node_n>；

one Node in the given DOM tree is represented as:

Node＝<tag,Father,Child,xpath,text,Attri>；

given a characteristic attribute Attri of a node, it is expressed as:

Attri＝<id,class,x,y,w,h>；

given a path xpath of a Node, it is represented as a sequence:

xpath＝</tag₁[x₁]/tag₂[x₂]/…/tag_n[x_n]>；

in the formula, tag_iIndicating the label name, x, on the path_iIndicating that the node is the ith node in the same level of the DOM tree.