CN110968761B - Webpage structured data self-adaptive extraction method - Google Patents

Webpage structured data self-adaptive extraction method Download PDF

Info

Publication number
CN110968761B
CN110968761B CN201911196582.4A CN201911196582A CN110968761B CN 110968761 B CN110968761 B CN 110968761B CN 201911196582 A CN201911196582 A CN 201911196582A CN 110968761 B CN110968761 B CN 110968761B
Authority
CN
China
Prior art keywords
node
similarity
area
data
data item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911196582.4A
Other languages
Chinese (zh)
Other versions
CN110968761A (en
Inventor
陈星�
郭莹楠
杨植
郑勇杰
陈晓娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201911196582.4A priority Critical patent/CN110968761B/en
Publication of CN110968761A publication Critical patent/CN110968761A/en
Priority to PCT/CN2020/101247 priority patent/WO2021103557A1/en
Application granted granted Critical
Publication of CN110968761B publication Critical patent/CN110968761B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a self-adaptive extraction method of webpage structured data, which comprises the steps of firstly packaging an extraction template, judging whether the structure of a target webpage is changed or not according to the extraction template, and finding data in the target webpage according to the path of the data in the extraction template if the structure of the target webpage is not changed; if the structure of the target webpage is changed, calculating the similarity between the designated area of the extracted template and all areas of the target webpage, taking the area with the highest similarity as a candidate area, mapping data items in the candidate area, calculating the similarity between the node corresponding to each data item and all nodes with text contents not empty in the target webpage, wherein each data item corresponds to the node with the highest similarity. The invention can still correctly extract the target data after the structure of the webpage changes.

Description

Self-adaptive extraction method for webpage structured data
Technical Field
The invention relates to the field of extraction of web page structured data of the Internet of things, in particular to a self-adaptive extraction method of web page structured data.
Background
The Internet (Internet) is a huge resource bank, the number of current Web pages reaches hundreds of billions, the Web pages are continuously increased at an incredible speed every hour, the rapid development of the Internet causes the information to show an explosive growth, and the Web is used as a main carrier of the Internet information and is full of various information. In order to collect the effective information we need contained in the Web page, various Web data extraction techniques have been proposed.
However, the current Web data extraction technology generally only aims at a specific Web page structure, and when the Web page is updated iteratively, the problem of a change in the Web page structure may be encountered, so that Web page information cannot be extracted or wrong information is extracted.
Disclosure of Invention
In view of the above, the present invention provides a method for adaptively extracting web page structured data, which can still correctly extract target data after a web page structure changes.
The invention is realized by adopting the following scheme: a webpage structured data self-adaptive extraction method comprises the following steps:
packaging the extraction template, judging whether the structure of the target webpage is changed or not according to the extraction template, and if not, finding the data in the target webpage according to the path of the data in the extraction template; if the structure of the target webpage is changed, calculating the similarity between the designated area of the extracted template and all areas of the target webpage, taking the area with the highest similarity as a candidate area, mapping data items in the candidate area, calculating the similarity between the node corresponding to each data item and all nodes with text contents not empty in the target webpage, wherein each data item corresponds to the node with the highest similarity.
Further, the encapsulation extraction template specifically comprises the following steps:
step S11: inputting a target webpage, data to be extracted and the name of an extraction template, calling a JS script by a system to extract information of all nodes in the webpage, and analyzing and generating a DOM tree;
step S12: finding out a designated sub-tree containing data to be extracted in the DOM tree according to the input labeling information;
step S13: and crawling the information of the subtree to store a file Template in a specific format, wherein Json represents that the specific area of the webpage needs to extract the structured representation of data, and DOMTree represents a DOM tree subtree of the specific area of the webpage.
Further, in step S13, the Json is expressed as:
Json=<name1:value1,name2:value2,...,namen:valuen>;
in the formula, nameiIs the name of the data to be extracted, valueiIs the data value corresponding to the data name;
the DOMTree is expressed as:
DOMTree=<Node1,Node2,…,Noden>;
in the formula, NodeiIs a Node of the tree, wherein Node1Is the root node of the subtree;
one Node in the given DOM tree is represented as:
Node=<tag,Father,Child,xpath,text,Attri>;
in the formula, tag is a label name of the node, Father is a Father node of the node, Child is a Child node list of the node, xpath is a path of the node, text is text content of the node, and Attri is a characteristic attribute of the node;
given a feature Attribute Attribute for a node, it is expressed as:
Attri=<id,class,x,y,w,h>;
in the formula, id is the page id of the node label, class is the class name of the node label, x is the distance between the node and the left frame of the page, y is the distance between the node and the top of the page, w is the width of the area occupied by the node in the page, and h is the height of the area occupied by the node in the page;
given a path xpath of a Node, it is represented as a sequence:
path=</tag1[x1]/tag2[x2]/…/tagn[xn]>;
where tag denotes a label name on the path, xiIndicates that the node is the xth node in the same level in the DOM treeiAnd (4) each node.
Further, the step of judging whether the structure of the target webpage is changed according to the extracted template specifically includes:
reading all node information of json strings and subtrees in the extracted template, analyzing the node information into a DOM tree, calling a JS script to extract all node information in a target page, and analyzing and generating the DOM tree;
finding a sub-tree under the path of the DOM tree root node generated by extracting the template, judging whether the structures of the two sub-trees are changed, if the similarity of the two sub-trees is greater than a specified threshold value, the structure of the target webpage is not changed; otherwise, the structure of the target webpage is considered to be changed.
Further, the calculating the similarity between the specified area of the extracted template and all areas of the target webpage, and taking the area with the highest similarity as a candidate area specifically comprises the following steps:
step S21: judging the path similarity between the designated area and each area in the target webpage;
step S22: judging the structural similarity between the designated area and each area in the target webpage;
step S23: judging the text similarity between the designated area and each area in the target webpage;
step S24: and for each region in the target webpage, respectively carrying out weighting calculation on the path similarity among the regions, the structure similarity among the regions and the text similarity among the regions according to preset weights to obtain the total similarity between the region and the specified region, and selecting the region with the highest total similarity as a candidate region.
Further, the mapping of the data items in the candidate region is performed, and the similarity calculation is performed on the node corresponding to each data item and all nodes of which the text content is not empty in the target webpage, where the node corresponding to each data item with the highest similarity specifically includes the following steps:
step S21: calculating the path similarity between each data item in the designated area and each data item in the candidate area;
step S22: calculating the structural similarity between each data item in the designated area and each data item in the candidate area;
step S23: calculating text similarity between each data item in the designated area and each data item in the candidate area;
step S24: and for each data item in the designated area, respectively performing weighted calculation on the path similarity, the structure similarity and the text recognition in the steps S21 to S23 according to preset weights to obtain the total similarity between the data item and each data item in the candidate area, and selecting the data item with the highest total similarity as the data item in the candidate area corresponding to the data item in the designated area.
Compared with the prior art, the invention has the following beneficial effects: according to the method, the characteristic values of all areas of the webpage are extracted through page rendering, and information such as the DOM tree structure of the webpage and the text similarity is combined, so that the target data can still be correctly extracted after the webpage structure is changed.
Drawings
FIG. 1 is a schematic diagram of the method of an embodiment of the present invention.
Fig. 2 is an example JS script 1 for system call, where Algorithm1 is a crawler script and Algorithm2 is a search tree Algorithm, according to an embodiment of the present invention.
Fig. 3 is an example 2 of a system call JS script according to the embodiment of the present invention. Wherein Algorithm3 is an in-region data item matching Algorithm.
Fig. 4 is an example of a web page before and after updating according to an embodiment of the present invention, where (a) is before updating the web page and (b) is after updating the web page.
Fig. 5 is a schematic diagram of an extraction result of the method of the embodiment.
Detailed Description
The invention is further explained by the following embodiments in conjunction with the drawings.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure herein. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, the present embodiment provides a method for adaptively extracting web page structured data, including the following steps:
packaging the extraction template, judging whether the structure of the target webpage is changed or not according to the extraction template, and if not, finding the data in the target webpage according to the path of the data in the extraction template; if the structure of the target webpage is changed, calculating the similarity between the designated area of the extracted template and all areas of the target webpage, taking the area with the highest similarity as a candidate area, mapping data items in the candidate area, calculating the similarity between the node corresponding to each data item and all nodes with text contents not empty in the target webpage, wherein each data item corresponds to the node with the highest similarity.
Preferably, the encapsulation process of the extraction template can be described as inputting the website of the information to be extracted, the label information of the data to be extracted and the name of the extraction template, the system finds the corresponding block on the webpage according to the label content, and encapsulates the characteristic value and the label information of the block into the extraction template to be stored as a file with a specific format. The step after the template is packaged and extracted is a web data self-adaptive extraction process, which can be described as firstly reading required template information from the extracted template, wherein the required template information comprises information to be extracted from a specific area of an old page and the characteristic attribute of the area, analyzing the page into a DOM tree by using an existing crawler tool and acquiring the characteristic attribute of the page, finding an area under the path of the current page according to the path of the area specified by the extracted template, calculating the similarity of the areas, judging whether the two areas are similar, if so, the structure of the page is not changed, and extracting the specified information; and if the similarity is smaller than a specified threshold value, changing the structure of the webpage, and performing adaptive matching of the new webpage and the old webpage. The adaptive matching process of the new webpage and the old webpage can be divided into two stages: target region matching and intra-region data item mapping. The two stages comprise path similarity calculation, structure similarity calculation and text similarity calculation, the similarity between the nodes is comprehensively calculated from the three aspects, and the accuracy of self-adaption is improved.
In this embodiment, the package extraction template specifically includes the following steps:
step S11: inputting a target webpage, data to be extracted and the name of an extraction template, calling a JS script by a system to extract information of all nodes in the webpage, and analyzing and generating a DOM tree;
step S12: finding out a designated sub-tree containing data to be extracted in the DOM tree according to the input labeling information;
step S13: and crawling the information of the subtree to store a file Template in a specific format, wherein Json represents that the specific area of the webpage needs to extract the structured representation of data, and DOMTree represents a DOM tree subtree of the specific area of the webpage.
Preferably, in step S11, the algorithm of the JS script called by the system is as shown in fig. 2.
In this embodiment, in step S13, Json is expressed as:
Json=<name1:value1,name2:value2,...,namen:valuen>;
in the formula, nameiIs the name of the data to be extracted, valueiIs the data value corresponding to the data name;
the DOMTree is expressed as:
DOMTree=<Node1,Node2,…,Noden>;
in the formula, NodeiIs a Node of the tree, wherein Node1Is the root node of the subtree;
one Node in the given DOM tree is represented as:
Node=<tag,Father,Child,xpath,text,Attri>;
in the formula, tag is a label name of the node, Father is a Father node of the node, Child is a Child node list of the node, xpath is a path of the node, text is text content of the node, and Attri is a characteristic attribute of the node;
given a feature Attribute Attribute for a node, it is expressed as:
Attri=<id,class,x,y,w,h>;
in the formula, id is the page id of the node label, class is the class name of the node label, x is the distance between the node and the left frame of the page, y is the distance between the node and the top of the page, w is the width of the area occupied by the node in the page, and h is the height of the area occupied by the node in the page;
given a path xpath of a Node, it is represented as a sequence:
path=</tag1[x1]/tag2[x2]/…/tagn[xn]>;
where tag denotes a label name on the path, xiIndicates that the node is the x-th node in the same level in the DOM treeiAnd (4) each node.
Preferably, after the extraction template is obtained, the data required in the webpage can be extracted by inputting the name of the template and the website address of the target webpage. The process of Web data self-adaptive extraction can be divided into 3 steps: 1. and reading all node information of the json strings and sub-trees in the extracted template, analyzing the node information into a DOM tree, calling the JS script to extract all node information in the target page, and analyzing and generating the DOM tree. 2. Finding subtrees under the path of the DOM tree root node generated by extracting the template, judging whether the structures of the two subtrees are changed, if the similarity is greater than a specified threshold value, not changing the structure of the webpage, and finding data in the target webpage according to the path of the data in the extracting template; and if the similarity is smaller than a specified threshold, starting an adaptive matching stage. 3. In the self-adaptation stage, the similarity of the designated area of the extracted template and all areas of the target webpage is calculated, the similarity calculation comprises path similarity, structure similarity and text similarity, finally the total similarity is obtained by weighted average of all the similarities, the area with the highest similarity is taken as a candidate area, and mapping of data items in the area is carried out. Similarity calculation is carried out on nodes corresponding to each data item and all nodes with text contents not empty in the target webpage, the similarity calculation is also divided into path similarity, structure similarity and text similarity, weighted average is taken, and each data item corresponds to a node with the highest similarity.
In this embodiment, the determining whether the structure of the target webpage is changed according to the extracted template specifically includes:
reading all node information of json strings and subtrees in the extracted template, analyzing the node information into a DOM tree, calling a JS script to extract all node information in a target page, and analyzing and generating the DOM tree;
finding subtrees under the path of the DOM tree root node generated by extracting the template, judging whether the structures of the two subtrees are changed, and if the similarity of the two subtrees is greater than a specified threshold value, the structure of the target webpage is not changed; otherwise, the structure of the target webpage is considered to be changed.
In this embodiment, regarding target area matching, a web page may be structurally divided into several areas, that is, a DOM tree of the web page is divided into several sub-trees, a template is extracted to store feature attributes of a specified area crawled from the web page in advance and a data structure to be extracted, a process of comparing the area similarity is to perform similarity calculation on all feature values and attributes of the specified area stored in the template and feature values and attributes of all areas of the input web page, and an area with the highest similarity is regarded as a specified area after iterative update of the web page.
Specifically, the calculating and extracting similarity between the template-specified area and all areas of the target webpage, and taking the area with the highest similarity as the candidate area specifically comprises the following steps:
step S21: judging the path similarity between the designated area and each area in the target webpage; by observing a large number of webpages before and after iterative updating, the fact that most of the subblocks of the webpages only move near the original position even if the webpage structure changes is found, and therefore the path similarity of the two areas can be used as an index for observing the area similarity. And taking the DOM tree path of the root nodes of the two areas as two variables to construct a formula. The common web page DOM tree path is generally regarded as a tag sequence contained from a root node to a leaf node, the traditional web page DOM tree path matching model adopts path matching to calculate the similarity of the path sequence, only sequence matching is considered, the position of the tree path in the web page DOM tree is ignored, obviously, the tree path is not in accordance with reality, and the calculated similarity result can not truly and effectively reflect actual similar information. Therefore, the present embodiment proposes an improved path similarity calculation method, for two tree paths:
xpathi=</tagName1[x1]/tagName2[x2]/.../tagNamen[xn]>,
xpathtar=</tagName1[x1]/tagName2[x2]/.../tagNamen[xn]>,
the DOM tree path similarity between them is defined as follows:
sim(xpathi,xpathtar)=st(xpathi,xpathtar)*ω1+sp(xpathi,xpathtar)*(1-ω1);
wherein,
Figure GDA0003656223790000101
representing the similarity of the label sequences of the path of the tree, pathi(tagNamei)∩pathtar(tagNamej) Represents the longest common label sequence length, len (path), of the two paths starting from the root nodei) Show pathiThe length of the tag sequence of (a);
Figure GDA0003656223790000102
the position similarity of the two tree paths is shown, and the node number of the two paths with the same layer sequence number in the longest common label sequence starting from the root node is shown.
The path similarity is mainly composed of st (path)i,pathtar) And sp (path)i,pathtar) The two parts are formed and respectively reflect the label sequence and the position information in the path similarity, omega is the weight between the label sequence and the position information, the value range is 0-1, and the importance of the two parts in the path similarity can be adjusted by changing omega.
Step S22: judging the structural similarity between the designated area and each area in the target webpage; the similarity of the structures between the areas mainly considers a virtual structure and a real structure, namely the structure of a DOM tree and the structure of a webpage visualization, and the similarity is composed of two parts: the tree structure similarity and the coordinates and the size of the area in the webpage. The tree structure similarity comprises whether father nodes are consistent or not, total node number comparison contained in the tree and DOM tree height comparison; the coordinates and size of the regions in the web page include the height, width, length from the top of the page, and width from the left side of the page. For two regions, a comparison of the structural similarity between them is defined as follows:
sim(treestrui,treestrutar)=st(Ti,Ttar)*ω+sp(Ti,Ttar)*(1-ω);
wherein,
Figure GDA0003656223790000111
representing the similarity, equivalent (root), of the DOM tree structure of a web pagei,roottar) Showing whether the root nodes of the two areas are consistent or not, Ti(node) represents TiTotal number of nodes involved, H (T)i) Represents TiI.e. the number of node levels of the DOM tree. Omegai(i ═ 1,2,3) is the weight between them, and ranges from 0 to 1.
Wherein,
Figure GDA0003656223790000112
height (T) representing the similarity of the size and coordinates of the two regions in the entire pagei) Height, width (T) of the indication areai) Indicates the width of the region, top (T)i) Represents TiThe length of the represented area from the top of the page, left (T)i) Represents TiWidth, ω, of the represented region from the left side of the pagei(i ═ 1,2,3,4) is the weight between them, and ranges from 0 to 1.
The similarity of the structure between the regions is mainly composed of st (T)i,Ttar) And sp (T)i,Ttar) The two parts are formed and respectively represent DOM tree structure information and graphical interface layout information in the structural similarity, omega is the weight between the DOM tree structure information and the graphical interface layout information, the value range is 0-1, and the importance of the two parts in the structural similarity can be adjusted by changing omega.
Step S23: judging the text similarity between the designated area and each area in the target webpage; the text similarity is also a measurement factor of the similarity between the regions, and the embodiment uses the synonym forest to calculate the similarity between words. All in oneThe semantic word forest carries out semantic classification, and the words are organized into a five-level tree structure, and each unit synonym adopts eight-digit coding. The structure includes synonymy relations, high-level relations and hyponymy relations of word senses. For the fifth level, words are grouped, with one character added to the end of the code to mark that the corresponding word is a synonym ("═"), a homolog ("#") or that the group has only one word ("@"). With this encoding rule, the present implementation performs similarity calculation of chinese text using the following algorithm. The text within a region may be viewed as a sentence, which is composed of several words. As previously mentioned, calculating text similarity is essentially calculating sentence similarity. Thus, the present embodiment may use this formula, using word similarity sim (word; text)tar) Obtaining text similarity:
sim(w,text)=max(sim(word,word1),...,sim(word,wordk)),
Figure GDA0003656223790000121
where w is a word and text is all the text in a region, containing k words. sim (word )i) Is the similarity of two words. textiAnd texttarFor all text in both regions, defined as texti={wi,1,wi,2,...,wi,m},texttar={wtar,1,wtar,2,...,wtar,n}. m and n are text respectivelyiAnd texttarThe number of split words. Text similarity contains two metrics: the similarity of the text contents and the length of all texts in the two areas are compared. Omega is the weight between the two parts, the value range is 0-1, and the importance of the two parts in the text similarity can be adjusted by changing omega.
Step S24: and for each region in the target webpage, respectively carrying out weighting calculation on the path similarity among the regions, the structure similarity among the regions and the text similarity among the regions according to preset weights to obtain the total similarity between the region and the specified region, and selecting the region with the highest total similarity as a candidate region.
Wherein, the calculation of the total similarity adopts the following formula:
Figure GDA0003656223790000131
through the calculation, the area with the highest similarity to the target area can be obtained, the area is regarded as a suspected target area, and if the similarity is larger than a certain threshold value, data item matching in the next calculation area is carried out; if the similarity is smaller than the threshold, it indicates that the target area cannot be found in the updated webpage. Firstly, defining, namely, taking nodes with text contents not being empty in an area to form a node set to be matched as follows:
Items=<node1,node2,...,nodek>;
the set comprises k nodesi. Fig. 3 is an algorithm for data item matching in the present embodiment.
In this embodiment, the mapping of the data items in the candidate region, and performing similarity calculation on the node corresponding to each data item and all nodes of the target webpage whose text contents are not empty, where the node corresponding to each data item with the highest similarity specifically includes the following steps:
step S21: calculating the path similarity between each data item in the designated area and each data item in the candidate area; the path similarity between the data items is calculated according to the formula constructed in the past, except that the parameter path here is an intra-area path rather than a path in the whole webpage, that is:
path=xpath-xpathroot
wherein, xpathrootFor the path of the root node of the area, the calculation formula of the path similarity of the data item is as follows:
Figure GDA0003656223790000141
step S22: calculating the structural similarity between each data item in the designated area and the candidate area; the structural similarity between data items is mainly considered as follows: the tab properties of the page and the relative position within the region. The tag attributes of the page comprise whether tag names are consistent, whether tag ids are consistent, whether font types in the css style are consistent, and whether font sizes and colors are consistent; the relative position within a region includes a comparison of the length from the top of the page and the width from the left side of the page. The structural similarity calculation for two data items is defined as follows:
Figure GDA0003656223790000142
wherein,
Figure GDA0003656223790000143
indicating the similarity, equal (tagName), of the tag attributes of the data itemsi,tagNametar) Whether the label names are consistent or not is shown, if so, the label names are 1, otherwise, the label names are 0; equal (id)i,idtar) And whether the tag ids are consistent or not is shown, if so, the tag id is 1, otherwise, the tag id is 0. equivalent (font-family)i,font-familytar)、equal(font-sizei,font-sizetar)、equal(font-colori,font-colortar) Respectively representing whether the font types are consistent or not, the font sizes and the colors are consistent or not, if so, the font sizes and the colors are 1, otherwise, the font sizes and the colors are 0. OmegaiIs the weight between them, and has a value in the range of 0-1.
Figure GDA0003656223790000151
Representing the relative position of the data item within the region, representing the length of the data item from the top of the page, representing the width of the data item from the left side of the page, are of equal importance, so the weights are each half and are all 0.5.
Step S23: calculating the text similarity between each data item in the designated area and each data item in the candidate area; the text similarity calculation for the data items also uses the formula defined above, except that the text content text contains only the text content of a single data item, rather than all the text within the entire region.
sim(w,nodetext)=max(sim(word,word1),...,sim(word,words)),
Figure GDA0003656223790000152
Wherein, nodeText is the text contained in a single node, and contains s words, sim (word )i) Is the similarity of two words. nodetextiAnd nodetexttarIs the text in two nodes, defined as nodeTexti={wi,1,wi,2,...,wi,p},nodetexttar={wtar,1,wtar,2,...,wtar,q}. p and q are nodetext respectivelyiAnd nodetexttarThe number of split words. Text similarity contains two metrics: the similarity of the text contents and the length of all texts in the two areas are compared. Omega is the weight between the two parts, the value range is 0-1, and the importance of the two parts in the text similarity can be adjusted by changing omega.
Step S24: and for each data item in the designated area, respectively performing weighted calculation on the path similarity, the structure similarity and the text recognition in the steps S21-S23 according to preset weights to obtain the total similarity between the data item and each data item in the candidate area, and selecting the data item with the highest total similarity as the data item in the candidate area corresponding to the data item in the designated area. Calculating the similarity between all data items in the region and the data items in the specific region specified by the configuration file, and calculating the total similarity by using the obtained three measurement factors according to a certain weight to obtain the following formula:
Figure GDA0003656223790000161
in particular, to better illustrate the effect of the embodiment, as shown in fig. 4, fig. 4 is an example of a change before and after updating a web page, it can be seen that the structure of the web page is greatly changed, the position and size of the target area are changed, and the data content to be extracted is also changed. If the target block needed by the user cannot be located after the web page structure is changed by using the traditional web page data extraction algorithm, the corresponding relation of new and old version of the web page data item cannot be found, which is not beneficial to the large-scale extraction of data, and the embodiment hopes to monitor the change of the web page in real time and adaptively adjust the extraction template to adapt to the updating of the web page.
The feasibility of the method of this embodiment is discussed with respect to this example, and the information of the doctor is extracted for this embodiment of the web page. First, url and annotated json data of the website {' recommendation heat (integrated): '3.5' ' thank you for you: ': 1', ' gift: '0', ' department: 'department of medical university subsidiary hospital ophthalmology' in southwest ',' good intentions: ' correction and prevention of keratopathy, corneal refractive surgery, uveal disease, ametropia, ocular laser examination and treatment ', ' brief introduction: ' Zheng, woman, assistant chief and ren physicians, assistant professor, medical master and members of Chinese medical society engaged in clinical medical treatment, teaching and scientific research for over 10 years. ' and a name ' sector ' of the extraction template are input into a system, and the corresponding extraction template is obtained through operation. When the information of the region is to be extracted, the name of the extraction template and the webpage url are input, the system can analyze the webpage into a DOM tree storing all node information of the page, then the region under the path of the current webpage is found according to the region path specified by the extraction template, the similarity of the regions is calculated, whether the two regions are similar or not is judged, if so, the structure of the webpage is not changed, and the specified information is extracted; and if the similarity is smaller than the specified threshold, changing the structure of the webpage, and performing adaptive matching of the new webpage and the old webpage and updating of the extracted template. The information extracted before and after updating of the web page shown in fig. 4 is shown in fig. 5, where (a) is data extracted before updating of the web page, and (b) is data extracted after updating of the web page. It can be seen from the figure that the method of the embodiment can still effectively extract data under the condition that the structure of the webpage is greatly changed.
In summary, the method provided by this embodiment defines not only the corresponding extraction rule when formulating the extraction template, but also an adaptive matching rule according to the text feature, the HTML tag feature, the visual feature, and the DOM tree structure feature of the page data. Matching the web with the corresponding extraction template, and extracting data according to an extraction rule after matching is successful; and if the page is changed and the xpath expression fails, re-searching data according to the self-adaptive matching rule and updating the xpath. Experimental results show that the method has high accuracy and effectively reduces manual intervention in the extraction process.
The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims (3)

1. A webpage structured data self-adaptive extraction method is characterized by comprising the following steps:
packaging the extraction template, judging whether the structure of the target webpage is changed or not according to the extraction template, and if not, finding the data in the target webpage according to the path of the data in the extraction template; if the structure of the target webpage is changed, calculating the similarity between the designated area of the extracted template and all areas of the target webpage, taking the area with the highest similarity as a candidate area, mapping data items in the candidate area, calculating the similarity between the node corresponding to each data item and all nodes with text contents not empty in the target webpage, wherein each data item corresponds to the node with the highest similarity;
the specific step of judging whether the structure of the target webpage is changed according to the extraction template is as follows:
reading all node information of json strings and subtrees in the extracted template, analyzing the node information into a DOM tree, calling a JS script to extract all node information in a target page, and analyzing and generating the DOM tree;
finding subtrees under the path of the DOM tree root node generated by extracting the template, judging whether the structures of the two subtrees are changed, and if the similarity of the two subtrees is greater than a specified threshold value, the structure of the target webpage is not changed; otherwise, the structure of the target webpage is changed;
the method for calculating the similarity between the specified area of the extracted template and all areas of the target webpage and taking the area with the highest similarity as the candidate area specifically comprises the following steps of:
step S21: judging the path similarity between the designated area and each area in the target webpage;
step S22: judging the structural similarity between the designated area and each area in the target webpage;
step S23: judging the text similarity between the designated area and each area in the target webpage;
step S24: for each region in the target webpage, respectively performing weighted calculation on the path similarity among the regions, the structure similarity among the regions and the text similarity among the regions according to preset weights to obtain the total similarity of the region and a specified region, and selecting the region with the highest total similarity as a candidate region;
the mapping of the data items in the candidate area is performed, similarity calculation is performed on the node corresponding to each data item and all nodes with text contents not empty in the target webpage, and the node with the highest similarity corresponding to each data item specifically comprises the following steps:
step S25: calculating the path similarity between each data item in the designated area and each data item in the candidate area;
step S26: calculating the structural similarity between each data item in the designated area and each data item in the candidate area;
step S27: calculating the text similarity between each data item in the designated area and each data item in the candidate area;
step S28: and for each data item in the designated area, respectively performing weighted calculation on the path similarity, the structure similarity and the text similarity in the steps S25 to S27 according to preset weights to obtain the total similarity between the data item and each data item in the candidate area, and selecting the data item with the highest total similarity as the data item in the candidate area corresponding to the data item in the designated area.
2. The method for adaptively extracting web page structured data according to claim 1, wherein the encapsulating extraction template specifically comprises the following steps:
step S11: inputting a target webpage, data to be extracted and the name of an extraction template, calling a JS script by a system to extract information of all nodes in the webpage, and analyzing and generating a DOM tree;
step S12: finding out a designated sub-tree containing data to be extracted in the DOM tree according to the input labeling information;
step S13: and crawling the information of the subtree to store a file Template in a specific format, wherein Json represents that the specific area of the webpage needs to extract the structured representation of data, and DOMTree represents a DOM tree subtree of the specific area of the webpage.
3. The method for adaptively extracting web page structural data according to claim 2, wherein in step S13, Json represents:
Json=<name1:value1,name2:value2,...,namen:valuen>;
in the formula, nameiIs the name of the data to be extracted, valueiIs the data value corresponding to the data name;
the DOMTree is expressed as:
DOMTree=<Node1,Node2,…,Noden>;
in the formula, NodeiIs a Node of the tree, wherein Node1Is the root node of the subtree;
one Node in the given DOM tree is represented as:
Node=<tag,Father,Child,xpath,text,Attri>;
in the formula, tag is a label name of the node, Father is a Father node of the node, Child is a Child node list of the node, xpath is a path of the node, text is text content of the node, and Attri is a characteristic attribute of the node;
given a characteristic attribute Attri of a node, it is expressed as:
Attri=<id,class,x,y,w,h>;
in the formula, id is the page id of the node label, class is the class name of the node label, x is the distance between the node and the left frame of the page, y is the distance between the node and the top of the page, w is the width of the area occupied by the node in the page, and h is the height of the area occupied by the node in the page;
given a path xpath of a Node, it is represented as a sequence:
xpath=</tag1[x1]/tag2[x2]/…/tagn[xn]>;
in the formula, tagiIndicating the label name, x, on the pathiIndicating that the node is the ith node in the same level of the DOM tree.
CN201911196582.4A 2019-11-29 2019-11-29 Webpage structured data self-adaptive extraction method Active CN110968761B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911196582.4A CN110968761B (en) 2019-11-29 2019-11-29 Webpage structured data self-adaptive extraction method
PCT/CN2020/101247 WO2021103557A1 (en) 2019-11-29 2020-07-10 Adaptive extraction method for webpage structured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911196582.4A CN110968761B (en) 2019-11-29 2019-11-29 Webpage structured data self-adaptive extraction method

Publications (2)

Publication Number Publication Date
CN110968761A CN110968761A (en) 2020-04-07
CN110968761B true CN110968761B (en) 2022-07-08

Family

ID=70032195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911196582.4A Active CN110968761B (en) 2019-11-29 2019-11-29 Webpage structured data self-adaptive extraction method

Country Status (2)

Country Link
CN (1) CN110968761B (en)
WO (1) WO2021103557A1 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968761B (en) * 2019-11-29 2022-07-08 福州大学 Webpage structured data self-adaptive extraction method
CN113626028B (en) * 2020-05-07 2024-06-14 腾讯科技(深圳)有限公司 Page element mapping method and device
CN111932536B (en) * 2020-09-29 2021-03-05 平安国际智慧城市科技股份有限公司 Method and device for verifying lesion marking, computer equipment and storage medium
CN112632421B (en) * 2020-12-25 2022-05-10 杭州电子科技大学 Self-adaptive structured document extraction method
WO2023002366A1 (en) * 2021-07-19 2023-01-26 Web Data Works Ltd. SYSTEM AND METHOD FOR EFFICIENTLY IDENTIFYING AND SEGMENTING PRODUCT WEBPAGES ON AN eCOMMERCE WEBSITE
US20230019515A1 (en) 2021-07-19 2023-01-19 Web Data Works Ltd. System and Method for Efficiently Identifying and Segmenting Product Webpages on an eCommerce Website
CN115062206B (en) * 2022-05-30 2023-04-07 上海弘玑信息技术有限公司 Webpage element searching method and electronic equipment
CN114969478A (en) * 2022-05-30 2022-08-30 上海弘玑信息技术有限公司 Webpage structure detection method, equipment and readable storage medium
GB2621144A (en) * 2022-08-02 2024-02-07 Nchain Licensing Ag Wrapped encryption
CN117972179A (en) * 2024-01-05 2024-05-03 深圳中泓在线股份有限公司 Directional data acquisition normalization method, system and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073654A (en) * 2009-11-20 2011-05-25 富士通株式会社 Methods and equipment for generating and maintaining web content extraction template
CN102193944A (en) * 2010-03-12 2011-09-21 三星电子(中国)研发中心 Method for extracting webpage subject contents
JP2012059212A (en) * 2010-09-13 2012-03-22 Nippon Telegr & Teleph Corp <Ntt> Extraction apparatus, extraction method and extraction program
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
CN109325204A (en) * 2018-09-13 2019-02-12 武汉伯远生物科技有限公司 Web page contents extraction method
CN109344355A (en) * 2018-09-26 2019-02-15 北京因特睿软件有限公司 Automatic returning detection and Block- matching adaptive approach and device for Web evolution
CN110083754A (en) * 2019-04-23 2019-08-02 重庆紫光华山智安科技有限公司 The self-adapting data abstracting method of structure change webpage

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090125529A1 (en) * 2007-11-12 2009-05-14 Vydiswaran V G Vinod Extracting information based on document structure and characteristics of attributes
US8893294B1 (en) * 2014-01-21 2014-11-18 Shape Security, Inc. Flexible caching
CN110968761B (en) * 2019-11-29 2022-07-08 福州大学 Webpage structured data self-adaptive extraction method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073654A (en) * 2009-11-20 2011-05-25 富士通株式会社 Methods and equipment for generating and maintaining web content extraction template
CN102193944A (en) * 2010-03-12 2011-09-21 三星电子(中国)研发中心 Method for extracting webpage subject contents
JP2012059212A (en) * 2010-09-13 2012-03-22 Nippon Telegr & Teleph Corp <Ntt> Extraction apparatus, extraction method and extraction program
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
CN109325204A (en) * 2018-09-13 2019-02-12 武汉伯远生物科技有限公司 Web page contents extraction method
CN109344355A (en) * 2018-09-26 2019-02-15 北京因特睿软件有限公司 Automatic returning detection and Block- matching adaptive approach and device for Web evolution
CN110083754A (en) * 2019-04-23 2019-08-02 重庆紫光华山智安科技有限公司 The self-adapting data abstracting method of structure change webpage

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于网页聚类的正文信息提取方法";王一洲;《小型微型计算机系统》;20180115;全文 *

Also Published As

Publication number Publication date
CN110968761A (en) 2020-04-07
WO2021103557A1 (en) 2021-06-03

Similar Documents

Publication Publication Date Title
CN110968761B (en) Webpage structured data self-adaptive extraction method
CN111709233B (en) Intelligent diagnosis guiding method and system based on multi-attention convolutional neural network
CN111950285B (en) Medical knowledge graph intelligent automatic construction system and method with multi-mode data fusion
US10055391B2 (en) Method and apparatus for forming a structured document from unstructured information
KR101999152B1 (en) English text formatting method based on convolution network
Sanoja et al. Block-o-matic: A web page segmentation framework
CN109871538A (en) A kind of Chinese electronic health record name entity recognition method
CN106776711A (en) A kind of Chinese medical knowledge mapping construction method based on deep learning
CN108959566B (en) A kind of medical text based on Stacking integrated study goes privacy methods and system
US20020133483A1 (en) Systems and methods for computer based searching for relevant texts
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN112035675A (en) Medical text labeling method, device, equipment and storage medium
CN106934069A (en) Data retrieval method and system
CN107861944A (en) A kind of text label extracting method and device based on Word2Vec
Jankowska et al. Relative N-gram signatures: Document visualization at the level of character N-grams
CN113657105A (en) Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN110334362B (en) Method for solving and generating untranslated words based on medical neural machine translation
JP2007047974A (en) Information extraction device and information extraction method
US20240221949A1 (en) Systems and Methods for Machine Learning From Medical Records
CN111651579A (en) Information query method and device, computer equipment and storage medium
US11630824B2 (en) Document search method and document search system
EP2691874B1 (en) Textual analysis system
CN114398138A (en) Interface generation method and device, computer equipment and storage medium
EP4078467A1 (en) A transferrable neural architecture for structured data extraction from web documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant