CN110968761B - Webpage structured data self-adaptive extraction method - Google Patents
Webpage structured data self-adaptive extraction method Download PDFInfo
- Publication number
- CN110968761B CN110968761B CN201911196582.4A CN201911196582A CN110968761B CN 110968761 B CN110968761 B CN 110968761B CN 201911196582 A CN201911196582 A CN 201911196582A CN 110968761 B CN110968761 B CN 110968761B
- Authority
- CN
- China
- Prior art keywords
- node
- similarity
- area
- data
- data item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 44
- 238000013507 mapping Methods 0.000 claims abstract description 9
- 238000004806 packaging method and process Methods 0.000 claims abstract description 4
- 238000004364 calculation method Methods 0.000 claims description 24
- 238000000034 method Methods 0.000 claims description 17
- 230000009193 crawling Effects 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000003044 adaptive effect Effects 0.000 description 5
- 239000003086 colorant Substances 0.000 description 4
- 238000013075 data extraction Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005538 encapsulation Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 208000029091 Refraction disease Diseases 0.000 description 1
- 208000027076 Uveal disease Diseases 0.000 description 1
- 230000004430 ametropia Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 206010023365 keratopathy Diseases 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 208000014733 refractive error Diseases 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention relates to a self-adaptive extraction method of webpage structured data, which comprises the steps of firstly packaging an extraction template, judging whether the structure of a target webpage is changed or not according to the extraction template, and finding data in the target webpage according to the path of the data in the extraction template if the structure of the target webpage is not changed; if the structure of the target webpage is changed, calculating the similarity between the designated area of the extracted template and all areas of the target webpage, taking the area with the highest similarity as a candidate area, mapping data items in the candidate area, calculating the similarity between the node corresponding to each data item and all nodes with text contents not empty in the target webpage, wherein each data item corresponds to the node with the highest similarity. The invention can still correctly extract the target data after the structure of the webpage changes.
Description
Technical Field
The invention relates to the field of extraction of web page structured data of the Internet of things, in particular to a self-adaptive extraction method of web page structured data.
Background
The Internet (Internet) is a huge resource bank, the number of current Web pages reaches hundreds of billions, the Web pages are continuously increased at an incredible speed every hour, the rapid development of the Internet causes the information to show an explosive growth, and the Web is used as a main carrier of the Internet information and is full of various information. In order to collect the effective information we need contained in the Web page, various Web data extraction techniques have been proposed.
However, the current Web data extraction technology generally only aims at a specific Web page structure, and when the Web page is updated iteratively, the problem of a change in the Web page structure may be encountered, so that Web page information cannot be extracted or wrong information is extracted.
Disclosure of Invention
In view of the above, the present invention provides a method for adaptively extracting web page structured data, which can still correctly extract target data after a web page structure changes.
The invention is realized by adopting the following scheme: a webpage structured data self-adaptive extraction method comprises the following steps:
packaging the extraction template, judging whether the structure of the target webpage is changed or not according to the extraction template, and if not, finding the data in the target webpage according to the path of the data in the extraction template; if the structure of the target webpage is changed, calculating the similarity between the designated area of the extracted template and all areas of the target webpage, taking the area with the highest similarity as a candidate area, mapping data items in the candidate area, calculating the similarity between the node corresponding to each data item and all nodes with text contents not empty in the target webpage, wherein each data item corresponds to the node with the highest similarity.
Further, the encapsulation extraction template specifically comprises the following steps:
step S11: inputting a target webpage, data to be extracted and the name of an extraction template, calling a JS script by a system to extract information of all nodes in the webpage, and analyzing and generating a DOM tree;
step S12: finding out a designated sub-tree containing data to be extracted in the DOM tree according to the input labeling information;
step S13: and crawling the information of the subtree to store a file Template in a specific format, wherein Json represents that the specific area of the webpage needs to extract the structured representation of data, and DOMTree represents a DOM tree subtree of the specific area of the webpage.
Further, in step S13, the Json is expressed as:
Json=<name1:value1,name2:value2,...,namen:valuen>;
in the formula, nameiIs the name of the data to be extracted, valueiIs the data value corresponding to the data name;
the DOMTree is expressed as:
DOMTree=<Node1,Node2,…,Noden>;
in the formula, NodeiIs a Node of the tree, wherein Node1Is the root node of the subtree;
one Node in the given DOM tree is represented as:
Node=<tag,Father,Child,xpath,text,Attri>;
in the formula, tag is a label name of the node, Father is a Father node of the node, Child is a Child node list of the node, xpath is a path of the node, text is text content of the node, and Attri is a characteristic attribute of the node;
given a feature Attribute Attribute for a node, it is expressed as:
Attri=<id,class,x,y,w,h>;
in the formula, id is the page id of the node label, class is the class name of the node label, x is the distance between the node and the left frame of the page, y is the distance between the node and the top of the page, w is the width of the area occupied by the node in the page, and h is the height of the area occupied by the node in the page;
given a path xpath of a Node, it is represented as a sequence:
path=</tag1[x1]/tag2[x2]/…/tagn[xn]>;
where tag denotes a label name on the path, xiIndicates that the node is the xth node in the same level in the DOM treeiAnd (4) each node.
Further, the step of judging whether the structure of the target webpage is changed according to the extracted template specifically includes:
reading all node information of json strings and subtrees in the extracted template, analyzing the node information into a DOM tree, calling a JS script to extract all node information in a target page, and analyzing and generating the DOM tree;
finding a sub-tree under the path of the DOM tree root node generated by extracting the template, judging whether the structures of the two sub-trees are changed, if the similarity of the two sub-trees is greater than a specified threshold value, the structure of the target webpage is not changed; otherwise, the structure of the target webpage is considered to be changed.
Further, the calculating the similarity between the specified area of the extracted template and all areas of the target webpage, and taking the area with the highest similarity as a candidate area specifically comprises the following steps:
step S21: judging the path similarity between the designated area and each area in the target webpage;
step S22: judging the structural similarity between the designated area and each area in the target webpage;
step S23: judging the text similarity between the designated area and each area in the target webpage;
step S24: and for each region in the target webpage, respectively carrying out weighting calculation on the path similarity among the regions, the structure similarity among the regions and the text similarity among the regions according to preset weights to obtain the total similarity between the region and the specified region, and selecting the region with the highest total similarity as a candidate region.
Further, the mapping of the data items in the candidate region is performed, and the similarity calculation is performed on the node corresponding to each data item and all nodes of which the text content is not empty in the target webpage, where the node corresponding to each data item with the highest similarity specifically includes the following steps:
step S21: calculating the path similarity between each data item in the designated area and each data item in the candidate area;
step S22: calculating the structural similarity between each data item in the designated area and each data item in the candidate area;
step S23: calculating text similarity between each data item in the designated area and each data item in the candidate area;
step S24: and for each data item in the designated area, respectively performing weighted calculation on the path similarity, the structure similarity and the text recognition in the steps S21 to S23 according to preset weights to obtain the total similarity between the data item and each data item in the candidate area, and selecting the data item with the highest total similarity as the data item in the candidate area corresponding to the data item in the designated area.
Compared with the prior art, the invention has the following beneficial effects: according to the method, the characteristic values of all areas of the webpage are extracted through page rendering, and information such as the DOM tree structure of the webpage and the text similarity is combined, so that the target data can still be correctly extracted after the webpage structure is changed.
Drawings
FIG. 1 is a schematic diagram of the method of an embodiment of the present invention.
Fig. 2 is an example JS script 1 for system call, where Algorithm1 is a crawler script and Algorithm2 is a search tree Algorithm, according to an embodiment of the present invention.
Fig. 3 is an example 2 of a system call JS script according to the embodiment of the present invention. Wherein Algorithm3 is an in-region data item matching Algorithm.
Fig. 4 is an example of a web page before and after updating according to an embodiment of the present invention, where (a) is before updating the web page and (b) is after updating the web page.
Fig. 5 is a schematic diagram of an extraction result of the method of the embodiment.
Detailed Description
The invention is further explained by the following embodiments in conjunction with the drawings.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure herein. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, the present embodiment provides a method for adaptively extracting web page structured data, including the following steps:
packaging the extraction template, judging whether the structure of the target webpage is changed or not according to the extraction template, and if not, finding the data in the target webpage according to the path of the data in the extraction template; if the structure of the target webpage is changed, calculating the similarity between the designated area of the extracted template and all areas of the target webpage, taking the area with the highest similarity as a candidate area, mapping data items in the candidate area, calculating the similarity between the node corresponding to each data item and all nodes with text contents not empty in the target webpage, wherein each data item corresponds to the node with the highest similarity.
Preferably, the encapsulation process of the extraction template can be described as inputting the website of the information to be extracted, the label information of the data to be extracted and the name of the extraction template, the system finds the corresponding block on the webpage according to the label content, and encapsulates the characteristic value and the label information of the block into the extraction template to be stored as a file with a specific format. The step after the template is packaged and extracted is a web data self-adaptive extraction process, which can be described as firstly reading required template information from the extracted template, wherein the required template information comprises information to be extracted from a specific area of an old page and the characteristic attribute of the area, analyzing the page into a DOM tree by using an existing crawler tool and acquiring the characteristic attribute of the page, finding an area under the path of the current page according to the path of the area specified by the extracted template, calculating the similarity of the areas, judging whether the two areas are similar, if so, the structure of the page is not changed, and extracting the specified information; and if the similarity is smaller than a specified threshold value, changing the structure of the webpage, and performing adaptive matching of the new webpage and the old webpage. The adaptive matching process of the new webpage and the old webpage can be divided into two stages: target region matching and intra-region data item mapping. The two stages comprise path similarity calculation, structure similarity calculation and text similarity calculation, the similarity between the nodes is comprehensively calculated from the three aspects, and the accuracy of self-adaption is improved.
In this embodiment, the package extraction template specifically includes the following steps:
step S11: inputting a target webpage, data to be extracted and the name of an extraction template, calling a JS script by a system to extract information of all nodes in the webpage, and analyzing and generating a DOM tree;
step S12: finding out a designated sub-tree containing data to be extracted in the DOM tree according to the input labeling information;
step S13: and crawling the information of the subtree to store a file Template in a specific format, wherein Json represents that the specific area of the webpage needs to extract the structured representation of data, and DOMTree represents a DOM tree subtree of the specific area of the webpage.
Preferably, in step S11, the algorithm of the JS script called by the system is as shown in fig. 2.
In this embodiment, in step S13, Json is expressed as:
Json=<name1:value1,name2:value2,...,namen:valuen>;
in the formula, nameiIs the name of the data to be extracted, valueiIs the data value corresponding to the data name;
the DOMTree is expressed as:
DOMTree=<Node1,Node2,…,Noden>;
in the formula, NodeiIs a Node of the tree, wherein Node1Is the root node of the subtree;
one Node in the given DOM tree is represented as:
Node=<tag,Father,Child,xpath,text,Attri>;
in the formula, tag is a label name of the node, Father is a Father node of the node, Child is a Child node list of the node, xpath is a path of the node, text is text content of the node, and Attri is a characteristic attribute of the node;
given a feature Attribute Attribute for a node, it is expressed as:
Attri=<id,class,x,y,w,h>;
in the formula, id is the page id of the node label, class is the class name of the node label, x is the distance between the node and the left frame of the page, y is the distance between the node and the top of the page, w is the width of the area occupied by the node in the page, and h is the height of the area occupied by the node in the page;
given a path xpath of a Node, it is represented as a sequence:
path=</tag1[x1]/tag2[x2]/…/tagn[xn]>;
where tag denotes a label name on the path, xiIndicates that the node is the x-th node in the same level in the DOM treeiAnd (4) each node.
Preferably, after the extraction template is obtained, the data required in the webpage can be extracted by inputting the name of the template and the website address of the target webpage. The process of Web data self-adaptive extraction can be divided into 3 steps: 1. and reading all node information of the json strings and sub-trees in the extracted template, analyzing the node information into a DOM tree, calling the JS script to extract all node information in the target page, and analyzing and generating the DOM tree. 2. Finding subtrees under the path of the DOM tree root node generated by extracting the template, judging whether the structures of the two subtrees are changed, if the similarity is greater than a specified threshold value, not changing the structure of the webpage, and finding data in the target webpage according to the path of the data in the extracting template; and if the similarity is smaller than a specified threshold, starting an adaptive matching stage. 3. In the self-adaptation stage, the similarity of the designated area of the extracted template and all areas of the target webpage is calculated, the similarity calculation comprises path similarity, structure similarity and text similarity, finally the total similarity is obtained by weighted average of all the similarities, the area with the highest similarity is taken as a candidate area, and mapping of data items in the area is carried out. Similarity calculation is carried out on nodes corresponding to each data item and all nodes with text contents not empty in the target webpage, the similarity calculation is also divided into path similarity, structure similarity and text similarity, weighted average is taken, and each data item corresponds to a node with the highest similarity.
In this embodiment, the determining whether the structure of the target webpage is changed according to the extracted template specifically includes:
reading all node information of json strings and subtrees in the extracted template, analyzing the node information into a DOM tree, calling a JS script to extract all node information in a target page, and analyzing and generating the DOM tree;
finding subtrees under the path of the DOM tree root node generated by extracting the template, judging whether the structures of the two subtrees are changed, and if the similarity of the two subtrees is greater than a specified threshold value, the structure of the target webpage is not changed; otherwise, the structure of the target webpage is considered to be changed.
In this embodiment, regarding target area matching, a web page may be structurally divided into several areas, that is, a DOM tree of the web page is divided into several sub-trees, a template is extracted to store feature attributes of a specified area crawled from the web page in advance and a data structure to be extracted, a process of comparing the area similarity is to perform similarity calculation on all feature values and attributes of the specified area stored in the template and feature values and attributes of all areas of the input web page, and an area with the highest similarity is regarded as a specified area after iterative update of the web page.
Specifically, the calculating and extracting similarity between the template-specified area and all areas of the target webpage, and taking the area with the highest similarity as the candidate area specifically comprises the following steps:
step S21: judging the path similarity between the designated area and each area in the target webpage; by observing a large number of webpages before and after iterative updating, the fact that most of the subblocks of the webpages only move near the original position even if the webpage structure changes is found, and therefore the path similarity of the two areas can be used as an index for observing the area similarity. And taking the DOM tree path of the root nodes of the two areas as two variables to construct a formula. The common web page DOM tree path is generally regarded as a tag sequence contained from a root node to a leaf node, the traditional web page DOM tree path matching model adopts path matching to calculate the similarity of the path sequence, only sequence matching is considered, the position of the tree path in the web page DOM tree is ignored, obviously, the tree path is not in accordance with reality, and the calculated similarity result can not truly and effectively reflect actual similar information. Therefore, the present embodiment proposes an improved path similarity calculation method, for two tree paths:
xpathi=</tagName1[x1]/tagName2[x2]/.../tagNamen[xn]>,
xpathtar=</tagName1[x1]/tagName2[x2]/.../tagNamen[xn]>,
the DOM tree path similarity between them is defined as follows:
sim(xpathi,xpathtar)=st(xpathi,xpathtar)*ω1+sp(xpathi,xpathtar)*(1-ω1);
wherein,representing the similarity of the label sequences of the path of the tree, pathi(tagNamei)∩pathtar(tagNamej) Represents the longest common label sequence length, len (path), of the two paths starting from the root nodei) Show pathiThe length of the tag sequence of (a);the position similarity of the two tree paths is shown, and the node number of the two paths with the same layer sequence number in the longest common label sequence starting from the root node is shown.
The path similarity is mainly composed of st (path)i,pathtar) And sp (path)i,pathtar) The two parts are formed and respectively reflect the label sequence and the position information in the path similarity, omega is the weight between the label sequence and the position information, the value range is 0-1, and the importance of the two parts in the path similarity can be adjusted by changing omega.
Step S22: judging the structural similarity between the designated area and each area in the target webpage; the similarity of the structures between the areas mainly considers a virtual structure and a real structure, namely the structure of a DOM tree and the structure of a webpage visualization, and the similarity is composed of two parts: the tree structure similarity and the coordinates and the size of the area in the webpage. The tree structure similarity comprises whether father nodes are consistent or not, total node number comparison contained in the tree and DOM tree height comparison; the coordinates and size of the regions in the web page include the height, width, length from the top of the page, and width from the left side of the page. For two regions, a comparison of the structural similarity between them is defined as follows:
sim(treestrui,treestrutar)=st(Ti,Ttar)*ω+sp(Ti,Ttar)*(1-ω);
wherein,representing the similarity, equivalent (root), of the DOM tree structure of a web pagei,roottar) Showing whether the root nodes of the two areas are consistent or not, Ti(node) represents TiTotal number of nodes involved, H (T)i) Represents TiI.e. the number of node levels of the DOM tree. Omegai(i ═ 1,2,3) is the weight between them, and ranges from 0 to 1.
Wherein,height (T) representing the similarity of the size and coordinates of the two regions in the entire pagei) Height, width (T) of the indication areai) Indicates the width of the region, top (T)i) Represents TiThe length of the represented area from the top of the page, left (T)i) Represents TiWidth, ω, of the represented region from the left side of the pagei(i ═ 1,2,3,4) is the weight between them, and ranges from 0 to 1.
The similarity of the structure between the regions is mainly composed of st (T)i,Ttar) And sp (T)i,Ttar) The two parts are formed and respectively represent DOM tree structure information and graphical interface layout information in the structural similarity, omega is the weight between the DOM tree structure information and the graphical interface layout information, the value range is 0-1, and the importance of the two parts in the structural similarity can be adjusted by changing omega.
Step S23: judging the text similarity between the designated area and each area in the target webpage; the text similarity is also a measurement factor of the similarity between the regions, and the embodiment uses the synonym forest to calculate the similarity between words. All in oneThe semantic word forest carries out semantic classification, and the words are organized into a five-level tree structure, and each unit synonym adopts eight-digit coding. The structure includes synonymy relations, high-level relations and hyponymy relations of word senses. For the fifth level, words are grouped, with one character added to the end of the code to mark that the corresponding word is a synonym ("═"), a homolog ("#") or that the group has only one word ("@"). With this encoding rule, the present implementation performs similarity calculation of chinese text using the following algorithm. The text within a region may be viewed as a sentence, which is composed of several words. As previously mentioned, calculating text similarity is essentially calculating sentence similarity. Thus, the present embodiment may use this formula, using word similarity sim (word; text)tar) Obtaining text similarity:
sim(w,text)=max(sim(word,word1),...,sim(word,wordk)),
where w is a word and text is all the text in a region, containing k words. sim (word )i) Is the similarity of two words. textiAnd texttarFor all text in both regions, defined as texti={wi,1,wi,2,...,wi,m},texttar={wtar,1,wtar,2,...,wtar,n}. m and n are text respectivelyiAnd texttarThe number of split words. Text similarity contains two metrics: the similarity of the text contents and the length of all texts in the two areas are compared. Omega is the weight between the two parts, the value range is 0-1, and the importance of the two parts in the text similarity can be adjusted by changing omega.
Step S24: and for each region in the target webpage, respectively carrying out weighting calculation on the path similarity among the regions, the structure similarity among the regions and the text similarity among the regions according to preset weights to obtain the total similarity between the region and the specified region, and selecting the region with the highest total similarity as a candidate region.
Wherein, the calculation of the total similarity adopts the following formula:
through the calculation, the area with the highest similarity to the target area can be obtained, the area is regarded as a suspected target area, and if the similarity is larger than a certain threshold value, data item matching in the next calculation area is carried out; if the similarity is smaller than the threshold, it indicates that the target area cannot be found in the updated webpage. Firstly, defining, namely, taking nodes with text contents not being empty in an area to form a node set to be matched as follows:
Items=<node1,node2,...,nodek>;
the set comprises k nodesi. Fig. 3 is an algorithm for data item matching in the present embodiment.
In this embodiment, the mapping of the data items in the candidate region, and performing similarity calculation on the node corresponding to each data item and all nodes of the target webpage whose text contents are not empty, where the node corresponding to each data item with the highest similarity specifically includes the following steps:
step S21: calculating the path similarity between each data item in the designated area and each data item in the candidate area; the path similarity between the data items is calculated according to the formula constructed in the past, except that the parameter path here is an intra-area path rather than a path in the whole webpage, that is:
path=xpath-xpathroot;
wherein, xpathrootFor the path of the root node of the area, the calculation formula of the path similarity of the data item is as follows:
step S22: calculating the structural similarity between each data item in the designated area and the candidate area; the structural similarity between data items is mainly considered as follows: the tab properties of the page and the relative position within the region. The tag attributes of the page comprise whether tag names are consistent, whether tag ids are consistent, whether font types in the css style are consistent, and whether font sizes and colors are consistent; the relative position within a region includes a comparison of the length from the top of the page and the width from the left side of the page. The structural similarity calculation for two data items is defined as follows:
indicating the similarity, equal (tagName), of the tag attributes of the data itemsi,tagNametar) Whether the label names are consistent or not is shown, if so, the label names are 1, otherwise, the label names are 0; equal (id)i,idtar) And whether the tag ids are consistent or not is shown, if so, the tag id is 1, otherwise, the tag id is 0. equivalent (font-family)i,font-familytar)、equal(font-sizei,font-sizetar)、equal(font-colori,font-colortar) Respectively representing whether the font types are consistent or not, the font sizes and the colors are consistent or not, if so, the font sizes and the colors are 1, otherwise, the font sizes and the colors are 0. OmegaiIs the weight between them, and has a value in the range of 0-1.
Representing the relative position of the data item within the region, representing the length of the data item from the top of the page, representing the width of the data item from the left side of the page, are of equal importance, so the weights are each half and are all 0.5.
Step S23: calculating the text similarity between each data item in the designated area and each data item in the candidate area; the text similarity calculation for the data items also uses the formula defined above, except that the text content text contains only the text content of a single data item, rather than all the text within the entire region.
sim(w,nodetext)=max(sim(word,word1),...,sim(word,words)),
Wherein, nodeText is the text contained in a single node, and contains s words, sim (word )i) Is the similarity of two words. nodetextiAnd nodetexttarIs the text in two nodes, defined as nodeTexti={wi,1,wi,2,...,wi,p},nodetexttar={wtar,1,wtar,2,...,wtar,q}. p and q are nodetext respectivelyiAnd nodetexttarThe number of split words. Text similarity contains two metrics: the similarity of the text contents and the length of all texts in the two areas are compared. Omega is the weight between the two parts, the value range is 0-1, and the importance of the two parts in the text similarity can be adjusted by changing omega.
Step S24: and for each data item in the designated area, respectively performing weighted calculation on the path similarity, the structure similarity and the text recognition in the steps S21-S23 according to preset weights to obtain the total similarity between the data item and each data item in the candidate area, and selecting the data item with the highest total similarity as the data item in the candidate area corresponding to the data item in the designated area. Calculating the similarity between all data items in the region and the data items in the specific region specified by the configuration file, and calculating the total similarity by using the obtained three measurement factors according to a certain weight to obtain the following formula:
in particular, to better illustrate the effect of the embodiment, as shown in fig. 4, fig. 4 is an example of a change before and after updating a web page, it can be seen that the structure of the web page is greatly changed, the position and size of the target area are changed, and the data content to be extracted is also changed. If the target block needed by the user cannot be located after the web page structure is changed by using the traditional web page data extraction algorithm, the corresponding relation of new and old version of the web page data item cannot be found, which is not beneficial to the large-scale extraction of data, and the embodiment hopes to monitor the change of the web page in real time and adaptively adjust the extraction template to adapt to the updating of the web page.
The feasibility of the method of this embodiment is discussed with respect to this example, and the information of the doctor is extracted for this embodiment of the web page. First, url and annotated json data of the website {' recommendation heat (integrated): '3.5' ' thank you for you: ': 1', ' gift: '0', ' department: 'department of medical university subsidiary hospital ophthalmology' in southwest ',' good intentions: ' correction and prevention of keratopathy, corneal refractive surgery, uveal disease, ametropia, ocular laser examination and treatment ', ' brief introduction: ' Zheng, woman, assistant chief and ren physicians, assistant professor, medical master and members of Chinese medical society engaged in clinical medical treatment, teaching and scientific research for over 10 years. ' and a name ' sector ' of the extraction template are input into a system, and the corresponding extraction template is obtained through operation. When the information of the region is to be extracted, the name of the extraction template and the webpage url are input, the system can analyze the webpage into a DOM tree storing all node information of the page, then the region under the path of the current webpage is found according to the region path specified by the extraction template, the similarity of the regions is calculated, whether the two regions are similar or not is judged, if so, the structure of the webpage is not changed, and the specified information is extracted; and if the similarity is smaller than the specified threshold, changing the structure of the webpage, and performing adaptive matching of the new webpage and the old webpage and updating of the extracted template. The information extracted before and after updating of the web page shown in fig. 4 is shown in fig. 5, where (a) is data extracted before updating of the web page, and (b) is data extracted after updating of the web page. It can be seen from the figure that the method of the embodiment can still effectively extract data under the condition that the structure of the webpage is greatly changed.
In summary, the method provided by this embodiment defines not only the corresponding extraction rule when formulating the extraction template, but also an adaptive matching rule according to the text feature, the HTML tag feature, the visual feature, and the DOM tree structure feature of the page data. Matching the web with the corresponding extraction template, and extracting data according to an extraction rule after matching is successful; and if the page is changed and the xpath expression fails, re-searching data according to the self-adaptive matching rule and updating the xpath. Experimental results show that the method has high accuracy and effectively reduces manual intervention in the extraction process.
The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.
Claims (3)
1. A webpage structured data self-adaptive extraction method is characterized by comprising the following steps:
packaging the extraction template, judging whether the structure of the target webpage is changed or not according to the extraction template, and if not, finding the data in the target webpage according to the path of the data in the extraction template; if the structure of the target webpage is changed, calculating the similarity between the designated area of the extracted template and all areas of the target webpage, taking the area with the highest similarity as a candidate area, mapping data items in the candidate area, calculating the similarity between the node corresponding to each data item and all nodes with text contents not empty in the target webpage, wherein each data item corresponds to the node with the highest similarity;
the specific step of judging whether the structure of the target webpage is changed according to the extraction template is as follows:
reading all node information of json strings and subtrees in the extracted template, analyzing the node information into a DOM tree, calling a JS script to extract all node information in a target page, and analyzing and generating the DOM tree;
finding subtrees under the path of the DOM tree root node generated by extracting the template, judging whether the structures of the two subtrees are changed, and if the similarity of the two subtrees is greater than a specified threshold value, the structure of the target webpage is not changed; otherwise, the structure of the target webpage is changed;
the method for calculating the similarity between the specified area of the extracted template and all areas of the target webpage and taking the area with the highest similarity as the candidate area specifically comprises the following steps of:
step S21: judging the path similarity between the designated area and each area in the target webpage;
step S22: judging the structural similarity between the designated area and each area in the target webpage;
step S23: judging the text similarity between the designated area and each area in the target webpage;
step S24: for each region in the target webpage, respectively performing weighted calculation on the path similarity among the regions, the structure similarity among the regions and the text similarity among the regions according to preset weights to obtain the total similarity of the region and a specified region, and selecting the region with the highest total similarity as a candidate region;
the mapping of the data items in the candidate area is performed, similarity calculation is performed on the node corresponding to each data item and all nodes with text contents not empty in the target webpage, and the node with the highest similarity corresponding to each data item specifically comprises the following steps:
step S25: calculating the path similarity between each data item in the designated area and each data item in the candidate area;
step S26: calculating the structural similarity between each data item in the designated area and each data item in the candidate area;
step S27: calculating the text similarity between each data item in the designated area and each data item in the candidate area;
step S28: and for each data item in the designated area, respectively performing weighted calculation on the path similarity, the structure similarity and the text similarity in the steps S25 to S27 according to preset weights to obtain the total similarity between the data item and each data item in the candidate area, and selecting the data item with the highest total similarity as the data item in the candidate area corresponding to the data item in the designated area.
2. The method for adaptively extracting web page structured data according to claim 1, wherein the encapsulating extraction template specifically comprises the following steps:
step S11: inputting a target webpage, data to be extracted and the name of an extraction template, calling a JS script by a system to extract information of all nodes in the webpage, and analyzing and generating a DOM tree;
step S12: finding out a designated sub-tree containing data to be extracted in the DOM tree according to the input labeling information;
step S13: and crawling the information of the subtree to store a file Template in a specific format, wherein Json represents that the specific area of the webpage needs to extract the structured representation of data, and DOMTree represents a DOM tree subtree of the specific area of the webpage.
3. The method for adaptively extracting web page structural data according to claim 2, wherein in step S13, Json represents:
Json=<name1:value1,name2:value2,...,namen:valuen>;
in the formula, nameiIs the name of the data to be extracted, valueiIs the data value corresponding to the data name;
the DOMTree is expressed as:
DOMTree=<Node1,Node2,…,Noden>;
in the formula, NodeiIs a Node of the tree, wherein Node1Is the root node of the subtree;
one Node in the given DOM tree is represented as:
Node=<tag,Father,Child,xpath,text,Attri>;
in the formula, tag is a label name of the node, Father is a Father node of the node, Child is a Child node list of the node, xpath is a path of the node, text is text content of the node, and Attri is a characteristic attribute of the node;
given a characteristic attribute Attri of a node, it is expressed as:
Attri=<id,class,x,y,w,h>;
in the formula, id is the page id of the node label, class is the class name of the node label, x is the distance between the node and the left frame of the page, y is the distance between the node and the top of the page, w is the width of the area occupied by the node in the page, and h is the height of the area occupied by the node in the page;
given a path xpath of a Node, it is represented as a sequence:
xpath=</tag1[x1]/tag2[x2]/…/tagn[xn]>;
in the formula, tagiIndicating the label name, x, on the pathiIndicating that the node is the ith node in the same level of the DOM tree.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911196582.4A CN110968761B (en) | 2019-11-29 | 2019-11-29 | Webpage structured data self-adaptive extraction method |
PCT/CN2020/101247 WO2021103557A1 (en) | 2019-11-29 | 2020-07-10 | Adaptive extraction method for webpage structured data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911196582.4A CN110968761B (en) | 2019-11-29 | 2019-11-29 | Webpage structured data self-adaptive extraction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110968761A CN110968761A (en) | 2020-04-07 |
CN110968761B true CN110968761B (en) | 2022-07-08 |
Family
ID=70032195
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911196582.4A Active CN110968761B (en) | 2019-11-29 | 2019-11-29 | Webpage structured data self-adaptive extraction method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110968761B (en) |
WO (1) | WO2021103557A1 (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110968761B (en) * | 2019-11-29 | 2022-07-08 | 福州大学 | Webpage structured data self-adaptive extraction method |
CN113626028B (en) * | 2020-05-07 | 2024-06-14 | 腾讯科技(深圳)有限公司 | Page element mapping method and device |
CN111932536B (en) * | 2020-09-29 | 2021-03-05 | 平安国际智慧城市科技股份有限公司 | Method and device for verifying lesion marking, computer equipment and storage medium |
CN112632421B (en) * | 2020-12-25 | 2022-05-10 | 杭州电子科技大学 | Self-adaptive structured document extraction method |
WO2023002366A1 (en) * | 2021-07-19 | 2023-01-26 | Web Data Works Ltd. | SYSTEM AND METHOD FOR EFFICIENTLY IDENTIFYING AND SEGMENTING PRODUCT WEBPAGES ON AN eCOMMERCE WEBSITE |
US20230019515A1 (en) | 2021-07-19 | 2023-01-19 | Web Data Works Ltd. | System and Method for Efficiently Identifying and Segmenting Product Webpages on an eCommerce Website |
CN115062206B (en) * | 2022-05-30 | 2023-04-07 | 上海弘玑信息技术有限公司 | Webpage element searching method and electronic equipment |
CN114969478A (en) * | 2022-05-30 | 2022-08-30 | 上海弘玑信息技术有限公司 | Webpage structure detection method, equipment and readable storage medium |
GB2621144A (en) * | 2022-08-02 | 2024-02-07 | Nchain Licensing Ag | Wrapped encryption |
CN117972179A (en) * | 2024-01-05 | 2024-05-03 | 深圳中泓在线股份有限公司 | Directional data acquisition normalization method, system and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073654A (en) * | 2009-11-20 | 2011-05-25 | 富士通株式会社 | Methods and equipment for generating and maintaining web content extraction template |
CN102193944A (en) * | 2010-03-12 | 2011-09-21 | 三星电子(中国)研发中心 | Method for extracting webpage subject contents |
JP2012059212A (en) * | 2010-09-13 | 2012-03-22 | Nippon Telegr & Teleph Corp <Ntt> | Extraction apparatus, extraction method and extraction program |
CN102890681A (en) * | 2011-07-20 | 2013-01-23 | 阿里巴巴集团控股有限公司 | Method and system for generating webpage structure template |
CN109325204A (en) * | 2018-09-13 | 2019-02-12 | 武汉伯远生物科技有限公司 | Web page contents extraction method |
CN109344355A (en) * | 2018-09-26 | 2019-02-15 | 北京因特睿软件有限公司 | Automatic returning detection and Block- matching adaptive approach and device for Web evolution |
CN110083754A (en) * | 2019-04-23 | 2019-08-02 | 重庆紫光华山智安科技有限公司 | The self-adapting data abstracting method of structure change webpage |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090125529A1 (en) * | 2007-11-12 | 2009-05-14 | Vydiswaran V G Vinod | Extracting information based on document structure and characteristics of attributes |
US8893294B1 (en) * | 2014-01-21 | 2014-11-18 | Shape Security, Inc. | Flexible caching |
CN110968761B (en) * | 2019-11-29 | 2022-07-08 | 福州大学 | Webpage structured data self-adaptive extraction method |
-
2019
- 2019-11-29 CN CN201911196582.4A patent/CN110968761B/en active Active
-
2020
- 2020-07-10 WO PCT/CN2020/101247 patent/WO2021103557A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073654A (en) * | 2009-11-20 | 2011-05-25 | 富士通株式会社 | Methods and equipment for generating and maintaining web content extraction template |
CN102193944A (en) * | 2010-03-12 | 2011-09-21 | 三星电子(中国)研发中心 | Method for extracting webpage subject contents |
JP2012059212A (en) * | 2010-09-13 | 2012-03-22 | Nippon Telegr & Teleph Corp <Ntt> | Extraction apparatus, extraction method and extraction program |
CN102890681A (en) * | 2011-07-20 | 2013-01-23 | 阿里巴巴集团控股有限公司 | Method and system for generating webpage structure template |
CN109325204A (en) * | 2018-09-13 | 2019-02-12 | 武汉伯远生物科技有限公司 | Web page contents extraction method |
CN109344355A (en) * | 2018-09-26 | 2019-02-15 | 北京因特睿软件有限公司 | Automatic returning detection and Block- matching adaptive approach and device for Web evolution |
CN110083754A (en) * | 2019-04-23 | 2019-08-02 | 重庆紫光华山智安科技有限公司 | The self-adapting data abstracting method of structure change webpage |
Non-Patent Citations (1)
Title |
---|
"基于网页聚类的正文信息提取方法";王一洲;《小型微型计算机系统》;20180115;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110968761A (en) | 2020-04-07 |
WO2021103557A1 (en) | 2021-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110968761B (en) | Webpage structured data self-adaptive extraction method | |
CN111709233B (en) | Intelligent diagnosis guiding method and system based on multi-attention convolutional neural network | |
CN111950285B (en) | Medical knowledge graph intelligent automatic construction system and method with multi-mode data fusion | |
US10055391B2 (en) | Method and apparatus for forming a structured document from unstructured information | |
KR101999152B1 (en) | English text formatting method based on convolution network | |
Sanoja et al. | Block-o-matic: A web page segmentation framework | |
CN109871538A (en) | A kind of Chinese electronic health record name entity recognition method | |
CN106776711A (en) | A kind of Chinese medical knowledge mapping construction method based on deep learning | |
CN108959566B (en) | A kind of medical text based on Stacking integrated study goes privacy methods and system | |
US20020133483A1 (en) | Systems and methods for computer based searching for relevant texts | |
CN106815307A (en) | Public Culture knowledge mapping platform and its use method | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
CN112035675A (en) | Medical text labeling method, device, equipment and storage medium | |
CN106934069A (en) | Data retrieval method and system | |
CN107861944A (en) | A kind of text label extracting method and device based on Word2Vec | |
Jankowska et al. | Relative N-gram signatures: Document visualization at the level of character N-grams | |
CN113657105A (en) | Medical entity extraction method, device, equipment and medium based on vocabulary enhancement | |
CN110334362B (en) | Method for solving and generating untranslated words based on medical neural machine translation | |
JP2007047974A (en) | Information extraction device and information extraction method | |
US20240221949A1 (en) | Systems and Methods for Machine Learning From Medical Records | |
CN111651579A (en) | Information query method and device, computer equipment and storage medium | |
US11630824B2 (en) | Document search method and document search system | |
EP2691874B1 (en) | Textual analysis system | |
CN114398138A (en) | Interface generation method and device, computer equipment and storage medium | |
EP4078467A1 (en) | A transferrable neural architecture for structured data extraction from web documents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |