WO2021103557A1 - 一种网页结构化数据自适应提取方法 - Google Patents

一种网页结构化数据自适应提取方法 Download PDF

Info

Publication number
WO2021103557A1
WO2021103557A1 PCT/CN2020/101247 CN2020101247W WO2021103557A1 WO 2021103557 A1 WO2021103557 A1 WO 2021103557A1 CN 2020101247 W CN2020101247 W CN 2020101247W WO 2021103557 A1 WO2021103557 A1 WO 2021103557A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
similarity
area
data
path
Prior art date
Application number
PCT/CN2020/101247
Other languages
English (en)
French (fr)
Inventor
陈星�
郭莹楠
杨植
郑勇杰
陈晓娜
Original Assignee
福州大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 福州大学 filed Critical 福州大学
Publication of WO2021103557A1 publication Critical patent/WO2021103557A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the invention relates to the field of webpage structured data extraction of the Internet of Things, in particular to a method for self-adaptive extraction of webpage structured data.
  • the Internet is a huge resource library.
  • the current number of web pages has reached hundreds of billions, and it continues to grow at an alarming rate every hour.
  • the rapid development of the Internet has caused an explosive growth of information.
  • the Web is the main carrier of Internet information. , Flooded with all kinds of information.
  • various web data extraction techniques have been proposed.
  • the current web data extraction technology generally only targets a specific web page structure.
  • the web page is updated iteratively, it may encounter the problem of the page structure change, resulting in the inability to extract the web page information or extract the wrong information.
  • the purpose of the present invention is to provide an adaptive extraction method for web page structured data, which can still extract target data correctly after the web page structure changes.
  • the present invention adopts the following scheme to realize: a method for self-adaptive extraction of webpage structured data, including the following steps:
  • Encapsulate the extraction template judge whether the structure of the target webpage has changed according to the extracted template, if it has not changed, find the data in the target webpage according to the path of the data in the extracted template; if the structure of the target webpage changes, calculate the specified area and target webpage of the extracted template For the similarity of all regions, the region with the highest similarity is taken as the candidate region, and the data items in the candidate region are mapped. The similarity calculation is performed on the node corresponding to each data item and all the nodes whose text content in the target web page is not empty. Each data item corresponds to the node with the highest similarity.
  • packaging extraction template specifically includes the following steps:
  • Step S11 Input the name of the target webpage, the data to be extracted, and the extraction template, and the system calls the JS script to extract the information of all nodes in the page, and parses to generate a DOM tree;
  • Step S12 Find the designated subtree in the DOM tree that contains the data to be extracted according to the input annotation information
  • step S13 the Json is expressed as:
  • name i is the name of the data to be extracted, and value i is the data value corresponding to the data name;
  • the DOMTree is expressed as:
  • DOMTree ⁇ Node 1 ,Node 2 ,...,Node n >;
  • Node i is a node of the tree, and Node 1 is the root node of the subtree;
  • tag is the label name of the node
  • Father is the parent node of the node
  • Child is the list of child nodes of the node
  • xpath is the path of the node
  • text is the text content of the node
  • Attri is the characteristic attribute of the node ;
  • Attri ⁇ id,class,x,y,w,h>;
  • id is the page id of the node label
  • class is the class name of the node label
  • x is the distance between the node and the left border of the page
  • y is the distance between the node and the top of the web page
  • w is the location of the node in the web page.
  • the width of the occupied area, h is the height of the area occupied by the node in the webpage;
  • tag represents the tag name on the path
  • x i represents that the node is the x i- th node at the same level in the DOM tree.
  • the judging whether the structure of the target webpage is changed according to the extracted template specifically includes:
  • the path of the root node of the DOM tree generated by the extracted template find the subtree under the path of the target page, and determine whether the structure of the two subtrees has changed. If the similarity of the two subtrees is greater than the specified threshold, the target webpage structure has not changed; otherwise, the target page is considered as the target The structure of the web page has changed.
  • the calculating and extracting the similarity between the designated area of the template and all the areas of the target webpage, and taking the area with the highest similarity as the candidate area specifically includes the following steps:
  • Step S21 Judging the path similarity between the designated area and each area in the target webpage
  • Step S22 Determine the structural similarity between the designated area and each area in the target webpage
  • Step S23 Judging the text similarity between the designated area and each area in the target webpage
  • Step S24 For each area in the target webpage, the path similarity between the areas, the structural similarity between the areas, and the text similarity between the areas are weighted and calculated according to the preset weights to obtain the total of the area and the designated area. Similarity, select the region with the highest total similarity as the candidate region.
  • mapping of the data items in the candidate area is performed, and the similarity calculation is performed on the nodes corresponding to each data item and all the nodes in the target webpage whose text content is not empty, and the nodes with the highest similarity corresponding to each data item specifically include The following steps:
  • Step S21 Calculate the path similarity between the data items in the designated area and the candidate area
  • Step S22 Calculate the structural similarity between the data items in the designated area and the candidate area
  • Step S23 Calculate the text similarity between the data items in the designated area and the candidate area;
  • Step S24 For each data item in the designated area, the path similarity, structure similarity, and text recognition in steps S21 to S23 are weighted and calculated according to preset weights to obtain the data item and the candidate area. For the total similarity of each data item, the one with the highest total similarity is selected as the data item in the candidate area corresponding to the data item in the designated area.
  • the present invention Compared with the prior art, the present invention has the following beneficial effects: the present invention extracts the feature values of each area of the webpage through page rendering, and then combines the page DOM tree structure, text similarity and other information, so that it can still be correct after the webpage structure changes. Extract the target data.
  • Fig. 1 is a schematic diagram of the principle of the method according to an embodiment of the present invention.
  • Figure 2 is a system call JS script example 1 according to an embodiment of the present invention, where Algorithm1 is a crawler script, and Algorithm2 is a search tree algorithm.
  • Fig. 3 is an example 2 of system calling JS script according to the embodiment of the present invention.
  • Algorithm3 is the data item matching algorithm in the area.
  • Fig. 4 is an example of a web page before and after updating in an embodiment of the present invention, where (a) is before the web page is updated, and (b) is after the web page is updated.
  • FIG. 5 is a schematic diagram of the extraction result of the method of this embodiment.
  • this embodiment provides a method for adaptively extracting structured webpage data, which includes the following steps:
  • Encapsulate the extraction template judge whether the structure of the target webpage has changed according to the extracted template, if it has not changed, find the data in the target webpage according to the path of the data in the extracted template; if the structure of the target webpage changes, calculate the specified area and target webpage of the extracted template For the similarity of all regions, the region with the highest similarity is taken as the candidate region, and the data items in the candidate region are mapped. The similarity calculation is performed on the node corresponding to each data item and all the nodes whose text content in the target web page is not empty. Each data item corresponds to the node with the highest similarity.
  • the packaging process of the extraction template can be described as inputting the URL of the information to be extracted, the annotation information of the data to be extracted, and the naming of the extraction template.
  • the system will find the corresponding block on the webpage according to the marked content, and then identify the characteristics of the block.
  • the value and label information are encapsulated into an extraction template and stored as a file in a specific format.
  • the step after encapsulating the extraction template is the web data adaptive extraction process, which can be described as first reading the required template information from the extraction template, including the information that needs to be extracted in a specific area of the old page and the characteristic attributes of the area, using the existing
  • the crawler tool parses the webpage into a DOM tree and obtains the characteristic attributes of the webpage, finds the area under the path of the current webpage according to the area path specified by the extraction template, calculates its similarity, and judges whether the two areas are similar. If they are similar, the webpage structure has not changed. Extract the specified information; if the similarity is less than the specified threshold, the web page structure is changed, and adaptive matching of the old and new web pages is performed.
  • the adaptive matching process of new and old web pages can be divided into two stages: target area matching and data item mapping within the area. These two stages include path similarity calculation, structure similarity calculation and text similarity calculation. From these three aspects, the similarity between nodes is comprehensively calculated to improve the accuracy of adaptation.
  • the packaging extraction template specifically includes the following steps:
  • Step S11 Input the name of the target webpage, the data to be extracted, and the extraction template, and the system calls the JS script to extract the information of all nodes in the page, and parses to generate a DOM tree;
  • Step S12 Find the designated subtree in the DOM tree that contains the data to be extracted according to the input annotation information
  • step S11 the algorithm of the JS script called by the system is shown in FIG. 2.
  • step S13 the Json is expressed as:
  • name i is the name of the data to be extracted, and value i is the data value corresponding to the data name;
  • the DOMTree is expressed as:
  • DOMTree ⁇ Node 1 ,Node 2 ,...,Node n >;
  • Node i is a node of the tree, and Node 1 is the root node of the subtree;
  • tag is the label name of the node
  • Father is the parent node of the node
  • Child is the list of child nodes of the node
  • xpath is the path of the node
  • text is the text content of the node
  • Attri is the characteristic attribute of the node ;
  • Attri ⁇ id,class,x,y,w,h>;
  • id is the page id of the node label
  • class is the class name of the node label
  • x is the distance between the node and the left border of the page
  • y is the distance between the node and the top of the web page
  • w is the location of the node in the web page.
  • the width of the occupied area, h is the height of the area occupied by the node in the webpage;
  • tag represents the tag name on the path
  • x i represents that the node is the x i- th node at the same level in the DOM tree.
  • the process of adaptive extraction of web data can be divided into 3 steps: 1. Read the json string in the extraction template and all the node information of the subtree, parse it into a DOM tree, call the JS script to extract the information of all nodes in the target page, and parse it to generate DOM tree. 2. Find the subtree of the target page according to the path of the root node of the DOM tree generated by the extracted template, and determine whether the two subtree structures have changed. If the similarity is greater than the specified threshold, the web page structure has not changed.
  • the adaptive stage first calculates the similarity between the designated area of the extracted template and all areas of the target webpage.
  • the similarity calculation includes path similarity, structural similarity and text similarity.
  • the total similarity is the weighted average of each similarity.
  • the area with the highest similarity is the candidate area, and the data items in the area are mapped.
  • the similarity of each data item corresponding node and all nodes in the target webpage whose text content is not empty are calculated, which are also divided into path similarity, structure similarity and text similarity, taking the weighted average, and each data item corresponds to the similarity. The highest node.
  • the judging whether the structure of the target webpage is changed according to the extracted template is specifically as follows:
  • the path of the root node of the DOM tree generated by the extracted template find the subtree under the path of the target page, and determine whether the structure of the two subtrees has changed. If the similarity of the two subtrees is greater than the specified threshold, the target webpage structure has not changed; otherwise, the target page is considered as the target The structure of the web page has changed.
  • the webpage can be divided into several areas in structure, that is, the DOM tree of the webpage is divided into several subtrees, and the extraction template stores the features of the designated area that are crawled down from the webpage in advance.
  • the process of comparing attributes and the data structure to be extracted, and the area similarity is to calculate the similarity between all the feature values and attributes of the specified area stored in the extraction template and the feature values and attributes of all areas of the input webpage.
  • the highest area is regarded as the designated area after the iterative update of the web page.
  • the calculation of the similarity between the designated area of the extracted template and all areas of the target webpage, and the selection of the area with the highest similarity as the candidate area specifically includes the following steps:
  • Step S21 Determine the path similarity between the designated area and each area in the target webpage; by observing a large number of iteratively updated webpages, it is found that even if the webpage structure changes, most of the sub-blocks of the webpage will only move near the original location. Therefore, the path similarity of the two regions can be used as an index to examine the similarity of the regions. Take the DOM tree path path of the root node of the two regions as two variables to construct a formula.
  • the commonly used web page DOM tree path is usually regarded as a tag sequence from the root node to the leaf node.
  • the traditional web page DOM tree path matching model uses path matching to calculate the similarity of the path sequence. Only the sequence matching is considered, and the tree path is ignored. The position appearing in the DOM tree of the webpage obviously does not conform to the reality, and the calculated similarity result cannot truly and effectively reflect the actual similarity information. Therefore, this embodiment proposes an improved path similarity calculation method for two tree paths:
  • xpath i ⁇ /tagName 1 [x 1 ]/tagName 2 [x 2 ]/.../tagName n [x n ]>,
  • xpath tar ⁇ /tagName 1 [x 1 ]/tagName 2 [x 2 ]/.../tagName n [x n ]>,
  • sim(xpath i ,xpath tar ) st(xpath i ,xpath tar )* ⁇ 1 +sp(xpath i ,xpath tar )*(1- ⁇ 1 );
  • path i (tagName i ) ⁇ path tar (tagName j ) represents the length of the longest common tag sequence of the two paths starting from the root node
  • len (path i ) represents the tag of path i
  • Sequence length Represents the position similarity of the two tree paths, and represents the number of nodes with the same layer sequence number in the longest common label sequence starting from the root node of the two paths.
  • Path similarity is mainly composed of st (path i , path tar ) and sp (path i , path tar ), which respectively reflect the label sequence and position information in path similarity, ⁇ is the weight between them, and the value is The range is 0-1. Changing ⁇ can adjust the importance of the two parts in path similarity.
  • Step S22 Determine the structural similarity between the designated area and each area in the target webpage; the structural similarity between the areas mainly considers the virtual structure and the real structure, that is, the structure of the DOM tree and the visualized structure of the webpage, which consists of two parts Composition: tree structure similarity and the coordinates and size of the area in the web page.
  • the tree structure similarity includes whether the parent nodes are the same, the total number of nodes contained in the tree, and the height of the DOM tree.
  • the coordinates and size of the area in the web page include the height and width of the two areas and the length and distance from the top of the page. The width from the left side of the page. For two regions, the comparison of their structural similarity between regions is defined as follows:
  • sim(treestru i ,treestru tar ) st(T i ,T tar )* ⁇ +sp(T i ,T tar )*(1- ⁇ );
  • height (T i ) represents the height of the area
  • width (T i ) represents the width of the area
  • top (T i ) represents the distance from the area represented by T i to the page
  • the length of the top, left (T i ) represents the width of the area represented by T i from the left side of the page
  • the structural similarity between regions is mainly composed of st (T i , T tar ) and sp (T i , T tar ), which respectively reflect the DOM tree structure information and graphical interface layout information in the structural similarity, and ⁇ is them
  • the weight between, and the value range is 0-1. Changing ⁇ can adjust the importance of the two parts in structural similarity.
  • Step S23 Determine the text similarity between the designated area and each area in the target webpage; the text similarity is also a measure of the similarity between the areas.
  • This embodiment uses the synonym word forest to calculate the similarity between words degree.
  • the synonym word forest performs semantic classification and organizes the words into a five-level tree structure, and each unit synonym uses an eight-digit code.
  • the structure includes synonymous relations, high-level relations and subsense relations of word meanings.
  • this implementation uses the following algorithm to perform the similarity calculation of Chinese text.
  • the text in the area can be regarded as a sentence, which consists of several words.
  • calculating text similarity is essentially calculating sentence similarity. Therefore, this embodiment can use this formula to obtain the text similarity using word similarity sim (word; text tar ):
  • sim(w,text) max(sim(word,word 1 ),...,sim(word,word k )),
  • w is a word
  • text is all text in the area, including k words.
  • sim(word,word i ) is the similarity of two words.
  • m and n are the number of words split by text i and text tar respectively.
  • the text similarity includes two measurement factors: the similarity of the text content and the comparison of the length of all texts in the two regions.
  • is the weight between them, the value range is 0-1, changing ⁇ can adjust the importance of the two parts in text similarity.
  • Step S24 For each area in the target webpage, the path similarity between the areas, the structural similarity between the areas, and the text similarity between the areas are weighted and calculated according to the preset weights to obtain the total of the area and the designated area. Similarity, select the region with the highest total similarity as the candidate region. Among them, the calculation of the total similarity adopts the following formula:
  • sim(tree i ,tree tar ) sim(xpath i ,xpath tar )* ⁇ 1
  • this embodiment can obtain the area with the highest similarity to the target area and regard it as a suspected target area. If the similarity is greater than a certain threshold, the next step is to calculate the data item matching in the area; if similar If the degree is less than the threshold, it means that the target area cannot be found in the updated webpage. Firstly, define the node set to be matched by taking the nodes whose text content contained in the area is not empty:
  • Figure 3 is an algorithm for data item matching in this embodiment.
  • the mapping of data items in the candidate area is performed, and the similarity calculation is performed on the node corresponding to each data item and all the nodes in the target webpage whose text content is not empty, and each data item corresponds to the highest similarity.
  • the node specifically includes the following steps:
  • Step S21 Calculate the path similarity between the data items in the designated area and the candidate area; the path similarity between the data items is calculated according to the previously constructed formula, the difference is that the parameter path here is the path within the area instead of the entire webpage The path in, namely:
  • xpath root is the path of the root node of the area
  • the calculation formula for the path similarity of the data item is:
  • Step S22 Calculate the structural similarity between the data items in the designated area and the candidate area; the structural similarity between the data items mainly has the following considerations: the label attributes of the page and the relative position in the area.
  • the label attributes of the page include whether the label name is the same, the label id is the same, whether the font type in the css style is the same, the font size and color are the same;
  • the relative position in the area includes the length from the top of the page and the distance from the left of the page Comparison of width.
  • tags the similarity of the tag attributes of the data item indicate whether the tag names are consistent, if they are consistent, it is 1, otherwise it is 0; equal (id i , id tar ) indicates whether the tag ids are consistent, if they are consistent It is 1, otherwise it is 0.
  • equal(font-family i ,font-family tar ), equal(font-size i ,font-size tar ), equal(font-color i ,font-color tar ) represent whether the font type is consistent, whether the font size and color are consistent Consistent, if they are consistent, it is 1, otherwise it is 0.
  • ⁇ i is the weight between them, and the value range is 0-1.
  • Step S23 Calculate the text similarity between the data items in the designated area and the candidate area; the text similarity calculation of the data items also uses the above-defined formula, the difference is that the text content text only contains the text content of a single data item, and Not all text in the entire area.
  • sim(w,nodetext) max(sim(word,word 1 ),...,sim(word,word s )),
  • nodetext is the text contained in a single node, containing s words
  • sim (word, word i ) is the similarity of the two words
  • p, q are the number of words split by nodetext i and nodetext tar respectively.
  • the text similarity includes two measurement factors: the similarity of the text content and the comparison of the length of all texts in the two regions. ⁇ is the weight between them, the value range is 0-1, changing ⁇ can adjust the importance of the two parts in text similarity.
  • Step S24 For each data item in the designated area, the path similarity, structure similarity, and text recognition in steps S21 to S23 are weighted and calculated according to preset weights to obtain the data item and the candidate area. For the total similarity of each data item, the one with the highest total similarity is selected as the data item in the candidate area corresponding to the data item in the designated area. Calculate the similarity between all the data items in the area and the data items in the specific area specified by the configuration file, and calculate the total similarity with the three measurement factors obtained above according to certain weights, and the following formula can be obtained:
  • sim(node i ,node tar ) sim(path i ,path tar )* ⁇ 1 +sim(nodestru i ,nodestru tar )* ⁇ 2 +sim(nodetext i ,nodetext tar )* ⁇ 3
  • Figure 4 is an example of changes before and after the web page is updated. It can be seen that the structure of the web page has undergone major changes, and the location and size of the target area All have changed, and at the same time, the content of the data to be extracted has also changed. However, the method of this embodiment can still locate the target area and output the one-to-one correspondence of the data items. If the traditional webpage data extraction algorithm is unable to locate the target block we need after the webpage structure changes, nor can it find the corresponding relationship between the new and old page data items, which is not conducive to the large-scale extraction of data. This embodiment hopes It can monitor webpage changes in real time and adaptively adjust the extraction template to adapt to the update of the page.
  • the system will parse the web page into a DOM tree that stores all the node information of the page, and then find the current web page under the path according to the area path specified by the extracted template Calculate the similarity between the two regions to determine whether the two regions are similar. If they are similar, the web page structure has not changed, and the specified information is extracted; if the similarity is less than the specified threshold, the web page structure is changed, and the new and old web pages are adaptively matched and the extraction template is updated. .
  • the method proposed in this embodiment not only defines corresponding extraction rules when formulating extraction templates, but also defines adaptive matching rules based on the text features, HTML tag features, visual features, and DOM tree structure features of the page data.
  • the web is matched with the corresponding extraction template.
  • data is extracted according to the extraction rules; if the page changes and the xpath expression becomes invalid, the data is searched again according to the adaptive matching rules and the xpath is updated.
  • Experimental results show that this method has a high accuracy rate and effectively reduces manual intervention in the extraction process.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种网页结构化数据自适应提取方法,首先封装抽取模板,根据抽取模板判断目标网页的结构是否改变,若未改变则根据抽取模板中的数据的路径找到目标网页中的数据;若目标网页的结构改变,则计算抽取模板指定区域和目标网页所有区域的相似度,取相似度最高的区域作为候选区域,进行候选区域内数据项的映射,对每个数据项对应节点和目标网页中所有文本内容不为空的节点进行相似度计算,每个数据项对应相似度最高的节点。所述方法在网页结构变化后仍然能够正确抽取出目标数据。

Description

一种网页结构化数据自适应提取方法 技术领域
本发明涉及物联网网页结构化数据提取领域,特别是一种网页结构化数据自适应提取方法。
背景技术
互联网(Internet)是一个巨大的资源库,目前的网页数量已达千亿级,每小时还以惊人的速度持续增长,互联网的高速发展,使得信息呈现爆炸式增长,Web作为互联网信息的主要载体,充斥着各种各样的信息。为了收集网页中包含着的我们所需的有效信息,人们提出了各种各样的Web数据抽取技术。
技术问题
然而当前的Web数据抽取技术一般只针对特定的网页结构,当网页迭代更新时,可能会遇到页面结构变化的问题,导致无法抽取网页信息或者抽取出错误的信息。
技术解决方案
有鉴于此,本发明的目的是提出一种网页结构化数据自适应提取方法,在网页结构变化后仍然能够正确抽取出目标数据。
本发明采用以下方案实现:一种网页结构化数据自适应提取方法,包括以下步骤:
封装抽取模板,根据抽取模板判断目标网页的结构是否改变,若未改变则根据抽取模板中的数据的路径找到目标网页中的数据;若目标网页的结构改变,则计算抽取模板指定区域和目标网页所有区域的相似度,取相似度最高的区域 作为候选区域,进行候选区域内数据项的映射,对每个数据项对应节点和目标网页中所有文本内容不为空的节点进行相似度计算,每个数据项对应相似度最高的节点。
进一步地,所述封装抽取模板具体包括以下步骤:
步骤S11:输入目标网页、待提取数据及抽取模板的名称,系统调用JS脚本提取页面中所有节点的信息,并解析生成DOM树;
步骤S12:根据输入的标注信息找到DOM树中包含待提取数据的指定子树;
步骤S13:将该子树的信息爬取下来存为特定格式的文件Template=<Json,DOMTree>,其中,Json表示网页特定区域需要抽取数据的结构化表示,DOMTree表示网页特定区域的DOM树子树。
进一步地,步骤S13中,所述Json表示为:
Json=<name 1:value 1,name 2:value 2,...,name n:value n>;
式中,name i是要抽取的数据名,value i是该数据名所对应的数据值;
所述DOMTree表示为:
DOMTree=<Node 1,Node 2,…,Node n>;
式中,Node i为该树的一个节点,其中Node 1为该子树的根节点;
给定DOM树中的一个节点Node,表示为:
Node=<tag,Father,Child,xpath,text,Attri>;
式中,tag为该节点的标签名,Father为该节点的父节点,Child为该节点的子节点列表,xpath为该节点的路径,text为该节点的文本内容,Attri为该节点 的特征属性;
给定一个节点的特征属性Attribute,表示为:
Attri=<id,class,x,y,w,h>;
式中,id为该节点标签的页面id,class为该节点标签的类名,x为该节点与页面左边框的距离,y为该节点和网页顶部的距离,w为该节点在网页中所占区域的宽度,h为该节点在网页中所占区域的高;
给定一个节点Node的路径xpath,表示为一个序列:
path=</tag 1[x 1]/tag 2[x 2]/…/tag n[x n]>;
式中,tag表示路径上的标签名,x i表示该节点是DOM树中处于同一层的第x i个节点。
进一步地,所述根据抽取模板判断目标网页的结构是否改变具体为:
读取抽取模板中的json串和子树所有节点信息,解析成DOM树,调用JS脚本提取目标页面中所有节点的信息,并解析生成DOM树;
根据抽取模板生成的DOM树根节点的路径找到目标页面该路径下的子树,判断两个子树结构是否变化,若两个子树的相似度大于指定阈值,则目标网页结构未改变;否则认为目标网页的结构改变。
进一步地,所述计算抽取模板指定区域和目标网页所有区域的相似度,取相似度最高的区域作为候选区域具体包括以下步骤:
步骤S21:判断指定区域与目标网页中每个区域间的路径相似度;
步骤S22:判断指定区域与目标网页中每个区域间的结构相似度;
步骤S23:判断指定区域与目标网页中每个区域间的文本相似度;
步骤S24:对目标网页中的每个区域,分别按照预设的权重将区域间的路径相似度、区域间的结构相似度、区域间的文本相似度进行加权计算得到该区域与指定区域的总相似度,选择总相似度最高的区域作为候选区域。
进一步地,所述进行候选区域内数据项的映射,对每个数据项对应节点和目标网页中所有文本内容不为空的节点进行相似度计算,每个数据项对应相似度最高的节点具体包括以下步骤:
步骤S21:计算指定区域与候选区域中各数据项之间的路径相似度;
步骤S22:计算指定区域与候选区域中各数据项之间的结构相似度;
步骤S23:计算指定区域与候选区域中各数据项之间的文本相似度;
步骤S24:对指定区域中的每个数据项,分别按照预设的权重将步骤S21至步骤S23中的路径相似度、结构相似度、文本相识度进行加权计算得到该数据项与候选区域中的各个数据项的总相似度,选取总相似度最高的作为指定区域中该数据项所对应的候选区域中的数据项。
有益效果
与现有技术相比,本发明有以下有益效果:本发明通过页面渲染提取网页各个区域的特征值,再结合页面DOM树结构、文本相似度等信息,使其在网页结构变化后仍然能够正确抽取出目标数据。
附图说明
图1为本发明实施例的方法原理示意图。
图2为本发明实施例的系统调用JS脚本示例1,其中Algorithm1为爬虫脚本,Algorithm2为搜索树算法。
图3为本发明实施例的系统调用JS脚本示例2。其中,Algorithm3为区域内数 据项匹配算法。
图4为本发明实施例中网页更新前后的例子,其中(a)为网页更新前,(b)为网页更新后。
图5为本实施例方法的抽取结果示意图。
具体实施方式
下面结合附图及实施例对本发明做进一步说明。
应该指出,以下详细说明都是示例性的,旨在对本申请提供进一步的说明。除非另有指明,本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。
需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本申请的示例性实施方式。如在这里所使用的,除非上下文另外明确指出,否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或它们的组合。
如图1所示,本实施例提供了一种网页结构化数据自适应提取方法,包括以下步骤:
封装抽取模板,根据抽取模板判断目标网页的结构是否改变,若未改变则根据抽取模板中的数据的路径找到目标网页中的数据;若目标网页的结构改变,则计算抽取模板指定区域和目标网页所有区域的相似度,取相似度最高的区域作为候选区域,进行候选区域内数据项的映射,对每个数据项对应节点和目标网页中所有文本内容不为空的节点进行相似度计算,每个数据项对应相似度最高的节点。
较佳的,抽取模板的封装过程可描述为输入要抽取信息的网址、要抽取数据的标注信息及抽取模板的命名,系统将根据标注内容找到网页上相应的区块, 将该区块的特征值及标签信息封装成抽取模板存储为特定格式的文件。封装抽取模板之后的步骤为web数据自适应抽取过程,可描述为首先从抽取模板中读取所需的模板信息,包括旧版页面特定区域需要抽取的信息以及该区域的特征属性,用已有的爬虫工具将网页解析成DOM树并获取网页的特征属性,根据抽取模板指定的区域路径找到当前网页该路径下的区域,计算其相似度,判断两区域是否相似,若相似则网页结构未改变,提取指定信息;若相似度小于指定阈值,则网页结构改变,进行新旧网页的自适应匹配。其中,新旧网页的自适应匹配过程可分为两个阶段:目标区域匹配和区域内数据项映射。这两个阶段都包含了路径相似度计算、结构相似度计算以及文本相似度计算,从这三个方面综合计算节点间的相似度,提高自适应的准确率。
在本实施例中,所述封装抽取模板具体包括以下步骤:
步骤S11:输入目标网页、待提取数据及抽取模板的名称,系统调用JS脚本提取页面中所有节点的信息,并解析生成DOM树;
步骤S12:根据输入的标注信息找到DOM树中包含待提取数据的指定子树;
步骤S13:将该子树的信息爬取下来存为特定格式的文件Template=<Json,DOMTree>,其中,Json表示网页特定区域需要抽取数据的结构化表示,DOMTree表示网页特定区域的DOM树子树。
较佳的,步骤S11中,系统所调用的JS脚本的算法如图2所示。
在本实施例中,步骤S13中,所述Json表示为:
Json=<name 1:value 1,name 2:value 2,...,name n:value n>;
式中,name i是要抽取的数据名,value i是该数据名所对应的数据值;
所述DOMTree表示为:
DOMTree=<Node 1,Node 2,…,Node n>;
式中,Node i为该树的一个节点,其中Node 1为该子树的根节点;
给定DOM树中的一个节点Node,表示为:
Node=<tag,Father,Child,xpath,text,Attri>;
式中,tag为该节点的标签名,Father为该节点的父节点,Child为该节点的子节点列表,xpath为该节点的路径,text为该节点的文本内容,Attri为该节点的特征属性;
给定一个节点的特征属性Attribute,表示为:
Attri=<id,class,x,y,w,h>;
式中,id为该节点标签的页面id,class为该节点标签的类名,x为该节点与页面左边框的距离,y为该节点和网页顶部的距离,w为该节点在网页中所占区域的宽度,h为该节点在网页中所占区域的高;
给定一个节点Node的路径xpath,表示为一个序列:
path=</tag 1[x 1]/tag 2[x 2]/…/tag n[x n]>;
式中,tag表示路径上的标签名,x i表示该节点是DOM树中处于同一层的第x i个节点。
较佳的,得到抽取模板后,输入模板名字和目标网页的网址即可抽取网页中所需的数据。Web数据自适应抽取的过程可分为3个步骤:1、读取抽取模板中的json串和子树所有节点信息,解析成DOM树,调用JS脚本提取目标页面中所有节点的信息,并解析生成DOM树。2、根据抽取模板生成的DOM树根节点的路径找到目标页面该路径下的子树,判断两个子树结构是否变化,相似度大于指定阈值,则网页结构未改变,根据抽取模板中数据的路径找到目标网 页中的数据;若相似度小于指定阈值,则开始自适应匹配阶段。3、自适应阶段首先计算抽取模板指定区域和目标网页所有区域的相似度,相似度计算包括路径相似度、结构相似度和文本相似度,最后总的相似度为各个相似度加权平均得到,取相似度最高的区域为候选区域,进行区域内数据项的映射。每个数据项对应节点和目标网页中所有文本内容不为空的节点进行相似度计算,同样分为路径相似度、结构相似度和文本相似度,取加权平均值,每个数据项对应相似度最高的节点。
在本实施例中,所述根据抽取模板判断目标网页的结构是否改变具体为:
读取抽取模板中的json串和子树所有节点信息,解析成DOM树,调用JS脚本提取目标页面中所有节点的信息,并解析生成DOM树;
根据抽取模板生成的DOM树根节点的路径找到目标页面该路径下的子树,判断两个子树结构是否变化,若两个子树的相似度大于指定阈值,则目标网页结构未改变;否则认为目标网页的结构改变。
在本实施例中,关于目标区域匹配,网页在结构上可分割为若干个区域,即将网页的DOM树分割成若干个子树,抽取模板中存储着预先从网页上爬取下来的指定区域的特征属性和所要提取的数据结构,区域相似度比较的过程就是将抽取模板中存储着的指定区域的所有特征值和属性与所输入的网页的所有区域的特征值和属性进行相似度计算,相似度最高的区域即看作是网页迭代更新后的指定区域。
具体的,所述计算抽取模板指定区域和目标网页所有区域的相似度,取相似度最高的区域作为候选区域具体包括以下步骤:
步骤S21:判断指定区域与目标网页中每个区域间的路径相似度;通过观察大量迭代更新前后的网页,发现即使网页结构变化,大部分网页的子块也只会在原本的位置附近移动,因此两区域的路径相似度可作为考察区域相似度的 一个指标。取两区域的根节点的DOM树路径path为两个变量,构造公式。常用的网页DOM树路径通常被视为一条从根节点到叶节点所包含的标签序列,传统网页DOM树路径匹配模型采用路径匹配计算路径序列的相似度,只考虑序列匹配,忽略了树路径在网页DOM树中出现的位置,显然不符合实际,计算出的相似度结果也不能真实有效地反应实际相似信息。因此,本实施例提出一种改进的路径相似度计算方法,对于两条树路径:
xpath i=</tagName 1[x 1]/tagName 2[x 2]/.../tagName n[x n]>,
xpath tar=</tagName 1[x 1]/tagName 2[x 2]/.../tagName n[x n]>,
他们之间的DOM树路径相似度定义如下:
sim(xpath i,xpath tar)=st(xpath i,xpath tar)*ω 1+sp(xpath i,xpath tar)*(1-ω 1);
其中,
Figure PCTCN2020101247-appb-000001
表示示树路径的标签序列相似度,path i(tagName i)Ιpath tar(tagName j)表示两条路径以根节点为开始的最长公共标签序列长度,len(path i)表示路径path i的标签序列长度;
Figure PCTCN2020101247-appb-000002
表示两条树路径的位置相似度,表示两条路径以根节点为开始的最长公共标签序列中有相同层序号的节点数。
路径相似度主要由st(path i,path tar)和sp(path i,path tar)两部分构成,分别体现了路径相似性中的标签序列和位置信息,ω是它们之间的权重,取值范围为0-1,改变ω可调节这两部分在路径相似性中的重要性。
步骤S22:判断指定区域与目标网页中每个区域间的结构相似度;区域间 结构的相似度主要考虑虚拟的结构和真实的结构,即DOM树的结构和网页可视化的结构,由两个部分组成:树结构相似度及区域在网页中的坐标和大小。其中,树结构相似度包含父节点是否一致、树内包含的总节点数比较、DOM树高度比较;而区域在网页中的坐标和大小则包括两区域的高度、宽度以及距离页面顶部的长度和距离页面左侧的宽度。对于两个区域,他们的区域间结构相似度比较定义如下:
sim(treestru i,treestru tar)=st(T i,T tar)*ω+sp(T i,T tar)*(1-ω);
其中,
Figure PCTCN2020101247-appb-000003
表示网页DOM树结构的相似度,equal(root i,root tar)表示判断两区域根节点是否一致,T i(node)表示T i所包含的节点总数,H(T i)表示T i的树高度,即该DOM树的节点层数。ω i(i=1,2,3)是它们之间的权重,取值范围为0-1。
其中,
Figure PCTCN2020101247-appb-000004
表示两区域在整个页面中所占的大小和坐标的相似度,height(T i)表示区域的高度,width(T i)表示区域的宽度,top(T i)表示T i所表示区域距离页面顶部的长度,left(T i)表示T i所表示区域距离页面左侧的宽度,ω i(i=1,2,3,4)是它们之间的权重,取值范围为0-1。
区域间结构的相似度主要由st(T i,T tar)和sp(T i,T tar)两部分构成,分别体现了结构相似性中的DOM树结构信息和图形界面布局信息,ω是它们之间的权 重,取值范围为0-1,改变ω可调节这两部分在结构相似性中的重要性。
步骤S23:判断指定区域与目标网页中每个区域间的文本相似度;文本相似度也是区域间相似度的一个度量因子,本实施例使用的是同义词词林来计算词与词之间的相似度。同义词词林进行语义分类,并将单词组织成五级树结构,每个单元同义词采用八位数编码。该结构中包括同义关系,高级关系和词义的下义关系。对于第五级,单词被分组,一个字符被添加到编码的末尾以标记相应的单词是同义词(“=”),同源(“#”)或该组只有一个单词(“@”)。利用这种编码规则,本实施使用以下算法来执行中文文本的相似度计算。区域内的文本可以看作句子,句子由几个单词组成。如前所述,计算文本相似度本质上是计算句子相似度。因此,本实施例可以使用这个公式,使用单词相似度sim(word;text tar)获得文本相似度:
sim(w,text)=max(sim(word,word 1),...,sim(word,word k)),
Figure PCTCN2020101247-appb-000005
其中,w是一个单词,text是区域内的所有文本,包含k个单词。sim(word,word i)是两个词的相似度。text i和text tar为两个区域内的所有文本,定义为,text i={w i,1,w i,2,...,w i,m},text tar={w tar,1,w tar,2,...,w tar,n}。m,n分别为text i和text tar拆分的单词个数。文本相似度包含两个度量因子:文本内容的相似度以及两区域所有文本的长度比较。ω是它们之间的权重,取值范围为0-1,改变ω可调节这两部分在文本相似性中的重要性。
步骤S24:对目标网页中的每个区域,分别按照预设的权重将区域间的路径相似度、区域间的结构相似度、区域间的文本相似度进行加权计算得到该区 域与指定区域的总相似度,选择总相似度最高的区域作为候选区域。其中,总相似度的计算采用下式:
sim(tree i,tree tar)=sim(xpath i,xpath tar)*ω 1
+sim(treestru i,treestru tar)*ω 2
+sim(text i,text tar)*ω 3
通过上述的计算,本实施例可得到与目标区域相似度最高的区域,将其看做疑似的目标区域,如果相似度大于一定的阈值,则进行下一步计算区域内的数据项匹配;若相似度小于该阈值,则说明在更新后的网页中未能找到目标区域。首先定义,取区域内所包含文本内容不为空的节点构成待匹配的节点集合为:
Items=<node 1,node 2,...,node k>;
集合中包含k个节点node i。图3为本实施例中数据项匹配的算法。
在本实施例中,所述进行候选区域内数据项的映射,对每个数据项对应节点和目标网页中所有文本内容不为空的节点进行相似度计算,每个数据项对应相似度最高的节点具体包括以下步骤:
步骤S21:计算指定区域与候选区域中各数据项之间的路径相似度;数据项间的路径相似度按之前构造的公式计算,不同的是这里的参数path是区域内路径而不是在整个网页中的路径,即:
path=xpath-xpath root
其中,xpath root为该区域根节点的路径,则数据项路径相似度的计算公式为:
Figure PCTCN2020101247-appb-000006
步骤S22:计算指定区域与候选区域中各数据项之间的结构相似度;数据项间的结构相似度主要有以下考量:页面的标签属性以及在区域内的相对位置。其中,页面的标签属性包含标签名是否一致、标签id是否一致以及css样式中的字体类型是否一致、字体大小和颜色是否一致;区域内的相对位置包含距离页面顶部的长度和距离页面左侧的宽度的比较。对于两个数据项的结构相似度计算定义如下:
Figure PCTCN2020101247-appb-000007
其中,
Figure PCTCN2020101247-appb-000008
表示数据项的标签属性的相似度,equal(tagName i,tagName tar)表示标签名是否一致,若一致则为1,否则为0;equal(id i,id tar)表示标签id是否一致,若一致则为1,否则为0。equal(font-family i,font-family tar)、equal(font-size i,font-size tar)、equal(font-color i,font-color tar)分别代表字体类型是否一致、字体大小和颜色是否一致,若一致则为1,否则为0。ω i是它们之间的权重,取值范围为0-1。
Figure PCTCN2020101247-appb-000009
表示数据项在区域内的相对位置,表示数据项距离页面顶部的长度,表示数据项距离页面左侧的宽度,它们的重要程度一致,所以权重各占一半,都为0.5。
步骤S23:计算指定区域与候选区域中各数据项之间的文本相似度;数据项的文本相似度计算同样使用上述定义的公式,不同的是文本内容text只包含单个数据项的文本内容,而不是整个区域内的所有文本。
sim(w,nodetext)=max(sim(word,word 1),...,sim(word,word s)),
Figure PCTCN2020101247-appb-000010
其中,nodetext是单个节点所包含的文本,包含s个词,sim(word,word i)是两个词的相似度。nodetext i和nodetext tar为两个节点内的文本,定义为,nodetext i={w i,1,w i,2,...,w i,p},nodetext tar={w tar,1,w tar,2,...,w tar,q}。p,q分别为nodetext i和nodetext tar拆分的单词个数。文本相似度包含两个度量因子:文本内容的相似度以及两区域所有文本的长度比较。ω是它们之间的权重,取值范围为0-1,改变ω可调节这两部分在文本相似性中的重要性。
步骤S24:对指定区域中的每个数据项,分别按照预设的权重将步骤S21至步骤S23中的路径相似度、结构相似度、文本相识度进行加权计算得到该数据项与候选区域中的各个数据项的总相似度,选取总相似度最高的作为指定区域中该数据项所对应的候选区域中的数据项。计算区域内所有的数据项与配置文件所指定的特定区域内的数据项的相似度,将上述得到的三个度量因子按一定权重计算总相似度,可得如下式子:
sim(node i,node tar)=sim(path i,path tar)*ω 1+sim(nodestru i,nodestru tar)*ω 2+sim(nodetext i,nodetext tar)*ω 3
特别的,为了更好地说明本实施例的效果,如图4所示,图4是网页更新前后变化的一个例子,可以看到网页的结构发生了较大的变化,目标区域的位置和大小都发生了变化,同时需要抽取的数据内容也发生了一些改变,然而本实施例的方法仍然能够定位到目标区域,并且输出数据项的一一对应关系。如果用传统的网页数据抽取算法在网页结构变化后是无法定位到我们所需的目标块的,更无法找到新旧版页面数据项的对应关系,这不利于数据的规模化抽取,本实施例希望能实时监控网页的变化并自适应调整抽取模板来适应页面的更新。
针对这个例子来讨论本实施例方法的可行性,对于这个网页本实施例要抽取大夫的信息。首先将该网站的url和标注的json数据{'推荐热度(综合):':'3.5','感谢信:':'1','礼物:':'0','科室:':'西南医科大学附属医院眼科','擅长:':'角膜病、角膜屈光手术、葡萄膜疾病、屈光不正的矫正及防治,眼部激光检查及治疗','简介:':'郑洋,女,副主任医师,副教授,医学硕士,中华医学会会员,从事临床医疗、教学和科研工作近10余年。专业方...'}以及抽取模板的命名“doctor”输入系统,运行得到相应的抽取模板。当本实施例要抽取该区域的信息时,输入抽取模板的名字和网页url,系统会将网页解析成存储页面所有节点信息的DOM树,然后根据抽取模板指定的区域路径找到当前网页该路径下的区域,计算其相似度,判断两区域是否相似,若相似则网页结构未改变,提取指定信息;若相似度小于指定阈值,则网页结构改变,进行新旧网页的自适应匹配以及抽取模板的更新。图4所示网页的更新前后抽取的信息如图5所示,其中(a)为网页更新前抽取的数据,(b)为网页更新后抽取的数据。可以由图看出,本实施例的方法能够在网页结构改变较大的情况下仍然有效提取数据。
综上,本实施例提出的方法在制定抽取模板时不仅定义相应的抽取规则,而且根据页面数据的文本特征、HTML标签特征、视觉特征、DOM树结构特征定义自适应匹配规则。web与相应的抽取模板进行匹配,匹配成功后按照抽取规则进行数据抽取;若页面发生变化,xpath表达式失效,则根据自适应匹配规 则重新搜索数据,并更新xpath。实验结果表明该方法具有较高的准确率,并且有效地减少了抽取过程中的人工干预。
以上所述,仅是本发明的较佳实施例而已,并非是对本发明作其它形式的限制,任何熟悉本专业的技术人员可能利用上述揭示的技术内容加以变更或改型为等同变化的等效实施例。但是凡是未脱离本发明技术方案内容,依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与改型,仍属于本发明技术方案的保护范围。

Claims (6)

  1. 一种网页结构化数据自适应提取方法,其特征在于,包括以下步骤:
    封装抽取模板,根据抽取模板判断目标网页的结构是否改变,若未改变则根据抽取模板中的数据的路径找到目标网页中的数据;若目标网页的结构改变,则计算抽取模板指定区域和目标网页所有区域的相似度,取相似度最高的区域作为候选区域,进行候选区域内数据项的映射,对每个数据项对应节点和目标网页中所有文本内容不为空的节点进行相似度计算,每个数据项对应相似度最高的节点。
  2. 根据权利要求1所述的一种网页结构化数据自适应提取方法,其特征在于,所述封装抽取模板具体包括以下步骤:
    步骤S11:输入目标网页、待提取数据及抽取模板的名称,系统调用JS脚本提取页面中所有节点的信息,并解析生成DOM树;
    步骤S12:根据输入的标注信息找到DOM树中包含待提取数据的指定子树;
    步骤S13:将该子树的信息爬取下来存为特定格式的文件Template=<Json,DOMTree>,其中,Json表示网页特定区域需要抽取数据的结构化表示,DOMTree表示网页特定区域的DOM树子树。
  3. 根据权利要求2所述的一种网页结构化数据自适应提取方法,其特征在于,步骤S13中,所述Json表示为:
    Json=<name 1:value 1,name 2:value 2,...,name n:value n>;
    式中,name i是要抽取的数据名,value i是该数据名所对应的数据值;
    所述DOMTree表示为:
    DOMTree=<Node 1,Node 2,…,Node n>;
    式中,Node i为该树的一个节点,其中Node 1为该子树的根节点;
    给定DOM树中的一个节点Node,表示为:
    Node=<tag,Father,Child,xpath,text,Attri>;
    式中,tag为该节点的标签名,Father为该节点的父节点,Child为该节点的子节点列表,xpath为该节点的路径,text为该节点的文本内容,Attri为该节点的特征属性;
    给定一个节点的特征属性Attribute,表示为:
    Attri=<id,class,x,y,w,h>;
    式中,id为该节点标签的页面id,class为该节点标签的类名,x为该节点与页面左边框的距离,y为该节点和网页顶部的距离,w为该节点在网页中所占区域的宽度,h为该节点在网页中所占区域的高;
    给定一个节点Node的路径xpath,表示为一个序列:
    path=</tag 1[x 1]/tag 2[x 2]/…/tag n[x n]>;
    式中,tag表示路径上的标签名,x i表示该节点是DOM树中处于同一层的第x i个节点。
  4. 根据权利要求1所述的一种网页结构化数据自适应提取方法,其特征在于,所述根据抽取模板判断目标网页的结构是否改变具体为:
    读取抽取模板中的json串和子树所有节点信息,解析成DOM树,调用JS脚本提取目标页面中所有节点的信息,并解析生成DOM树;
    根据抽取模板生成的DOM树根节点的路径找到目标页面该路径下的子树, 判断两个子树结构是否变化,若两个子树的相似度大于指定阈值,则目标网页结构未改变;否则认为目标网页的结构改变。
  5. 根据权利要求1所述的一种网页结构化数据自适应提取方法,其特征在于,所述计算抽取模板指定区域和目标网页所有区域的相似度,取相似度最高的区域作为候选区域具体包括以下步骤:
    步骤S21:判断指定区域与目标网页中每个区域间的路径相似度;
    步骤S22:判断指定区域与目标网页中每个区域间的结构相似度;
    步骤S23:判断指定区域与目标网页中每个区域间的文本相似度;
    步骤S24:对目标网页中的每个区域,分别按照预设的权重将区域间的路径相似度、区域间的结构相似度、区域间的文本相似度进行加权计算得到该区域与指定区域的总相似度,选择总相似度最高的区域作为候选区域。
  6. 根据权利要求1所述的一种网页结构化数据自适应提取方法,其特征在于,所述进行候选区域内数据项的映射,对每个数据项对应节点和目标网页中所有文本内容不为空的节点进行相似度计算,每个数据项对应相似度最高的节点具体包括以下步骤:
    步骤S21:计算指定区域与候选区域中各数据项之间的路径相似度;
    步骤S22:计算指定区域与候选区域中各数据项之间的结构相似度;
    步骤S23:计算指定区域与候选区域中各数据项之间的文本相似度;
    步骤S24:对指定区域中的每个数据项,分别按照预设的权重将步骤S21至步骤S23中的路径相似度、结构相似度、文本相识度进行加权计算得到该数据项与候选区域中的各个数据项的总相似度,选取总相似度最高的作为指定区域中该数据项所对应的候选区域中的数据项。
PCT/CN2020/101247 2019-11-29 2020-07-10 一种网页结构化数据自适应提取方法 WO2021103557A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911196582.4 2019-11-29
CN201911196582.4A CN110968761B (zh) 2019-11-29 2019-11-29 一种网页结构化数据自适应提取方法

Publications (1)

Publication Number Publication Date
WO2021103557A1 true WO2021103557A1 (zh) 2021-06-03

Family

ID=70032195

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/101247 WO2021103557A1 (zh) 2019-11-29 2020-07-10 一种网页结构化数据自适应提取方法

Country Status (2)

Country Link
CN (1) CN110968761B (zh)
WO (1) WO2021103557A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969478A (zh) * 2022-05-30 2022-08-30 上海弘玑信息技术有限公司 网页结构检测方法、设备和可读存储介质
CN115062206A (zh) * 2022-05-30 2022-09-16 上海弘玑信息技术有限公司 一种网页元素的搜索方法和电子设备
WO2023002366A1 (en) * 2021-07-19 2023-01-26 Web Data Works Ltd. SYSTEM AND METHOD FOR EFFICIENTLY IDENTIFYING AND SEGMENTING PRODUCT WEBPAGES ON AN eCOMMERCE WEBSITE
US11763376B2 (en) 2021-07-19 2023-09-19 Web Data Works Ltd. System, manufacture, and method for efficiently identifying and segmenting product webpages on an eCommerce website
GB2621144A (en) * 2022-08-02 2024-02-07 Nchain Licensing Ag Wrapped encryption

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968761B (zh) * 2019-11-29 2022-07-08 福州大学 一种网页结构化数据自适应提取方法
CN113626028A (zh) * 2020-05-07 2021-11-09 腾讯科技(深圳)有限公司 一种页面元素的映射方法及装置
CN111932536B (zh) * 2020-09-29 2021-03-05 平安国际智慧城市科技股份有限公司 病灶标注的验证方法、装置、计算机设备及存储介质
CN112632421B (zh) * 2020-12-25 2022-05-10 杭州电子科技大学 一种自适应结构化的文档抽取方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102890681A (zh) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 一种生成网页结构模板的方法及系统
US8893294B1 (en) * 2014-01-21 2014-11-18 Shape Security, Inc. Flexible caching
CN109344355A (zh) * 2018-09-26 2019-02-15 北京因特睿软件有限公司 针对网页变化的自动回归检测与块匹配自适应方法和装置
CN110968761A (zh) * 2019-11-29 2020-04-07 福州大学 一种网页结构化数据自适应提取方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090125529A1 (en) * 2007-11-12 2009-05-14 Vydiswaran V G Vinod Extracting information based on document structure and characteristics of attributes
CN102073654B (zh) * 2009-11-20 2012-12-19 富士通株式会社 生成与维护网页内容抽取模板的方法和设备
CN102193944A (zh) * 2010-03-12 2011-09-21 三星电子(中国)研发中心 网页主题内容抽取方法
JP2012059212A (ja) * 2010-09-13 2012-03-22 Nippon Telegr & Teleph Corp <Ntt> 抽出装置、抽出方法及び抽出プログラム
CN109325204B (zh) * 2018-09-13 2022-01-07 武汉伯远生物科技有限公司 网页内容自动提取方法
CN110083754A (zh) * 2019-04-23 2019-08-02 重庆紫光华山智安科技有限公司 结构变化网页的自适应数据抽取方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102890681A (zh) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 一种生成网页结构模板的方法及系统
US8893294B1 (en) * 2014-01-21 2014-11-18 Shape Security, Inc. Flexible caching
CN109344355A (zh) * 2018-09-26 2019-02-15 北京因特睿软件有限公司 针对网页变化的自动回归检测与块匹配自适应方法和装置
CN110968761A (zh) * 2019-11-29 2020-04-07 福州大学 一种网页结构化数据自适应提取方法

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023002366A1 (en) * 2021-07-19 2023-01-26 Web Data Works Ltd. SYSTEM AND METHOD FOR EFFICIENTLY IDENTIFYING AND SEGMENTING PRODUCT WEBPAGES ON AN eCOMMERCE WEBSITE
US11763376B2 (en) 2021-07-19 2023-09-19 Web Data Works Ltd. System, manufacture, and method for efficiently identifying and segmenting product webpages on an eCommerce website
CN114969478A (zh) * 2022-05-30 2022-08-30 上海弘玑信息技术有限公司 网页结构检测方法、设备和可读存储介质
CN115062206A (zh) * 2022-05-30 2022-09-16 上海弘玑信息技术有限公司 一种网页元素的搜索方法和电子设备
CN115062206B (zh) * 2022-05-30 2023-04-07 上海弘玑信息技术有限公司 一种网页元素的搜索方法和电子设备
GB2621144A (en) * 2022-08-02 2024-02-07 Nchain Licensing Ag Wrapped encryption

Also Published As

Publication number Publication date
CN110968761A (zh) 2020-04-07
CN110968761B (zh) 2022-07-08

Similar Documents

Publication Publication Date Title
WO2021103557A1 (zh) 一种网页结构化数据自适应提取方法
KR102564144B1 (ko) 텍스트 관련도를 확정하기 위한 방법, 장치, 기기 및 매체
CN111709233B (zh) 基于多注意力卷积神经网络的智能导诊方法及系统
CN111950285B (zh) 多模态数据融合的医疗知识图谱智能自动构建系统和方法
WO2021139424A1 (zh) 文本内涵质量的评估方法、装置、设备及存储介质
CN110377755A (zh) 基于药品说明书的合理用药知识图谱构建方法
CN103544176B (zh) 用于生成多个页面所对应的页面结构模板的方法和设备
CN104834735B (zh) 一种基于词向量的文档摘要自动提取方法
CN109871538A (zh) 一种中文电子病历命名实体识别方法
Tang et al. Knowledge representation learning with entity descriptions, hierarchical types, and textual relations
CN111666477B (zh) 一种数据处理方法、装置、智能设备及介质
CN106776711A (zh) 一种基于深度学习的中文医学知识图谱构建方法
WO2015093541A1 (ja) シナリオ生成装置、及びそのためのコンピュータプログラム
CN110675944A (zh) 分诊方法及装置、计算机设备及介质
CN106815307A (zh) 公共文化知识图谱平台及其使用办法
CN111292848A (zh) 一种基于贝叶斯估计的医疗知识图谱辅助推理方法
CN106528676B (zh) 基于人工智能的实体语义检索处理方法及装置
WO2015093539A1 (ja) 複雑述語テンプレート収集装置、及びそのためのコンピュータプログラム
CN113707299A (zh) 基于问诊会话的辅助诊断方法、装置及计算机设备
JP6908977B2 (ja) 医療情報処理システム、医療情報処理装置及び医療情報処理方法
CN112784065A (zh) 基于多阶邻域注意力网络的无监督知识图谱融合方法及装置
CN104346382B (zh) 使用语言查询的文本分析系统和方法
Zhang et al. Chinese named entity recognition for apple diseases and pests based on character augmentation
CN110334362B (zh) 一种基于医学神经机器翻译的解决产生未翻译单词的方法
CN104572787B (zh) 伪原创网站的识别方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20894462

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20894462

Country of ref document: EP

Kind code of ref document: A1