CN101984434B - Webpage data extracting method based on extensible language query - Google Patents

Webpage data extracting method based on extensible language query Download PDF

Info

Publication number
CN101984434B
CN101984434B CN201010545520A CN201010545520A CN101984434B CN 101984434 B CN101984434 B CN 101984434B CN 201010545520 A CN201010545520 A CN 201010545520A CN 201010545520 A CN201010545520 A CN 201010545520A CN 101984434 B CN101984434 B CN 101984434B
Authority
CN
China
Prior art keywords
node
attribute
data
path
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201010545520A
Other languages
Chinese (zh)
Other versions
CN101984434A (en
Inventor
聂铁铮
于戈
王波涛
岳德君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201010545520A priority Critical patent/CN101984434B/en
Publication of CN101984434A publication Critical patent/CN101984434A/en
Application granted granted Critical
Publication of CN101984434B publication Critical patent/CN101984434B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种基于可扩展语言查询的网页数据抽取方法,属于计算机数据库技术领域,包括以下:步骤确定Web页面中抽取数据内容时所对应的模式结构;定位Web页面中数据区域、数据单元和属性文本;对属性文本进行语义标注;生成数据单元节点路径:计算抽取属性值的路径表达式;生成数据抽取的XML查询语句;利用XML查询语句抽取数据,本发明能够生成精确的XML查询语句,保证XML查询语句的正确性,本发明具有较高的通用性,能够与已有无缝融合,本发明能够适应更加复杂的查询结果输出。

Figure 201010545520

A method for extracting web page data based on scalable language query, belonging to the technical field of computer databases, comprising the following steps: determining the corresponding schema structure when extracting data content from a web page; locating the data area, data unit and attribute text in the web page; Carry out semantic labeling on attribute text; generate data unit node path: calculate path expression for extracting attribute value; generate XML query statement for data extraction; use XML query statement to extract data, the present invention can generate accurate XML query statement, guarantee XML query The correctness of the sentence, the present invention has higher versatility, can be seamlessly integrated with existing ones, and the present invention can adapt to the output of more complex query results.

Figure 201010545520

Description

Web data abstracting method based on the extend markup language inquiry
Technical field
The invention belongs to the computer database technology field, particularly a kind of web data abstracting method based on the extend markup language inquiry.
Background technology
Continuous development along with the Web field; Data message increases rapidly among the Web; Current each application continues to increase the demand of Web data; Though comprised a large amount of structurings and semi-structured data among the Web, these data owners will offer the user through browser with the form of hypertext markup language HTML and browse, and are difficult to directly be used among the application such as data mining and data integration; Therefore how efficiently and exactly from a large amount of Web pages drawing-out structureization become more and more important with semi-structured data, mainly be divided three classes to the typical abstracting method of Web data: based on the method for html tag tree or DOM Document Object Model dom tree; Method based on page structure; Method based on visual information; Method based on html tag tree or DOM Document Object Model dom tree mainly comprises XWRAP, RoadRunner, Lixto, MDR and MDRII etc.; Method groundwork based on page structure comprises NoDoSE, DEByE and SG-WRAP etc.; Method based on visual information is main with ViDRE mainly;
Is a kind of commonplace method based on html tag tree or DOM Document Object Model dom tree to data recording extraction in the page; Before extracted data, be the DOM Document Object Model dom tree with the Web conversion of page at first, then based on architectural feature in the dom tree and automatic or automanual decimation rule extracted data therefrom based on label; Method based on page structure is at first formulated the structure that comprises data division in the page; In the page, seek similar part as extracting the result according to this structure again, still, for the page simple in structure; It can obtain good effect; If in the page dom tree in complex structure and the data field noise node too much, then treatment effect is not fine, but also can't support the data identification of nested structure;
The position habit characteristic of mainly utilizing user's browsing content in the webpage design based on the technology of visual information extracted data in the webpage is extracted data from the relevant position, a kind of abstracting method that the ViDRE of Microsoft Research, Asia proposes based on visualization feature, and this method is simulated the identifying of human eye to the page to a certain extent; Finally reach the purpose of identifying object information; Yet, on the one hand, when the page does not have tangible visual signature; Extraction efficiency based on vision can seriously reduce; And on the other hand, be applicable to based on the mode of vision the single page carried out data pick-up that the page efficiency in extracting different for the identical data of a large amount of structures will be very low;
Above method is only applicable to comprise the webpage of simple data structure, will be difficult to expression or produce the attribute disappearance if the data in the webpage are hierarchical relationship then the result that extracts, therefore is difficult to the complicated content of pages of handle data structures; Secondly, these methods directly generate after initialization and extract result data, if wherein have the Attribute Recognition mistake then to be difficult to timely correction; In addition, these method operations are relatively very independent, are difficult to combine with available data storehouse system, therefore lack the unified management to web data.
Summary of the invention
For remedying the deficiency of said method, the present invention provides a kind of web data abstracting method based on the extensible language inquiry.
Technical scheme of the present invention is achieved in that based on the web data abstracting method of extensible language inquiry, may further comprise the steps:
Step 1: pairing mode configuration when confirming in the Web page extracted data content;
Mode configuration comprises: 2 kinds of the list structure of relation form and hierarchical structures, wherein, the data pattern S of list structure is by data entity name E and one group of community set A={A 1..., A nConstitute A wherein iAn attribute in the representation attribute set constitutes 1<=i<=n, the quantity of n representation attribute, A by the data type of Property Name and attribute iBe expressed as<n, Type>, N representation attribute title wherein, Type representation attribute data type, said data type Type comprises integer type integer, floating point type float and character string type string; Described hierarchical structure is meant the complex data structures of being made up of fundamental type, and its corresponding data pattern is expressed as S i', comprise attribute { A i' 1..., A i' x, x is a Mode S iThe quantity of ' middle attribute;
Step 2: data area, data cell and attribute text in the Web page of location;
The Web page source code format of html language description is turned to the document of XML language;
Said data area Da is meant the zone that minimum border comprised that in the Web page, comprises all data cells, and localization method is: corresponding minimum subtree that comprises all data cells in the corresponding DOM Document Object Model DOM structure of the Web page;
Said data cell Du, the corresponding data entity of a mode configuration that expression Web data pick-up institute will obtain usually by the attribute description in the pattern, repeats appearance with certain rules in the page; Localization method is: in the DOM Document Object Model dom tree of the Web page, find out the node at each property content place of data entity in the page, the minimum subtree that comprises these nodes is exactly a data cell;
Said attribute text At; Be illustrated in the content of text that comprises the property value of data pattern attribute in the Web page; Usually in the DOM Document Object Model dom tree of the Web page in the text node of node element, localization method is property value: in the corresponding DOM Document Object Model dom tree structure of the Web page, find out the node that comprises this property value text;
Step 3: the attribute text in the step 2 carries out semantic tagger;
Method is: each the attribute text for being comprised in each data cell is all specified the attribute that is comprised in one or more data patterns;
Step 4: generate the data cell node path, may further comprise the steps:
Step 4-1: the data cell set that step 2 is obtained is expressed as: U={U 1, U 2..., U y, wherein, U iRepresent a data unit, i=1 wherein ..., y;
Step 4-2: according to established data unit U i, institute is to deserved node element in page XML document to confirm it, and this node table is shown N i, the structure according to XML document is node element N again iThe path values of generation from root node to this node is expressed as P i
Step 4-3: the path expression of computational data unit, method is:
Get the path of a data cell node, at path values P iIn; Each step in the predicate location path expression formula of use location; Promptly by the documentation root node to the corresponding node element of data cell each node of process, get each node label in the path expression, the path of all data cells has identical sequence label; The sequence label that then begins from root node is expressed as T, is expressed as (T respectively comprising m label 1, T 2..., T m), label T wherein 1Be the label of root node, all the other labels and the like, the label of each node is expressed as (p at it with the position sequence in the label brotgher of node I1..., p Im), position p wherein I1Be the position of root node label, all the other labels and the like, then path values is expressed as:
Path values P i=/label 1 [position i1]/label 2 [position i2]/... / label m [position im],
Be P i=/T 1[p I1]/T 2[p I2]/.../T m[p Im]/
Step 4-4:, calculate the longest common path LCP that begins from root node to the set of paths of data unit:
The longest said common path is meant the path that the total node in the path of all data cell nodes constitutes; The method of calculating the longest common path LCP is: for the path of data cell node; First label position that begins from root node begins coupling; If the positional value of all data cell node paths under current label is identical, i.e. p 1i=p 2i=...=p Yi, then add current label and positional value order in the longest common path to, i.e. LCP+=/T i[p i], if there is different value in the positional value of all data cell node paths under current label, then stop the coupling, with the longest current common path value as the longest final common path value;
Step 4-5: the longest common path LCP that abbreviation step 4-4 calculates;
For one in the longest common path pairing node of step, be expressed as n i, corresponding label is T iIf, do not exist identically in its brotgher of node with its label, and to have identical successor path be "/label I+1/ .../label m" the non-data cell node of descendants's node, then the positional value of this node can omit in the expression formula of the longest common path;
Step 4-6: adopt the method that generates predicate to calculate local path; Described local path is meant the path that the privately owned node of each node constitutes; It is the predicate expression formula of location node accurately, can in all data cell nodes in location, filter incoherent node:
The method that generates predicate is: the label of supposing the node in current step is T i, see in all brotghers of node of node set in the current step whether comprising identical with its label and having identical successor path is "/label I+1/ ... ../label m" the non-data cell node of descendants's node; if then do not omit predicate; if having to check again then whether the XML of non-data cell node attribute is arranged in the present node, can present node and the non-data cell node that meets top condition be distinguished, and if such XML attribute were arranged with this attribute as the predicate expression formula; if there is not then further to calculate the scope of positional value in the predicate, call the noise node to these qualified non-data cell nodes;
The method of the scope of positional value is following in the said calculating predicate:
If the noise node only appears at before the data cell node set, then according to the scope of position in the predicate of cell node be: from the label T of all data cell nodes for this label list registration iPositional value minimum in the pairing node location is to a last node with this label;
If the noise node only appears at after the data cell node set, then according to the scope of position in the predicate of cell node be: label T from first to all data cell nodes for this label list registration iThe positional value of maximum in the pairing node location;
If back end is cut apart by the noise node regularly, the interval p that the computational data cell node is cut apart by the noise node Inte, the length p that the computational data cell node occurs continuously Cont, and calculate the label T of all data cell nodes iThe positional value of minimum and maximum positional value are expressed as pmin and pmax in the pairing node location, and the node that then meets the lower position condition is considered to the node on the data cell path: after (1) node location value deducts pmin, to p InteRemainder behind the delivery is less than p ContValue; (2) the node location value deducts pmax less than maximum noise node location value and adds the value after 1 again;
Step 4-7: merge the longest common path and local path;
Will the longest common path and local path merge, obtain the path P u of locator data unit in the XML document of the Web page;
Step 5: calculate the path expression that extracts property value, may further comprise the steps;
Step 5-1: the path that generates the attribute node location;
Suppose that in sample data the node at the property value place of pattern attribute Ai with respect to the path representation of data cell node is:
/ label A I1[position A I1]/label A I2[position A I2]/... / label A Ik[position A Ik]
Promptly/TA I1[pA I1]/TA I2[pA I2]/... / TA Ik[pA Ik], TA wherein IjThe expression label A Ij, pA IjExpression position A Ij, j=1 wherein ..., k, label A IkFor comprising the label of property value node, position A IkFor this node at it with the position in the label brotgher of node, then can use the method for step 4-5, abbreviation is carried out in the path of attribute node location;
Step 5-2: confirm the property value decimation rule;
The property value decimation rule is applicable to following two kinds of situation: 1, the property value of a plurality of attributes is contained in the node text simultaneously; 2, comprise non-property value content of text in the node content of text;
Suppose that non-property value content of text is a fixed text in the node text; And also use fixing text to cut apart between the property value of the different attribute in same node text; Only need calculate the property value that the fixed character string of cutting apart attribute in the node text gets final product unbundled attribute value text or different attribute, method is:
At first get a plurality of sample Web pages; Therefrom extract the node text that comprises same alike result; If alphabet is the property value content then directly extraction in this node text, otherwise extracts public substring and cut apart attribute, the rule of extraction property value is as follows from the node text:
If attribute A in the node text iProperty value before fixed text Text1 arranged, then at first node text-string Str is got the substring Str-after after the fixed text Text1, check attribute A again iProperty value after, if fixed text Text2 is arranged, then again character string Str-after is got the substring before the fixed text Text2, be expressed as Str-before;
Step 6: the XML query statement that generates data pick-up;
Back end path and attribute node path that the structure of the XML query statement of drawing-out structure data mainly is based on step 4 and step 5 and is obtained; When using the XQuery query language; The structure of statement mainly is to use the FLWOR expression formula of XQuery query language; Wherein, each clause's function is following:
FOR clause: locator data cell node set;
LET clause: increase predicate variable;
WHERE clause: the predicate of data cell node based on the attribute path filtered;
ORDER clause: the rule that the result is sorted;
RETURN clause: return the desired data layout of user;
According to XML query language XQuery syntactic property, can extract the data content of the hierarchical structure in the Web page through nested FLWOR clause's in RETURN clause form, be several kinds of methods that make up the XML query statement according to different demands below:
Step 6-1: when the data pick-up result is hierarchical structure, the XML query statement structure construction method that will generate be:
(1) outermost layer of statement uses fixing XML element tags as root node, and the centre is the XML query expression, is the FLWOR expression formula for the XQuery language, promptly uses following form:<root node Biao Qian>The XML query expression</root node Biao Qian>
(2) in the XML query expression; Use the path expression locator data cell node variable of data cell; Use FOR statement locator data cell node variable for the XQuery language, can use LET statement and WHERE statement to add the predicate of locator data cell node simultaneously;
(3) in the XML query expression; Output at Query Result; Use the attribute-name in the data pattern or have the label of the text of identical semanteme as element in the XML document; Use the path of the attribute node location that generates in the step 5 and the property value decimation rule is located corresponding attribute under the data cell node variable property value text, concrete form is:<shu Xingbiaoqian>{ expression formula that attribute node path and property value decimation rule constitute }</Shu Xingbiaoqian>
The one-piece construction of XML query statement is:
< root node label >
{
FOR data cell node variable in data cell node path
[LET statement]
[WHERE statement]
RETURN < data entity name label >
<attribute 1 Biao Qian>{ expression formula that attribute 1 node path and property value decimation rule constitute }</attribute 1 Biao Qian>
……
<attribute n Biao Qian>{ expression formula that attribute n node path and property value decimation rule constitute }</attribute n Biao Qian>
</>data entity name label;
}
</root node Biao Qian>
Step 6-2: when the data pick-up result is the list structure of relation form, the XML query statement structure construction method that will generate:
1 in the XML query expression; Use the path expression locator data cell node variable of data cell; Use FOR statement locator data cell node variable for the XQuery language, use LET statement and WHERE statement to add the predicate of locator data cell node simultaneously;
2 in the XML query expression; Output at Query Result; Demand according to the output result; Be arranged in order the expression formula that is made up of attribute node path and property value decimation rule, separate with special symbol between the expression formula of different attribute value, concrete form is: { property value of attribute 1 extracts expression formula } separator { property value of attribute 2 extracts expression formula } separator ... Separator { property value of attribute n extracts expression formula }
Step 7: utilize XML query statement extracted data;
Use the execution engine of XML query processing, operation XML query statement can extract the data designated content from the webpage that is formatted as the XML document form on the XML document after the target web conversion.
Advantage of the present invention: the Web data pick-up method based on the XML inquiry of the present invention has than extensive applicability: (1) the present invention can generate accurate XML query statement; Based on path expression generation method; Data unit and property value are carried out accurate XPath expression formula location, thereby guarantee the correctness of XML query statement; (2) the present invention has high generality, and the XML query statement of data source extracted data may operate in database in generation or the XML query specification is carried out on the engine, can with existing seamless fusion; (3) the present invention can adapt to complicated query result output more, through the structure of adjustment bearing-age tree virgin sentence, supports to extract the data content of the middle-level structure of the Web page, not only is confined to simple relational structure.
Description of drawings
Fig. 1 is that the web data abstracting method electronics that the present invention is based on the extensible language inquiry is sold the Web page of data synoptic diagram of book website;
Fig. 2 is the web data abstracting method process flow diagram that the present invention is based on the extensible language inquiry;
Fig. 3 the present invention is based on the position view of the web data abstracting method data cell of extensible language inquiry at the page documents dom tree.
Embodiment
Below in conjunction with accompanying drawing and embodiment the present invention is done further explain:
Fig. 1 sells a Web page of data of book website for certain electronics, adopts the flow process of the inventive method as shown in Figure 2, and step is following:
Step 1: pairing data pattern S when confirming from the Web page extracted data content, wherein data entity title E is " books ", the Property Name that community set comprised and the data type of attribute are as shown in table 1:
Table 1 is data entity " books " Property Name that is comprised and the data type of attribute
Figure GDA0000149816270000061
Figure GDA0000149816270000071
Step 2: the data area in the location map 1 in the sample page, data cell and attribute text, can know that from Fig. 1 data cell is made up of data cell 1, data cell 2, data cell 3;
The XML document that at first need being formatted as of html page be met the XML language standard:
<div?class=″list_book_right″>
<h2><img/><a name=" link_prd_name " href=" " target=" _ blank ">Algorithm and the data structure examination question essence of preparing for the postgraduate qualifying examination is analysed (the 2nd edition)</a></h2>
<h3>Client's scoring:</h3>
<h4 class=" list_r_list_h4 ">The author:<a href=" ">Chen Shoukong</a>,<a href=" ">Hu Xiaokun</a>,<a href=" ">Li Ling</a>Write</h4>
<h4>Publishing house:<a href=" ">China Machine Press</a></h4>
<h4>Publication time: 2007 07 month</h4>
< h5>this book collected key university of institute and academy of sciences surplus in the of since nineteen ninety-two domestic 60, the 1600 multiple tracks examination questions of more than 300 cover Master degree candidates entrance " algorithm and data structure " examination papers, and provided Key for Reference and analysis.This book can be used as institution of higher learning's computing machine and relevant speciality learning data<font class=dot>...</font></h5>
<div?class=″clear″></div>
<h6><span>$42.00</span><span>$35.70</span>Discount: 85 foldings are saved: $6.30</h6>
<span?class=″list_r_list_button″><a?href=”><img?src=”/></a></span>
<span?class=″list_r_list_button″><a?href=″″><img?src=″″/></a></span>
</div>
Step 3: the attribute text to data cell in the sample page marks, and the text among Fig. 1 in 3 data unit marks respectively as follows:
Data cell 1:
Title: algorithm and the data structure examination question essence of preparing for the postgraduate qualifying examination is analysed (the 2nd edition)
Author: Chen Shoukong, Hu Xiaokun, Li Ling
Publishing house: China Machine Press
Publication time: 2007 07 month
The books brief introduction: this book collected key university of institute and academy of sciences surplus in the of since nineteen ninety-two domestic 60, the 1600 multiple tracks examination questions of more than 300 cover Master degree candidates entrance " algorithm and data structure " examination papers; And having provided Key for Reference and analysis, this book can be used as institution of higher learning's computing machine and relevant speciality learning data
Original cost: $42.00
Present price: $35.70
Discount: 85
Save amount of money: $6.30
Data cell 2:
Title: data mining notion and technology (former book the 2nd edition)
Author: Han Jiawei may win, and model is bright, Meng Xiaofeng
Publishing house: China Machine Press
Publication time: 2007 03 month
The books brief introduction: this book is told about the important knowledge and technology innovation in data mining field all sidedly; On the quite comprehensive basis of the 1st version content; The 2nd edition newest research results of having showed this field; For example excavate stream, sequential and sequence data and excavate time and space, multimedia, text and web data, this book can be used as
Original cost: $55.00
Present price: $42.30
Discount: 77
Save amount of money: $12.70
Data cell 3:
Title: the Oracle9i&10g art of programming: go deep into data base architecture
Author: Kate, Su Jinguo
Publishing house: People's Telecon Publishing House
Publication time: in October, 2006
The books brief introduction: this book is the authoritative books about oracle 9j az&10g data base architecture, has contained all most important oracle architecture characteristicses, comprises file, internal storage structure and process; Lock and door bolt; Affairs, concurrent and many versions, table and index, data type; And subregion and parallel, and
Original cost: $99.00
Present price: $74.30
Discount: 75
Save amount of money: $24.70
Step 4: the path that generates the data cell node;
Step 4-1: data cell U={U1; U2; U3}, wherein: U1 representes that data cell 1, U2 represent that data cell 2, U3 represent data cell 3, wherein the title of data cell 1 is " algorithm and data structure prepare for the postgraduate qualifying examination examination question essence analyse (the 2nd edition) "; The title of data cell 2 is " a data mining notion and technology (former book the 2nd edition) ", and the title of data cell 3 is " the Oracle9i&10g art of programming: go deep into data base architecture ";
Step 4-2: data cell marks out in Fig. 1; Position in the data cell corresponding page document D OM tree shown in solid dot among Fig. 3, among Fig. 3 in the corresponding XML document of root node label be the element of html, the outermost layer dotted line of Webpage in the corresponding diagram 1; Comprising whole content viewables and not visual content; Node 1 is the XML node element of head for label, the web data header in the corresponding diagram 1 in the Webpage, and wherein the page metamessage that comprises of content is not visual element; Node 2 is the XML node element of body for label; Outermost layer solid line in the corresponding diagram 1 in the Webpage,, node 2.1 to node 2.7 all is that label is the XML node element of div in the child nodes of node 2; The below advertisement position in the Webpage in node 2.7 corresponding diagram 1 wherein; Triangle among the figure under the node is represented the subtree under this node, and node 2.6.1 is that label is the XML node element of div in the child nodes of node 2.6 to node 2.6.3, wherein solid line zone pointed, the data area of Webpage in the node 2.6.3 corresponding diagram 1; Node 2.6.3.1 is that first label is the XML node element of div in the node 2.6.3 child nodes; The data cell 1 of Webpage solid line zone pointed in the node 2.6.3.1 corresponding diagram 1 wherein, node div [1] representes that this node is that first label is the XML node element of div in its father node child nodes, the pairing node in solid node bit data unit among Fig. 3; The path of this node is exactly the data cell path, comprises the attribute value data of data cell in the text node of this node subtree;
Step 4-3: calculating path expression formula;
The path values of the node of data cell in XML document is respectively:
P1:“/html[1]/body[1]/div[6]/div[3]/div[1]/div[4]/div[2]”
P2:“/html[1]/body[1]/div[6]/div[3]/div[1]/div[5]/div[2]”
P3:“/html[1]/body[1]/div[6]/div[3]/div[1]/div[6]/div[2]”
Html wherein, body, div are the XML element tags, div [i] expression label is that the div node is i in its brotgher of node with label;
Step 4-4: calculate the longest common path LCP that begins from root node;
The longest common path LCP:
LCP:“/html[1]/body[1]/div[6]/div[3]/div[1]”
Step 4-5: the longest common path LCP that abbreviation step 4-3 calculates;
The longest common path expression formula behind the abbreviation is:
LCP:“/html/body/div[6]/div[3]/div[1]”
Step 4-6: calculate local path;
The expression formula of local path is:
“/div[.class=″list_r_list″]/div[2]”
Step 4-7: merge the longest common path and local path;
The path expression that obtains the data cell node after the merging is:
“/html/body/div[6]/div[3]/div[1]/div[.class=″list_r_list″]/div[2]”
Wherein div [.class=" list_r_list "] expression has XML attribute class, and Class is the Style Attributes of label node in the html document, and property value is that the XML label of list_r_list is the node element of div;
Step 5: generate the path expression that extracts property value;
For data cell 1, it is following to generate the path expression that extracts property value based on structure wherein:
1. attribute " title "
Path localization and expression formula is "/h2/a/ ", and wherein h2 representes that label is the XML node element of h2, and a representes that label is the XML node element of a;
The property value decimation rule: the content in this attribute node all is a property value information, can use text () function in the XQuery language as the function that extracts property value.
2. attribute " author "
Path localization and expression formula is "/h4 [1]/a ", and wherein h4 [1] representes that first label is the XML node element of h4, and a representes that label is the XML node element of a;
The property value decimation rule: the content in this attribute node all is a property value information.
3. attribute " publishing house "
Path localization and expression formula is "/h4 [2]/a ", and wherein h4 [2] representes second XML node element that label is h4, and a representes that label is the XML node element of a;
The property value decimation rule: the content in this attribute node all is a property value information.
4. attribute " publication time "
Path localization and expression formula is "/h4 [3] ", wherein the 3rd XML node element that label is h4 of h4 [3] expression;
The property value decimation rule: the content part in this attribute node is a property value information, and decimation rule is for eliminating public non-property value character string " publication time: ".
5. attribute " books brief introduction "
Path localization and expression formula is "/h5 ", and wherein h5 representes that label is the XML node element of h5;
The property value decimation rule: the content in this attribute node all is a property value information.
6. attribute " original cost "
Path localization and expression formula is "/h6/span [1] ", and wherein h6 representes that label is the XML node element of h6, and span [1] representes that first label is the XML node element of span;
The property value decimation rule: the content part in this attribute node is a property value information, and decimation rule is for eliminating public non-property value character string " $ ".
7. attribute " present price "
Path localization and expression formula is "/h6/span [2] ", and wherein h6 representes that label is the XML node element of h6, and span [2] representes second XML node element that label is span;
The property value decimation rule: the content part in this attribute node is a property value information, and decimation rule is for eliminating public non-property value character string " $ ".
8. attribute " discount "
Path localization and expression formula is "/h6/ ", and wherein h6 representes that label is the XML node element of h6;
The property value decimation rule: the content part in this attribute node is a property value information, and decimation rule is for eliminating node text-string public non-property value character string of middle front part " discount: " and the public non-property value character string in rear portion " folding ".
9. attribute " the saving amount of money "
Path localization and expression formula is "/h6/ ", and wherein h6 representes that label is the XML node element of h6;
The property value decimation rule: the content part in this attribute node is a property value information, and decimation rule " is saved: $ " for eliminating the public non-property value character string of node text-string middle front part.
Step 6: the XML query statement that generates data pick-up;
To generate XML form extraction result data is example, uses XQuery as the XML query language, following for the XQuery statement that this example generated:
Figure GDA0000149816270000111
Figure GDA0000149816270000121
Wherein, FOR, IN, RETURN are XQuery query language key word; Text () is for obtaining the function of intranodal text; Substring-before () is for obtaining the substring function before a certain special string in the character string, and substring-after () is for obtaining the substring function after a certain special string in the character string.
Step 7: carry out above XML query statement extracted data;
Behind the data pick-up of above XML query statement execution to example page, the XML data content of acquisition is:
<books tabulation >
<books >
<shu Ming>Algorithm and the data structure examination question essence of preparing for the postgraduate qualifying examination is analysed (the 2nd edition)</Shu Ming>
<zuo Zhe>Oldly keep Kong Huxiao a kind of jade Li Ling</Zuo Zhe>
<chu Banshe>China Machine Press</Chu Banshe>
<chu Banshijian>2007 07 month</Chu Banshijian>
<books brief introduction>this book collected key university of institute and academy of sciences surplus in the of since nineteen ninety-two domestic 60, the 1600 multiple tracks examination questions of more than 300 cover Master degree candidates entrance " algorithm and data structure " examination papers, and provided Key for Reference and analysis.This book can be used as institution of higher learning's computing machine and relevant speciality learning data</books Jian Jie>
<yuan Jia>42.00</Yuan Jia>
<xian Jia>35.70</Xian Jia>
<zhe Kou>85</Zhe Kou>
<save Jin E>6.30</saving Jin E>
</Tu Shu>
<books >
<shu Ming>Data mining notion and technology (former book the 2nd edition)</Shu Ming>
<zuo Zhe>Han Jiawei may the Bo Fanmingmeng small peak</Zuo Zhe>
<chu Banshe>China Machine Press</Chu Banshe>
<chu Banshijian>2007 03 month</Chu Banshijian>
<books brief introduction>this book is told about the important knowledge and technology innovation in data mining field all sidedly.On the quite comprehensive basis of the 1st version content, the 2nd edition newest research results of having showed this field for example excavated stream, sequential and sequence data and excavated time and space, multimedia, text and web data.This book can be used as</books Jian Jie>
<yuan Jia>55.00</Yuan Jia>
<xian Jia>42.30</Xian Jia>
<zhe Kou>77</Zhe Kou>
<save Jin E>12.70</saving Jin E>
</Tu Shu>
<books >
<shu Ming>The Oracle9i&10g art of programming: go deep into data base architecture</Shu Ming>
<zuo Zhe>The Kate Jin nation of reviving</Zuo Zhe>
<chu Banshe>The People's Telecon Publishing House</Chu Banshe>
<chu Banshijian>In October, 2006</Chu Banshijian>
<books Jian Jie>This book is the authoritative books about oracle 9j az&10g data base architecture, has contained all most important oracle architecture characteristicses, comprises file, internal storage structure and process; Lock and door bolt, affairs, concurrent and many versions, table and index; Data type, and subregion and parallel, and</books Jian Jie>
<yuan Jia>99.00</Yuan Jia>
<xian Jia>74.30</Xian Jia>
<zhe Kou>75</Zhe Kou>
<save Jin E>24.70</saving Jin E>
</Tu Shu>
<books tabulation >

Claims (1)

1.一种基于可扩展标记语言查询的网页数据抽取方法,其特征在于:包括以下步骤:1. a method for extracting web page data based on Extensible Markup Language query, characterized in that: comprise the following steps: 步骤1:确定Web页面中抽取数据内容时所对应的模式结构;Step 1: Determine the schema structure corresponding to the data content extracted from the Web page; 步骤2:定位Web页面中数据区域、数据单元和属性文本;Step 2: Locate the data area, data unit and attribute text in the Web page; 步骤3:对步骤2中的属性文本进行语义标注;Step 3: Semantically annotate the attribute text in step 2; 步骤4:生成数据单元节点路径;Step 4: Generate data unit node path; 步骤5:计算抽取属性值的路径表达式;Step 5: Calculate the path expression for extracting attribute values; 步骤6:生成数据抽取的XML查询语句;Step 6: Generate an XML query statement for data extraction; 步骤7:利用XML查询语句抽取数据;Step 7: extract data by using XML query statement; 其中,步骤1所述的模式结构包括:关系形式的表结构和层次结构两种,其中,表结构的数据模式S由数据实体名E和一组属性集合A={A1,…,An}所构成,其中Ai表示属性集合中的一个属性,由属性名称和属性的数据类型构成,1<=i<=n,n表示属性的数量,Ai表示为<N,Type>,其中N表示属性名称,Type表示属性数据类型,所述数据类型Type包括整数类型integer、浮点类型float和字符串类型string;所述的层次结构是指由基本类型组成的复杂数据结构,其对应的数据模式表示为Si′,包含属性{Ai1,...,Aix},x为模式Si′中属性的数量;Wherein, the schema structure described in step 1 includes: a relational table structure and a hierarchical structure, wherein, the data schema S of the table structure consists of a data entity name E and a set of attribute sets A={A 1 ,...,A n }, where A i represents an attribute in the attribute set, which is composed of the attribute name and the data type of the attribute, 1<=i<=n, n represents the number of attributes, and A i represents <N, Type>, where N represents the attribute name, Type represents the attribute data type, and the data type Type includes an integer type integer, a floating point type float and a character string type string; the hierarchical structure refers to a complex data structure composed of basic types, and its corresponding The data schema is denoted as S i ′, which contains attributes {A i1 , ..., A ix }, where x is the number of attributes in the schema S i ′; 步骤2所述数据区域Da,是指在Web页面中包含所有数据单元的最小边界所包含的区域,定位方法为:在Web页面对应的文档对象模型DOM结构中对应一个包含所有数据单元的最小子树;The data area Da in step 2 refers to the area included in the smallest boundary that contains all data units in the web page, and the positioning method is: in the corresponding document object model DOM structure of the web page, a smallest sub-area that includes all data units is corresponding. Tree; 所述数据单元Du,表示Web数据抽取所要获得的一个模式结构对应的数据实体,由模式中的属性描述,在页面中以一定的规律重复出现;定位方法为:在Web页面的文档对象模型DOM树中,找出页面中数据实体各属性内容所在的节点,包含这些节点的最小子树就是数据单元;The data unit Du represents a data entity corresponding to a pattern structure to be obtained by Web data extraction, which is described by attributes in the pattern and reappears with a certain rule in the page; the positioning method is: in the document object model DOM of the Web page In the tree, find out the nodes where the content of each attribute of the data entity in the page is located, and the smallest subtree containing these nodes is the data unit; 所述属性文本At,表示在Web页面中包含数据模式属性的属性值的文本内容,属性值在Web页面的文档对象模型DOM树中元素节点的文本节点中,定位方法为:在Web页面对应的文档对象模型DOM树结构中找出包含该属性值文本的节点;The attribute text At represents the text content containing the attribute value of the data mode attribute in the Web page, and the attribute value is in the text node of the element node in the document object model DOM tree of the Web page, and the positioning method is: in the corresponding Find the node containing the attribute value text in the DOM tree structure of the Document Object Model; 步骤4所述生成数据单元节点路径包括以下步骤:The generation of the data unit node path described in step 4 includes the following steps: 步骤4-1:将步骤2得到的数据单元集合表示为:U={U1,U2,…,Uy},其中,Ui表示一个数据单元,其中i=1,…,y;Step 4-1: Express the data unit set obtained in step 2 as: U={U 1 , U 2 ,...,U y }, where U i represents a data unit, where i=1,...,y; 步骤4-2:根据确定的数据单元Ui,确定其在页面XML文档中所对应得元素节点,该节点表示为Ni,再根据XML文档的结构为元素节点Ni生成从根节点到该节点的路径值,表示为PiStep 4-2: According to the determined data unit U i , determine its corresponding element node in the page XML document, the node is denoted as N i , and then generate the element node N i from the root node to the element node according to the structure of the XML document The path value of the node, denoted as P i ; 步骤4-3:计算数据单元的路径表达式,方法为:Step 4-3: Calculate the path expression of the data unit by: 取一个数据单元节点的路径,在路径值Pi中,使用位置谓词定位路径表达式中的每一个步,即由文档根节点到数据单元对应的元素节点所经过的每一个节点,取路径表达式中的每个节点标签,所有数据单元的路径具有相同的标签序列,则从根节点开始的标签序列表示为T,其中包括m个标签分别表示为(T1,T2,…,Tm),其中标签T1为根节点的标签,其余标签依次类推,每个节点的标签在其同标签兄弟节点中的位置序列表示为(pi1…,pim),其中位置pi1为根节点标签的位置,其余标签依次类推,则路径值表示为:Take the path of a data unit node, in the path value P i , use the position predicate to locate each step in the path expression, that is, each node passed by from the document root node to the element node corresponding to the data unit, take the path expression Each node label in the formula, all data unit paths have the same label sequence, then the label sequence starting from the root node is denoted as T, including m labels denoted as (T 1 , T 2 ,...,T m ), where the label T 1 is the label of the root node, and the rest of the labels are deduced by analogy. The position sequence of the label of each node in its brother nodes with the same label is expressed as (p i1 ..., p im ), where the position p i1 is the root node The position of the label, and the rest of the labels are deduced in turn, and the path value is expressed as: 路径值Pi=/标签1[位置i1]/标签2[位置i2]/……/标签m[位置im],Path value P i =/label1[position i1]/label2[position i2]/.../labelm[position im], 即Pi=/T1[pi1]/T2[pi2]/....../Tm[pim]/That is, P i =/T 1 [p i1 ]/T 2 [p i2 ]/....../T m [p im ]/ 步骤4-4:对数据单元的路径集合,计算从根节点开始的最长公共路径LCP:Step 4-4: For the path set of the data unit, calculate the longest common path LCP starting from the root node: 所述最长公共路径是指所有数据单元节点的路径共有的节点构成的路径,计算最长公共路径LCP的方法为:对于数据单元节点的路径,从根节点开始的第一个标签位置开始匹配,如果所有数据单元节点路径在当前标签下的位置值相同,即p1i=p2i=...=pyi,则把当前标签和位置值顺序添加到最长公共路径中,即LCP+=/Ti[pi],如果所有数据单元节点路径在当前标签下的位置值存在不同值,则停止匹配,将当前最长公共路径值作为最终的最长公共路径值;The longest common path refers to the path composed of nodes shared by the paths of all data unit nodes. The method of calculating the longest common path LCP is: for the path of the data unit node, start matching from the first label position starting from the root node , if all data unit node paths have the same position value under the current label, that is, p 1i =p 2i =...=p yi , then add the current label and position value to the longest common path in sequence, that is, LCP+=/ T i [p i ], if the position values of all data unit node paths under the current label have different values, stop matching, and use the current longest common path value as the final longest common path value; 步骤4-5:化简步骤4-4计算得到的最长公共路径LCP;Step 4-5: Simplify the longest common path LCP calculated in step 4-4; 对于最长公共路径中的一个步所对应的节点,表示为ni,对应的标签为Ti,如果其兄弟节点中不存在与其标签相同,且具有相同后继路径为“/标签i+1/....../标签m”的子孙节点的非数据单元节点,则该节点的位置值在最长公共路径的表达式中可以省略;For the node corresponding to a step in the longest public path, denoted as n i , the corresponding label is T i , if there is no sibling node with the same label and the same successor path as "/label i+1 / ....../label m "'s descendant node's non-data unit node, then the position value of this node can be omitted in the expression of the longest common path; 步骤4-6:采用生成谓词的方法计算局部路径,所述的局部路径是指每个节点私有的节点构成的路径:Step 4-6: Calculate the partial path by using the method of generating predicates. The partial path refers to the path formed by the private nodes of each node: 生成谓词的方法为:假设当前步的节点的标签为Ti,看当前步中节点集合的所有兄弟节点中,是否包含与其标签相同且具有相同后继路径为“/标签i+1/....../标签m”的子孙节点的非数据单元节点,若没有则省略谓词,若有则再查看当前节点中是否有非数据单元节点的XML属性,能够将当前节点与符合上面条件的非数据单元节点区分,如果有这样的XML属性则将该属性作为谓词表达式,若没有则进一步计算谓词中位置值的范围,把这些符合条件的非数据单元节点称为噪音节点;The method of generating the predicate is: assuming that the label of the node in the current step is T i , check whether all sibling nodes of the node set in the current step contain the same label and the same successor path as "/label i+1 /... If there is no non-data unit node of the descendant node of .../label m ", omit the predicate, and if there is, then check whether there is any XML attribute of the non-data unit node in the current node, and the current node can be compared with the non-data unit node that meets the above conditions Data unit nodes are distinguished. If there is such an XML attribute, the attribute is used as a predicate expression. If not, the range of position values in the predicate is further calculated, and these qualified non-data unit nodes are called noise nodes; 所述计算谓词中位置值的范围的方法如下:The method for calculating the range of positional values in the predicate is as follows: 如果噪音节点只出现在数据单元节点集合之前,则对于该标签表示数据单元节点的谓词中位置的范围为:从所有数据单元节点的标签Ti所对应的节点位置中最小的位置值到最后一个具有该标签的节点;If the noise node only appears before the set of data unit nodes, the range of the position in the predicate representing the data unit node for this label is: from the smallest position value among the node positions corresponding to the label Ti of all data unit nodes to the last one with the node of the label; 如果噪音节点只出现在数据单元节点集合之后,则对于该标签表示数据单元节点的谓词中位置的范围为:从第一个到所有数据单元节点的标签Ti所对应的节点位置中最大的位置值;If the noise node appears only after the set of data unit nodes, the range of positions in the predicate representing the data unit nodes for this label is: from the first to the largest position among the node positions corresponding to the label T i of all data unit nodes value; 如果数据节点被噪音节点有规律地分割,计算数据单元节点被噪音节点分割的间隔pinte,计算数据单元节点连续出现的长度pcont,并计算所有数据单元节点的标签Ti所对应的节点位置中最小的位置值和最大的位置值,表示为pmin和pmax,则符合下面位置条件的节点被认为是数据单元路径上的节点:(1)节点位置值减去pmin后,对pinte取模后的余数小于pcont值;(2)节点位置值小于最大噪音节点位置值减去pmax再加1后的值;If the data node is regularly divided by the noise node, calculate the interval p inte of the data unit node being divided by the noise node, calculate the length p cont of the continuous appearance of the data unit node, and calculate the node position corresponding to the label T i of all the data unit nodes The minimum position value and the maximum position value in , expressed as pmin and pmax, then the nodes meeting the following position conditions are considered as nodes on the path of the data unit: (1) After subtracting pmin from the node position value, take the modulus of p inte The remainder after is less than the p cont value; (2) the node position value is less than the value of the maximum noise node position value minus pmax plus 1; 步骤4-7:合并最长公共路径和局部路径;Steps 4-7: Merge the longest common path and the local path; 将最长公共路径和局部路径合并,得到在Web页面的XML文档中定位数据单元的路径Pu;Merging the longest public path and the partial path to obtain the path Pu for locating the data unit in the XML document of the Web page; 步骤5所述的计算抽取属性值的路径表达式包括以下步骤:The path expression for calculating and extracting attribute values described in step 5 includes the following steps: 步骤5-1:生成属性节点定位的路径;Step 5-1: Generate a path for locating attribute nodes; 假设在样本数据中,模式属性Ai的属性值所在的节点相对于数据单元节点的路径表示为:Assume that in the sample data, the path of the node where the attribute value of the schema attribute A i is located relative to the data unit node is expressed as: /标签Ai1[位置Ai1]/标签Ai2[位置Ai2]/……/标签Aik[位置Aik]/label A i1 [position A i1 ]/label A i2 [position A i2 ]/... /label A ik [position A ik ] 即/TAi1[pAi1]/TAi2[pAi2]/…/TAik[pAik],其中TAij表示标签Aij,pAij表示位置Aij,其中j=1,…,k,标签Aik为包含属性值节点的标签,位置Aik为该节点在其同标签兄弟节点中的位置,则可以使用步骤4-5的方法,对属性节点定位的路径进行化简;Namely /TA i1 [pA i1 ]/TA i2 [pA i2 ]/.../TA ik [pA ik ], where TA ij represents label A ij , pA ij represents position A ij , where j=1,...,k, label A ik is the label containing the attribute value node, and the position A ik is the position of the node in its sibling nodes with the same label, then the method of steps 4-5 can be used to simplify the path for locating the attribute node; 步骤5-2:确定属性值抽取规则;Step 5-2: Determine the attribute value extraction rules; 属性值抽取规则适用于以下两种情况:1)、多个属性的属性值同时包含于一个节点文本中;2)、节点文本内容中包含非属性值文本内容;The attribute value extraction rules are applicable to the following two situations: 1) The attribute values of multiple attributes are contained in a node text at the same time; 2) The node text content contains non-attribute value text content; 假设节点文本中非属性值文本内容为固定文本,且在同一节点文本中的不同属性的属性值之间也使用固定的文本进行分割,只需计算出节点文本中分割属性的固定字符串即可分离属性值文本或不同属性的属性值,方法为:Assuming that the text content of the non-attribute value text in the node text is fixed text, and the attribute values of different attributes in the same node text are also divided by fixed text, it is only necessary to calculate the fixed string of the split attribute in the node text Separate property value text or property values for different properties by: 首先取多个样本Web页面,从中抽取包含相同属性的节点文本,如果该节点文本中全部字符均为属性值内容则直接抽取,否则提取公共子串并分割属性,从节点文本中抽取属性值的规则如下:First, take multiple sample web pages, and extract node texts containing the same attribute from them. If all the characters in the node text are attribute value content, then extract them directly; otherwise, extract common substrings and divide attributes, and extract the attribute value from the node text The rules are as follows: 如果在节点文本中属性Ai的属性值之前有固定文本Text1,则首先对节点文本字符串Str取固定文本Text1之后的子字符串Str-after,再查看属性Ai的属性值之后,如果有固定文本Text2,则再对字符串Str-after取固定文本Text2之前的子字符串,表示为Str-before;If there is a fixed text Text1 before the attribute value of the attribute A i in the node text, first take the substring Str-after after the fixed text Text1 for the node text string Str, and then check after the attribute value of the attribute A i , if there is For the fixed text Text2, then take the substring before the fixed text Text2 for the string Str-after, and express it as Str-before; 步骤6所述的生成数据抽取的XML查询语句包括以下步骤:The XML query statement generating data extracted described in step 6 includes the following steps: 步骤6-1:在数据抽取结果为层次结构时,所要生成的XML查询语句结构构建方法为:Step 6-1: When the data extraction result is a hierarchical structure, the method for constructing the structure of the XML query statement to be generated is: (1)语句的最外层使用固定的XML元素标签作为根节点,中间为XML查询表达式,对于XQuery语言是FLWOR表达式,即使用如下形式:<根节点标签>XML查询表达式</根节点标签>;(1) The outermost layer of the statement uses a fixed XML element tag as the root node, and the middle is an XML query expression. For the XQuery language, it is a FLWOR expression, that is, the following form is used: <root node tag>XML query expression</root node-label> (2)在XML查询表达式中,使用数据单元的路径表达式定位数据单元节点变量,对于XQuery语言使用FOR语句定位数据单元节点变量,同时可以使用LET语句和WHERE语句添加定位数据单元节点的谓词;(2) In the XML query expression, use the path expression of the data unit to locate the data unit node variable, and use the FOR statement to locate the data unit node variable for the XQuery language, and use the LET statement and WHERE statement to add predicates for locating the data unit node ; (3)在XML查询表达式中,在查询结果的输出部分,使用数据模式中的属性名或具有相同语义的文本作为XML文档中元素的标签,使用步骤5中生成的属性节点定位的路径和属性值抽取规则在数据单元节点变量下定位对应属性的属性值文本,具体形式为:<属性标签>{属性节点路径和属性值抽取规则构成的表达式}</属性标签>(3) In the XML query expression, in the output part of the query result, use the attribute name in the data schema or the text with the same semantics as the label of the element in the XML document, use the path and the attribute node location generated in step 5 The attribute value extraction rule locates the attribute value text of the corresponding attribute under the data unit node variable, and the specific form is: <attribute tag>{expression formed by attribute node path and attribute value extraction rule}</attribute tag> 步骤6-2:在数据抽取结果为关系形式的表结构时,所要生成的XML查询语句结构构建方法:Step 6-2: When the data extraction result is a relational table structure, the method for constructing the structure of the XML query statement to be generated: (1)在XML查询表达式中,使用数据单元的路径表达式定位数据单元节点变量,对于XQuery语言使用FOR语句定位数据单元节点变量,同时使用LET语句和WHERE语句添加定位数据单元节点的谓词;(1) In the XML query expression, use the path expression of the data unit to locate the data unit node variable, use the FOR statement to locate the data unit node variable for the XQuery language, and use the LET statement and the WHERE statement to add the predicate for locating the data unit node; (2)在XML查询表达式中,在查询结果的输出部分,按照输出结果的需求,依次排列由属性节点路径和属性值抽取规则构成的表达式,不同属性值的表达式之间用特殊的符号进行分隔,具体形式为:{属性1的属性值抽取表达式}分隔符{属性2的属性值抽取表达式}分隔符…分隔符{属性n的属性值抽取表达式}。(2) In the XML query expression, in the output part of the query result, according to the requirements of the output result, the expressions composed of the attribute node path and the attribute value extraction rules are arranged in sequence, and the expressions of different attribute values are separated by a special The specific form is: {extraction expression of attribute value of attribute 1} delimiter {extraction expression of attribute value of attribute 2} delimiter ... delimiter {extraction expression of attribute value of attribute n}.
CN201010545520A 2010-11-16 2010-11-16 Webpage data extracting method based on extensible language query Expired - Fee Related CN101984434B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010545520A CN101984434B (en) 2010-11-16 2010-11-16 Webpage data extracting method based on extensible language query

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010545520A CN101984434B (en) 2010-11-16 2010-11-16 Webpage data extracting method based on extensible language query

Publications (2)

Publication Number Publication Date
CN101984434A CN101984434A (en) 2011-03-09
CN101984434B true CN101984434B (en) 2012-09-05

Family

ID=43641603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010545520A Expired - Fee Related CN101984434B (en) 2010-11-16 2010-11-16 Webpage data extracting method based on extensible language query

Country Status (1)

Country Link
CN (1) CN101984434B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102456053B (en) * 2010-11-02 2013-08-14 江苏大学 Method for mapping XML document to database
CN102902723A (en) * 2012-09-06 2013-01-30 北京北森测评技术有限公司 Method and device for analyzing network data
CN103778104B (en) * 2012-10-22 2017-05-03 富士通株式会社 Information processing device, information processing method and electronic device
CN103186674A (en) * 2013-04-02 2013-07-03 浪潮电子信息产业股份有限公司 Web data quick inquiry method based on extensive makeup language (XML)
WO2016090625A1 (en) * 2014-12-12 2016-06-16 Hewlett-Packard Development Company, L.P. Scalable web data extraction
CN105808520B (en) * 2014-12-30 2018-12-14 联想(北京)有限公司 Electronic equipment and its sentence processing method
CN106980619B (en) * 2016-01-18 2021-03-26 北京国双科技有限公司 Data query method and device
CN106294722B (en) * 2016-08-09 2019-11-22 上海资誉网络科技有限公司 A kind of web page contents extraction method and device
CN107957909B (en) * 2016-10-17 2022-01-07 腾讯科技(深圳)有限公司 Information processing method, terminal equipment and server
CN106649628B (en) * 2016-12-06 2020-08-25 北京大学 Interaction enhancement method and system for web page visualization area
CN108614842B (en) * 2016-12-13 2021-03-30 北京国双科技有限公司 Method and device for querying data
CN106951451B (en) * 2017-02-22 2019-11-12 麒麟合盛网络技术股份有限公司 A kind of webpage content extracting method, device and calculate equipment
CN108334560B (en) * 2018-01-03 2022-04-15 腾讯科技(深圳)有限公司 Information acquisition method and related equipment
CN110309364B (en) * 2018-03-02 2023-03-28 腾讯科技(深圳)有限公司 Information extraction method and device
CN109582886B (en) * 2018-11-02 2022-05-10 北京字节跳动网络技术有限公司 Page content extraction method, template generation method and device, medium and equipment
CN112528082B (en) * 2020-12-08 2022-05-03 集美大学 An XML document pipeline XPath query method, terminal device and storage medium
CN112836063B (en) * 2021-01-27 2023-06-06 四川新网银行股份有限公司 Method for realizing feature tracing
CN114817639B (en) * 2022-05-18 2024-05-10 山东大学 Webpage diagram convolution document ordering method and system based on contrast learning
CN115658993B (en) * 2022-09-27 2023-06-06 观澜网络(杭州)有限公司 Intelligent extraction method and system for core content of webpage

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290624A (en) * 2008-06-11 2008-10-22 华东师范大学 A method for automatic extraction of news webpage metadata
CN101582074A (en) * 2009-01-21 2009-11-18 东北大学 Method for extracting data of DeepWeb response webpage

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290624A (en) * 2008-06-11 2008-10-22 华东师范大学 A method for automatic extraction of news webpage metadata
CN101582074A (en) * 2009-01-21 2009-11-18 东北大学 Method for extracting data of DeepWeb response webpage

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
周津等.基于XML的网页信息自动抽取.《计算机应用》.2004,第24卷第225-227页. *
孙高尚等.一种应用于Deep Web结果页面中分页标签的识别方法.《小型微型计算机系统》.2010,第31卷(第4期),第635-640页. *
李剑波等.一种基于XML的Web信息抽取方法.《情报杂志》.2006,(第8期),第49-51页. *
申德荣等.支持Web深层数据库网络的部分关键技术的研究.《计算机科学》.2007,第34卷(第8期),第123-125页. *
邓丽.面向主题的XML网页的模式和数据抽取.《中国优秀硕士学位论文全文数据库信息科技辑》.2004,第1-47页. *

Also Published As

Publication number Publication date
CN101984434A (en) 2011-03-09

Similar Documents

Publication Publication Date Title
CN101984434B (en) Webpage data extracting method based on extensible language query
Liu et al. Vide: A vision-based approach for deep web data extraction
Day et al. Reference metadata extraction using a hierarchical knowledge representation framework
US8554800B2 (en) System, methods and applications for structured document indexing
Zhao et al. Automatic extraction of dynamic record sections from search engine result pages
Zheng et al. Template-independent news extraction based on visual consistency
JP4956757B2 (en) Formula description structured language object search system and search method
CN103678412B (en) A kind of method and device of file retrieval
CN101763402B (en) Integrated retrieval method for multi-language information retrieval
CN102254014A (en) Adaptive information extraction method for webpage characteristics
Xue et al. Web page title extraction and its application
CN105677638B (en) Web information abstracting method
CN100447793C (en) Extraction Method of Page Query Interface Based on Visual Feature
Rehm et al. Ontology-based XQuery’ing of XML-encoded language resources on multiple annotation layers
CN106776569A (en) Tourist hot spot and its Feature Extraction Method and system in mass text
Zhang et al. Exploiting multi-category characteristics and unified framework to extract web content
Guan et al. Structure-based queries over the world wide Web
Pembe et al. A tree-based learning approach for document structure analysis and its application to web search
Colazzo et al. A typed text retrieval query language for XML documents
Zhou et al. Research on mechanism of the information retrieval based on ontology label
Lam et al. Web information extraction
Sathianesan et al. Personalized semantic based blog retrieval
Nie et al. Construct the XQuery-based wrapper for extracting web data
Chen et al. An efficient content extraction method for webpage based on tag-line-block analysis
Jayanthi et al. Referenced attribute Functional Dependency Database for visualizing web relational tables

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120905

Termination date: 20141116

EXPY Termination of patent right or utility model