CN102135976A - Hypertext markup language page structured data extraction method and device - Google Patents

Hypertext markup language page structured data extraction method and device Download PDF

Info

Publication number
CN102135976A
CN102135976A CN2010102976369A CN201010297636A CN102135976A CN 102135976 A CN102135976 A CN 102135976A CN 2010102976369 A CN2010102976369 A CN 2010102976369A CN 201010297636 A CN201010297636 A CN 201010297636A CN 102135976 A CN102135976 A CN 102135976A
Authority
CN
China
Prior art keywords
node
markup language
hypertext markup
page
language page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010102976369A
Other languages
Chinese (zh)
Other versions
CN102135976B (en
Inventor
胡汉强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN 201010297636 priority Critical patent/CN102135976B/en
Publication of CN102135976A publication Critical patent/CN102135976A/en
Application granted granted Critical
Publication of CN102135976B publication Critical patent/CN102135976B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to mobile searching, and discloses a hypertext markup language page structured data extraction method and a hypertext markup language page structured data extraction device. The hypertext markup language page structured data extraction method comprises the following steps of: transmitting a searching request to a search engine or a deep web; acquiring a search result hypertext markup language page obtained according to the searching request by the search engine or the deep web; and extracting structured data from the search result hypertext markup language page according to template hypertext markup language pages stored in advance by the search engine or the deep web and matching relationships between template document object model trees corresponding to the template hypertext markup language pages and search result page document object model trees, wherein the template hypertext markup language page comprises manual analytic annotations. By the method and the device, relatively more accurate search results can be obtained, and when an interface of a member search engine or the deep web is changed, corresponding modification is not required to be performed on an extraction wrapper of the member search engine or the deep web.

Description

Hypertext Markup Language page structure data extraction method and device
Technical field
The present invention relates to mobile search, be specifically related to Hypertext Markup Language (HTML:Hypertext Markup Language) page structure data extraction method and device.
Background technology
Develop rapidly along with mobile communication technology and search engine technique, combination---mobile search as two big hot topic fields of search engine and these two current information industries of mobile communication, new bright spot of mobile value-added service and growth point have been become, one of mobile search very the important techniques bright spot be precise search, just offer the search service of user individual, user's gained is promptly searched.
The mobile search framework is a platform based on unit's search, it integrates the ability of many professional vertical search engines, one brand-new comprehensive search capability is provided for the user, how efficiently and accurately from the Search Results html page of member's search engine or dark net (SE) data of drawing-out structureization automatically, thereby integrate the structural data of each vertical search engine or dark net with unified format, presenting to search client again is the problem that needs solution, wherein, dark net is meant and is hidden in behind the search/query interface, cannot climb the internet database of getting acquisition by general universal search engine reptile.
Existing a kind of method from html page extraction structural data is to use decimation rule (Extraction-Rule) from html page drawing-out structure data, this method is according to the extracting rule of each result element of the html page of each member's search engine or dark net, and the result who makes up each member's search engine or dark net extracts wrapper (wrapper); When extracting certain element, be that label, the attribute according to this element, the combinatory analysis of property value go out to extract the extracting rule of this element of location from html page.
Though the use decimation rule can be from html page drawing-out structure data, but because decimation rule does not have unified method for expressing, therefore decimation rule need write among the extraction wrapper of each member's search engine or dark net, therefore when the interface of member's search engine or dark net had change, the extraction wrapper of this member's search engine or dark net also must do corresponding the change; Simultaneously, decimation rule only uses label, attribute, the property value of element often can not locate an element that will extract uniquely, therefore causes the result who searches for not accurate enough, can reduce the user experience of search subscriber.
Summary of the invention
The embodiment of the invention provides Hypertext Markup Language page structure data extraction method and device, can make Search Results have higher accuracy, and when the interface of member's search engine or dark net has change, do not need the extraction wrapper of this member's search engine or dark net is done corresponding the change.
The embodiment of the invention provides a kind of Hypertext Markup Language page structure data extraction method, comprising:
Send searching request to search engine or dark net;
Obtain the Search Results Hypertext Markup Language page that described search engine or dark net obtain according to described searching request;
The sample Hypertext Markup Language page according to search engine of preserving in advance or dark net, and the matching relationship between described sample Hypertext Markup Language page corresponding sample page documents objectification model tree and the described result of page searching document object model tree, extract structural data from the described Search Results Hypertext Markup Language page, the described sample Hypertext Markup Language page comprises artificial parsing mark.
The embodiment of the invention also provides a kind of Hypertext Markup Language page structure data extract device, comprising:
Transmitting element is used for sending searching request to search engine or dark net;
Acquiring unit is used to obtain the Search Results Hypertext Markup Language page that described search engine or dark net obtain according to described searching request;
Extraction unit, be used for the sample Hypertext Markup Language page according to search engine of preserving in advance or dark net, and the matching relationship between the result of page searching document object model tree that obtains of described sample page document object model tree and described acquiring unit, extract structural data from the described Search Results Hypertext Markup Language page, the described sample Hypertext Markup Language page comprises artificial parsing mark.
The above technical scheme that provides from the embodiment of the invention as can be seen, because search server can utilize the artificial sample page dom tree (Template DOM Tree) of resolving mark of band to extract structural data automatically from the Search Results html page in the embodiment of the invention, therefore search server can be at the general Wrapper of all search engines or a unification of dark net structure, simultaneously each dark net is made up the artificial Template DOM Tree that resolves mark of band, just can finish automatic extraction, have the accuracy and the extraction efficiency of higher extraction all dark web frame data; And, when the interface of search engine or dark net changes, only need docking port Template to resolve mark again, just can finish Automatic Extraction, and need not revise the code of general Wrapper new interface, the maintenance efficiency of system will be improved greatly.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
The process flow diagram of the html page structural data extraction method that Fig. 1 provides for one embodiment of the invention;
Fig. 2 is the artificial parsing of a node mark synoptic diagram in the sample page dom tree in the one embodiment of the invention;
The synoptic diagram of sample page dom tree and result of page searching dom tree when Fig. 3 shows node matching for one embodiment of the invention;
The synoptic diagram of sample page dom tree and result of page searching dom tree when Fig. 4 goes node matching for one embodiment of the invention;
The synoptic diagram of sample page dom tree and result of page searching dom tree when Fig. 5 carries out the node element coupling for one embodiment of the invention;
The structural drawing of the html page structural data extraction element that Fig. 6 provides for one embodiment of the invention;
The structural drawing of the html page structural data extraction element that Fig. 7 provides for another embodiment of the present invention;
The structural drawing of the html page structural data extraction element that Fig. 8 provides for another embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.
Introduce the html page structural data extraction method that the embodiment of the invention provides earlier, Fig. 1 has described the flow process of the html page structural data extraction method that one embodiment of the invention provides, what this embodiment described is the treatment scheme of search server, and this embodiment comprises:
101, send searching request to search engine or dark net.
In an embodiment of the present invention, search engine specifically can be the member's search engine in unit's search framework, or the search engine in other search frameworks.The embodiment of the invention mainly adopts first search framework to be described, and the processing mode that search server in the framework is searched for by the processing mode of search server and unit in other search frameworks is similar, repeats no more.
Search server sends to searching request with this search server at least one search engine or the dark net that is connected, and search engine or dark net can be searched for according to this search engine.
In one embodiment of the invention, search server can initiatively send searching request to member's search engine or dark net.In another embodiment of the present invention, after search server can receive the searching request of search client again, the searching request that receives is transmitted to member's search engine or dark net.
102, obtain the Search Results html page that search engine or dark net obtain according to searching request.
Search server can receive the Search Results html page that is sent in real time by search engine or dark net, also can fetch the Search Results html page from search engine or dark net in real time; This Search Results html page is to be obtained according to searching request search by search engine or dark net.
103, according to search engine of preserving in advance or the dark sample html page of netting, and the matching relationship between sample html page corresponding sample document object model (DOM:Document Object Model) tree and the result of page searching dom tree, extract structural data from the Search Results html page, the sample html page comprises artificial parsing mark.
Search server can be preserved sample (Template) html page of search engine or dark net in advance, and the structuring element to these sample html pages carried out the artificial mark of resolving, therefore these sample html pages comprise artificial parsing mark, and the sample html page is the sample searches that obtains from a search engine or dark net html page as a result.
In one embodiment of the invention, the artificial parsing mark that the sample html page comprises comprises: to the mark of table node in the sample html page, be that the subtree of root node is the minimum subtree of the sample page dom tree of the structural data that comprises that all will extract with the table node; And to the mark of row node in the sample html page, the row node is the direct child node of table node, is the data that the subtree of root node comprises certain delegation in the structuring list structured data that will extract to go node; And to the mark of node element in the sample html page, node element is to be node in the subtree of root to go node, and node element is the pairing node of element that will extract.Wherein, mark to table node in the sample html page can comprise: the table node is the table node in the mark sample html page, and the quantity of the direct child node that starts in the direct child node of table node in the sample html page of skipping during the row node in the mark expression mark sample html page is first quantity, and the quantity of the direct child node that ends up in the direct child node of table node in the sample html page of skipping during the row node in the mark expression mark sample html page is second quantity; Mark to row node in the sample html page comprises: the row node is the row node in the mark sample html page, and the quantity of the part (part) that the data line of row node correspondence is divided in the mark expression sample html page, and the sequence number that marks row node part of living in this sample html page; Mark to node element in the sample html page comprises: node element is a node element in the mark sample html page, and the row name in the structuring list structured data that marks the structural data that will extract, and the quantity that is listed as in the mark structuring list structure, and the sequence number that marks this node element row of living in.
Search server can directly extract structural data from the Search Results html page according to the sample html page that comprises artificial parsing mark; Also can be according to the sample html page that comprises artificial parsing mark, the Search Results html page resolved mark automatically after, extract structural data from the Search Results html page that has carried out automatic parsing mark.
Wherein, no matter can or can not resolve mark automatically to the Search Results html page, all need according to the sample html page that comprises artificial parsing mark, search table node on the Search Results html page from the result of page searching dom tree, row node and node element, particularly, search server is searched table node on the Search Results html page from the result of page searching dom tree in the one embodiment of the invention, row node and node element can be in the following way: the search server utilization is to the mark of table node in the sample html page, and the matching relationship between sample page dom tree and the result of page searching dom tree, search table node on the Search Results html page from the Search Results html page; Further, can utilize the table node on the Search Results html page that finds, and in the sample html page row node mark, and the matching relationship between sample page dom tree and the result of page searching dom tree, search capable node on the Search Results html page from the Search Results html page; Further, can utilize the capable node on the Search Results html page that finds, and to the mark of node element in the described sample html page, and the matching relationship between sample page dom tree and the result of page searching dom tree, search node element on the Search Results html page from the Search Results html page.
If do not need the Search Results html page is not resolved mark automatically, then behind the node element of searching from the Search Results html page on the Search Results html page, search server can be directly node element from the Search Results html page that finds extract structural data.If desired the Search Results html page is resolved mark automatically, then search server can mark the table node on the Search Results html page that finds behind the table node of searching from the Search Results html page on the Search Results html page; Behind the capable node of searching from the Search Results html page on the Search Results html page, can the capable node on the Search Results html page that find be marked; Behind the node element of searching from the Search Results html page on the Search Results html page, can the node element on the Search Results html page that find be marked; And then, extract structural data from the node element that finds from searching node element from the Search Results html page that has marked table node, row node and node element.
Wherein, in one embodiment of the invention, specifically can be in the following way from the table node that described Search Results html page is searched on the Search Results html page, determine that the table node makes progress father node until the path of root node in the sample page dom tree, root node from the sample page dom tree, each node along this path in the other direction finds the matched node on the result of page searching dom tree, the table node that in finding the result of page searching dom tree, mates, particularly, when each node in the path finds matched node on the result of page searching dom tree, wherein a kind of matching process is as described below: the father node that the node in the result of page searching dom tree will have coupling with node in the sample page dom tree, and when being both the child node of same position of father node, the node in this result of page searching dom tree is only the matched node of corresponding node in the sample page dom tree.The capable node that result of page searching dom tree result of page searching dom tree is searched on the Search Results html page from the Search Results html page can be in the following way, search the direct child node of the table node from the Search Results html page except that the direct child node of first quantity of beginning and other the direct child nodes except that the direct child node of second quantity of ending as the capable node on the Search Results html page.Can be in the following way from the node element that the Search Results html page is searched on the Search Results html page, determine that sample page dom tree kind node element makes progress father node until the path of row node, capable node from the sample page dom tree, each node along this path in the other direction finds the matched node on the result of page searching dom tree, the node element that in finding the result of page searching dom tree, mates, particularly, when each node in the path finds matched node on the result of page searching dom tree, wherein a kind of matching process is as described below: the father node that the node in the result of page searching dom tree will have coupling with corresponding node on the sample page dom tree, and when being both the child node of same position of father node, the node in this result of page searching dom tree is only the matched node of corresponding node in the sample page dom tree.
Wherein, in another embodiment of the present invention, can also comprise the location paths of certain node to the path of root node of table node in the sample html page to the mark of node in the sample html page; At this moment, search server is searched this certain node from the result of page searching dom tree can be in the following way, searches node with the same position path matched node as this certain node on the Search Results html page from the result of page searching dom tree according to the location paths of this certain node in the sample html page.
The location paths that can also comprise in another embodiment of the present invention, certain node to the path of row node of node element in the sample html page to the mark of node in the sample html page; At this moment, search server is searched this certain node from the result of page searching dom tree can be in the following way, searches node with the same position path matched node as this certain node on the Search Results html page from the result of page searching dom tree according to the location paths of this certain node in the sample html page.
Wherein, structural data specifically can be the structuring list structured data, and promptly data are list structure forms.
In one embodiment of the invention, if search server is after having received the searching request that search client sends searching request to be transmitted to member search engine or dark net, after then search server has extracted structural data from the Search Results html page, the structural data that extracts can be gathered, and use unified format to return to search client.
In another embodiment of the present invention, search server initiatively sends searching request to search engine or dark net, after having extracted structural data from the Search Results html page, search server can carry out statistical study to the structural data that extracts, and for example can carry out statistical study etc. to the room rate tendency of certain building.After having extracted structural data from the Search Results html page, if search server receives the searching request that search client sends, can directly the structural data that has extracted be sent to search client, thereby improve the speed that search client obtains Search Results.
From the above, search server can utilize the artificial Template DOM Tree that resolves mark of band to extract structural data automatically from the Search Results html page in the present embodiment, therefore search server can be at the general Wrapper of all search engines or a unification of dark net structure, simultaneously each dark net is made up the artificial Template DOM Tree that resolves mark of band, just can finish automatic extraction, have the accuracy and the extraction efficiency of higher extraction all dark web frame data; And, when the interface of search engine or dark net changes, only need docking port Template to resolve mark again, just can finish Automatic Extraction, and need not revise the code of general Wrapper new interface, the maintenance efficiency of system will be improved greatly.
Following act instantiation is described the html page structural data extraction method that the embodiment of the invention provides, and what this embodiment described also is the treatment scheme of search server, and this search server is the search server in unit's search framework.
At first, search server is manually resolved the processing procedure of mark to the sample html page.
Obtain sample (Template) html page of member's search engine or dark net, element to the structural data of sample page dom tree (Template DOM Tree) is manually resolved mark, resolve the attribute (Attributes) that marks the node (Element) that specifically can add Template DOM Tree to, in one embodiment of the invention, the artificial mark of resolving can comprise following three kinds.
First kind of parsing mark for table node (Target_table), with this node is that the subtree of root node is the minimum subtree that comprises the Template dom tree of all structural datas that will extract, and promptly all structural datas that will extract all will be searched from the subtree of this node.
In one embodiment of the present of invention, the parsing mark of table node can be as follows:
Annotation_tag=" Target_Table ", expression is extracted structural data from the subtree of this node.
(n is first quantity to Annotation_begin_skip_children_count=n, be positive integer or 0), skip n the direct child node that starts in the direct child node of Target_Table when being illustrated in mark Target_Row node, promptly n+1 direct child node from the Target_Table node begins to mark the Target_Row node, because the several direct child nodes in the front of Target_Table node do not comprise the structural data that will extract in some cases.
(m is second quantity to Annotation_end_skip_children_count=m, be positive integer or 0), skip m the direct child node that ends up in the direct child node of Target_Table during expression mark Target_Row node, ending m the direct child node that is the Target_Table node is not labeled as the Target_Row node, because the several direct child nodes of the ending of Target_Table node do not comprise the structural data that will extract in some cases.
Second kind of mark for row node (Target_row), the capable structure in the list structure that mark will extract, Target_Row can have a plurality of parts (Parts), and the row of repetition only need mark once.
In one embodiment of the present of invention, the parsing mark of row node can be as follows:
Annotation_tag=" Target_Row " represents that this node is the direct child node of Target_Table, represents to comprise in the subtree of this node the data of certain delegation in the structuring list structured data that will extract.
Annotation_row_total_part_num=n (integer of n>=1), the expression data line is divided into n part, and this n part is dispersed in the subtree of n continuous Target_Row node.
Annotation_row_part_idex=m (1<=m<=n), represent the sequence number of the residing m part of current Target_Row node.
The third is node element (Target_Field) mark, marks the element that will extract among concrete each Part of every row.
In one embodiment of the present of invention, the parsing mark of row node can be as follows:
Annotation_tag=" Field " represents that this node is certain node in the subtree of Target_Row node, the i.e. pairing node of the structural data that specifically will extract.
The row name of the structural data correspondence that will extract that Annotation_field_name=is concrete, as " Title ", " Author ", " Abstract ", " url " etc.
Annotation_total_num_of_field=n (n is the integer more than or equal to 1), the quantity of all row of expression structuring list structure.
Annotation_field_index=m (1<=m<=n), the sequence number of the m row that the element of expression present node representative is in.
In another embodiment of the present invention, can also carry out the mark of optional node (Optional_Node) to the sample page dom tree, be used to mark the Target_table node upward on the path of root node or the Target_Field node upward to the path of Target_Row node, be in the optional brotgher of node of node layer front, path, optional node specifically can be optional ad content or list structure optional image node in capable etc.
In one embodiment of the present of invention, the parsing of optional node node mark can be as follows:
Annotation_tag=“Optional_Node”。
In another embodiment of the present invention, can also carry out the mark of location paths (Annotation_LocationPath) to the sample page dom tree, expression Target_table node is upward on the path of root node or on the path of Target_Field node upward to the Target_Row node, be in the coupling location paths of path node layer, location paths specifically can be represented with the expression formula of the LocationPath of definition among standard x ML Path Language (XPath) Version 1.0 of W3C.
For example in one embodiment of the present of invention, the parsing of certain layer of path node location paths mark can be Annotation_LocationPath=child::TABLE[attribute::class=" value1 "], it is " TABLE " that element tags (Element Tag) is selected in expression, comprise attribute (attribute) and be " class ", property value is the child node of " value1 ", and promptly this certain layer path node can mate with the element tags of node or attribute or property value.
Fig. 2 has described the artificial parsing mark signal of node in the sample page dom tree in the one embodiment of the invention, as shown in Figure 2, the mark that the artificial parsing mark of node in the sample page dom tree is comprised the location paths of the mark of mark, optional node (2) of mark, node element (6), (7), (8), (9) of mark, row node (4), (5) of his-and-hers watches node (3) and node (1) and (9).As shown in Figure 2, to being labeled as of the location paths of node (1) " Annotation LocationPath=child::TABLE[attribute::class=" value1 "] ", optional node (2) is labeled as " Annotation_tag=" Optional node " ", his-and-hers watches node (3) be labeled as " Annotation_tag=" Target_table "; Annotation_begin_skip_children_count=1; Annotation_end_skip_children_count=1 ", row node (4) is labeled as " Annotation_tag=" Target_row "; Annotation_row_total_part_num=2; Annotation_row_part_index=1 ", row node 5 is labeled as " Annotation_tag=" Target_row "; Annotation_row_total_part_num=2; Annotation_row_part_index=2 ", node element (6) is labeled as " Annotation_tag=" Field "; Annotation_field_name=" Title "; Annotation_total_num_of_fields=4; Annotation_field_index=1 ", node element (7) is labeled as " Annotation_tag=" Field "; Annotation_field_name=" Authod "; Annotation_total_num_of_fields=4; Annotation_field_index=2 ", node element (8) is labeled as " Annotation_tag=" Field "; Annotation_field_name=" Abstract "; Annotation_total_num_of_fields=4; Annotation_field_index=3 ", node element (9) is labeled as " Annotation_tag=" Field "; Annotation_field_name=" url "; Annotation_total_num_of_fields=4, Annotation_field_index=4, Annotation_LocationPath=child::TD[attribute::class=" value2 "] ".
Then, after search server receives the searching request of search client, searching request is transmitted to member's search engine or dark net, and obtains member's search engine or dark net is searched for the Search Results html page that the back obtains according to this searching request.
Again then, search server utilizes the sample page dom tree that the result of page searching dom tree is resolved mark automatically.
One) carries out the coupling of Target_table node.
In the one embodiment of the invention, the flow process of coupling of carrying out the Target_table node is as described below.
1, determines that the Target_table node makes progress father (parent) node until the path of root node among the Template DOM Tree;
2, begin each child node in the other direction from the root node of Template DOM Tree and find matched node (corresponding node) among the Search Result Page DOM Tree along above-mentioned path.
Wherein, seeking matched node in the one embodiment of the invention can be in the following way.
1) if the node of certain level in the above-mentioned path of Template DOM Tree comprises the mark of Annotation_LocationPath=LocationPath expression formula, then carry out the coupling of respective layer minor node by the LocationPath expression formula, attribute as a node among Fig. 2 has following description: Annotation_LocationPath=child::Div[attribute::class=" value1 "], then the element tags (Element Tag) in the respective path level is " Div " among the expression selection Search Result Page DOM Tree, comprise attribute (attribute) and be " class ", property value is corresponding node or matched node for the child node of " value1 ", and promptly this layer path node mated by the element tags of node or attribute or property value.
If the node of certain level in the above-mentioned path of Template DOM Tree does not comprise the mark of Annotation_LocationPath=LocationPath expression formula, then corresponding node is i the same child node that has the father node of coupling and be both coupling parent node.Wherein, if the front brother of Tree layer of above-mentioned path node of Template DOM comprises optional node, check then whether Search Result Page DOM Tree respective path level comprises this optional node, whether the number that specifically can compare the brotgher of node that two dom tree respective path levels are comprised is identical, if Search Result Page DOM Tree respective path level does not comprise this optional node, then the node i ndex of corresponding node is to one of reach among the Search Result Page DOM Tree, and promptly this matched node is an i-1 child node of corresponding parent node.
Fig. 3 has described the signal of sample page dom tree and result of page searching dom tree when one embodiment of the invention is shown node matching, and as shown in Figure 3, the coupling flow process of showing node comprises:
1, find among the Template DOM Tree table node make progress the parent node until the path (5) of root node->(4)->(3)->(2)->(1).
2, from the root node of Template DOM Tree begin along reciprocal each child node (1) the above-mentioned path->(2)->(3)->(4)-. (5), find corresponding node among the Search Result Page DOM Tree or matched node (1 ')->(2 ')->(3 ')->(4 ')->(5 ').
The method of seeking matched node is specially:
(a) from root node start node (1) and node (1 ') coupling.
(b) node (2) has father node (1) and (1 ') that matches each other with node (2 '), and is all the 2nd child node of father node (1) and (1 ').
(c) because node (3) has the mark of " Annotation_LocationPath=child::Div[attribute::class=" value1 "] ", then press LocationPath (child::Div[attribute::class=" value1 "], the label of expression node is " Div ", comprise " class " attribute, and property value for " value1 ") expression formula carries out the coupling of respective layer minor node, thereby finds matched node (3 ').
(d) because there is optional node (be labeled as Annotation_tag=" Optional_Node ") node (4) front, first node (4) has father node (3) and (3 ') that matches each other with node (4 ') headed by the method for matched node (4 ') so search, and check whether Search Result Page DOM Tree respective path level (node (4 ') place level) comprises this optional node (whether the number that can compare the brotgher of node that two dom tree respective path levels are comprised is identical), if Search Result Page DOM Tree respective path level (node (4 ') place level) comprises optional node, then node (4 ') and node (4) are all j child node of father node (3 ') and (3), if Search Result Page DOM Tree respective path level (node (4 ') place level) does not comprise optional node, then node (4 ') is a j-1 child node of father node (3 ').
(e) node (5) has father node (4) and (4 ') that matches each other with node (5 '), and is all k the child node of father node (4) and (4 ').
3, the matched node (5 ') among the corresponding Search Result of table node (5) the Page DOM Tree in finding Template DOM Tree, and this matched node (5 ') automatically resolved be labeled as annotation_tag=" Target_Table ", Annotation_begin_skip_children_count=n (n is a positive integer or 0) that copy is corresponding and Annotation_end_skip_children_count=m (m is a positive integer or 0) mark simultaneously.Sample page dom tree result of page searching dom tree result of page searching dom tree result of page searching dom tree sample page dom tree result of page searching dom tree
Two) carry out the coupling of Target_Row node.
1, for the direct child node of following one deck of Target_Table node, according to the Annotation_begin_skip_children_count=n and the Annotation_end_skip_children_count=m of Target_Table node, at first skip n the directly what time sub and individual directly child node of back m of front.
2, the direct child node of remainder all is labeled as the Target_Row node, and annotation_row_total_part_num=n according to the Target_Row node among the Template DOM Tree, various piece during continuous mark is capable (is a n continuous Target_Row node, copy the annotation_row_total_part_num=n and the annotation_row_part_index=m mark of each node correspondence), the various piece of delegation has marked continuous n the part (next continuous n Target_Row node) that continues the mark next line again, has all marked up to the various piece of all Target_Row nodes.
Fig. 4 has described the signal of sample page dom tree and result of page searching dom tree when one embodiment of the invention is gone node matching, and as shown in Figure 4, the coupling flow process of node of going comprises:
1, for the direct child node of following one deck (2 ') (3 ') (4 ') of the table node (1 ') of Search Result Page DOM Tree (3 ' ') (4 ' ') (5 '), Annotation_begin_skip_children_count=1 and Annotation_end_skip_children_count=1 according to table node (1 ') skip 1 the direct child node (2 ') and 1 the direct child node in back (5 ') of front.
The 2 direct child nodes (3 ') (4 ') that will be left (3 ' ') (4 ' ') all are labeled as capable node, and mark annotation_row_total_part_num=2 (delegation comprises 2 parts) according to the capable node (2) (3) among the Template DOM Tree, mark various piece in certain row of Search Result Page DOM Tree continuously (for 2 continuous capable nodes (3 ') (4 '), copy annotation_row_total_part_num=2 and the annotation_row_part_index=1 or 2 marks of each node correspondence), the various piece of delegation has marked 2 continuous parts of continuing the mark next line (the capable node of next continuous 2 (3 ' ') (4 ' ') again), till the various piece of all capable nodes has all marked.
Three) carry out the coupling of Target_Field node.
1, find certain Target_Field node among the Template DOM Tree to make progress the parent node until the path of Target_Row node;
2, begin to find matched node among the Search Result Page DOM Tree from the Target_Row node of Template DOM Tree along each child node the above-mentioned path.
Wherein, seeking matched node in the one embodiment of the invention can be in the following way.
1) if the node of certain level in the above-mentioned path of Template DOM Tree comprises the mark of Annotation_LocationPath=LocationPath expression formula, then carry out the coupling of respective layer minor node by the LocationPath expression formula, for example the attribute of a node has following description: Annotation_LocationPath=child::Td[attribute::class=" value2 "], then the element tags (Element Tag) in the respective path level is " Td " among the expression selection Search Result Page DOM Tree, attribute (attribute) is " class ", property value is corresponding node or matched node for the child node of " value2 ", and promptly this layer path node mated by the element tags of node or attribute or property value.
If the node of certain level in the above-mentioned path of Template DOM Tree does not comprise the mark of Annotation_LocationPath=LocationPath expression formula, then corresponding node is i the same child node that has the parent node of coupling and be both coupling parent node.Wherein, if the front brother of Tree layer of above-mentioned path node of Template DOM comprises optional node, check then whether Search Result Page DOM Tree respective path level comprises this optional node, whether the number that specifically can compare the brotgher of node that two dom tree respective path levels are comprised is identical, if Search Result Page DOM Tree respective path level does not comprise this optional node, then the node i ndex of corresponding node is to one of reach among the Search Result Page DOM Tree, and promptly the node of this coupling is an i-1 child node of corresponding parent node.
The signal of sample page dom tree and result of page searching dom tree when Fig. 4 has described one embodiment of the invention and carries out the node element coupling, as shown in Figure 4, the coupling flow process of carrying out node element comprises:
A, the node element of subtree below the node of matching row node section 1 (1) (1 ') at first:
1. find certain node element (4) among the Template DOM Tree upwards the parent node until the path (4) of row node (1)->(3)->(2)->(1).
2. from the beginning of the capable node (1) of Template DOM Tree along above-mentioned path in the other direction (1)->(2)->(3)->each child node (4) find matched node (1 ') among the Search Result Page DOM Tree->(2 ')->(3 ')->(4 ').
The method of seeking matched node is:
(a) will go node (1) with the row node (1 ') coupling.
(b) node (2) has father node (1) and (1 ') that matches each other with node (2 '), and is all i the child node of father node (1) and (1 ').
(c) because there is optional node (be labeled as Annotation_tag=" Optional_Node ") node (3) front, first node (3) has father node (2) and (2 ') that matches each other with node (3 ') headed by the method for matched node (3 ') so search, and check whether Search Result Page DOM Tree respective path level (node (3 ') place level) comprises this optional node (whether the number that can compare the brotgher of node that two dom tree respective path levels are comprised is identical), if Search Result Page DOM Tree respective path level (node (3 ') place level) comprises optional node, then node (3 ') and node (3) are all j child node of father node (2 ') and (2), if Search Result Page DOM Tree respective path level (node (3 ') place level) does not comprise optional node, then node (3 ') is a j-1 child node of father node (2 ').
(d) node (4) has father node (3) and (3 ') that matches each other with node (4 '), and is all k the child node of father node (3) and (3 ').
3. the matched node (4 ') among the corresponding Search Result of node element (4) the Page DOM Tree in finding Template DOM Tree, and this matched node (4 ') automatically resolved be labeled as annotation_tag=" Target_Field ", Annotation_field_name=' Title ', Annotation_total_num_of_field=4 that copy is corresponding and Annotation_field_index=1 mark simultaneously; Directly extract structurized data after perhaps having found the matched node of Target_Field of Search Result Page from this node.
4. for another node element (5) among the Template DOM Tree repeat 1., 2., the 3. matched node (5 ') of step in finding Search Result Page DOM Tree.
B, mate the Target_Field node of subtree below the Target_Row Part2 node (6) (6 ') again.
Up to matched node (9 ') and (10 ') of the node that finds Template DOM Tree (9) and node (10) and Search Result Page DOM Tree.
The coupling method and the step among the A 1., 2., 3., 4. basic identical, repeat no more.Wherein, the coupling of node (10 ') and node (10) adopts following flow process: because node (10) has Annotation_LocationPath=child::Td[attribute::class=" value2 "] mark, then press LocationPath (child::Td[attribute::class=" value2 "], the label of expression node is " Td ", comprise " class " attribute, and property value for " value2 ") expression formula carries out the coupling of respective layer minor node, thereby finds matched node (10 ').
2) find matched node among the corresponding Search Result of certain Target_Field node Page DOM Tree among the Template DOM Tree after, the automatic parsing of this matched node is labeled as annotation_tag=" Target_Field ", and be the row name of the concrete structural data correspondence that will extract of the Annotation_field_name=of the Target_Field node correspondence among the copy of the matched node among the Search Result Page DOM Tree Template DOM Tree, Annotation_total_num_of_field=n and Annotation_field_index=m mark, further, if certain the Target_Field node among certain row Template DOM Tree does not find the matched node among the corresponding Search Result Page DOM Tree, then all Target_Field data of this row all abandon.
3) repeat 1) and 2) step marks each Field of a part of delegation.
4) repeat 1), 2) and 3) step marks all Field of all part of delegation.
5) repeat 1), 2), 3) and 4) step marks all capable Field.
Four) from the Search Result Page that has added automatic parsing mark, extract structurized data automatically, specifically can comprise the steps.
1, searches and have that to resolve label be the node of annotation_tag=" Target_table ".
2, search and have that to resolve label be the node of annotation_tag=" Target_row ", search the Part of all row.
3, searching the parsing label is the node of annotation_tag=" Target_Field ", extracts structurized data from the element of these nodes.
4,, extracted up to all structural datas of going to each Target_row repeating step 3.
So far, finished from described Search Results html page extraction structural data.
Wherein, in another embodiment of the present invention, after the matched node of the Target_Field that has found Search Result Page, can this matched node not resolved mark automatically, and can directly extract structurized data from this matched node.
From the above, search server can utilize the artificial Template DOM Tree that resolves mark of band to extract structural data automatically from the Search Results html page in the present embodiment, therefore search server can be at the general Wrapper of all dark net search engines or a unification of dark net structure, simultaneously each dark net is made up the artificial Template DOM Tree that resolves mark of band, just can finish automatic extraction, have the accuracy and the extraction efficiency of higher extraction all dark web frame data; And, when the interface of search engine or dark net changes, only need docking port Template to resolve mark again, just can finish Automatic Extraction, and need not revise the code of general Wrapper new interface, the maintenance efficiency of system will be improved greatly.
Need to prove, for aforesaid each method embodiment, for simple description, so it all is expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not subjected to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.
Introduce the html page structural data extraction element that the embodiment of the invention provides again, the html page structural data extraction element that the embodiment of the invention provides can be used as search server.
Fig. 6 has described the structure of the html page structural data extraction element that one embodiment of the invention provides, and comprising:
Transmitting element 601 is used for sending searching request to search engine or dark net.
Acquiring unit 602 is used to obtain the Search Results html page that searching request that search engine or dark net send according to transmitting element 601 obtains.
Extraction unit 603, be used for sample html page according to search engine of preserving in advance or dark net, and the matching relationship between the result of page searching dom tree that obtains of sample page dom tree and acquiring unit 602, extract structural data from the Search Results html page, the sample html page comprises artificial parsing mark.
As shown in Figure 6, the html page structural data extraction element that another embodiment of the present invention provides can also comprise receiving element 604, and this receiving element is used to receive the searching request from search client.At this moment, transmitting element 601 is just transmitted the searching request that receives to search engine or dark net after receiving element 604 has received from the searching request of search client; Transmitting element 601 is further also after extraction unit 603 extracts structural data from the Search Results html page, and the structural data that extraction unit 603 is extracted gathers and uses unified format to return to search client.
As shown in Figure 6, the html page structural data extraction element that another embodiment of the present invention provides can also comprise receiving element 604, this receiving element 604 receives the searching request from search client after extraction unit 603 extracts structural data from the Search Results html page.At this moment, transmitting element 601 is further also after receiving element 604 receives searching request from search client, and the structural data that extraction unit 603 is extracted gathers and uses unified format to return to search client.
From the above, search server can utilize the artificial Template DOM Tree that resolves mark of band to extract structural data automatically from the Search Results html page in the present embodiment, therefore search server can be at the general Wrapper of all dark net search engines or a unification of dark net structure, simultaneously each dark net is made up the artificial Template DOM Tree that resolves mark of band, just can finish automatic extraction, have the accuracy and the extraction efficiency of higher extraction all dark web frame data; And, when the interface of search engine or dark net changes, only need docking port Template to resolve mark again, just can finish Automatic Extraction, and need not revise the code of general Wrapper new interface, the maintenance efficiency of system will be improved greatly.
Fig. 7 has described the structure of the html page structural data extraction element that another embodiment of the present invention provides, and comprising:
Receiving element 701 is used to receive the searching request from search client.
Acquiring unit 702 is used to obtain the Search Results html page that searching request that search engine or dark net send according to transmitting element 701 obtains.
Extraction unit 703, be used for sample html page according to search engine of preserving in advance or dark net, and the matching relationship between the result of page searching dom tree that obtains of sample page dom tree and acquiring unit 702, extract structural data from the Search Results html page, the sample html page comprises artificial parsing mark, wherein, the artificial parsing mark that the sample html page comprises comprises: to the mark of table node in the sample html page, be that the subtree of root node is the minimum subtree of the sample page dom tree of the structural data that comprises that all will extract with the table node; And to the mark of row node in the sample html page, the row node is the direct child node of table node, is the data that the subtree of root node comprises certain delegation in the structuring list structured data that will extract to go node; And to the mark of node element in the sample html page, node element is to be node in the subtree of root to go node, and node element is the pairing node of element that will extract.
Transmitting element 704 is used for the searching request that receiving element 701 receives is transmitted to search engine or dark net; The structural data that extraction unit 703 is extracted gathers and uses unified format to return to search client.
Wherein, as shown in Figure 7, extraction unit 703 specifically can comprise:
First searches unit 7031, utilizes the mark to table node in the sample html page, and the matching relationship between sample page dom tree and the result of page searching dom tree, searches table node on the Search Results html page from the Search Results html page; Mark to table node in the described sample html page comprises: the table node is the table node in the mark sample html page, and the quantity of the direct child node that starts in the direct child node of table node in the sample html page of skipping during the row node in the described sample html page of mark expression mark is first quantity, and the quantity of the direct child node that ends up in the direct child node of table node in the sample html page of skipping during the row node in the mark expression mark sample html page is second quantity.
Second searches unit 7032, be used to utilize the first table node of searching on the Search Results html page that unit 7031 finds, and in the sample html page row node mark, and the matching relationship between sample page dom tree and the result of page searching dom tree, search capable node on the Search Results html page from the Search Results html page; Mark to row node in the sample html page comprises: the row node is the row node in the mark sample html page, and the quantity of the part that the data line of row node correspondence is divided in the mark expression sample html page, and mark is represented the sequence number of row node part of living in this sample html page.
The 3rd searches unit 7033, be used to utilize the second capable node of searching on the Search Results html page that unit 7032 finds, and to the mark of node element in the sample html page, and the matching relationship between sample page dom tree and the result of page searching dom tree, search node element on the Search Results html page from the Search Results html page; Mark to node element in the sample html page comprises: node element is a node element in the mark sample html page, and the row name in the structuring list structured data that marks the structural data that will extract, and the quantity that is listed as in the mark structuring list structure, and the sequence number that marks this node element row of living in.
Data extracting unit 7034 is used for directly extracting structural data from the 3rd node element of searching on the Search Results html page that the unit finds.
From the above, search server can utilize the artificial Template DOM Tree that resolves mark of band to extract structural data automatically from the Search Results html page in the present embodiment, therefore search server can be at the general Wrapper of all dark net search engines or a unification of dark net structure, simultaneously each dark net is made up the artificial Template DOM Tree that resolves mark of band, just can finish automatic extraction, have the accuracy and the extraction efficiency of higher extraction all dark web frame data; And, when the interface of search engine or dark net changes, only need docking port Template to resolve mark again, just can finish Automatic Extraction, and need not revise the code of general Wrapper new interface, the maintenance efficiency of system will be improved greatly.
Fig. 8 has described the structure of the html page structural data extraction element that another embodiment of the present invention provides, and comprising:
Receiving element 801 is used to receive the searching request from search client.
Acquiring unit 802 is used to obtain the Search Results html page that searching request that search engine or dark net send according to transmitting element 801 obtains.
Extraction unit 803, be used for sample html page according to search engine of preserving in advance or dark net, and the matching relationship between the result of page searching dom tree that obtains of sample page dom tree and acquiring unit 802, extract structural data from the Search Results html page, the sample html page comprises artificial parsing mark, wherein, the artificial parsing mark that the sample html page comprises comprises: to the mark of table node in the sample html page, be that the subtree of root node is the minimum subtree of the sample page dom tree of the structural data that comprises that all will extract with the table node; And to the mark of row node in the sample html page, the row node is the direct child node of table node, is the data that the subtree of root node comprises certain delegation in the structuring list structured data that will extract to go node; And to the mark of node element in the sample html page, node element is to be node in the subtree of root to go node, and node element is the pairing node of element that will extract.
Transmitting element 804 is used for the searching request that receiving element 801 receives is transmitted to search engine or dark net; The structural data that extraction unit 803 is extracted gathers and uses unified format to return to search client.
Wherein, as shown in Figure 8, extraction unit 803 specifically can comprise:
Mark unit 8031, be used for sample html page according to search engine of preserving in advance or dark net, and the matching relationship between the result of page searching dom tree that obtains of sample page dom tree and acquiring unit 802, the Search Results html page is resolved mark automatically.
Data extracting unit 8032 is used for extracting structural data from marking the Search Results html page that has carried out automatic parsing mark in unit 8031.
Wherein, as shown in Figure 8, mark unit 8031 specifically can comprise:
First searches unit 80311, be used for utilizing mark to described sample html page table node, and the matching relationship between sample page dom tree and the described result of page searching dom tree, search table node on the Search Results html page from the Search Results html page; Mark to table node in the described sample html page comprises: the table node is the table node in the mark sample html page, and the quantity of the direct child node that starts in the direct child node of table node in the sample html page of skipping during the row node in the mark expression mark sample html page is first quantity, and the quantity of the direct child node that ends up in the direct child node of table node in the described sample html page of skipping during the row node in the mark expression mark sample html page is second quantity.
The first mark unit 80312 is used for the first table node of searching on the Search Results html page that unit 80311 finds is marked.
Second searches unit 80313, be used to utilize the first table node of searching on the Search Results html page that unit 80311 finds, and in the described sample html page row node mark, and the matching relationship between sample page dom tree and the result of page searching dom tree, search capable node on the Search Results html page from the Search Results html page; Mark to row node in the sample html page comprises: the row node is the row node in the mark sample html page, and the quantity of the part that the data line of row node correspondence is divided in the mark expression sample html page, and mark is represented the sequence number of row node part of living in this sample html page.
The second mark unit 80314 is used for the second capable node of searching on the Search Results html page that unit 80313 finds is marked.
The 3rd searches unit 80315, be used to utilize the second capable node of searching on the Search Results html page that unit 80313 finds, and to the mark of node element in the sample html page, and the matching relationship between sample page dom tree and the result of page searching dom tree, search node element on the Search Results html page from the Search Results html page; Mark to node element in the sample html page comprises: node element is a node element in the mark sample html page, and the row name in the structuring list structured data that marks the structural data that will extract, and the quantity that is listed as in the mark structuring list structure, and the sequence number that marks this node element row of living in.
The 3rd mark unit 80316 is used for the 3rd node element of searching on the Search Results html page that unit 80315 finds is marked.
From the above, search server can utilize the artificial Template DOM Tree that resolves mark of band to extract structural data automatically from the Search Results html page in the present embodiment, therefore search server can be at the general Wrapper of all dark net search engines or a unification of dark net structure, simultaneously each dark net is made up the artificial Template DOM Tree that resolves mark of band, just can finish automatic extraction, have the accuracy and the extraction efficiency of higher extraction all dark web frame data; And, when the interface of search engine or dark net changes, only need docking port Template to resolve mark again, just can finish Automatic Extraction, and need not revise the code of general Wrapper new interface, the maintenance efficiency of system will be improved greatly.
Introduce the search system that the embodiment of the invention provides again, this search system comprises the html page structural data extraction element that the embodiment of the invention provides, and this html page structural data extraction element is connected with at least one search engine or dark net.
Contents such as the information interaction between said apparatus and intrasystem each module, implementation and since with the inventive method embodiment based on same design, particular content can repeat no more referring to the narration among the inventive method embodiment herein.
One of ordinary skill in the art will appreciate that all or part of flow process that realizes in the foregoing description method, be to instruct relevant hardware to finish by computer program, above-mentioned program can be stored in the computer read/write memory medium, this program can comprise the flow process as the embodiment of above-mentioned each side method when carrying out.Wherein, above-mentioned storage medium can be magnetic disc, CD, read-only storage memory body (ROM:Read-Only Memory) or stores memory body (RAM:Random Access Memory) etc. at random.
Used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and thought thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (17)

1. a Hypertext Markup Language page structure data extraction method is characterized in that, comprising:
Send searching request to search engine or dark net;
Obtain the Search Results Hypertext Markup Language page that described search engine or dark net obtain according to described searching request;
The sample Hypertext Markup Language page according to search engine of preserving in advance or dark net, and the matching relationship between described sample Hypertext Markup Language page corresponding sample page documents objectification model tree and the described result of page searching document object model tree, extract structural data from the described Search Results Hypertext Markup Language page, the described sample Hypertext Markup Language page comprises artificial parsing mark.
2. Hypertext Markup Language page structure data extraction method as claimed in claim 1 is characterized in that, the artificial parsing mark that the described sample Hypertext Markup Language page comprises comprises:
To the mark of table node in the described sample Hypertext Markup Language page, with the described table node subtree that is root node the minimum subtree of the sample page document object model tree of the structural data that comprises that all will extract;
And to the mark of row node in the described sample Hypertext Markup Language page, described capable node is the direct child node of described table node, comprises the data of certain delegation in the structuring list structured data that will extract with the described capable node subtree that is root node;
And to the mark of node element in the described sample Hypertext Markup Language page, described node element is that described node element is the pairing node of element that will extract with the node in the described capable node subtree that is root.
3. Hypertext Markup Language page structure data extraction method as claimed in claim 2 is characterized in that:
Described mark to table node in the described sample Hypertext Markup Language page comprises: mark and show node in the described sample Hypertext Markup Language page for showing node, and the quantity of the direct child node that starts in the direct child node of table node in the described sample Hypertext Markup Language page of skipping during the row node in the described sample Hypertext Markup Language page of mark expression mark is first quantity, and the quantity of the direct child node that ends up in the direct child node of table node in the described sample Hypertext Markup Language page of skipping during the row node in the described sample Hypertext Markup Language page of mark expression mark is second quantity;
Described mark to row node in the described sample Hypertext Markup Language page comprises: mark and go node in the described sample Hypertext Markup Language page for going node, and the quantity of the part that the data line of row node correspondence is divided in the described sample Hypertext Markup Language page of mark expression, and mark is represented the sequence number of row node part of living in this sample Hypertext Markup Language page;
Described mark to node element in the described sample Hypertext Markup Language page comprises: mark that node element is a node element in the described sample Hypertext Markup Language page, and the row name in described structuring list structured data that marks the structural data that will extract, and mark the quantity that is listed as in the described structuring list structure, and the sequence number that marks this node element row of living in.
4. Hypertext Markup Language page structure data extraction method as claimed in claim 3 is characterized in that, describedly extracts structural data from the described Search Results Hypertext Markup Language page and comprises:
Utilize described mark to table node in the described sample Hypertext Markup Language page, and the matching relationship between described sample page document object model tree and the described result of page searching document object model tree, search table node on the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page;
Table node on the Search Results Hypertext Markup Language page that utilization finds, and described mark to row node in the described sample Hypertext Markup Language page, and the matching relationship between described sample page document object model tree and the described result of page searching document object model tree, search capable node on the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page;
Capable node on the Search Results Hypertext Markup Language page that utilization finds, and described mark to node element in the described sample Hypertext Markup Language page, and the matching relationship between described sample page document object model tree and the described result of page searching document object model tree, search node element on the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page;
Direct node element from the Search Results Hypertext Markup Language page that finds extracts structural data.
5. Hypertext Markup Language page structure data extraction method as claimed in claim 4 is characterized in that, the described table node of searching on the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page comprises:
Determine that sample page document object model seeds table node makes progress father node until the path of root node, root node from sample page document object model tree, each node along this path in the other direction finds the matched node on the result of page searching document object model tree, the table node that mates in finding result of page searching document object model tree;
The described capable node of searching on the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page comprises:
Search the direct child node of the table node from the described Search Results Hypertext Markup Language page except that the direct child node of first quantity of beginning and other the direct child nodes except that the direct child node of second quantity of ending as the capable node on the Search Results Hypertext Markup Language page;
The described node element of searching on the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page comprises:
Determine that sample page document object model seeds node element makes progress father node until the path of row node, capable node from sample page document object model tree, each node along this path in the other direction finds the matched node on the result of page searching document object model tree, the node element that mates in finding result of page searching document object model tree.
6. Hypertext Markup Language page structure data extraction method as claimed in claim 5, it is characterized in that, describedly find the matched node on the result of page searching document object model tree to comprise along this path each node in the other direction: from result of page searching document object model tree, search with the sample Hypertext Markup Language page in the table node have the father node of coupling, and the node of child node of same position of father node that is both this coupling is as the matched node in the result of page searching document object model tree.
7. Hypertext Markup Language page structure data extraction method as claimed in claim 5, it is characterized in that, can also comprise the location paths of certain node to the path of root node of table node in the sample Hypertext Markup Language page to the mark of node in the sample Hypertext Markup Language page;
Described each node in the other direction finds matched node on the result of page searching document object model tree along this path, and the table node that mates in finding result of page searching document object model tree comprises: search node with the same position path matched node as this certain node on the Search Results Hypertext Markup Language page from result of page searching document object model tree according to the location paths of this certain node in the sample Hypertext Markup Language page.
8. Hypertext Markup Language page structure data extraction method as claimed in claim 5, it is characterized in that, can also comprise the location paths of certain node to the path of row node of node element in the sample Hypertext Markup Language page to the mark of node in the sample Hypertext Markup Language page;
Described each node in this path opposite direction finds the matched node on the result of page searching document object model tree, and the node element that mates in finding result of page searching document object model tree comprises:
From result of page searching document object model tree, search node with same position path matched node as this certain node on the Search Results Hypertext Markup Language page according to the location paths of this certain node in the sample Hypertext Markup Language page.
9. Hypertext Markup Language page structure data extraction method as claimed in claim 1 is characterized in that, describedly extracts structural data from the described Search Results Hypertext Markup Language page and comprises:
The described Search Results Hypertext Markup Language page is resolved mark automatically;
Extract structural data from the Search Results Hypertext Markup Language page that has carried out automatic parsing mark.
10. Hypertext Markup Language page structure data extraction method as claimed in claim 9 is characterized in that, the artificial parsing mark that the described sample Hypertext Markup Language page comprises comprises:
To the mark of table node in the described sample Hypertext Markup Language page, with the described table node subtree that is root node the minimum subtree of the sample page document object model tree of the structural data that comprises that all will extract;
And to the mark of row node in the described sample Hypertext Markup Language page, described capable node is the direct child node of described table node, comprises the data of certain delegation in the structuring list structured data that will extract with the described capable node subtree that is root node;
And to the mark of node element in the described sample Hypertext Markup Language page, described node element is that described node element is the pairing node of element that will extract with the node in the described capable node subtree that is root.
11. Hypertext Markup Language page structure data extraction method as claimed in claim 10 is characterized in that:
Described mark to table node in the described sample Hypertext Markup Language page comprises: mark and show node in the described sample Hypertext Markup Language page for showing node, and the quantity of the direct child node that starts in the direct child node of table node in the described sample Hypertext Markup Language page of skipping during the row node in the described sample Hypertext Markup Language page of mark expression mark is first quantity, and the quantity of the direct child node that ends up in the direct child node of table node in the described sample Hypertext Markup Language page of skipping during the row node in the described sample Hypertext Markup Language page of mark expression mark is second quantity;
Described mark to row node in the described sample Hypertext Markup Language page comprises: mark and go node in the described sample Hypertext Markup Language page for going node, and the quantity of the part that the data line of row node correspondence is divided in the described sample Hypertext Markup Language page of mark expression, and mark is represented the sequence number of row node part of living in this sample Hypertext Markup Language page;
Described mark to node element in the described sample Hypertext Markup Language page comprises: mark that node element is a node element in the described sample Hypertext Markup Language page, and the row name in described structuring list structured data that marks the structural data that will extract, and mark the quantity that is listed as in the described structuring list structure, and the sequence number that marks this node element row of living in.
12. Hypertext Markup Language page structure data extraction method as claimed in claim 11 is characterized in that, described to the described Search Results Hypertext Markup Language page resolve automatically the mark comprise:
Utilize described mark to table node in the described sample Hypertext Markup Language page, and the matching relationship between described sample page document object model tree and the described result of page searching document object model tree, search table node on the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page, the table node on the Search Results Hypertext Markup Language page that finds is marked;
Table node on the Search Results Hypertext Markup Language page that utilization finds, and described mark to row node in the described sample Hypertext Markup Language page, and the matching relationship between described sample page document object model tree and the described result of page searching document object model tree, search capable node on the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page, the capable node on the Search Results Hypertext Markup Language page that finds is marked;
Capable node on the Search Results Hypertext Markup Language page that utilization finds, and described mark to node element in the described sample Hypertext Markup Language page, and the matching relationship between described sample page document object model tree and the described result of page searching document object model tree, search node element on the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page, the node element on the Search Results Hypertext Markup Language page that finds is marked.
13. Hypertext Markup Language page structure data extraction method as claimed in claim 12 is characterized in that, the described table node of searching on the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page comprises:
Determine that sample page document object model seeds table node makes progress father node until the path of root node, root node from sample page document object model tree, each node along this path in the other direction finds the matched node on the result of page searching document object model tree, the table node that mates in finding result of page searching document object model tree;
The described capable node of searching on the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page comprises:
Search the direct child node of the table node from the described Search Results Hypertext Markup Language page except that the direct child node of first quantity of beginning and other the direct child nodes except that the direct child node of second quantity of ending as the capable node on the Search Results Hypertext Markup Language page;
The described node element of searching on the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page comprises:
Determine that sample page document object model seeds node element makes progress father node until the path of row node, capable node from sample page document object model tree, each node along this path in the other direction finds the matched node on the result of page searching document object model tree, the node element that mates in finding result of page searching document object model tree.
14. Hypertext Markup Language page structure data extraction method as claimed in claim 13, it is characterized in that, describedly find the matched node on the result of page searching document object model tree to comprise along this path each node in the other direction: from result of page searching document object model tree, search with the sample Hypertext Markup Language page in the table node have the father node of coupling, and the node of child node of same position of father node that is both this coupling is as the matched node in the result of page searching document object model tree.
15. Hypertext Markup Language page structure data extraction method as claimed in claim 13, it is characterized in that, can also comprise the location paths of certain node to the path of root node of table node in the sample Hypertext Markup Language page to the mark of node in the sample Hypertext Markup Language page;
Described each node in the other direction finds matched node on the result of page searching document object model tree along this path, and the table node that mates in finding result of page searching document object model tree comprises: search node with the same position path matched node as this certain node on the Search Results Hypertext Markup Language page from result of page searching document object model tree according to the location paths of this certain node in the sample Hypertext Markup Language page.
16. Hypertext Markup Language page structure data extraction method as claimed in claim 13, it is characterized in that, can also comprise the location paths of certain node to the path of row node of node element in the sample Hypertext Markup Language page to the mark of node in the sample Hypertext Markup Language page;
Described each node in this path opposite direction finds the matched node on the result of page searching document object model tree, and the node element that mates in finding result of page searching document object model tree comprises:
From result of page searching document object model tree, search node with same position path matched node as this certain node on the Search Results Hypertext Markup Language page according to the location paths of this certain node in the sample Hypertext Markup Language page.
17. a Hypertext Markup Language page structure data extract device is characterized in that, comprising:
Transmitting element is used for sending searching request to search engine or dark net;
Acquiring unit is used to obtain the Search Results Hypertext Markup Language page that described search engine or dark net obtain according to described searching request;
Extraction unit, be used for the sample Hypertext Markup Language page according to search engine of preserving in advance or dark net, and the matching relationship between the result of page searching document object model tree that obtains of described sample page document object model tree and described acquiring unit, extract structural data from the described Search Results Hypertext Markup Language page, the described sample Hypertext Markup Language page comprises artificial parsing mark.
CN 201010297636 2010-09-27 2010-09-27 Hypertext markup language page structured data extraction method and device Active CN102135976B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010297636 CN102135976B (en) 2010-09-27 2010-09-27 Hypertext markup language page structured data extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010297636 CN102135976B (en) 2010-09-27 2010-09-27 Hypertext markup language page structured data extraction method and device

Publications (2)

Publication Number Publication Date
CN102135976A true CN102135976A (en) 2011-07-27
CN102135976B CN102135976B (en) 2013-12-18

Family

ID=44295764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010297636 Active CN102135976B (en) 2010-09-27 2010-09-27 Hypertext markup language page structured data extraction method and device

Country Status (1)

Country Link
CN (1) CN102135976B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198069A (en) * 2012-01-06 2013-07-10 株式会社理光 Method and device for extracting relational table
CN104112002A (en) * 2014-07-14 2014-10-22 福建星网锐捷网络有限公司 Form adaption method, device and system
CN104598462A (en) * 2013-10-30 2015-05-06 深圳市国信互联科技有限公司 Method and device for extracting structural data
CN105138561A (en) * 2015-07-23 2015-12-09 中国测绘科学研究院 Deep web space data acquisition method and apparatus
CN109086450A (en) * 2018-08-24 2018-12-25 电子科技大学 A kind of Web depth net query interface detection method
CN109558571A (en) * 2018-10-18 2019-04-02 深圳壹账通智能科技有限公司 File size recognition methods, device, computer equipment and storage medium
CN109784382A (en) * 2018-12-27 2019-05-21 广州华多网络科技有限公司 Markup information processing method, device and server
CN110555178A (en) * 2019-08-28 2019-12-10 贝壳技术有限公司 Data proxy method and device
CN111026658A (en) * 2019-12-03 2020-04-17 北京小米移动软件有限公司 Debugging method, device and medium for fast application
CN112182310A (en) * 2020-11-04 2021-01-05 上海德拓信息技术股份有限公司 Method for realizing built-in real-time search universal tree-shaped component

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101211336A (en) * 2006-12-29 2008-07-02 鸿富锦精密工业(深圳)有限公司 Visualized system and method for generating inquiry file
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101211336A (en) * 2006-12-29 2008-07-02 鸿富锦精密工业(深圳)有限公司 Visualized system and method for generating inquiry file
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198069A (en) * 2012-01-06 2013-07-10 株式会社理光 Method and device for extracting relational table
CN104598462A (en) * 2013-10-30 2015-05-06 深圳市国信互联科技有限公司 Method and device for extracting structural data
CN104598462B (en) * 2013-10-30 2018-08-07 深圳市国信互联科技有限公司 Extract the method and device of structural data
CN104112002A (en) * 2014-07-14 2014-10-22 福建星网锐捷网络有限公司 Form adaption method, device and system
CN104112002B (en) * 2014-07-14 2017-08-25 福建星网锐捷网络有限公司 A kind of methods, devices and systems of list adaptation
CN105138561A (en) * 2015-07-23 2015-12-09 中国测绘科学研究院 Deep web space data acquisition method and apparatus
CN105138561B (en) * 2015-07-23 2018-11-27 中国测绘科学研究院 A kind of darknet space data acquisition method and device
CN109086450B (en) * 2018-08-24 2021-08-27 电子科技大学 Web deep network query interface detection method
CN109086450A (en) * 2018-08-24 2018-12-25 电子科技大学 A kind of Web depth net query interface detection method
CN109558571A (en) * 2018-10-18 2019-04-02 深圳壹账通智能科技有限公司 File size recognition methods, device, computer equipment and storage medium
CN109784382A (en) * 2018-12-27 2019-05-21 广州华多网络科技有限公司 Markup information processing method, device and server
CN110555178A (en) * 2019-08-28 2019-12-10 贝壳技术有限公司 Data proxy method and device
CN110555178B (en) * 2019-08-28 2020-07-21 贝壳找房(北京)科技有限公司 Data proxy method and device
CN111026658A (en) * 2019-12-03 2020-04-17 北京小米移动软件有限公司 Debugging method, device and medium for fast application
CN111026658B (en) * 2019-12-03 2023-10-20 北京小米移动软件有限公司 Quick application debugging method, device and medium
CN112182310A (en) * 2020-11-04 2021-01-05 上海德拓信息技术股份有限公司 Method for realizing built-in real-time search universal tree-shaped component
CN112182310B (en) * 2020-11-04 2023-11-17 上海德拓信息技术股份有限公司 Method for realizing built-in real-time search general tree-shaped component

Also Published As

Publication number Publication date
CN102135976B (en) 2013-12-18

Similar Documents

Publication Publication Date Title
CN102135976B (en) Hypertext markup language page structured data extraction method and device
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
CN103136360A (en) Internet behavior markup engine and behavior markup method corresponding to same
CN103544176A (en) Method and device for generating page structure template corresponding to multiple pages
CN106202514A (en) Accident based on Agent is across the search method of media information and system
CN102063488A (en) Code searching method based on semantics
CN102831121A (en) Method and system for extracting webpage information
CN103294781A (en) Method and equipment used for processing page data
US11580177B2 (en) Identifying information using referenced text
CN101344881A (en) Index generation method and device and search system for mass file type data
CN110674310A (en) Knowledge graph-based industrial Internet of things identification method
US10776351B2 (en) Automatic core data service view generator
CN114491325A (en) Webpage data extraction method and device, computer equipment and storage medium
CN102236713A (en) Digital television interaction service page information extraction method and device
CN114117242A (en) Data query method and device, computer equipment and storage medium
Liu et al. An XML-enabled data extraction toolkit for web sources
CN104133913A (en) System and method for automatically establishing city shop information library based on video analysis, searching and aggregation
Abebe et al. Overview of event-based collective knowledge management in multimedia digital ecosystems
CN101089841B (en) Precision search method and system based on knowledge code
Furche et al. How the Minotaur turned into Ariadne: ontologies in Web data extraction
CN104978379A (en) Method and device for building application program information station
CN104063506A (en) Method and device for identifying repeated web pages
CN103761312A (en) Information extraction system and method for multi-recording webpage
Chang et al. Supporting unified interface to wrapper generator in Integrated Information Retrieval
Hou et al. A spatial knowledge sharing platform. Using the visualization approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant