CN102135976B - Hypertext markup language page structured data extraction method and device - Google Patents

Hypertext markup language page structured data extraction method and device Download PDF

Info

Publication number
CN102135976B
CN102135976B CN 201010297636 CN201010297636A CN102135976B CN 102135976 B CN102135976 B CN 102135976B CN 201010297636 CN201010297636 CN 201010297636 CN 201010297636 A CN201010297636 A CN 201010297636A CN 102135976 B CN102135976 B CN 102135976B
Authority
CN
China
Prior art keywords
node
markup language
hypertext markup
page
language page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 201010297636
Other languages
Chinese (zh)
Other versions
CN102135976A (en
Inventor
胡汉强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN 201010297636 priority Critical patent/CN102135976B/en
Publication of CN102135976A publication Critical patent/CN102135976A/en
Application granted granted Critical
Publication of CN102135976B publication Critical patent/CN102135976B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to mobile searching, and discloses a hypertext markup language page structured data extraction method and a hypertext markup language page structured data extraction device. The hypertext markup language page structured data extraction method comprises the following steps of: transmitting a searching request to a search engine or a deep web; acquiring a search result hypertext markup language page obtained according to the searching request by the search engine or the deep web; and extracting structured data from the search result hypertext markup language page according to template hypertext markup language pages stored in advance by the search engine or the deep web and matching relationships between template document object model trees corresponding to the template hypertext markup language pages and search result page document object model trees, wherein the template hypertext markup language page comprises manual analytic annotations. By the method and the device, relatively more accurate search results can be obtained, and when an interface of a member search engine or the deep web is changed, corresponding modification is not required to be performed on an extraction wrapper of the member search engine or the deep web.

Description

Hypertext markup language page structured data extraction method and device
Technical field
The present invention relates to mobile search, be specifically related to Hypertext Markup Language (HTML:Hypertext Markup Language) page structure data extraction method and device.
Background technology
Develop rapidly along with mobile communication technology and search engine technique, combination---mobile search as two big hot topic fields of search engine and these two current information industries of mobile communication, bright spot and growth point that mobile value-added service is new have been become, a very important technological highlights of mobile search is precise search, namely offer the search service of user individual, user's gained is searched.
The mobile search framework is a platform based on unit's search, it integrates the ability of many professional vertical search engines, for the user provides the brand-new comprehensive search capability of, how efficiently and accurately from the Search Results html page of member's search engine or dark net (SE) data of drawing-out structure automatically, thereby integrate the structural data of each vertical search engine or dark net with unified form, presenting to search client is to need the problem solved again, wherein, dark net refers to and is hidden in after the search/query interface, cannot be crawled the internet database of acquisition by general universal search engine reptile.
A kind of existing method from html page extraction structural data is to use decimation rule (Extraction-Rule) from html page drawing-out structure data, the method is according to the extracting rule of each result element of the html page of each member's search engine or dark net, and the result that builds each member's search engine or dark net extracts wrapper (wrapper); When from html page, extracting certain element, it is the extracting rule that the combinatory analysis of label according to this element, attribute, property value goes out to extract this element of location.
Although the use decimation rule can be from html page drawing-out structure data, but because decimation rule does not have unified method for expressing, therefore decimation rule need to write in the extraction wrapper of each member's search engine or dark net, therefore when the interface of member's search engine or dark net has change, the extraction wrapper of this member's search engine or dark net also must do corresponding the change; Simultaneously, decimation rule is only used label, attribute, the property value of element often can not locate uniquely an element that will extract, and therefore causes the result of search not accurate enough, and the user that can reduce search subscriber experiences.
Summary of the invention
The embodiment of the present invention provides hypertext markup language page structured data extraction method and device, can make Search Results there is higher accuracy, and, when the interface of member's search engine or dark net has change, do not need the extraction wrapper of this member's search engine or dark net is done to corresponding the change.
The embodiment of the present invention provides a kind of hypertext markup language page structured data extraction method, comprising:
Send searching request to search engine or dark net;
Obtain the Search Results Hypertext Markup Language page that described search engine or dark net obtain according to described searching request;
According to the search engine of pre-save or the sample Hypertext Markup Language page of dark net, and sample page document object model tree corresponding to the described sample Hypertext Markup Language page and the matching relationship between described result of page searching document object model tree, extract structural data from the described Search Results Hypertext Markup Language page, the described sample Hypertext Markup Language page comprises artificial parsing mark.
The embodiment of the present invention also provides a kind of Hypertext Markup Language page structure data extraction device, comprising:
Transmitting element, for sending searching request to search engine or dark net;
Acquiring unit, the Search Results Hypertext Markup Language page obtained according to described searching request for obtaining described search engine or dark net;
Extraction unit, the sample Hypertext Markup Language page for the search engine according to pre-save or dark net, and the matching relationship between the result of page searching document object model tree that obtains of described sample page document object model tree and described acquiring unit, extract structural data from the described Search Results Hypertext Markup Language page, the described sample Hypertext Markup Language page comprises artificial parsing mark.
The above technical scheme provided from the embodiment of the present invention can be found out, because can utilizing the artificial sample page dom tree (Template DOM Tree) of resolving mark of band, search server in the embodiment of the present invention automatically extracts structural data from the Search Results html page, therefore search server can be for the general Wrapper of all search engines or a unification of dark net structure, each dark net is built to the artificial Template DOM Tree that resolves mark of band simultaneously, just can complete the automatic extraction to all dark web frame data, accuracy and extraction efficiency with higher extraction, and, when the interface of search engine or dark net changes, only need docking port Template again to resolve mark, just the Automatic Extraction to new interface can be completed, and the code of general Wrapper need not be revised, the maintenance efficiency of system will be improved greatly.
The accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, below will the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.
The process flow diagram of the html page structural data extraction method that Fig. 1 provides for one embodiment of the invention;
Fig. 2 is the artificial parsing of node mark schematic diagram in the sample page dom tree in one embodiment of the invention;
The schematic diagram that Fig. 3 is one embodiment of the invention sample page dom tree and result of page searching dom tree while being shown node matching;
The schematic diagram that Fig. 4 is one embodiment of the invention sample page dom tree and result of page searching dom tree while being gone node matching;
Fig. 5 is the schematic diagram that one embodiment of the invention is carried out node element when coupling sample page dom tree and result of page searching dom tree;
The structural drawing of the html page structural data extraction element that Fig. 6 provides for one embodiment of the invention;
The structural drawing of the html page structural data extraction element that Fig. 7 provides for another embodiment of the present invention;
The structural drawing of the html page structural data extraction element that Fig. 8 provides for another embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making under the creative work prerequisite the every other embodiment obtained, belong to the scope of protection of the invention.
First introduce the html page structural data extraction method that the embodiment of the present invention provides, Fig. 1 has described the flow process of the html page structural data extraction method that one embodiment of the invention provides, what this embodiment described is the treatment scheme of search server, and this embodiment comprises:
101, send searching request to search engine or dark net.
In an embodiment of the present invention, search engine can be specifically the member's search engine in unit's search framework, or the search engine in other search frameworks.The embodiment of the present invention mainly adopts first search framework to be described, and the processing mode that in other search frameworks, search server in framework is searched for by the processing mode of search server and unit is similar, repeats no more.
Search server sends to searching request with this search server at least one search engine or the dark net be connected is arranged, and search engine or dark net can be searched for according to this search engine.
In one embodiment of the invention, search server can initiatively send searching request to member's search engine or dark net.In another embodiment of the present invention, after search server can receive the searching request of search client again, the searching request received is transmitted to member's search engine or dark net.
102, obtain the Search Results html page that search engine or dark net obtain according to searching request.
Search server can receive the Search Results html page sent in real time by search engine or dark net, also can fetch the Search Results html page from search engine or dark net in real time; This Search Results html page is to be obtained according to searching request search by search engine or dark net.
103, according to the search engine of pre-save or the sample html page of dark net, and the matching relationship between sample files objectification model (DOM:Document Object Model) corresponding to sample html page tree and result of page searching dom tree, extract structural data from the Search Results html page, the sample html page comprises artificial parsing mark.
Search server can the pre-save search engine or sample (Template) html page of dark net, and the structuring element to these sample html pages carried out the artificial mark of resolving, therefore these sample html pages comprise artificial parsing mark, and the sample html page is the sample searches that obtains from search engine or dark net html page as a result.
In one embodiment of the invention, the artificial parsing mark that the sample html page comprises comprises: to the mark of table node in the sample html page, the subtree that the table node of take is root node is the minimum subtree that comprises the sample page dom tree of all structural datas that will extract; And, to the mark of row node in the sample html page, the row node is the direct child node of table node, take and go subtree that node is root node and comprise the data of certain a line in the structuring list structured data that will extract; And, to the mark of node element in the sample html page, node element is the node of going in the subtree that node is root, node element is the corresponding node of element that will extract.Wherein, mark to table node in the sample html page can comprise: in mark sample html page, the table node is the table node, and the quantity of the direct child node started in the direct child node of table node in the sample html page skipped while meaning row node in mark sample html page of mark is the first quantity, and the quantity of the direct child node ended up in the direct child node of table node in the sample html page skipped while meaning row node in mark sample html page of mark is the second quantity; Mark to row node in the sample html page comprises: in mark sample html page, the row node is the row node, and the quantity of the part (part) that in mark expression sample html page, data line corresponding to row node is divided into, and the sequence number that marks row node part of living in this sample html page; Mark to node element in the sample html page comprises: in mark sample html page, node element is node element, and the row name in the structuring list structured data that marks the structural data that will extract, and the quantity be listed as in the marking structure list structure, and the sequence number that marks this node element row of living in.
Search server can, according to the sample html page that comprises artificial parsing mark, directly extract structural data from the Search Results html page; Also can, according to the sample html page that comprises artificial parsing mark, after the Search Results html page is carried out automatically resolving mark, from the Search Results html page that has carried out automatic parsing mark, extract structural data.
Wherein, no matter can or can not carry out automatically resolving mark to the Search Results html page, all need according to the sample html page that comprises artificial parsing mark, search the table node the Search Results html page from the result of page searching dom tree, row node and node element, particularly, in one embodiment of the invention, search server is searched the table node the Search Results html page from the result of page searching dom tree, row node and node element can be in the following way: the mark of search server utilization to table node in the sample html page, and the matching relationship between sample page dom tree and result of page searching dom tree, search the table node the Search Results html page from the Search Results html page, further, can utilize the table node on the Search Results html page found, and in the sample html page row node mark, and the matching relationship between sample page dom tree and result of page searching dom tree, search the capable node the Search Results html page from the Search Results html page, further, can utilize the capable node on the Search Results html page found, and to the mark of node element in described sample html page, and the matching relationship between sample page dom tree and result of page searching dom tree, search the node element the Search Results html page from the Search Results html page.
If do not need, the Search Results html page is not carried out automatically resolving mark,, after the Search Results html page is searched the node element the Search Results html page, the search server directly node element from the Search Results html page found extracts structural data.If need to carry out automatically resolving mark to the Search Results html page, search server, after from the Search Results html page, searching the table node the Search Results html page, can be marked the table node on the Search Results html page found; After from the Search Results html page, searching the capable node the Search Results html page, can the capable node on the Search Results html page found be marked; After the Search Results html page is searched the node element the Search Results html page, can the node element on the Search Results html page found be marked; And then search node element from the Search Results html page from having marked table node, row node and node element, extract structural data from the node element found.
Wherein, in one embodiment of the invention, the table node of searching the Search Results html page from described Search Results html page specifically can be in the following way, determine that in the sample page dom tree, the table node makes progress father node until the path of root node, from the root node of sample page dom tree, each node in this path opposite direction finds the matched node on the result of page searching dom tree, until find the table node mated in the result of page searching dom tree, particularly, when each node in path finds the matched node on the result of page searching dom tree, wherein a kind of matching process is as described below: the father node that the node in the result of page searching dom tree will have coupling with node in the sample page dom tree, and while being both the child node of same position of father node, node in this result of page searching dom tree is only the matched node of corresponding node in the sample page dom tree.Result of page searching dom tree result of page searching dom tree is searched capable node the Search Results html page from the Search Results html page can be in the following way, in the direct child node of the table node from the Search Results html page, searches except the direct child node of the first quantity of beginning and other the direct child nodes except the direct child node of the second quantity of ending as the capable node on the Search Results html page.The node element of searching the Search Results html page from the Search Results html page can be in the following way, determine that sample page dom tree kind node element makes progress father node until the path of row node, from the capable node of sample page dom tree, each node in this path opposite direction finds the matched node on the result of page searching dom tree, until find the node element mated in the result of page searching dom tree, particularly, when each node in path finds the matched node on the result of page searching dom tree, wherein a kind of matching process is as described below: the father node that the node in the result of page searching dom tree will have coupling with node corresponding on the sample page dom tree, and while being both the child node of same position of father node, node in this result of page searching dom tree is only the matched node of corresponding node in the sample page dom tree.
Wherein, in another embodiment of the present invention, to the mark of node in the sample html page, can also comprise the location paths to certain node on the path of root node of table node in the sample html page; Now, search server is searched this certain node from the result of page searching dom tree can be in the following way, searches node with the same position path matched node as this certain node on the Search Results html page from the result of page searching dom tree according to the location paths of this certain node in the sample html page.
In another embodiment of the present invention, can also comprise the location paths to certain node on the path of row node of node element in the sample html page to the mark of node in the sample html page; Now, search server is searched this certain node from the result of page searching dom tree can be in the following way, searches node with the same position path matched node as this certain node on the Search Results html page from the result of page searching dom tree according to the location paths of this certain node in the sample html page.
Wherein, structural data can be specifically the structuring list structured data, and data are list structure forms.
In one embodiment of the invention, if search server is searching request to be transmitted to member's search engine or dark net after the searching request that has received the search client transmission, search server is from the Search Results html page has extracted structural data, the structural data of extraction can be gathered, and use unified form to return to search client.
In another embodiment of the present invention, search server initiatively sends searching request to search engine or dark net, from the Search Results html page has extracted structural data, search server can carry out statistical study to the structural data extracted, such as carrying out statistical study etc. to the real estate tend of certain building.From the Search Results html page has extracted structural data, if search server receives the searching request that search client sends, can directly the structural data extracted be sent to search client, thereby improve the speed that search client obtains Search Results.
From the above, in the present embodiment, search server can utilize the artificial Template DOM Tree that resolves mark of band automatically to extract structural data from the Search Results html page, therefore search server can be for the general Wrapper of all search engines or a unification of dark net structure, each dark net is built to the artificial Template DOM Tree that resolves mark of band simultaneously, just can complete the automatic extraction to all dark web frame data, there is accuracy and the extraction efficiency of higher extraction; And, when the interface of search engine or dark net changes, only need docking port Template again to resolve mark, just the Automatic Extraction to new interface can be completed, and the code of general Wrapper need not be revised, the maintenance efficiency of system will be improved greatly.
The html page structural data extraction method that following act instantiation provides the embodiment of the present invention is described, and what this embodiment described is also the treatment scheme of search server, and this search server is the search server in unit's search framework.
At first, search server carries out artificial processing procedure of resolving mark to the sample html page.
Obtain sample (Template) html page of member's search engine or dark net, element to the structural data of sample page dom tree (Template DOM Tree) carries out the artificial mark of resolving, resolve the attribute (Attributes) that marks the node (Element) that specifically can add Template DOM Tree to, in one embodiment of the invention, the artificial mark of resolving can comprise following three kinds.
The parsing mark that the first is table node (Target_table), the minimum subtree that the subtree that this node is root node of take is the Template dom tree that comprises all structural datas that will extract, all structural datas that will extract all will be searched from the subtree of this node.
In one embodiment of the present of invention, the parsing mark of table node can be as follows:
Annotation_tag=" Target_Table ", mean to extract structural data from the subtree of this node.
(n is the first quantity to Annotation_begin_skip_children_count=n, for positive integer or 0), skip n the direct child node started in the direct child node of Target_Table while being illustrated in mark Target_Row node, start to mark the Target_Row node from n+1 direct child node of Target_Table node, because the several direct child nodes in the front of Target_Table node do not comprise the structural data that will extract in some cases.
(m is the second quantity to Annotation_end_skip_children_count=m, for positive integer or 0), skip m the direct child node ended up in the direct child node of Target_Table while meaning mark Target_Row node, ending m the direct child node that is the Target_Table node is not labeled as the Target_Row node, because the several direct child nodes of the ending of Target_Table node do not comprise the structural data that will extract in some cases.
The second is the mark of row node (Target_row), the capable structure in the list structure that mark will extract, and Target_Row can have a plurality of parts (Parts), and the row of repetition only need to mark once.
In one embodiment of the present of invention, the parsing mark of row node can be as follows:
Annotation_tag=" Target_Row ", mean that this node is the direct child node of Target_Table, means the data that comprise certain a line in the structuring list structured data that will extract in the subtree of this node.
Annotation_row_total_part_num=n (integer of n>=1), mean that data line is divided into n part, and this n part is dispersed in the subtree of n continuous Target_Row node.
Annotation_row_part_idex=m (1<=m<=n), mean the residing m of current Target_Row node sequence number partly.
The third is node element (Target_Field) mark, marks the element that will extract in concrete each Part of every row.
In one embodiment of the present of invention, the parsing mark of row node can be as follows:
Annotation_tag=" Field ", mean certain node in subtree that this node is the Target_Row node, the corresponding node of the structural data that specifically will extract.
Row name corresponding to the structural data that will extract that Annotation_field_name=is concrete, as " Title ", " Author ", " Abstract ", " url " etc.
Annotation_total_num_of_field=n (n is more than or equal to 1 integer), the quantity of all row of expression structuring list structure.
Annotation_field_index=m (1<=m<=n), the sequence number of the m row that the element of expression present node representative is in.
In another embodiment of the present invention, can also carry out to the sample page dom tree mark of optional node (Optional_Node), for mark the Target_table node upward on the path of root node or the Target_Field node upward to the path of Target_Row node, the optional brotgher of node in node layer front, path, optional node can be specifically optional ad content or list structure optional image node in capable etc.
In one embodiment of the present of invention, the parsing of optional node node mark can be as follows:
Annotation_tag=“Optional_Node”。
In another embodiment of the present invention, can also carry out to the sample page dom tree mark of location paths (Annotation_LocationPath), mean that the Target_table node is upward on the path of root node or on the path of Target_Field node upward to the Target_Row node, in the coupling location paths of path node layer, location paths specifically can mean by the expression formula of the LocationPath of definition in standard x ML Path Language (XPath) Version 1.0 of W3C.
For example, in one embodiment of the present of invention, the parsing mark of certain layer of path node location paths can be: Annotation_LocationPath=child::TABLE[attribute::class=" value1 "], mean that selecting element tags (Element Tag) is " TABLE ", comprise attribute (attribute) for " class ", the child node that property value is " value1 ", this certain layer path node can be mated with the element tags of node or attribute or property value.
Fig. 2 has described the artificial parsing mark signal of node in the sample page dom tree in the one embodiment of the invention, as shown in Figure 2, the mark that the artificial parsing mark of node in the sample page dom tree is comprised the location paths of the mark of mark, optional node (2) of mark, node element (6), (7), (8), (9) of mark, row node (4), (5) of his-and-hers watches node (3) and node (1) and (9).As shown in Figure 2, to being labeled as of the location paths of node (1) " Annotation LocationPath=child::TABLE[attribute::class=" value1 "] ", optional node (2) is labeled as to " Annotation_tag=" Optional node " ", his-and-hers watches node (3) be labeled as " Annotation_tag=" Target_table ", Annotation_begin_skip_children_count=1, Annotation_end_skip_children_count=1 ", row node (4) is labeled as to " Annotation_tag=" Target_row ", Annotation_row_total_part_num=2, Annotation_row_part_index=1 ", row node 5 is labeled as to " Annotation_tag=" Target_row ", Annotation_row_total_part_num=2, Annotation_row_part_index=2 ", node element (6) is labeled as to " Annotation_tag=" Field ", Annotation_field_name=" Title ", Annotation_total_num_of_fields=4, Annotation_field_index=1 ", node element (7) is labeled as to " Annotation_tag=" Field ", Annotation_field_name=" Authod ", Annotation_total_num_of_fields=4, Annotation_field_index=2 ", node element (8) is labeled as to " Annotation_tag=" Field ", Annotation_field_name=" Abstract ", Annotation_total_num_of_fields=4, Annotation_field_index=3 ", node element (9) is labeled as to " Annotation_tag=" Field ", Annotation_field_name=" url ", Annotation_total_num_of_fields=4, Annotation_field_index=4, Annotation_LocationPath=child::TD[attribute::class=" value2 "] ".
Then, after search server receives the searching request of search client, searching request is transmitted to member's search engine or dark net, and obtains member's search engine or dark net is searched for the Search Results html page of rear acquisition according to this searching request.
Again then, search server utilizes the sample page dom tree to carry out automatically resolving mark to the result of page searching dom tree.
One) carry out the coupling of Target_table node.
In one embodiment of the invention, the flow process of coupling of carrying out the Target_table node is as described below.
1, determine that in Template DOM Tree, the Target_table node makes progress father (parent) node until the path of root node;
2, from the root node of Template DOM Tree, start to find the matched node (corresponding node) in Search Result Page DOM Tree along each child node the opposite direction of above-mentioned path.
Wherein, finding matched node in one embodiment of the invention can be in the following way.
1) if the mark that the node of certain level in the above-mentioned path of Template DOM Tree comprises the Annotation_LocationPath=LocationPath expression formula, by the LocationPath expression formula, carry out the coupling of respective layer minor node, as the attribute of a node in Fig. 2 has following description: Annotation_LocationPath=child::Div[attribute::class=" value1 "], meaning to select the element tags (Element Tag) in the respective path level in Search Result Page DOM Tree is " Div ", comprise attribute (attribute) for " class ", the child node that property value is " value1 " is corresponding node or matched node, this layer of path node mated by the element tags of node or attribute or property value.
If the node of certain level in the above-mentioned path of Template DOM Tree does not comprise the mark of Annotation_LocationPath=LocationPath expression formula, corresponding node is i the same child node that has the father node of coupling and be both coupling parent node.Wherein, if the front brother of Tree layer of above-mentioned path node of Template DOM comprises optional node, check whether Search Result Page DOM Tree respective path level comprises this optional node, whether the number that specifically can compare the brotgher of node that two dom tree respective path levels comprise is identical, if Search Result Page DOM Tree respective path level does not comprise this optional node, in Search Result Page DOM Tree the node i ndex of corresponding node to the reach one, i-1 the child node that this matched node is corresponding parent node.
Fig. 3 has described the signal of sample page dom tree and result of page searching dom tree when one embodiment of the invention is shown node matching, and as shown in Figure 3, the coupling flow process of being shown node comprises:
1, find in Template DOM Tree the table node make progress the parent node until the path (5) of root node->(4)->(3)->(2)->(1).
2, from the root node of Template DOM Tree start along reciprocal each child node (1) above-mentioned path->(2)->(3)->(4)-. (5), find corresponding node in Search Result Page DOM Tree or matched node (1 ')->(2 ')->(3 ')->(4 ')->(5 ').
The method of finding matched node is specially:
(a) from root node start node (1) and node (1 ') coupling.
(b) node (2) has with node (2 ') father node (1) and (1 ') matched each other, and is all the 2nd child node of father node (1) and (1 ').
(c) mark of " Annotation_LocationPath=child::Div[attribute::class=" value1 "] " is arranged due to node (3), press LocationPath (child::Div[attribute::class=" value1 "], the label that means node is " Div ", comprise " class " attribute, and property value for " value1 ") expression formula carries out the coupling of respective layer minor node, thereby finds matched node (3 ').
(d) because there is optional node (be labeled as Annotation_tag=" Optional_Node ") node (4) front, there is with node (4 ') father node (3) and (3 ') matched each other so search first node (4) headed by the method for matched node (4 '), and check whether Search Result Page DOM Tree respective path level (node (4 ') place level) comprises this optional node (whether the number that can compare the brotgher of node that two dom tree respective path levels comprise is identical), if Search Result Page DOM Tree respective path level (node (4 ') place level) comprises optional node, node (4 ') and node (4) are all j child node of father node (3 ') and (3), if Search Result Page DOM Tree respective path level (node (4 ') place level) does not comprise optional node, j-1 the child node that node (4 ') is father node (3 ').
(e) node (5) has with node (5 ') father node (4) and (4 ') matched each other, and is all k the child node of father node (4) and (4 ').
3 until find the matched node (5 ') in the corresponding Search Result of table node (5) the Page DOM Tree in Template DOM Tree, and this matched node (5 ') is automatically resolved and is labeled as annotation_tag=" Target_Table ", copy corresponding Annotation_begin_skip_children_count=n (n is positive integer or 0) and Annotation_end_skip_children_count=m (m is positive integer or 0) mark simultaneously.Sample page dom tree result of page searching dom tree result of page searching dom tree result of page searching dom tree sample page dom tree result of page searching dom tree
Two) carry out the coupling of Target_Row node.
1, for the direct child node of lower one deck of Target_Table node, according to Annotation_begin_skip_children_count=n and the Annotation_end_skip_children_count=m of Target_Table node, at first skip n the direct what time sub and individual directly child node of back m of front.
2, the direct child node of remainder all is labeled as to the Target_Row node, and the annotation_row_total_part_num=n according to the Target_Row node in Template DOM Tree, various piece during continuous mark is capable (is n continuous Target_Row node, copy annotation_row_total_part_num=n and annotation_row_part_index=m mark that each node is corresponding), the various piece of a line has marked continuous n the part (next continuous n Target_Row node) that continues again the mark next line, until the various piece of all Target_Row nodes has all marked.
Fig. 4 has described the signal of sample page dom tree and result of page searching dom tree when one embodiment of the invention is gone node matching, and as shown in Figure 4, the coupling flow process of node of being gone comprises:
1, for the direct child node of lower one deck (2 ') (3 ') (4 ') of the table node (1 ') of Search Result Page DOM Tree (3 ' ') (4 ' ') (5 '), Annotation_begin_skip_children_count=1 and Annotation_end_skip_children_count=1 according to table node (1 '), skip 1 direct child node (2 ') and 1 the direct child node in back (5 ') of front.
2 all are labeled as capable node by the direct child node (3 ') (4 ') be left (3 ' ') (4 ' '), and the mark annotation_row_total_part_num=2 (a line comprises 2 parts) according to the capable node (2) (3) in Template DOM Tree, the various piece marked continuously in certain row of Search Result Page DOM Tree (is 2 continuous capable nodes (3 ') (4 '), copy annotation_row_total_part_num=2 and annotation_row_part_index=1 or 2 marks that each node is corresponding), the various piece of a line has marked 2 continuous parts of continuing the mark next line (the capable node of next continuous 2 (3 ' ') (4 ' ') again), until the various piece of all capable nodes has all marked.
Three) carry out the coupling of Target_Field node.
1, find certain Target_Field node in Template DOM Tree to make progress the parent node until the path of Target_Row node;
2, from the Target_Row node of Template DOM Tree, start to find the matched node in Search Result Page DOM Tree along each child node above-mentioned path.
Wherein, finding matched node in one embodiment of the invention can be in the following way.
1) if the mark that the node of certain level in the above-mentioned path of Template DOM Tree comprises the Annotation_LocationPath=LocationPath expression formula, by the LocationPath expression formula, carry out the coupling of respective layer minor node, for example the attribute of a node has following description: Annotation_LocationPath=child::Td[attribute::class=" value2 "], meaning to select the element tags (Element Tag) in the respective path level in Search Result Page DOM Tree is " Td ", attribute (attribute) is " class ", the child node that property value is " value2 " is corresponding node or matched node, this layer of path node mated by the element tags of node or attribute or property value.
If the node of certain level in the above-mentioned path of Template DOM Tree does not comprise the mark of Annotation_LocationPath=LocationPath expression formula, corresponding node is i the same child node that has the parent node of coupling and be both coupling parent node.Wherein, if the front brother of Tree layer of above-mentioned path node of Template DOM comprises optional node, check whether Search Result Page DOM Tree respective path level comprises this optional node, whether the number that specifically can compare the brotgher of node that two dom tree respective path levels comprise is identical, if Search Result Page DOM Tree respective path level does not comprise this optional node, in Search Result Page DOM Tree the node i ndex of corresponding node to the reach one, i-1 the child node that the node of this coupling is corresponding parent node.
The signal of sample page dom tree and result of page searching dom tree when Fig. 4 has described one embodiment of the invention and carries out the node element coupling, as shown in Figure 4, the coupling flow process of carrying out node element comprises:
A, the node element of subtree below the node of matching row node section 1 (1) (1 ') at first:
1. find certain node element (4) in Template DOM Tree upwards the parent node until the path (4) of row node (1)->(3)->(2)->(1).
2. from the capable node (1) of Template DOM Tree start along above-mentioned path (1) in the other direction->(2)->(3)->each child node (4) find matched node (1 ') in Search Result Page DOM Tree->(2 ')->(3 ')->(4 ').
The method of finding matched node is:
(a) will go node (1) with the row node (1 ') coupling.
(b) node (2) has with node (2 ') father node (1) and (1 ') matched each other, and is all i the child node of father node (1) and (1 ').
(c) because there is optional node (be labeled as Annotation_tag=" Optional_Node ") node (3) front, there is with node (3 ') father node (2) and (2 ') matched each other so search first node (3) headed by the method for matched node (3 '), and check whether Search Result Page DOM Tree respective path level (node (3 ') place level) comprises this optional node (whether the number that can compare the brotgher of node that two dom tree respective path levels comprise is identical), if Search Result Page DOM Tree respective path level (node (3 ') place level) comprises optional node, node (3 ') and node (3) are all j child node of father node (2 ') and (2), if Search Result Page DOM Tree respective path level (node (3 ') place level) does not comprise optional node, j-1 the child node that node (3 ') is father node (2 ').
(d) node (4) has with node (4 ') father node (3) and (3 ') matched each other, and is all k the child node of father node (3) and (3 ').
3. until find the matched node (4 ') in the corresponding Search Result of node element (4) the Page DOM Tree in Template DOM Tree, and this matched node (4 ') is automatically resolved and is labeled as annotation_tag=" Target_Field ", copy corresponding Annotation_field_name=' Title ', Annotation_total_num_of_field=4 and Annotation_field_index=1 mark simultaneously; Perhaps found after the matched node of Target_Field of Search Result Page directly from the structurized data of this Node extraction.
4. for another node element (5) in Template DOM Tree repeat 1., 2., 3. step is until find the matched node (5 ') in Search Result Page DOM Tree.
B, mate again the Target_Field node of subtree below Target_Row Part2 node (6) (6 ').
Until find matched node (9 ') and (10 ') of node (9) and node (10) and the Search Result Page DOM Tree of Template DOM Tree.
The coupling method and the step in A 1., 2., 3., 4. basic identical, repeat no more.Wherein, the coupling of node (10 ') and node (10) adopts following flow process: because node (10) has Annotation_LocationPath=child::Td[attribute::class=" value2 "] mark, press LocationPath (child::Td[attribute::class=" value2 "], the label that means node is " Td ", comprise " class " attribute, and property value for " value2 ") expression formula carries out the coupling of respective layer minor node, thereby finds matched node (10 ').
2) after finding the matched node in the corresponding Search Result of certain the Target_Field node Page DOM Tree in Template DOM Tree, this matched node is automatically resolved and is labeled as annotation_tag=" Target_Field ", and be concrete row name corresponding to the structural data that will extract of Annotation_field_name=corresponding to Target_Field node in the copy of the matched node in Search Result Page DOM Tree Template DOM Tree, Annotation_total_num_of_field=n and Annotation_field_index=m mark, further, if certain the Target_Field node in certain row Template DOM Tree does not find the matched node in corresponding Search Result Page DOM Tree, all Target_Field data of this row all abandon.
3) repeat 1) and 2) each Field of a part of step mark a line.
4) repeat 1), 2) and 3) all Field of all part of step mark a line.
5) repeat 1), 2), 3) and 4) step marks all Field of all row.
Four) automatically extract structurized data from the Search Result Page that has added automatic parsing mark, specifically can comprise the steps.
1, search and there is the node that the parsing label is annotation_tag=" Target_table ".
2, search and there is the node that the parsing label is annotation_tag=" Target_row ", search the Part of all row.
3, search and resolve the node that label is annotation_tag=" Target_Field ", extract structurized data from the element of these nodes.
4, to each Target_row repeating step 3, until the structural data of all row has extracted.
So far, completed from described Search Results html page and extracted structural data.
Wherein, in another embodiment of the present invention, after the matched node of the Target_Field that has found Search Result Page, can this matched node not carried out automatically resolving mark, and can directly from this matched node, extract structurized data.
From the above, in the present embodiment, search server can utilize the artificial Template DOM Tree that resolves mark of band automatically to extract structural data from the Search Results html page, therefore search server can be for the general Wrapper of all dark net search engines or a unification of dark net structure, each dark net is built to the artificial Template DOM Tree that resolves mark of band simultaneously, just can complete the automatic extraction to all dark web frame data, there is accuracy and the extraction efficiency of higher extraction; And, when the interface of search engine or dark net changes, only need docking port Template again to resolve mark, just the Automatic Extraction to new interface can be completed, and the code of general Wrapper need not be revised, the maintenance efficiency of system will be improved greatly.
It should be noted that, for aforesaid each embodiment of the method, for simple description, therefore it all is expressed as to a series of combination of actions, but those skilled in the art should know, the present invention is not subject to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.
Introduce the html page structural data extraction element that the embodiment of the present invention provides, the html page structural data extraction element that the embodiment of the present invention provides can be used as search server again.
Fig. 6 has described the structure of the html page structural data extraction element that one embodiment of the invention provides, and comprising:
Transmitting element 601, for sending searching request to search engine or dark net.
Acquiring unit 602, for obtaining search engine or the dark net Search Results html page according to the searching request acquisition of transmitting element 601 transmissions.
Extraction unit 603, sample html page for the search engine according to pre-save or dark net, and the matching relationship between the result of page searching dom tree that obtains of sample page dom tree and acquiring unit 602, extract structural data from the Search Results html page, the sample html page comprises artificial parsing mark.
As shown in Figure 6, the html page structural data extraction element that another embodiment of the present invention provides can also comprise receiving element 604, and this receiving element is for receiving the searching request from search client.Now, transmitting element 601 just forwards to search engine or dark net the searching request received after receiving element 604 has received from the searching request of search client; Transmitting element 601 further also at extraction unit 603 from the Search Results html page extracts structural data, the structural data that extraction unit 603 is extracted is gathered and uses unified form to return to search client.
As shown in Figure 6, the html page structural data extraction element that another embodiment of the present invention provides can also comprise receiving element 604, this receiving element 604 from the Search Results html page extracts structural data, receives the searching request from search client at extraction unit 603.Now, transmitting element 601 is further also after receiving element 604 receives the searching request from search client, and the structural data that extraction unit 603 is extracted is gathered and uses unified form to return to search client.
From the above, in the present embodiment, search server can utilize the artificial Template DOM Tree that resolves mark of band automatically to extract structural data from the Search Results html page, therefore search server can be for the general Wrapper of all dark net search engines or a unification of dark net structure, each dark net is built to the artificial Template DOM Tree that resolves mark of band simultaneously, just can complete the automatic extraction to all dark web frame data, there is accuracy and the extraction efficiency of higher extraction; And, when the interface of search engine or dark net changes, only need docking port Template again to resolve mark, just the Automatic Extraction to new interface can be completed, and the code of general Wrapper need not be revised, the maintenance efficiency of system will be improved greatly.
Fig. 7 has described the structure of the html page structural data extraction element that another embodiment of the present invention provides, and comprising:
Receiving element 701, for receiving the searching request from search client.
Acquiring unit 702, for obtaining search engine or the dark net Search Results html page according to the searching request acquisition of transmitting element 701 transmissions.
Extraction unit 703, sample html page for the search engine according to pre-save or dark net, and the matching relationship between the result of page searching dom tree that obtains of sample page dom tree and acquiring unit 702, extract structural data from the Search Results html page, the sample html page comprises artificial parsing mark, wherein, the artificial parsing mark that the sample html page comprises comprises: to the mark of table node in the sample html page, the subtree that the table node of take is root node is the minimum subtree that comprises the sample page dom tree of all structural datas that will extract; And, to the mark of row node in the sample html page, the row node is the direct child node of table node, take and go subtree that node is root node and comprise the data of certain a line in the structuring list structured data that will extract; And, to the mark of node element in the sample html page, node element is the node of going in the subtree that node is root, node element is the corresponding node of element that will extract.
Transmitting element 704, be transmitted to search engine or dark net for the searching request that receiving element 701 is received; The structural data that extraction unit 703 is extracted is gathered and uses unified form to return to search client.
Wherein, as shown in Figure 7, extraction unit 703 specifically can comprise:
First searches unit 7031, utilizes the mark to table node in the sample html page, and the matching relationship between sample page dom tree and result of page searching dom tree, from the Search Results html page, searches the table node the Search Results html page; Mark to table node in described sample html page comprises: in mark sample html page, the table node is the table node, and the quantity of the direct child node started in the direct child node of table node in the sample html page skipped while meaning row node in the described sample html page of mark of mark is the first quantity, and the quantity of the direct child node ended up in the direct child node of table node in the sample html page skipped while meaning row node in mark sample html page of mark is the second quantity.
Second searches unit 7032, for utilizing the first table node of searching on the Search Results html page that unit 7031 finds, and in the sample html page row node mark, and the matching relationship between sample page dom tree and result of page searching dom tree, search the capable node the Search Results html page from the Search Results html page; Mark to row node in the sample html page comprises: in mark sample html page, the row node is the row node, and the quantity of the part that in mark expression sample html page, data line corresponding to row node is divided into, and mark means the sequence number of row node part of living in this sample html page.
The 3rd searches unit 7033, for utilizing the second capable node of searching on the Search Results html page that unit 7032 finds, and to the mark of node element in the sample html page, and the matching relationship between sample page dom tree and result of page searching dom tree, search the node element the Search Results html page from the Search Results html page; Mark to node element in the sample html page comprises: in mark sample html page, node element is node element, and the row name in the structuring list structured data that marks the structural data that will extract, and the quantity be listed as in the marking structure list structure, and the sequence number that marks this node element row of living in.
Data extracting unit 7034, for directly extracting structural data from the 3rd node element of searching the Search Results html page that unit finds.
From the above, in the present embodiment, search server can utilize the artificial Template DOM Tree that resolves mark of band automatically to extract structural data from the Search Results html page, therefore search server can be for the general Wrapper of all dark net search engines or a unification of dark net structure, each dark net is built to the artificial Template DOM Tree that resolves mark of band simultaneously, just can complete the automatic extraction to all dark web frame data, there is accuracy and the extraction efficiency of higher extraction; And, when the interface of search engine or dark net changes, only need docking port Template again to resolve mark, just the Automatic Extraction to new interface can be completed, and the code of general Wrapper need not be revised, the maintenance efficiency of system will be improved greatly.
Fig. 8 has described the structure of the html page structural data extraction element that another embodiment of the present invention provides, and comprising:
Receiving element 801, for receiving the searching request from search client.
Acquiring unit 802, for obtaining search engine or the dark net Search Results html page according to the searching request acquisition of transmitting element 801 transmissions.
Extraction unit 803, sample html page for the search engine according to pre-save or dark net, and the matching relationship between the result of page searching dom tree that obtains of sample page dom tree and acquiring unit 802, extract structural data from the Search Results html page, the sample html page comprises artificial parsing mark, wherein, the artificial parsing mark that the sample html page comprises comprises: to the mark of table node in the sample html page, the subtree that the table node of take is root node is the minimum subtree that comprises the sample page dom tree of all structural datas that will extract; And, to the mark of row node in the sample html page, the row node is the direct child node of table node, take and go subtree that node is root node and comprise the data of certain a line in the structuring list structured data that will extract; And, to the mark of node element in the sample html page, node element is the node of going in the subtree that node is root, node element is the corresponding node of element that will extract.
Transmitting element 804, be transmitted to search engine or dark net for the searching request that receiving element 801 is received; The structural data that extraction unit 803 is extracted is gathered and uses unified form to return to search client.
Wherein, as shown in Figure 8, extraction unit 803 specifically can comprise:
Mark unit 8031, sample html page for the search engine according to pre-save or dark net, and the matching relationship between the result of page searching dom tree that obtains of sample page dom tree and acquiring unit 802, the Search Results html page is carried out automatically resolving mark.
Data extracting unit 8032, extract structural data for the Search Results html page that has carried out automatic parsing mark from mark unit 8031.
Wherein, as shown in Figure 8, mark unit 8031 specifically can comprise:
First searches unit 80311, for utilizing the mark to described sample html page table node, and the matching relationship between sample page dom tree and described result of page searching dom tree, search the table node the Search Results html page from the Search Results html page; Mark to table node in described sample html page comprises: in mark sample html page, the table node is the table node, and the quantity of the direct child node started in the direct child node of table node in the sample html page skipped while meaning row node in mark sample html page of mark is the first quantity, and the quantity of the direct child node ended up in the direct child node of table node in the described sample html page skipped while meaning row node in mark sample html page of mark is the second quantity.
The first mark unit 80312, for being marked the first table node of searching on the Search Results html page that unit 80311 finds.
Second searches unit 80313, for utilizing the first table node of searching on the Search Results html page that unit 80311 finds, and in described sample html page the row node mark, and the matching relationship between sample page dom tree and result of page searching dom tree, search the capable node the Search Results html page from the Search Results html page; Mark to row node in the sample html page comprises: in mark sample html page, the row node is the row node, and the quantity of the part that in mark expression sample html page, data line corresponding to row node is divided into, and mark means the sequence number of row node part of living in this sample html page.
The second mark unit 80314, for being marked the second capable node of searching on the Search Results html page that unit 80313 finds.
The 3rd searches unit 80315, for utilizing the second capable node of searching on the Search Results html page that unit 80313 finds, and to the mark of node element in the sample html page, and the matching relationship between sample page dom tree and result of page searching dom tree, search the node element the Search Results html page from the Search Results html page; Mark to node element in the sample html page comprises: in mark sample html page, node element is node element, and the row name in the structuring list structured data that marks the structural data that will extract, and the quantity be listed as in the marking structure list structure, and the sequence number that marks this node element row of living in.
The 3rd mark unit 80316, for being marked the 3rd node element of searching on the Search Results html page that unit 80315 finds.
From the above, in the present embodiment, search server can utilize the artificial Template DOM Tree that resolves mark of band automatically to extract structural data from the Search Results html page, therefore search server can be for the general Wrapper of all dark net search engines or a unification of dark net structure, each dark net is built to the artificial Template DOM Tree that resolves mark of band simultaneously, just can complete the automatic extraction to all dark web frame data, there is accuracy and the extraction efficiency of higher extraction; And, when the interface of search engine or dark net changes, only need docking port Template again to resolve mark, just the Automatic Extraction to new interface can be completed, and the code of general Wrapper need not be revised, the maintenance efficiency of system will be improved greatly.
Introduce the search system that the embodiment of the present invention provides, this search system comprises the html page structural data extraction element that the embodiment of the present invention provides again, and this html page structural data extraction element is connected with at least one search engine or dark net.
The contents such as the information interaction between said apparatus and intrasystem each module, implementation, due to the inventive method embodiment based on same design, particular content can, referring to the narration in the inventive method embodiment, repeat no more herein.
One of ordinary skill in the art will appreciate that all or part of flow process realized in above-described embodiment method, to come the hardware that instruction is relevant to complete by computer program, above-mentioned program can be stored in a computer read/write memory medium, this program, when carrying out, can comprise the flow process as the embodiment of above-mentioned each side method.Wherein, above-mentioned storage medium can be magnetic disc, CD, read-only store-memory body (ROM:Read-Only Memory) or random store-memory body (RAM:Random Access Memory) etc.
Applied specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment is just for helping to understand method of the present invention and thought thereof; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention simultaneously.

Claims (15)

1. a hypertext markup language page structured data extraction method, is characterized in that, comprising:
Send searching request to search engine or dark net;
Obtain the Search Results Hypertext Markup Language page that described search engine or dark net obtain according to described searching request;
According to the search engine of pre-save or the sample Hypertext Markup Language page of dark net, and sample page document object model tree corresponding to the described sample Hypertext Markup Language page and the matching relationship between described result of page searching document object model tree, extract structural data from the described Search Results Hypertext Markup Language page, the described sample Hypertext Markup Language page comprises artificial parsing mark, and described artificial parsing mark is the mark that the structuring element for the described sample Hypertext Markup Language page carries out;
The artificial parsing mark that the described sample Hypertext Markup Language page comprises comprises:
To the mark of table node in the described sample Hypertext Markup Language page, the subtree that the described table node of take is root node is the minimum subtree that comprises the sample page document object model tree of all structural datas that will extract;
And in the described sample Hypertext Markup Language page row node mark, described row node is the direct child node of described table node, the subtree that the described row node of take is root node comprises the data of certain a line in the structuring list structured data that will extract;
And, to the mark of node element in the described sample Hypertext Markup Language page, described node element is the node of take in the subtree that described row node is root, described node element is the corresponding node of element that will extract.
2. hypertext markup language page structured data extraction method as claimed in claim 1 is characterized in that:
The described mark to table node in the described sample Hypertext Markup Language page comprises: marking table node in the described sample Hypertext Markup Language page is the table node, and the quantity that marks the direct child node started in the direct child node of table node in the described sample Hypertext Markup Language page of skipping while meaning in the described sample Hypertext Markup Language page of mark to go node is the first quantity, and the quantity that marks the direct child node ended up in the direct child node of table node in the described sample Hypertext Markup Language page of skipping while meaning in the described sample Hypertext Markup Language page of mark to go node is the second quantity,
The described mark to row node in the described sample Hypertext Markup Language page comprises: marking row node in the described sample Hypertext Markup Language page is the row node, and the quantity of the part that in the described sample Hypertext Markup Language page of mark expression, data line corresponding to row node is divided into, and mark means the sequence number of row node part of living in this sample Hypertext Markup Language page;
The described mark to node element in the described sample Hypertext Markup Language page comprises: marking node element in the described sample Hypertext Markup Language page is node element, and the row name in described structuring list structured data that marks the structural data that will extract, and mark the quantity be listed as in described structuring list structure, and the sequence number that marks this node element row of living in.
3. hypertext markup language page structured data extraction method as claimed in claim 2, is characterized in that, describedly extracts structural data from the described Search Results Hypertext Markup Language page and comprise:
Utilize the described mark to table node in the described sample Hypertext Markup Language page, and the matching relationship between described sample page document object model tree and described result of page searching document object model tree, search the table node the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page;
Table node on the Search Results Hypertext Markup Language page that utilization finds, and the described mark to row node in the described sample Hypertext Markup Language page, and the matching relationship between described sample page document object model tree and described result of page searching document object model tree, search the capable node the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page;
Capable node on the Search Results Hypertext Markup Language page that utilization finds, and the described mark to node element in the described sample Hypertext Markup Language page, and the matching relationship between described sample page document object model tree and described result of page searching document object model tree, search the node element the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page;
Directly the node element from the Search Results Hypertext Markup Language page found extracts structural data.
4. hypertext markup language page structured data extraction method as claimed in claim 3, is characterized in that, the described table node of searching the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page comprises:
Determine that sample page document object model seeds table node makes progress father node until the path of root node, from the root node of sample page document object model tree, each node in this path opposite direction finds the matched node on result of page searching document object model tree, until find the table node mated in result of page searching document object model tree;
The described capable node of searching the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page comprises:
In the direct child node of the table node from the described Search Results Hypertext Markup Language page, search except the direct child node of the first quantity of beginning and other the direct child nodes except the direct child node of the second quantity of ending as the capable node on the Search Results Hypertext Markup Language page;
The described node element of searching the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page comprises:
Determine that in sample page document object model tree, node element makes progress father node until the path of row node, from the capable node of sample page document object model tree, each node in this path opposite direction finds the matched node on result of page searching document object model tree, until find the node element mated in result of page searching document object model tree.
5. hypertext markup language page structured data extraction method as claimed in claim 4, it is characterized in that, it is described that along this path, each node in the other direction finds the matched node on result of page searching document object model tree to comprise: from result of page searching document object model tree, search with the sample Hypertext Markup Language page in the table node there is the father node of coupling, and the node of child node of same position of father node that is both this coupling is as the matched node in result of page searching document object model tree.
6. hypertext markup language page structured data extraction method as claimed in claim 4, it is characterized in that, to the mark of node in the sample Hypertext Markup Language page, can also comprise the location paths to certain node on the path of root node of table node in the sample Hypertext Markup Language page;
Described along this path, each node in the other direction finds the matched node on result of page searching document object model tree, until find the table node mated in result of page searching document object model tree to comprise: search node with the same position path matched node as this certain node on the Search Results Hypertext Markup Language page from result of page searching document object model tree according to the location paths of this certain node in the sample Hypertext Markup Language page.
7. hypertext markup language page structured data extraction method as claimed in claim 4, it is characterized in that, to the mark of node in the sample Hypertext Markup Language page, can also comprise the location paths to certain node on the path of row node of node element in the sample Hypertext Markup Language page;
Described each node in this path opposite direction finds the matched node on result of page searching document object model tree, until find the node element mated in result of page searching document object model tree to comprise:
Search node with the same position path matched node as this certain node on the Search Results Hypertext Markup Language page from result of page searching document object model tree according to the location paths of this certain node in the sample Hypertext Markup Language page.
8. hypertext markup language page structured data extraction method as claimed in claim 1, is characterized in that, describedly extracts structural data from the described Search Results Hypertext Markup Language page and comprise:
The described Search Results Hypertext Markup Language page is carried out automatically resolving mark;
Extract structural data from the Search Results Hypertext Markup Language page that has carried out automatic parsing mark.
9. hypertext markup language page structured data extraction method as claimed in claim 8 is characterized in that:
The described mark to table node in the described sample Hypertext Markup Language page comprises: marking table node in the described sample Hypertext Markup Language page is the table node, and the quantity that marks the direct child node started in the direct child node of table node in the described sample Hypertext Markup Language page of skipping while meaning in the described sample Hypertext Markup Language page of mark to go node is the first quantity, and the quantity that marks the direct child node ended up in the direct child node of table node in the described sample Hypertext Markup Language page of skipping while meaning in the described sample Hypertext Markup Language page of mark to go node is the second quantity,
The described mark to row node in the described sample Hypertext Markup Language page comprises: marking row node in the described sample Hypertext Markup Language page is the row node, and the quantity of the part that in the described sample Hypertext Markup Language page of mark expression, data line corresponding to row node is divided into, and mark means the sequence number of row node part of living in this sample Hypertext Markup Language page;
The described mark to node element in the described sample Hypertext Markup Language page comprises: marking node element in the described sample Hypertext Markup Language page is node element, and the row name in described structuring list structured data that marks the structural data that will extract, and mark the quantity be listed as in described structuring list structure, and the sequence number that marks this node element row of living in.
10. hypertext markup language page structured data extraction method as claimed in claim 9, is characterized in that, describedly the described Search Results Hypertext Markup Language page is carried out to automatically resolve mark comprises:
Utilize the described mark to table node in the described sample Hypertext Markup Language page, and the matching relationship between described sample page document object model tree and described result of page searching document object model tree, search the table node the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page, the table node on the Search Results Hypertext Markup Language page found is marked;
Table node on the Search Results Hypertext Markup Language page that utilization finds, and the described mark to row node in the described sample Hypertext Markup Language page, and the matching relationship between described sample page document object model tree and described result of page searching document object model tree, search the capable node the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page, the capable node on the Search Results Hypertext Markup Language page found is marked;
Capable node on the Search Results Hypertext Markup Language page that utilization finds, and the described mark to node element in the described sample Hypertext Markup Language page, and the matching relationship between described sample page document object model tree and described result of page searching document object model tree, search the node element the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page, the node element on the Search Results Hypertext Markup Language page found is marked.
11. hypertext markup language page structured data extraction method as claimed in claim 10, is characterized in that, the described table node of searching the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page comprises:
Determine that in sample page document object model tree, the table node makes progress father node until the path of root node, from the root node of sample page document object model tree, each node in this path opposite direction finds the matched node on result of page searching document object model tree, until find the table node mated in result of page searching document object model tree;
The described capable node of searching the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page comprises:
In the direct child node of the table node from the described Search Results Hypertext Markup Language page, search except the direct child node of the first quantity of beginning and other the direct child nodes except the direct child node of the second quantity of ending as the capable node on the Search Results Hypertext Markup Language page;
The described node element of searching the Search Results Hypertext Markup Language page from the described Search Results Hypertext Markup Language page comprises:
Determine that in sample page document object model tree, node element makes progress father node until the path of row node, from the capable node of sample page document object model tree, each node in this path opposite direction finds the matched node on result of page searching document object model tree, until find the node element mated in result of page searching document object model tree.
12. hypertext markup language page structured data extraction method as claimed in claim 11, it is characterized in that, it is described that along this path, each node in the other direction finds the matched node on result of page searching document object model tree to comprise: from result of page searching document object model tree, search with the sample Hypertext Markup Language page in the table node there is the father node of coupling, and the node of child node of same position of father node that is both this coupling is as the matched node in result of page searching document object model tree.
13. hypertext markup language page structured data extraction method as claimed in claim 11, it is characterized in that, to the mark of node in the sample Hypertext Markup Language page, can also comprise the location paths to certain node on the path of root node of table node in the sample Hypertext Markup Language page;
Described along this path, each node in the other direction finds the matched node on result of page searching document object model tree, until find the table node mated in result of page searching document object model tree to comprise: search node with the same position path matched node as this certain node on the Search Results Hypertext Markup Language page from result of page searching document object model tree according to the location paths of this certain node in the sample Hypertext Markup Language page.
14. hypertext markup language page structured data extraction method as claimed in claim 11, it is characterized in that, to the mark of node in the sample Hypertext Markup Language page, can also comprise the location paths to certain node on the path of row node of node element in the sample Hypertext Markup Language page;
Described each node in this path opposite direction finds the matched node on result of page searching document object model tree, until find the node element mated in result of page searching document object model tree to comprise:
Search node with the same position path matched node as this certain node on the Search Results Hypertext Markup Language page from result of page searching document object model tree according to the location paths of this certain node in the sample Hypertext Markup Language page.
15. a Hypertext Markup Language page structure data extraction device, is characterized in that, comprising:
Transmitting element, for sending searching request to search engine or dark net;
Acquiring unit, the Search Results Hypertext Markup Language page obtained according to described searching request for obtaining described search engine or dark net;
Extraction unit, the sample Hypertext Markup Language page for the search engine according to pre-save or dark net, and the matching relationship between the result of page searching document object model tree that obtains of described sample page document object model tree and described acquiring unit, extract structural data from the described Search Results Hypertext Markup Language page, the described sample Hypertext Markup Language page comprises artificial parsing mark, and described artificial parsing mark is the mark that the structuring element for the described sample Hypertext Markup Language page carries out;
The artificial parsing mark that the described sample Hypertext Markup Language page comprises comprises:
To the mark of table node in the described sample Hypertext Markup Language page, the subtree that the described table node of take is root node is the minimum subtree that comprises the sample page document object model tree of all structural datas that will extract;
And in the described sample Hypertext Markup Language page row node mark, described row node is the direct child node of described table node, the subtree that the described row node of take is root node comprises the data of certain a line in the structuring list structured data that will extract;
And, to the mark of node element in the described sample Hypertext Markup Language page, described node element is the node of take in the subtree that described row node is root, described node element is the corresponding node of element that will extract.
CN 201010297636 2010-09-27 2010-09-27 Hypertext markup language page structured data extraction method and device Active CN102135976B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010297636 CN102135976B (en) 2010-09-27 2010-09-27 Hypertext markup language page structured data extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010297636 CN102135976B (en) 2010-09-27 2010-09-27 Hypertext markup language page structured data extraction method and device

Publications (2)

Publication Number Publication Date
CN102135976A CN102135976A (en) 2011-07-27
CN102135976B true CN102135976B (en) 2013-12-18

Family

ID=44295764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010297636 Active CN102135976B (en) 2010-09-27 2010-09-27 Hypertext markup language page structured data extraction method and device

Country Status (1)

Country Link
CN (1) CN102135976B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198069A (en) * 2012-01-06 2013-07-10 株式会社理光 Method and device for extracting relational table
CN104598462B (en) * 2013-10-30 2018-08-07 深圳市国信互联科技有限公司 Extract the method and device of structural data
CN104112002B (en) * 2014-07-14 2017-08-25 福建星网锐捷网络有限公司 A kind of methods, devices and systems of list adaptation
CN105138561B (en) * 2015-07-23 2018-11-27 中国测绘科学研究院 A kind of darknet space data acquisition method and device
CN109086450B (en) * 2018-08-24 2021-08-27 电子科技大学 Web deep network query interface detection method
CN109558571A (en) * 2018-10-18 2019-04-02 深圳壹账通智能科技有限公司 File size recognition methods, device, computer equipment and storage medium
CN109784382A (en) * 2018-12-27 2019-05-21 广州华多网络科技有限公司 Markup information processing method, device and server
CN110555178B (en) * 2019-08-28 2020-07-21 贝壳找房(北京)科技有限公司 Data proxy method and device
CN111026658B (en) * 2019-12-03 2023-10-20 北京小米移动软件有限公司 Quick application debugging method, device and medium
CN112182310B (en) * 2020-11-04 2023-11-17 上海德拓信息技术股份有限公司 Method for realizing built-in real-time search general tree-shaped component

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101211336A (en) * 2006-12-29 2008-07-02 鸿富锦精密工业(深圳)有限公司 Visualized system and method for generating inquiry file
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101211336A (en) * 2006-12-29 2008-07-02 鸿富锦精密工业(深圳)有限公司 Visualized system and method for generating inquiry file
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages

Also Published As

Publication number Publication date
CN102135976A (en) 2011-07-27

Similar Documents

Publication Publication Date Title
CN102135976B (en) Hypertext markup language page structured data extraction method and device
US20100083095A1 (en) Method for Extracting Data from Web Pages
CN107423391B (en) Information extraction method of webpage structured data
US11580177B2 (en) Identifying information using referenced text
US11423042B2 (en) Extracting information from unstructured documents using natural language processing and conversion of unstructured documents into structured documents
CN103544176A (en) Method and device for generating page structure template corresponding to multiple pages
CN102063488A (en) Code searching method based on semantics
US20100223214A1 (en) Automatic extraction using machine learning based robust structural extractors
CN102831121A (en) Method and system for extracting webpage information
CN103136360A (en) Internet behavior markup engine and behavior markup method corresponding to same
CN103678509B (en) Generate the method and device of web page template
CN106294885A (en) A kind of data collection towards isomery webpage and mask method
KR20200082179A (en) Data transformation method for spatial data&#39;s semantic annotation
CN103838862A (en) Video searching method, device and terminal
Oelen et al. Creating a scholarly knowledge graph from survey article tables
US11392753B2 (en) Navigating unstructured documents using structured documents including information extracted from unstructured documents
Greenberg Metadata and digital information
Grasso et al. Effective web scraping with oxpath
CN107015907A (en) A kind of system and method for automatic accurate positioning webpage element
Jou Schema extraction for deep web query interfaces using heuristics rules
CN116523041A (en) Knowledge graph construction method, retrieval method and system for equipment field and electronic equipment
CN103761312B (en) Information extraction system and method for multi-recording webpage
CN104063506A (en) Method and device for identifying repeated web pages
Chang et al. Supporting unified interface to wrapper generator in Integrated Information Retrieval
CN101089841A (en) Precision search method and system based on knowlege code

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant