CN108563729B - Bid winning information extraction method for bidding website based on DOM tree - Google Patents

Bid winning information extraction method for bidding website based on DOM tree Download PDF

Info

Publication number
CN108563729B
CN108563729B CN201810301630.0A CN201810301630A CN108563729B CN 108563729 B CN108563729 B CN 108563729B CN 201810301630 A CN201810301630 A CN 201810301630A CN 108563729 B CN108563729 B CN 108563729B
Authority
CN
China
Prior art keywords
node
bid
winning
dom tree
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810301630.0A
Other languages
Chinese (zh)
Other versions
CN108563729A (en
Inventor
陈羽中
林剑
郭昆
张伟智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201810301630.0A priority Critical patent/CN108563729B/en
Publication of CN108563729A publication Critical patent/CN108563729A/en
Application granted granted Critical
Publication of CN108563729B publication Critical patent/CN108563729B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention aims to provide a bid winning information extraction method of a bidding website based on a DOM tree, which comprises the following steps: firstly, acquiring a title of each bid-winning item displayed in a list page and a link of a bid-winning item detail page through the collection of a bid-winning information list page of a bid-winning website, and acquiring an HTML code of the bid-winning item detail page through the link, wherein the data form one item of bid-winning item data, and N items of bid-winning item data form a data set; for each item of bid-winning item data in the data set, a DOM tree is created by using the title of the bid-winning item in the list page and the corresponding HTML code; traversing the data set to generate N DOM trees; generating a wrapper according to the N DOM trees; and finally, extracting the text content in the bid-winning item detail page by using a wrapper, namely information of the bid-winning item. The method can improve the accuracy rate of winning bid information extraction and reduce the total execution time of the tasks.

Description

Bid winning information extraction method for bidding website based on DOM tree
Technical Field
The invention relates to the technical field of wrappers (Wrapper), in particular to a bid winning information extraction method for a bidding website based on a DOM tree.
Background
Information is extracted from the web page and is mainly finished by a wrapper. The wrapper is a software program that is composed of a series of information extraction rules that have been already established and a program that uses the rules. For the inquiry request of the specific information source of the user, relevant useful data is found out from the information source of the page, extracted, converted into data described by a specified format and returned to the user. A wrapper is typically directed to a class of pages in a particular information source. Extracting data from multiple different sources requires a suite of wrapper libraries.
At present, almost all web pages on the network contain more or less templates, and the content of the templates contains navigation bars, organization marks and contact information, advertisement bars and the like, and the information can frequently appear in all the web pages of the same organization. The more and more experts and scholars are attracted by relevant research of webpage text content extraction methods, and some algorithms are proposed. Scholars propose a precise extraction algorithm of news webpage texts based on double-layer decision, wherein the double-layer decision refers to the decision of the global scope of the region where the news webpage texts are located and the decision of whether each segment of texts in the text scope is the local content of the texts; the method comprises the steps that a learner provides a webpage theme information automatic extraction algorithm based on DOM, and provides an STU-DOM tree model containing semantic information aiming at the semi-structural characteristics of HTML and the deficiency of semantic description based on DOM specification; some researchers put forward a webpage text information extraction method based on a mark window, and the position of a text is found out by comparing the text and the text in a calculation title. Some researchers have proposed an automatic text extraction method for short text web pages, which divides the web page into short texts by the number of text words and then determines whether the short texts are texts by finding out the nodes and positions with the maximum text density.
At present, most webpage text content extraction algorithms are classified by taking text density and label density as features, the characteristics of the same content and different content of text content labels of similar webpages are not considered, and the problem of short text content is not processed well.
Disclosure of Invention
The invention aims to provide a bid winning information extraction method for a bidding website based on a DOM tree.
In order to achieve the purpose, the technical scheme of the invention is as follows: a bid winning information extraction method for a bidding website based on a DOM tree specifically comprises the following steps: a bid winning information extraction method for a bidding website based on a DOM tree is characterized by comprising the following steps: step A: acquiring a title of each bid-winning item displayed in a list page and a link of a bid-winning item detail page through the collection of a bid-winning information list page of a bid-winning website, and acquiring an HTML (hypertext markup language) code of the bid-winning item detail page through the link, wherein the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page form bid-winning item data, and N items of bid-winning item data form a data set used for generating a wrapper; n is a natural number not less than 1; and B: for each item of bid-winning item data in the data set, creating a DOM tree by using the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page, and traversing the data set containing the N items of bid-winning item data obtained in the step A to generate N DOM trees; and C: generating a Wrapper by using the N DOM trees created in the step B; step D: and D, extracting the text content in the bid-winning item detail page of the bid-winning website, namely the information of the bid-winning item by using the wrapper generated in the step C.
In an embodiment of the present invention, in the step B, for each item in the data set, a DOM tree is created by using a title of the winning bid item in the list page and an HTML code of a detail page of the corresponding winning bid item, which specifically includes the following steps: step B1: analyzing an HTML code of a detail page of a bid-winning project, creating a DOM tree, defining eight attributes for each Node in the DOM tree, namely, the nodes are represented as Node = { label, attrib, text, parent, child, next sibling, prebro, nextbro and flag }, and respectively represent a label type, a label attribute, a text, a father Node, a child Node, a last sibling Node, a next sibling Node and whether the Node is a bid-winning information mark, wherein the label attribute comprises id, class and href of a label, and the flag attribute is initialized to true; step B2: cleaning nodes where modification type tags are located in the DOM tree, wherein the modification type tags comprise < head > </head >, < script > </script >, and < style > </style >, and the like. Because < head > </head > is the head information of the website, < script > </script > is the script, < style > </style > is the style, can not include the information of winning a bid, so delete; step B3: searching a node P where the title is located in the DOM tree through the obtained bid-winning item title from the list page and through a fuzzy matching method; step B4: replacing all brother nodes positioned in front of the P node in the layer of the P node by the self-defined node, and performing the step on the father node of the P node in a recursion manner until the root node is reached; step B5: and starting from the root node where the < html > tag is located, performing breadth-first traversal, and performing post-processing on the nodes in the DOM tree according to the tags and tag attributes of the nodes.
Further, in the step B3, by obtaining the title of the bid-winning item from the list page, and by using a fuzzy matching method, the node P where the title is located is searched in the DOM tree, which is specifically as follows: performing depth-first traversal on the DOM tree and finding out a first coincident node through the following judgment:
Figure 100002_DEST_PATH_IMAGE001
where S denotes the title in the list page, T denotes the strings to be compared, LCS (S, T) calculates the longest common subsequence length of the two strings, and s.length denotes the string length of the title in the list page.
Further, in step B4, a self-defined node is used to replace a sibling node located before the P node in the layer where the P node is located, where the self-defined node is defined as < div class = "# equals > </div >, a tag attribute of the self-defined node includes a class attribute, and the text is not null, and is used to retain the self-defined node but not to be deleted when the DOM tree is post-processed in step B5, and the flag attribute is set to false to indicate that the self-defined node is not bid-winning item information.
Further, in step B5, starting from the root node where the < html > tag is located, performing breadth-first traversal, and performing post-processing on the node in the DOM tree according to the tag and the tag attribute of the node, specifically including the following steps: step B51: sequentially adding all child nodes of the root node of the DOM tree into a queue Q; step B52: if the queue Q is not empty, popping out a queue head node Q from the queue Q, and if the queue Q is empty, ending the process; step B53: if the node q is the < a > tag, the tag attribute contains href, and the value of href is not null, deleting the node q from the DOM tree, and then jumping to the step B56, otherwise, executing the step B54; step B54: if the node q is the < p > tag and the tag attribute has no id attribute, executing the step B55, otherwise, jumping to the step B56; step B55: if the node q is a leaf node, directly deleting the node q, then jumping to the step B52, otherwise, replacing the node q with all child nodes of q, and jumping to the step B52; step B56: and if the label attribute of the node label q has the class attribute or the id attribute, jumping to the step B55, otherwise, adding all child nodes into the queue and jumping to the step B52.
In an embodiment of the present invention, in the step C, generating the Wrapper by using the N DOM trees created in the step B specifically includes the following steps: step C1: randomly selecting a root node of a DOM tree from the N DOM trees and adding the root node into the set S; step C2: counting the text length contained in the node and the text length distribution situation contained in the nodes with the same label type and label attribute with the node in other N-1 DOM trees, namely counting the number of the nodes corresponding to different text lengths, if the maximum value of the number of the nodes is less than N/2, considering that the node has winning bid information, and skipping to the step C3, otherwise, considering that the node is not winning bid information, setting the flag value of the node to false and skipping to the step C4; step C3: c2 is sequentially executed for all child nodes under the node, respectively; step C4: searching an ancestor node R of the node from the set, if the ancestor node R exists, replacing the node with other child nodes with flag values of true in the child node set of R and a sibling node set with flag values of true of the node, and if the ancestor node does not exist, deleting the node from the result set; step C5: and generating a Wrapper by using the nodes in the set S, wherein the generated Wrapper is an Xpath path expression set generated by using the label type and the label attribute.
Further, in the step C5, the node in the set S is used to generate a Wrapper, and the generated Wrapper is an Xpath path expression set generated by using the label type and the label attribute, which specifically includes the following steps: step C51: traversing all nodes in the set, and respectively performing the step C52; step C52: if the node label attribute has an id attribute and the value of the id attribute is not null, generating an XPath path expression of @/node.tag [ @ id = node.id ]; otherwise, generating an Xpath path expression of// node.tag [ @ class = node.class ]; the path expression of Xpath is expressed as a node which accords with the label type of node.tag and id of the label attribute of node.id in all descendant nodes taking the current node as the root, and is expressed as a node which accords with the label type of node.tag and is expressed as node.class in the label attribute, and the expression of Xpath is expressed as a node which accords with the label type of node.tag and is expressed as node.class in all descendant nodes taking the current node as the root; step C53: adding the generated Xpath path into an Xpath path expression set of the Wrapper; wrapper represents a set of Xpath path expressions generated using the label type and label attributes of a node.
In an embodiment of the present invention, in the step D, a standard DOM tree is created for the bid-posting webpage, a node in the DOM tree is selected by using an Xpath path expression in the Wrapper, and texts of all descendant nodes taking the node as a root are extracted.
Compared with the prior art, the method has the advantages that the wrapper capable of automatically extracting the bid-winning data is generated by processing the bid-winning data set of the bid-winning website, the accuracy rate of bid-winning information extraction is improved, and meanwhile the total execution time of tasks is reduced.
Drawings
Fig. 1 is a flowchart of a bid winning information extraction method for a bid inviting website based on a DOM tree according to the present invention.
Fig. 2 is an exemplary diagram of step B55.
Fig. 3 is an exemplary diagram of step C4.
Detailed Description
The invention is further explained below with reference to the figures and the specific embodiments.
FIG. 1 is a flowchart of a bid winning information extraction method for a bid inviting website based on a DOM tree according to the present invention. Firstly, acquiring a title of each bid-winning item displayed in a list page and a link of a bid-winning item detail page through the collection of a bid-winning information list page of a bid-winning website, and acquiring an HTML code of the bid-winning item detail page through the link, wherein the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page form bid-winning item data, and a plurality of items of bid-winning item data form a data set for generating a wrapper; and then traversing a data set containing the data of the N bid-winning items, and creating a DOM tree for each bid-winning item in the data set by using the title of the bid-winning item in the list page and the HTML code of the corresponding detail page of the bid-winning item. After traversing the data set, generating N DOM trees; generating a wrapper according to the N DOM trees; and finally, extracting the text content in the bid-winning item detail page of the bid-winning website, namely the information of the bid-winning item by using the generated wrapper. As shown in fig. 1, the method comprises the steps of:
step A: through the collection of a bid-winning information list page of a bid-winning website, a title of each bid-winning item displayed in the list page and a link of a bid-winning item detail page are obtained, and an HTML code of the bid-winning item detail page is obtained through the link, wherein the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page form bid-winning item data, and the N items of bid-winning item data form a data set used for generating a wrapper. N is a natural number not less than 1.
And B: and B, traversing the data set containing the N items of bid-winning item data obtained in the step A, and creating a DOM tree for each item of bid-winning item data in the data set by using the title of the bid-winning item in the list page and the HTML code of the corresponding detail page of the bid-winning item. And after traversing the data set, generating N DOM trees.
Specifically, in the step B, traversing the data set containing the N items of bid-winning item data obtained in the step a, creating a DOM tree for each item of bid-winning item data in the data set by using the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page, and generating N DOM trees after traversing the data set, specifically including the following steps:
step B1: analyzing an HTML code of a detail page of a bid-winning project, creating a DOM tree, defining eight attributes for each Node in the DOM tree, namely, the nodes are represented as Node = { label, attrib, text, parent, children, prebro, nextbro and flag }, and respectively represent a label type, a label attribute, a text, a father Node, a child Node, a last brother Node, a next brother Node and whether the Node is a bid-winning information mark, wherein the label attribute attrib comprises id, class and href of a label, and the flag attribute is initialized to true;
step B2: cleaning nodes where modification type tags are located in the DOM tree, wherein the modification type tags comprise < head > </head >, < script > </script >, and < style > </style >, and the like. Because < head > </head > is the head information of the website, < script > </script > is the script, < style > </style > is the style, can not include the information of winning a bid, so delete;
step B3: searching a node P where the title is located in the DOM tree through the obtained bid-winning item title from the list page and through a fuzzy matching method;
preferably, the specific method for searching the node P where the title is located in the DOM tree by the fuzzy matching method is as follows:
searching a node P where the title is located in the DOM tree, wherein the specific method is as follows: performing depth-first traversal on the DOM tree and finding out a first coincident node through the following judgment:
Figure 685772DEST_PATH_IMAGE001
where S denotes the title in the list page, T denotes the strings to be compared, LCS (S, T) calculates the longest common subsequence length of the two strings, and s.length denotes the string length of the title in the list page.
Step B4: replacing all brother nodes positioned in front of the P node in the layer of the P node by the self-defined node, and performing the step on the father node of the P node in a recursion manner until the root node is reached;
preferably, in the step B4, a self-defined node is used to replace a sibling node located before the P node in the layer where the P node is located, where the self-defined node is defined as < div class = "# equals" > </div >, a tag attribute of the self-defined node includes a class attribute, and the text is not null, and the self-defined node is reserved but not deleted when the DOM tree is post-processed in the step B5, and the flag attribute is set to false, which indicates that the self-defined node is not winning bid item information.
Step B5: starting from a root node where the < html > tag is located, performing breadth-first traversal, and performing post-processing on nodes in the DOM tree according to the tags and tag attributes of the nodes;
specifically, in step B5, starting from the root node where the < html > tag is located, performing breadth-first traversal, and performing post-processing on the node in the DOM tree according to the tag and the tag attribute of the node, specifically including the following steps:
step B51: sequentially adding all child nodes of the root node of the DOM tree into a queue Q;
step B52: if the queue Q is not empty, popping out a queue head node Q from the queue Q, and if the queue Q is empty, ending the process;
step B53: if the node q is the < a > tag, the tag attribute contains href, and the value of href is not null, deleting the node q from the DOM tree, and then jumping to the step B56, otherwise, executing the step B54;
step B54: if the node q is the < p > tag and the tag attribute has no id attribute, executing the step B55, otherwise, jumping to the step B56;
step B55: if the node q is a leaf node, directly deleting the node q, then jumping to the step B52, otherwise, replacing the node q with all child nodes of q, and jumping to the step B52;
an exemplary diagram of step B55 is shown in fig. 2, in a specific embodiment of the invention. Node C is taken as an example in the example.
Step B56: if the label attribute of the node label q has the class attribute or the id attribute, jumping to step B55, otherwise, adding all child nodes into the queue and jumping to step B52;
and C: generating a wrapper by using the N DOM trees created in the step B;
specifically, in the step C, generating the Wrapper by using the N DOM trees created in the step B specifically includes the following steps:
step C1: randomly selecting a root node of a DOM tree from the N DOM trees and adding the root node into the set S;
step C2: counting the text length contained in the node and the text length distribution situation contained in the nodes with the same label type and label attribute with the node in other N-1 DOM trees, namely counting the number of the nodes corresponding to different text lengths, if the maximum value of the number of the nodes is less than N/2, considering that the node has winning bid information, and skipping to the step C3, otherwise, considering that the node is not winning bid information, setting the flag value of the node to false and skipping to the step C4;
step C3: c2 is sequentially executed for all child nodes under the node, respectively;
step C4: searching an ancestor node R of the node from the set, if the ancestor node R exists, replacing the node with other child nodes with flag values of true in the child node set of R and a sibling node set with flag values of true of the node, and if the ancestor node does not exist, deleting the node from the result set;
an example of step C4 is shown in FIG. 3, assuming that this time execution is to node E, which is { A } in set S, and node B has completed step C2 and has a flag value of true, nodes D and F have not executed to C2, which is the initial value of true, so set S is replaced from { A } to { B, F, D }.
Step C5: and generating a Wrapper by using the nodes in the set S, wherein the generated Wrapper is an Xpath path expression set generated by using the label type and the label attribute.
Specifically, in the step C5, a Wrapper is generated by using the nodes in the set S, and the generated Wrapper is an Xpath path expression set generated by using the label type and the label attribute, which specifically includes the following steps:
step C51: traversing all nodes in the set, and respectively performing the step C52;
step C52: if the node label attribute has an id attribute and the value of the id attribute is not null, generating an XPath path expression of @/node.tag [ @ id = node.id ]; otherwise, generating an Xpath path expression of// node.tag [ @ class = node.class ];
the path expression of Xpath is expressed as a node which accords with the label type of node.tag and the id of the label attribute of node.id in all descendant nodes taking the current node as the root, and is expressed as a node which accords with the label type of node.tag in all descendant nodes taking the current node as the root.
Step C53: and adding the generated Xpath path into an Xpath path expression set of the Wrapper. Wrapper represents a set of Xpath path expressions generated using the label type and label attributes of a node.
Step D: and D, extracting the text content in the bid-winning item detail page of the bid-winning website, namely the information of the bid-winning item by using the wrapper generated in the step C.
The specific method for extracting the text content in the bid-winning item detail page of the bid-winning website comprises the following steps: and (3) creating a standard DOM tree for the bidding webpage, selecting a node in the DOM tree by using an XPath path expression in the Wrapper, and extracting the texts of all descendant nodes taking the node as a root.
The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims (6)

1. A bid winning information extraction method for a bidding website based on a DOM tree is characterized in that: the method comprises the following steps:
step A: acquiring a title of each bid-winning item displayed in a list page and a link of a bid-winning item detail page through the collection of a bid-winning information list page of a bid-winning website, and acquiring an HTML (hypertext markup language) code of the bid-winning item detail page through the link, wherein the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page form bid-winning item data, and N items of bid-winning item data form a data set used for generating a wrapper; n is a natural number not less than 1;
and B: for each item of bid-winning item data in the data set, creating a DOM tree by using the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page, and traversing the data set containing the N items of bid-winning item data obtained in the step A to generate N DOM trees;
and C: generating a Wrapper by using the N DOM trees created in the step B;
in the step C, the method for generating the Wrapper by using the N DOM trees created in the step B specifically includes the following steps:
step C1: randomly selecting a root node of a DOM tree from the N DOM trees and adding the root node into the set S;
step C2: counting the text length contained in the node and the text length distribution situation contained in the nodes with the same label type and label attribute with the node in other N-1 DOM trees, namely counting the number of the nodes corresponding to different text lengths, if the maximum value of the number of the nodes is less than N/2, considering that the node has winning bid information, and skipping to the step C3, otherwise, considering that the node is not winning bid information, setting the flag value of the node to false and skipping to the step C4;
step C3: c2 is sequentially executed for all child nodes under the node, respectively;
step C4: searching an ancestor node R of the node from the set, if the ancestor node R exists, replacing the node with other child nodes with flag values of true in the child node set of R and a sibling node set with flag values of true of the node, and if the ancestor node does not exist, deleting the node from the result set;
step C5: generating a Wrapper by using the nodes in the set S, wherein the generated Wrapper is an Xpath path expression set generated by using the label type and the label attribute;
step D: c, extracting the text content in the bid-winning item detail page of the bid-winning website, namely bid-winning item information, by using the wrapper generated in the step C;
and D, creating a standard DOM tree for the bidding webpage, selecting nodes in the DOM tree by using an XPath path expression in the Wrapper, and extracting texts of all descendant nodes taking the nodes as roots.
2. The method for extracting bid-winning information from a bidding website based on a DOM tree as claimed in claim 1, wherein: in step B, for each item of bid-winning item data in the data set, a DOM tree is created using the title of the bid-winning item in the list page and the HTML code of the detail page of the corresponding bid-winning item, which specifically includes the following steps:
step B1: analyzing an HTML code of a detail page of a bid-winning project, creating a DOM tree, defining eight attributes for each Node in the DOM tree, namely, the nodes are represented as Node = { label, attrib, text, parent, child, next sibling, prebro, nextbro and flag }, and respectively represent a label type, a label attribute, a text, a father Node, a child Node, a last sibling Node, a next sibling Node and whether the Node is a bid-winning information mark, wherein the label attribute comprises id, class and href of a label, and the flag attribute is initialized to true;
step B2: cleaning nodes where modification type tags are located in the DOM tree, wherein the modification type tags comprise < head > </head >, < script > </script > and < style >; because < head > </head > is the head information of the website, < script > </script > is the script, < style > </style > is the style, can not include the information of winning a bid, so delete;
step B3: searching a node P where the title is located in the DOM tree through the obtained bid-winning item title from the list page and through a fuzzy matching method;
step B4: replacing all brother nodes positioned in front of the P node in the layer of the P node by the self-defined node, and performing the step on the father node of the P node in a recursion manner until the root node is reached;
step B5: and starting from the root node where the < html > tag is located, performing breadth-first traversal, and performing post-processing on the nodes in the DOM tree according to the tags and tag attributes of the nodes.
3. The DOM tree based bid winning information extraction method of the bidding website of claim 2, wherein: in the step B3, by obtaining the title of the bid-winning item from the list page, and by using a fuzzy matching method, the node P where the title is located is searched in the DOM tree, which is specifically as follows:
performing depth-first traversal on the DOM tree and finding out a first coincident node through the following judgment:
Figure DEST_PATH_IMAGE001
where S denotes the title in the list page, T denotes the strings to be compared, LCS (S, T) calculates the longest common subsequence length of the two strings, and s.length denotes the string length of the title in the list page.
4. The DOM tree based bid winning information extraction method of the bidding website of claim 2, wherein: in step B4, a self-defined node is used to replace a sibling node located before the P node in the layer where the P node is located, where the self-defined node is defined as a < div class = "# equals" > </div > tag, the tag attribute includes a class attribute, the text is not null, the role is that when the DOM tree is post-processed in step B5, the self-defined node is retained but not deleted, and the flag attribute is set to false, which indicates that the self-defined node is not winning bid item information.
5. The DOM tree based bid winning information extraction method of the bidding website of claim 2, wherein: in step B5, performing breadth-first traversal starting from the root node where the < html > tag is located, and performing post-processing on the node in the DOM tree according to the tag and the tag attribute of the node, specifically including the following steps:
step B51: sequentially adding all child nodes of the root node of the DOM tree into a queue Q;
step B52: if the queue Q is not empty, popping out a queue head node Q from the queue Q, and if the queue Q is empty, ending the process;
step B53: if the node q is the < a > tag, the tag attribute contains href, and the value of href is not null, deleting the node q from the DOM tree, and then jumping to the step B56, otherwise, executing the step B54;
step B54: if the node q is the < p > tag and the tag attribute has no id attribute, executing the step B55, otherwise, jumping to the step B56;
step B55: if the node q is a leaf node, directly deleting the node q, then jumping to the step B52, otherwise, replacing the node q with all child nodes of q, and jumping to the step B52;
step B56: and if the label attribute of the node label q has the class attribute or the id attribute, jumping to the step B55, otherwise, adding all child nodes into the queue and jumping to the step B52.
6. The method for extracting bid-winning information from a bidding website based on a DOM tree as claimed in claim 1, wherein: in the step C5, a Wrapper is generated by using the nodes in the set S, and the generated Wrapper is an Xpath path expression set generated by using the label type and the label attribute, which specifically includes the following steps:
step C51: traversing all nodes in the set, and respectively performing the step C52;
step C52: if the node label attribute has an id attribute and the value of the id attribute is not null, generating an XPath path expression of @/node.tag [ @ id = node.id ]; otherwise, generating an Xpath path expression of// node.tag [ @ class = node.class ];
the path expression of Xpath is expressed as a node which accords with the label type of node.tag and id of the label attribute of node.id in all descendant nodes taking the current node as the root, and is expressed as a node which accords with the label type of node.tag and is expressed as node.class in the label attribute, and the expression of Xpath is expressed as a node which accords with the label type of node.tag and is expressed as node.class in all descendant nodes taking the current node as the root;
step C53: adding the generated Xpath path into an Xpath path expression set of the Wrapper; wrapper represents a set of Xpath path expressions generated using the label type and label attributes of a node.
CN201810301630.0A 2018-04-04 2018-04-04 Bid winning information extraction method for bidding website based on DOM tree Active CN108563729B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810301630.0A CN108563729B (en) 2018-04-04 2018-04-04 Bid winning information extraction method for bidding website based on DOM tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810301630.0A CN108563729B (en) 2018-04-04 2018-04-04 Bid winning information extraction method for bidding website based on DOM tree

Publications (2)

Publication Number Publication Date
CN108563729A CN108563729A (en) 2018-09-21
CN108563729B true CN108563729B (en) 2022-04-01

Family

ID=63534214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810301630.0A Active CN108563729B (en) 2018-04-04 2018-04-04 Bid winning information extraction method for bidding website based on DOM tree

Country Status (1)

Country Link
CN (1) CN108563729B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657180B (en) * 2018-12-11 2021-11-26 中科国力(镇江)智能技术有限公司 Intelligent automatic fuzzy extraction system for webpage content
CN109726341A (en) * 2018-12-28 2019-05-07 四川新网银行股份有限公司 A kind of automatic abstracting method of webpage information based on Web page classifying and cluster
CN110059085B (en) * 2019-03-18 2021-02-26 浙江工业大学 Web 2.0-oriented JSON data analysis and modeling method
CN110502680A (en) * 2019-08-27 2019-11-26 重庆大司空信息科技有限公司 A kind of abstracting method and device of acceptance of the bid bulletin relevant field
CN111708967B (en) * 2020-06-11 2023-05-16 浙江浙大网新国际软件技术服务有限公司 Fingerprint identification method based on sitemap
CN111966930B (en) * 2020-08-17 2021-05-04 山东亿云信息技术有限公司 Webpage list analyzing method and system based on XPath sequence
CN113409111A (en) * 2021-06-15 2021-09-17 广州比地数据科技有限公司 Bidding information processing method, system and readable storage medium
CN113779235B (en) * 2021-09-13 2024-02-02 北京市律典通科技有限公司 Word document outline recognition processing method and device
CN114528811B (en) * 2022-01-21 2022-09-02 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN115017430A (en) * 2022-06-27 2022-09-06 京东科技控股股份有限公司 List page determination method and device, electronic equipment and storage medium
CN116362223B (en) * 2023-03-07 2023-12-15 北京粉笔蓝天科技有限公司 Automatic identification method and device for web page article titles and texts

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102467501A (en) * 2010-10-29 2012-05-23 北大方正集团有限公司 Method and system for extracting news record metadata from news list page
CN102831121A (en) * 2011-06-15 2012-12-19 阿里巴巴集团控股有限公司 Method and system for extracting webpage information
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
CN103870506A (en) * 2012-12-17 2014-06-18 中国科学院计算技术研究所 Webpage information extraction method and system
CN104462540A (en) * 2014-12-24 2015-03-25 中国科学院声学研究所 Webpage information extraction method
CN105912633A (en) * 2016-04-11 2016-08-31 上海大学 Sparse sample-oriented focus type Web information extraction system and method
CN106250456A (en) * 2016-07-28 2016-12-21 浪潮软件集团有限公司 Bid winning announcement extraction method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9003552B2 (en) * 2010-12-30 2015-04-07 Ensighten, Inc. Online privacy management

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102467501A (en) * 2010-10-29 2012-05-23 北大方正集团有限公司 Method and system for extracting news record metadata from news list page
CN102831121A (en) * 2011-06-15 2012-12-19 阿里巴巴集团控股有限公司 Method and system for extracting webpage information
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
CN103870506A (en) * 2012-12-17 2014-06-18 中国科学院计算技术研究所 Webpage information extraction method and system
CN104462540A (en) * 2014-12-24 2015-03-25 中国科学院声学研究所 Webpage information extraction method
CN105912633A (en) * 2016-04-11 2016-08-31 上海大学 Sparse sample-oriented focus type Web information extraction system and method
CN106250456A (en) * 2016-07-28 2016-12-21 浪潮软件集团有限公司 Bid winning announcement extraction method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Web信息抽取系统的设计与实现;皮珊;《中国优秀硕士学位论文全文数据库》;20140515;第I139-161页 *
Web页面列表信息的自主抽取;侯锟;《科技广场》;20070331;第117-118页 *
Wrapper Generation for Automatic Data Extraction from Large Web Sites;Nitin Jindal;《SpringerLink》;20051231;第34-53页 *
平坦数据记录列表页的web信息抽取;李贵;《计算机科学》;20100731;第203-205转252页 *

Also Published As

Publication number Publication date
CN108563729A (en) 2018-09-21

Similar Documents

Publication Publication Date Title
CN108563729B (en) Bid winning information extraction method for bidding website based on DOM tree
US8255394B2 (en) Apparatus, system, and method for efficient content indexing of streaming XML document content
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN101464905B (en) Web page information extraction system and method
CN103955529B (en) A kind of internet information search polymerize rendering method
CN104268148B (en) A kind of forum page Information Automatic Extraction method and system based on time string
CN107423391B (en) Information extraction method of webpage structured data
Peters et al. Content extraction using diverse feature sets
US20090063538A1 (en) Method for normalizing dynamic urls of web pages through hierarchical organization of urls from a web site
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
JP2006004417A (en) Method and device for recognizing specific type of information file
Leonhardt et al. Boilerplate removal using a neural sequence labeling model
CN102184189A (en) Webpage core block determining method based on DOM (Document Object Model) node text density
CN103823824A (en) Method and system for automatically constructing text classification corpus by aid of internet
CN103309961B (en) Webpage content extraction method based on Markov random field
CN110059085B (en) Web 2.0-oriented JSON data analysis and modeling method
CN107436955B (en) English word correlation degree calculation method and device based on Wikipedia concept vector
CN103440232A (en) Automatic sScientific paper standardization automatic detecting and editing method
CN109634594A (en) A kind of code snippet recommended method considering code statement order information
CN103440233A (en) Automatic sScientific paper standardization automatic detecting and editing system
CN109657114B (en) Method for extracting webpage semi-structured data
CN108959204B (en) Internet financial project information extraction method and system
CN115358200A (en) Template document automatic generation method based on SysML meta model
CN108874870A (en) A kind of data pick-up method, equipment and computer can storage mediums
Omari et al. Cross-supervised synthesis of web-crawlers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant