CN108563729B

CN108563729B - Bid winning information extraction method for bidding website based on DOM tree

Info

Publication number: CN108563729B
Application number: CN201810301630.0A
Authority: CN
Inventors: 陈羽中; 林剑; 郭昆; 张伟智
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2018-04-04
Filing date: 2018-04-04
Publication date: 2022-04-01
Anticipated expiration: 2038-04-04
Also published as: CN108563729A

Abstract

The invention aims to provide a bid winning information extraction method of a bidding website based on a DOM tree, which comprises the following steps: firstly, acquiring a title of each bid-winning item displayed in a list page and a link of a bid-winning item detail page through the collection of a bid-winning information list page of a bid-winning website, and acquiring an HTML code of the bid-winning item detail page through the link, wherein the data form one item of bid-winning item data, and N items of bid-winning item data form a data set; for each item of bid-winning item data in the data set, a DOM tree is created by using the title of the bid-winning item in the list page and the corresponding HTML code; traversing the data set to generate N DOM trees; generating a wrapper according to the N DOM trees; and finally, extracting the text content in the bid-winning item detail page by using a wrapper, namely information of the bid-winning item. The method can improve the accuracy rate of winning bid information extraction and reduce the total execution time of the tasks.

Description

Bid winning information extraction method for bidding website based on DOM tree

Technical Field

The invention relates to the technical field of wrappers (Wrapper), in particular to a bid winning information extraction method for a bidding website based on a DOM tree.

Background

Information is extracted from the web page and is mainly finished by a wrapper. The wrapper is a software program that is composed of a series of information extraction rules that have been already established and a program that uses the rules. For the inquiry request of the specific information source of the user, relevant useful data is found out from the information source of the page, extracted, converted into data described by a specified format and returned to the user. A wrapper is typically directed to a class of pages in a particular information source. Extracting data from multiple different sources requires a suite of wrapper libraries.

At present, almost all web pages on the network contain more or less templates, and the content of the templates contains navigation bars, organization marks and contact information, advertisement bars and the like, and the information can frequently appear in all the web pages of the same organization. The more and more experts and scholars are attracted by relevant research of webpage text content extraction methods, and some algorithms are proposed. Scholars propose a precise extraction algorithm of news webpage texts based on double-layer decision, wherein the double-layer decision refers to the decision of the global scope of the region where the news webpage texts are located and the decision of whether each segment of texts in the text scope is the local content of the texts; the method comprises the steps that a learner provides a webpage theme information automatic extraction algorithm based on DOM, and provides an STU-DOM tree model containing semantic information aiming at the semi-structural characteristics of HTML and the deficiency of semantic description based on DOM specification; some researchers put forward a webpage text information extraction method based on a mark window, and the position of a text is found out by comparing the text and the text in a calculation title. Some researchers have proposed an automatic text extraction method for short text web pages, which divides the web page into short texts by the number of text words and then determines whether the short texts are texts by finding out the nodes and positions with the maximum text density.

At present, most webpage text content extraction algorithms are classified by taking text density and label density as features, the characteristics of the same content and different content of text content labels of similar webpages are not considered, and the problem of short text content is not processed well.

Disclosure of Invention

The invention aims to provide a bid winning information extraction method for a bidding website based on a DOM tree.

In order to achieve the purpose, the technical scheme of the invention is as follows: a bid winning information extraction method for a bidding website based on a DOM tree specifically comprises the following steps: a bid winning information extraction method for a bidding website based on a DOM tree is characterized by comprising the following steps: step A: acquiring a title of each bid-winning item displayed in a list page and a link of a bid-winning item detail page through the collection of a bid-winning information list page of a bid-winning website, and acquiring an HTML (hypertext markup language) code of the bid-winning item detail page through the link, wherein the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page form bid-winning item data, and N items of bid-winning item data form a data set used for generating a wrapper; n is a natural number not less than 1; and B: for each item of bid-winning item data in the data set, creating a DOM tree by using the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page, and traversing the data set containing the N items of bid-winning item data obtained in the step A to generate N DOM trees; and C: generating a Wrapper by using the N DOM trees created in the step B; step D: and D, extracting the text content in the bid-winning item detail page of the bid-winning website, namely the information of the bid-winning item by using the wrapper generated in the step C.

In an embodiment of the present invention, in the step B, for each item in the data set, a DOM tree is created by using a title of the winning bid item in the list page and an HTML code of a detail page of the corresponding winning bid item, which specifically includes the following steps: step B1: analyzing an HTML code of a detail page of a bid-winning project, creating a DOM tree, defining eight attributes for each Node in the DOM tree, namely, the nodes are represented as Node = { label, attrib, text, parent, child, next sibling, prebro, nextbro and flag }, and respectively represent a label type, a label attribute, a text, a father Node, a child Node, a last sibling Node, a next sibling Node and whether the Node is a bid-winning information mark, wherein the label attribute comprises id, class and href of a label, and the flag attribute is initialized to true; step B2: cleaning nodes where modification type tags are located in the DOM tree, wherein the modification type tags comprise < head > </head >, < script > </script >, and < style > </style >, and the like. Because < head > </head > is the head information of the website, < script > </script > is the script, < style > </style > is the style, can not include the information of winning a bid, so delete; step B3: searching a node P where the title is located in the DOM tree through the obtained bid-winning item title from the list page and through a fuzzy matching method; step B4: replacing all brother nodes positioned in front of the P node in the layer of the P node by the self-defined node, and performing the step on the father node of the P node in a recursion manner until the root node is reached; step B5: and starting from the root node where the < html > tag is located, performing breadth-first traversal, and performing post-processing on the nodes in the DOM tree according to the tags and tag attributes of the nodes.

Further, in the step B3, by obtaining the title of the bid-winning item from the list page, and by using a fuzzy matching method, the node P where the title is located is searched in the DOM tree, which is specifically as follows: performing depth-first traversal on the DOM tree and finding out a first coincident node through the following judgment:

where S denotes the title in the list page, T denotes the strings to be compared, LCS (S, T) calculates the longest common subsequence length of the two strings, and s.length denotes the string length of the title in the list page.

Further, in step B4, a self-defined node is used to replace a sibling node located before the P node in the layer where the P node is located, where the self-defined node is defined as < div class = "# equals > </div >, a tag attribute of the self-defined node includes a class attribute, and the text is not null, and is used to retain the self-defined node but not to be deleted when the DOM tree is post-processed in step B5, and the flag attribute is set to false to indicate that the self-defined node is not bid-winning item information.

Further, in step B5, starting from the root node where the < html > tag is located, performing breadth-first traversal, and performing post-processing on the node in the DOM tree according to the tag and the tag attribute of the node, specifically including the following steps: step B51: sequentially adding all child nodes of the root node of the DOM tree into a queue Q; step B52: if the queue Q is not empty, popping out a queue head node Q from the queue Q, and if the queue Q is empty, ending the process; step B53: if the node q is the < a > tag, the tag attribute contains href, and the value of href is not null, deleting the node q from the DOM tree, and then jumping to the step B56, otherwise, executing the step B54; step B54: if the node q is the < p > tag and the tag attribute has no id attribute, executing the step B55, otherwise, jumping to the step B56; step B55: if the node q is a leaf node, directly deleting the node q, then jumping to the step B52, otherwise, replacing the node q with all child nodes of q, and jumping to the step B52; step B56: and if the label attribute of the node label q has the class attribute or the id attribute, jumping to the step B55, otherwise, adding all child nodes into the queue and jumping to the step B52.

In an embodiment of the present invention, in the step C, generating the Wrapper by using the N DOM trees created in the step B specifically includes the following steps: step C1: randomly selecting a root node of a DOM tree from the N DOM trees and adding the root node into the set S; step C2: counting the text length contained in the node and the text length distribution situation contained in the nodes with the same label type and label attribute with the node in other N-1 DOM trees, namely counting the number of the nodes corresponding to different text lengths, if the maximum value of the number of the nodes is less than N/2, considering that the node has winning bid information, and skipping to the step C3, otherwise, considering that the node is not winning bid information, setting the flag value of the node to false and skipping to the step C4; step C3: c2 is sequentially executed for all child nodes under the node, respectively; step C4: searching an ancestor node R of the node from the set, if the ancestor node R exists, replacing the node with other child nodes with flag values of true in the child node set of R and a sibling node set with flag values of true of the node, and if the ancestor node does not exist, deleting the node from the result set; step C5: and generating a Wrapper by using the nodes in the set S, wherein the generated Wrapper is an Xpath path expression set generated by using the label type and the label attribute.

Further, in the step C5, the node in the set S is used to generate a Wrapper, and the generated Wrapper is an Xpath path expression set generated by using the label type and the label attribute, which specifically includes the following steps: step C51: traversing all nodes in the set, and respectively performing the step C52; step C52: if the node label attribute has an id attribute and the value of the id attribute is not null, generating an XPath path expression of @/node.tag [ @ id = node.id ]; otherwise, generating an Xpath path expression of// node.tag [ @ class = node.class ]; the path expression of Xpath is expressed as a node which accords with the label type of node.tag and id of the label attribute of node.id in all descendant nodes taking the current node as the root, and is expressed as a node which accords with the label type of node.tag and is expressed as node.class in the label attribute, and the expression of Xpath is expressed as a node which accords with the label type of node.tag and is expressed as node.class in all descendant nodes taking the current node as the root; step C53: adding the generated Xpath path into an Xpath path expression set of the Wrapper; wrapper represents a set of Xpath path expressions generated using the label type and label attributes of a node.

In an embodiment of the present invention, in the step D, a standard DOM tree is created for the bid-posting webpage, a node in the DOM tree is selected by using an Xpath path expression in the Wrapper, and texts of all descendant nodes taking the node as a root are extracted.

Compared with the prior art, the method has the advantages that the wrapper capable of automatically extracting the bid-winning data is generated by processing the bid-winning data set of the bid-winning website, the accuracy rate of bid-winning information extraction is improved, and meanwhile the total execution time of tasks is reduced.

Drawings

Fig. 1 is a flowchart of a bid winning information extraction method for a bid inviting website based on a DOM tree according to the present invention.

Fig. 2 is an exemplary diagram of step B55.

Fig. 3 is an exemplary diagram of step C4.

Detailed Description

The invention is further explained below with reference to the figures and the specific embodiments.

FIG. 1 is a flowchart of a bid winning information extraction method for a bid inviting website based on a DOM tree according to the present invention. Firstly, acquiring a title of each bid-winning item displayed in a list page and a link of a bid-winning item detail page through the collection of a bid-winning information list page of a bid-winning website, and acquiring an HTML code of the bid-winning item detail page through the link, wherein the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page form bid-winning item data, and a plurality of items of bid-winning item data form a data set for generating a wrapper; and then traversing a data set containing the data of the N bid-winning items, and creating a DOM tree for each bid-winning item in the data set by using the title of the bid-winning item in the list page and the HTML code of the corresponding detail page of the bid-winning item. After traversing the data set, generating N DOM trees; generating a wrapper according to the N DOM trees; and finally, extracting the text content in the bid-winning item detail page of the bid-winning website, namely the information of the bid-winning item by using the generated wrapper. As shown in fig. 1, the method comprises the steps of:

step A: through the collection of a bid-winning information list page of a bid-winning website, a title of each bid-winning item displayed in the list page and a link of a bid-winning item detail page are obtained, and an HTML code of the bid-winning item detail page is obtained through the link, wherein the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page form bid-winning item data, and the N items of bid-winning item data form a data set used for generating a wrapper. N is a natural number not less than 1.

And B: and B, traversing the data set containing the N items of bid-winning item data obtained in the step A, and creating a DOM tree for each item of bid-winning item data in the data set by using the title of the bid-winning item in the list page and the HTML code of the corresponding detail page of the bid-winning item. And after traversing the data set, generating N DOM trees.

Specifically, in the step B, traversing the data set containing the N items of bid-winning item data obtained in the step a, creating a DOM tree for each item of bid-winning item data in the data set by using the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page, and generating N DOM trees after traversing the data set, specifically including the following steps:

step B1: analyzing an HTML code of a detail page of a bid-winning project, creating a DOM tree, defining eight attributes for each Node in the DOM tree, namely, the nodes are represented as Node = { label, attrib, text, parent, children, prebro, nextbro and flag }, and respectively represent a label type, a label attribute, a text, a father Node, a child Node, a last brother Node, a next brother Node and whether the Node is a bid-winning information mark, wherein the label attribute attrib comprises id, class and href of a label, and the flag attribute is initialized to true;

step B2: cleaning nodes where modification type tags are located in the DOM tree, wherein the modification type tags comprise < head > </head >, < script > </script >, and < style > </style >, and the like. Because < head > </head > is the head information of the website, < script > </script > is the script, < style > </style > is the style, can not include the information of winning a bid, so delete;

step B3: searching a node P where the title is located in the DOM tree through the obtained bid-winning item title from the list page and through a fuzzy matching method;

preferably, the specific method for searching the node P where the title is located in the DOM tree by the fuzzy matching method is as follows:

searching a node P where the title is located in the DOM tree, wherein the specific method is as follows: performing depth-first traversal on the DOM tree and finding out a first coincident node through the following judgment:

Step B4: replacing all brother nodes positioned in front of the P node in the layer of the P node by the self-defined node, and performing the step on the father node of the P node in a recursion manner until the root node is reached;

preferably, in the step B4, a self-defined node is used to replace a sibling node located before the P node in the layer where the P node is located, where the self-defined node is defined as < div class = "# equals" > </div >, a tag attribute of the self-defined node includes a class attribute, and the text is not null, and the self-defined node is reserved but not deleted when the DOM tree is post-processed in the step B5, and the flag attribute is set to false, which indicates that the self-defined node is not winning bid item information.

Step B5: starting from a root node where the < html > tag is located, performing breadth-first traversal, and performing post-processing on nodes in the DOM tree according to the tags and tag attributes of the nodes;

specifically, in step B5, starting from the root node where the < html > tag is located, performing breadth-first traversal, and performing post-processing on the node in the DOM tree according to the tag and the tag attribute of the node, specifically including the following steps:

step B51: sequentially adding all child nodes of the root node of the DOM tree into a queue Q;

step B52: if the queue Q is not empty, popping out a queue head node Q from the queue Q, and if the queue Q is empty, ending the process;

step B53: if the node q is the < a > tag, the tag attribute contains href, and the value of href is not null, deleting the node q from the DOM tree, and then jumping to the step B56, otherwise, executing the step B54;

step B54: if the node q is the < p > tag and the tag attribute has no id attribute, executing the step B55, otherwise, jumping to the step B56;

step B55: if the node q is a leaf node, directly deleting the node q, then jumping to the step B52, otherwise, replacing the node q with all child nodes of q, and jumping to the step B52;

an exemplary diagram of step B55 is shown in fig. 2, in a specific embodiment of the invention. Node C is taken as an example in the example.

Step B56: if the label attribute of the node label q has the class attribute or the id attribute, jumping to step B55, otherwise, adding all child nodes into the queue and jumping to step B52;

and C: generating a wrapper by using the N DOM trees created in the step B;

specifically, in the step C, generating the Wrapper by using the N DOM trees created in the step B specifically includes the following steps:

step C1: randomly selecting a root node of a DOM tree from the N DOM trees and adding the root node into the set S;

step C2: counting the text length contained in the node and the text length distribution situation contained in the nodes with the same label type and label attribute with the node in other N-1 DOM trees, namely counting the number of the nodes corresponding to different text lengths, if the maximum value of the number of the nodes is less than N/2, considering that the node has winning bid information, and skipping to the step C3, otherwise, considering that the node is not winning bid information, setting the flag value of the node to false and skipping to the step C4;

step C3: c2 is sequentially executed for all child nodes under the node, respectively;

step C4: searching an ancestor node R of the node from the set, if the ancestor node R exists, replacing the node with other child nodes with flag values of true in the child node set of R and a sibling node set with flag values of true of the node, and if the ancestor node does not exist, deleting the node from the result set;

an example of step C4 is shown in FIG. 3, assuming that this time execution is to node E, which is { A } in set S, and node B has completed step C2 and has a flag value of true, nodes D and F have not executed to C2, which is the initial value of true, so set S is replaced from { A } to { B, F, D }.

Step C5: and generating a Wrapper by using the nodes in the set S, wherein the generated Wrapper is an Xpath path expression set generated by using the label type and the label attribute.

Specifically, in the step C5, a Wrapper is generated by using the nodes in the set S, and the generated Wrapper is an Xpath path expression set generated by using the label type and the label attribute, which specifically includes the following steps:

step C51: traversing all nodes in the set, and respectively performing the step C52;

step C52: if the node label attribute has an id attribute and the value of the id attribute is not null, generating an XPath path expression of @/node.tag [ @ id = node.id ]; otherwise, generating an Xpath path expression of// node.tag [ @ class = node.class ];

the path expression of Xpath is expressed as a node which accords with the label type of node.tag and the id of the label attribute of node.id in all descendant nodes taking the current node as the root, and is expressed as a node which accords with the label type of node.tag in all descendant nodes taking the current node as the root.

Step C53: and adding the generated Xpath path into an Xpath path expression set of the Wrapper. Wrapper represents a set of Xpath path expressions generated using the label type and label attributes of a node.

Step D: and D, extracting the text content in the bid-winning item detail page of the bid-winning website, namely the information of the bid-winning item by using the wrapper generated in the step C.

The specific method for extracting the text content in the bid-winning item detail page of the bid-winning website comprises the following steps: and (3) creating a standard DOM tree for the bidding webpage, selecting a node in the DOM tree by using an XPath path expression in the Wrapper, and extracting the texts of all descendant nodes taking the node as a root.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A bid winning information extraction method for a bidding website based on a DOM tree is characterized in that: the method comprises the following steps:

step A: acquiring a title of each bid-winning item displayed in a list page and a link of a bid-winning item detail page through the collection of a bid-winning information list page of a bid-winning website, and acquiring an HTML (hypertext markup language) code of the bid-winning item detail page through the link, wherein the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page form bid-winning item data, and N items of bid-winning item data form a data set used for generating a wrapper; n is a natural number not less than 1;

and B: for each item of bid-winning item data in the data set, creating a DOM tree by using the title of the bid-winning item in the list page and the HTML code of the corresponding bid-winning item detail page, and traversing the data set containing the N items of bid-winning item data obtained in the step A to generate N DOM trees;

and C: generating a Wrapper by using the N DOM trees created in the step B;

in the step C, the method for generating the Wrapper by using the N DOM trees created in the step B specifically includes the following steps:

step C5: generating a Wrapper by using the nodes in the set S, wherein the generated Wrapper is an Xpath path expression set generated by using the label type and the label attribute;

step D: c, extracting the text content in the bid-winning item detail page of the bid-winning website, namely bid-winning item information, by using the wrapper generated in the step C;

and D, creating a standard DOM tree for the bidding webpage, selecting nodes in the DOM tree by using an XPath path expression in the Wrapper, and extracting texts of all descendant nodes taking the nodes as roots.

2. The method for extracting bid-winning information from a bidding website based on a DOM tree as claimed in claim 1, wherein: in step B, for each item of bid-winning item data in the data set, a DOM tree is created using the title of the bid-winning item in the list page and the HTML code of the detail page of the corresponding bid-winning item, which specifically includes the following steps:

step B1: analyzing an HTML code of a detail page of a bid-winning project, creating a DOM tree, defining eight attributes for each Node in the DOM tree, namely, the nodes are represented as Node = { label, attrib, text, parent, child, next sibling, prebro, nextbro and flag }, and respectively represent a label type, a label attribute, a text, a father Node, a child Node, a last sibling Node, a next sibling Node and whether the Node is a bid-winning information mark, wherein the label attribute comprises id, class and href of a label, and the flag attribute is initialized to true;

step B2: cleaning nodes where modification type tags are located in the DOM tree, wherein the modification type tags comprise < head > </head >, < script > </script > and < style >; because < head > </head > is the head information of the website, < script > </script > is the script, < style > </style > is the style, can not include the information of winning a bid, so delete;

step B5: and starting from the root node where the < html > tag is located, performing breadth-first traversal, and performing post-processing on the nodes in the DOM tree according to the tags and tag attributes of the nodes.

3. The DOM tree based bid winning information extraction method of the bidding website of claim 2, wherein: in the step B3, by obtaining the title of the bid-winning item from the list page, and by using a fuzzy matching method, the node P where the title is located is searched in the DOM tree, which is specifically as follows:

performing depth-first traversal on the DOM tree and finding out a first coincident node through the following judgment:

4. The DOM tree based bid winning information extraction method of the bidding website of claim 2, wherein: in step B4, a self-defined node is used to replace a sibling node located before the P node in the layer where the P node is located, where the self-defined node is defined as a < div class = "# equals" > </div > tag, the tag attribute includes a class attribute, the text is not null, the role is that when the DOM tree is post-processed in step B5, the self-defined node is retained but not deleted, and the flag attribute is set to false, which indicates that the self-defined node is not winning bid item information.

5. The DOM tree based bid winning information extraction method of the bidding website of claim 2, wherein: in step B5, performing breadth-first traversal starting from the root node where the < html > tag is located, and performing post-processing on the node in the DOM tree according to the tag and the tag attribute of the node, specifically including the following steps:

step B56: and if the label attribute of the node label q has the class attribute or the id attribute, jumping to the step B55, otherwise, adding all child nodes into the queue and jumping to the step B52.

6. The method for extracting bid-winning information from a bidding website based on a DOM tree as claimed in claim 1, wherein: in the step C5, a Wrapper is generated by using the nodes in the set S, and the generated Wrapper is an Xpath path expression set generated by using the label type and the label attribute, which specifically includes the following steps:

the path expression of Xpath is expressed as a node which accords with the label type of node.tag and id of the label attribute of node.id in all descendant nodes taking the current node as the root, and is expressed as a node which accords with the label type of node.tag and is expressed as node.class in the label attribute, and the expression of Xpath is expressed as a node which accords with the label type of node.tag and is expressed as node.class in all descendant nodes taking the current node as the root;

step C53: adding the generated Xpath path into an Xpath path expression set of the Wrapper; wrapper represents a set of Xpath path expressions generated using the label type and label attributes of a node.