CN111444407A

CN111444407A - Automatic extraction method and system for page list information of web crawler

Info

Publication number: CN111444407A
Application number: CN202010222132.4A
Authority: CN
Inventors: 姜建武; 李景文; 陆妍玲
Original assignee: Guilin University of Technology
Current assignee: Guilin University of Technology
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2020-07-24
Anticipated expiration: 2040-03-26
Also published as: CN111444407B

Abstract

The invention relates to a method and a system for automatically extracting page list information of a web crawler. The method comprises the following steps: acquiring a hypertext markup language document of a page to be extracted; determining a set of hypertext markup language objects according to elements in the hypertext markup language document; traversing the hypertext markup language object set to determine a document object model; determining a webpage structure of the page to be extracted according to the document object model; the webpage structure comprises list items and list item attributes; determining an extraction mode according to the webpage structure of the page to be extracted; and extracting the page list information to be extracted by utilizing the extraction mode. The method and the system for automatically extracting the web crawler page list information provided by the invention realize the automatic capture of the web crawler page list information.

Description

Automatic extraction method and system for page list information of web crawler

Technical Field

The invention relates to the field of web crawlers, in particular to a method and a system for automatically extracting page list information of a web crawler.

Background

With the rapid development of information technology, a large amount of data information of the features is brought. How to extract useful information from massive information quickly, efficiently and accurately puts higher demands on the information acquisition technology of the new era. Under the background, the web crawler technology is rapidly developed and widely applied due to the characteristics of convenience in information acquisition, various acquisition modes, semi-automation of information acquisition and the like. However, in the conventional web crawler technology, a crawler script needs to be written according to the characteristics of a web page, the internet is full of massive information publishing platforms, systems and websites, and the way and format of displaying information on each page are varied, which causes the development cost of the web crawler technology to be high, and after the web page is changed, the crawler program needs to be updated pertinently, and the capturing stability is also influenced. Moreover, automatic page grabbing can be realized only through manual intervention. Therefore, the prior art cannot realize automatic capture of web crawler page list information.

Disclosure of Invention

The invention aims to provide a method and a system for automatically extracting web crawler page list information, which are used for automatically capturing the web crawler page list information.

In order to achieve the purpose, the invention provides the following scheme:

a method for automatically extracting page list information of a web crawler comprises the following steps:

acquiring a hypertext markup language document of a page to be extracted;

determining a set of hypertext markup language objects according to elements in the hypertext markup language document; the elements include tags, attributes, and text of a hypertext markup language document;

traversing the hypertext markup language object set to determine a document object model;

determining a webpage structure of the page to be extracted according to the document object model; the webpage structure comprises list items and list item attributes;

determining an extraction mode according to the webpage structure of the page to be extracted;

and extracting the page list information to be extracted by utilizing the extraction mode.

Optionally, the acquiring a hypertext markup language document of a page to be extracted further includes:

judging whether the page to be extracted is reprinted or not to obtain a first judgment result;

if the first judgment result shows that the page to be extracted is changed, executing the step of acquiring the hypertext markup language document of the page to be extracted;

and if the first judgment result shows that the page to be extracted is not reprinted, extracting the page list information to be extracted directly according to the corresponding extraction mode when the page is not reprinted.

Optionally, the traversing the hypertext markup language object set to determine a document object model specifically includes:

traversing the hypertext markup language object set to determine the relationship among all objects; all the relationships among the objects comprise the height and the depth of the peer relationship, the inclusion relationship, the parent relationship, the child relationship and the hierarchical relationship among the objects;

and determining the document object model according to the relation among all the objects.

Optionally, the determining, according to the document object model, a web page structure of the page to be extracted specifically includes:

determining the number of the tags of each attribute and the number of the tags of all the attributes in the page to be extracted according to the document object model;

determining the proportion of the occurrence times of each label of each attribute in the page to be extracted to obtain a single proportion set;

determining the proportion of the label attribute combination in the page to be extracted to obtain a combination proportion set;

arranging the single proportion set and the combined proportion set in a descending order according to the frequency of the tag attributes to obtain list items; the attribute of the list item is a label attribute with the same appearance proportion of a plurality of single label attributes as the appearance proportion of the combined label attribute in the combined proportion set;

determining all document object chains in the page to be extracted according to the document object model;

intercepting each document object chain by taking the tag attribute combination corresponding to the list item as the starting point;

calculating the frequency of occurrence of each intercepted document object chain;

and performing descending order arrangement on the frequency of the occurrence of each intercepted document object chain, and determining the attribute of the list item.

Optionally, the intercepting each document object chain starting with the tag attribute combination corresponding to the list item further includes:

judging whether the document object chain contains a tag attribute combination corresponding to the list item to obtain a second judgment result;

if the second judgment result indicates that the document object chain contains the tag attribute combination corresponding to the list item, the document object chain is reserved, and each document object chain is intercepted by taking the tag attribute combination corresponding to the list item as an initial point;

and if the second judgment result shows that the document object chain does not contain the tag attribute combination corresponding to the list item, rejecting the document object chain.

An automatic extraction system for web crawler page list information, comprising:

the hypertext markup language document acquisition module is used for acquiring a hypertext markup language document of a page to be extracted;

a hypertext markup language object set determining module, configured to determine a hypertext markup language object set according to elements in the hypertext markup language document; the elements include tags, attributes, and text of a hypertext markup language document;

the document object model determining module is used for traversing the hypertext markup language object set and determining a document object model;

the webpage structure determining module is used for determining the webpage structure of the page to be extracted according to the document object model; the webpage structure comprises list items and list item attributes;

the extraction mode determining module is used for determining an extraction mode according to the webpage structure of the page to be extracted;

and the page list information extraction module is used for extracting the page list information to be extracted by utilizing the extraction mode.

Optionally, the method further includes:

the first judgment module is used for judging whether the page to be extracted is reprinted or not to obtain a first judgment result;

the execution module is used for executing the step of acquiring the hypertext markup language document of the page to be extracted if the first judgment result shows that the page to be extracted is changed;

and the extraction module is used for extracting the page list information to be extracted directly according to the corresponding extraction mode when the version is not changed if the first judgment result shows that the page to be extracted is not changed.

Optionally, the document object model determining module specifically includes:

the relation determining unit among all the objects is used for traversing the hypertext markup language object set and determining the relation among all the objects; all the relationships among the objects comprise the height and the depth of the peer relationship, the inclusion relationship, the parent relationship, the child relationship and the hierarchical relationship among the objects;

and the document object model determining unit is used for determining the document object model according to the relation among all the objects.

Optionally, the web page structure determining module specifically includes:

the label attribute determining unit is used for determining the number of labels of each attribute and the number of labels of all attributes in the page to be extracted according to the document object model;

the single proportion set determining unit is used for determining the proportion of the occurrence times of each label of each attribute in the page to be extracted to obtain a single proportion set;

a combined proportion set determining unit, configured to determine a proportion of a tag attribute combination in the page to be extracted, so as to obtain a combined proportion set;

the list item determining unit is used for performing descending order arrangement on the single proportion set and the combined proportion set according to the frequency of the tag attributes to obtain list items; the attribute of the list item is a label attribute with the same appearance proportion of a plurality of single label attributes as the appearance proportion of the combined label attribute in the combined proportion set;

the document object chain determining unit is used for determining all document object chains in the page to be extracted according to the document object model;

a document object chain intercepting unit, configured to intercept each document object chain starting with a tag attribute combination corresponding to the list item;

the frequency calculating unit is used for calculating the frequency of the occurrence of each intercepted document object chain;

and the list item attribute determining unit is used for performing descending order arrangement on the frequency of the document object chain after each interception to determine the list item attribute.

Optionally, the web page structure determining module further includes:

the first judging unit is used for judging whether the document object chain contains a tag attribute combination corresponding to the list item to obtain a second judging result;

a document object chain retaining unit, configured to retain the document object chain if the second determination result indicates that the document object chain contains the tag attribute combination corresponding to the list item, and intercept each of the document object chains starting with the tag attribute combination corresponding to the list item;

and the document object chain removing unit is used for removing the document object chain if the second judgment result shows that the document object chain does not contain the tag attribute combination corresponding to the list item.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the method and the system for automatically extracting the web crawler page list information, provided by the invention, the hypertext markup language document of the page to be extracted is obtained, the hypertext markup language object set is determined, the document object set is determined according to the hypertext markup language object set, and then the web page structure of the page to be extracted is determined through the document object set, namely, the web page structure to be extracted is automatically identified, so that the problem that the information display mode and the format of each page are changeable is solved, manual intervention is avoided, automatic capture of the web crawler page list information is realized, and the stability of a web crawler is ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a method for automatically extracting page list information of a web crawler according to the present invention;

fig. 2 is a schematic structural diagram of an automatic web crawler page list information extraction system according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic flow chart of a method for automatically extracting page list information of a web crawler according to the present invention, and as shown in fig. 1, the method for automatically extracting page list information of a web crawler according to the present invention includes:

s101, obtaining a hypertext markup language document of a page to be extracted.

Before S101, the method further includes:

and judging whether the page to be extracted is reprinted or not to obtain a first judgment result.

And if the first judgment result shows that the page to be extracted is changed, executing the step of acquiring the hypertext markup language document of the page to be extracted.

S102, determining a set of hypertext markup language objects according to elements in the hypertext markup language document, wherein the elements comprise tags, attributes and texts of the hypertext markup language document, the elements in the hypertext markup language document are all regarded as one object, and the determined set of hypertext markup language objects is HTM L L ═ Obj₁,Obj₂,Obj₃,...,Obj_n}。

S103, traversing the hypertext markup language object set and determining a document object model.

Traversing the hypertext markup language object set to determine the relationship among all objects; the relationships among all the objects comprise the peer relationships among the objects, the inclusion relationships, the parent relationships, the child relationships and the height and the depth of the hierarchical relationship. Height is the length of the longest path of a leaf and depth is the length of the path to its root.

And determining the document object model Dom according to the relation among all the objects.

The method comprises the steps of S104, determining a webpage structure of a page to be extracted according to a document object model, wherein the webpage structure comprises list items and list item attributes, and according to the characteristics of the document object model, the number of all elements and attributes and the incidence relation among the elements and the attributes can be rapidly calculated, after statistical data and the incidence relation are obtained, the statistical data and the incidence relation are analyzed, contents such as the list items, the list item attributes and irrelevant information are extracted from the document structure, and the irrelevant information is removed.

And determining the number of the tags of each attribute and the number of the tags of all the attributes in the page to be extracted according to the document object model. Computing all tags, attributes x within a page_iThe amount of (a) is mi; the total number of all labels and attributes in the page is calculated as w,

and determining the proportion of the occurrence times of each label of each attribute in the page to be extracted to obtain a single proportion set. I.e. calculating the proportion p of the number of occurrences of a single label and attribute in the whole page_iForm a single specific gravity set p, p ═ p₁,p₂,p₃,p₄,...p_w}，

And determining the proportion of the label attribute combination in the page to be extracted to obtain a combination proportion set k. Sometimes the property within a page is more than one composition, a table, by combination of different elementsTo different structural content, k ═ p (b_1,2,...),p(_1,3,...),p(_2,3,...),...,p(_i,j,...) Where i, j represent different combinations of label attributes, and i, j are all combinations present within the page.

n(_i,j,...) Refers to the number of occurrences of multiple tag attribute combinations in a page.

Arranging the single proportion set and the combined proportion set in a descending order according to the frequency of the tag attributes to obtain list items; the attribute of the list item is a label attribute with the same appearance proportion of a plurality of single label attributes as the appearance proportion of the combined label attribute in the combined proportion set. I.e. k (_i,j,...)＝p_i＝p_j＝...。

And determining all document object chains, namely DOM chains, in the page to be extracted according to the document object model. Where the hierarchical DOM chain is denoted by M and the side-by-side DOM chain is denoted by N.

And intercepting each document object chain by taking the tag attribute combination corresponding to the list item as the start.

And calculating the occurrence frequency of the document object chain after each interception.

S105, determining an extraction mode according to the webpage structure of the page to be extracted;

and S106, extracting the page list information to be extracted by utilizing the extraction mode.

Intercepting each document object chain by taking the tag attribute combination corresponding to the list item as a start, wherein the method further comprises the following steps:

and judging whether the document object chain contains a tag attribute combination corresponding to the list item to obtain a second judgment result.

If the second judgment result shows that the document object chain contains the tag attribute combination corresponding to the list item, the document object chain is reserved, and each document object chain is intercepted by taking the tag attribute combination corresponding to the list item as an initial point.

In a specific embodiment, HM L ═ html, head, body, h3, div, p, a, span }, traverses the HTM L, records the Height (Height) and Depth (Depth) of the interrelation (sibling, including, parent, child, etc.) and hierarchical relationship among all objects, the Height is the length of the longest path of the leaf, and the Depth is the length of the path to its root.

1) Calculating the number of all tags in the page to be 25 (since the tags in the head do not participate in data extraction, useless tags in the head are removed, and the tags do not participate in calculation), and calculating the number of the attributes to be 11;

2) calculating the total number of all labels and attributes in the page to be 25+ 11-36;

3) calculating the proportion p of the occurrence times of the single label and the attribute in the whole page; as shown in table 1:

TABLE 1

Label/Property	p	Label/Property	p
				html	1/36	a	5/36
head	1/36	span	10/36
				body	1/36	main	5/36
div	1/36	list	5/36
				p	5/36	content	1/36

In the above table, html, head, body, div, p, a, and span are labels, and main, list, and content are attributes.

4) And calculating the proportion of the label attribute combination in the page. Sometimes the attributes in a page are more than one composition, and different structural contents are expressed through the combination of different elements. The combined tag attribute set is denoted as k. The p-tag in this example has two class attributes, main and list;

k＝{p_main,list}

5) and (3) arranging the sets p and k in a descending order according to the occurrence frequency of the label attributes, acquiring the label attributes with the same occurrence ratio of the single element set and the combined element set and the same occurrence ratio of a plurality of single elements and the combined label attributes in the combined element set, and finding out through comparison:

p_(mian,list)＝p_main＝p_list

then the tags P with attributes main and list are list items, and the tags html, head, body, div and attribute content are irrelevant tags and irrelevant attributes.

② decision algorithm for list item attribute

Through analysis, it is found that in this example, the list item and the list item attribute are in a containment relationship, and are in a hierarchical form M, and a parallel form N exists between the list item attributes, then the process of determining the list item attribute is as follows:

1) all DOM chains within the page are listed, with the hierarchical DOM chain represented by M and the side-by-side DOM chain represented by N. In this example, only one layer contains the relationship, i.e., p contains a, span.

M＝{P→a|span|span}

2) And intercepting the DOM chains by taking the attribute (combination) of the list item tag as an initial point, directly abandoning the DOM chains without the attribute of the list item tag, calculating the frequency of occurrence of each DOM chain, and constructing a set R as a result. In this example, there is only one chain { P → a | span | span }, which occurs 5 times, then:

3) and arranging DOM chains in a descending order according to the frequency, recording the DOM chains as R', and determining the attribute of the list item to be selected with higher frequency. And then according to the webpage rule obtained above, the span can be judged as the attribute of the list item.

As a result: the list item is P, and the attributes of the list item are main and list; the list item content is an a label, and the list item attribute is a span label.

Fig. 2 is a schematic structural diagram of an automatic extraction system for web crawler page list information provided by the present invention, and as shown in fig. 2, the automatic extraction system for web crawler page list information provided by the present invention includes: a hypertext markup language document acquisition module 201, a hypertext markup language object set determination module 202, a document object model determination module 203, a web page structure determination module 204, an extraction mode determination module 205, and a page list information extraction module 206.

The hypertext markup language document acquisition module 201 is configured to acquire a hypertext markup language document of a page to be extracted.

The hypertext markup language object set determining module 202 is configured to determine a hypertext markup language object set according to elements in the hypertext markup language document; the elements include tags, attributes, and text of a hypertext markup language document.

The document object model determining module 203 is configured to traverse the set of hypertext markup language objects to determine a document object model.

The web page structure determining module 204 is configured to determine a web page structure of the page to be extracted according to the document object model; the web page structure includes list items and list item attributes.

The extraction mode determining module 205 is configured to determine an extraction mode according to a web page structure of the page to be extracted.

The page list information extraction module 206 is configured to extract the page list information to be extracted by using the extraction mode.

The system for automatically extracting the page list information of the web crawler provided by the invention further comprises: the device comprises a first judgment module, an execution module and an extraction module corresponding to the non-version-change time.

The first judgment module is used for judging whether the page to be extracted is reprinted or not to obtain a first judgment result.

And the execution module is used for executing the step of acquiring the hypertext markup language document of the page to be extracted if the first judgment result shows that the page to be extracted is changed.

And the extraction module according to the extraction mode corresponding to the non-version-change condition is used for directly extracting the page list information to be extracted according to the extraction mode corresponding to the non-version-change condition if the first judgment result shows that the page to be extracted is not version-changed.

The document object model determining module 203 specifically includes: a relation determining unit between all the objects and a document object model determining unit.

The relation determining unit among all the objects is used for traversing the hypertext markup language object set and determining the relation among all the objects; the relationships among all the objects comprise the peer relationships among the objects, the inclusion relationships, the parent relationships, the child relationships and the height and the depth of the hierarchical relationship.

The document object model determining unit is used for determining the document object model according to the relation among all the objects.

The web page structure determining module 204 specifically includes: the system comprises a tag attribute determining unit, a single proportion set determining unit, a combined proportion set determining unit, a list item determining unit, a document object chain intercepting unit, a frequency calculating unit and a list item attribute determining unit.

The label attribute determining unit is used for determining the number of labels of each attribute and the number of labels of all attributes in the page to be extracted according to the document object model.

The single proportion set determining unit is used for determining the proportion of the occurrence times of each label of each attribute in the page to be extracted to obtain a single proportion set.

And the combined proportion set determining unit is used for determining the proportion of the label attribute combination in the page to be extracted to obtain a combined proportion set.

The list item determining unit is used for performing descending order arrangement on the single proportion set and the combined proportion set according to the frequency of the tag attributes to obtain list items; the attribute of the list item is a label attribute with the same appearance proportion of a plurality of single label attributes as the appearance proportion of the combined label attribute in the combined proportion set.

And the document object chain determining unit is used for determining all document object chains in the page to be extracted according to the document object model.

And the document object chain intercepting unit is used for intercepting each document object chain by taking the tag attribute combination corresponding to the list item as the starting point.

The frequency calculating unit is used for calculating the frequency of the document object chain after each interception.

The web page structure determining module 204 further comprises: the device comprises a first judging unit, a document object chain retaining unit and a document object chain removing unit.

The first judging unit is used for judging whether the document object chain contains a tag attribute combination corresponding to the list item to obtain a second judging result.

And the document object chain retaining unit is used for retaining the document object chain if the second judgment result indicates that the document object chain contains the tag attribute combination corresponding to the list item, and intercepting each document object chain by taking the tag attribute combination corresponding to the list item as a start.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method for automatically extracting page list information of a web crawler is characterized by comprising the following steps:

acquiring a hypertext markup language document of a page to be extracted;

2. The method according to claim 1, wherein the obtaining of the hypertext markup language document of the page to be extracted further comprises:

3. The method according to claim 1, wherein the traversing the set of html objects to determine a document object model specifically includes:

4. The method according to claim 1, wherein the determining a web page structure of the page to be extracted according to the document object model specifically includes:

5. The method according to claim 4, wherein the intercepting each document object chain starting with the tag attribute combination corresponding to the list item further comprises:

6. An automatic extraction system for web crawler page list information, comprising:

7. The system according to claim 6, further comprising:

8. The system according to claim 6, wherein the document object model determining module specifically comprises:

9. The system according to claim 6, wherein the web page structure determining module specifically comprises:

10. The system according to claim 9, wherein the web page structure determining module further comprises: