CN112784135A

CN112784135A - Webpage information identification system

Info

Publication number: CN112784135A
Application number: CN202110217963.7A
Authority: CN
Inventors: 张冶青
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-05-11

Abstract

The invention is suitable for the technical field of computers, and provides a webpage information identification system which comprises a path identification module, a vectorization processing module, a clustering processing module, a verification module, a webpage information identification module and a webpage information output module. The webpage information identification system is in butt joint with the target webpage, so that a label path in the target webpage is identified and obtained based on the content of the target webpage, then vectorization processing and clustering processing are carried out on the label path, automatic sorting of webpage content is achieved, and finally the optimal list nodes in the optimal list node set are marked on the target webpage through the webpage information output module. The invention can realize the automatic sorting of the webpage content, and the optimal list nodes in the optimal list node set have clear hierarchy, so that the marked webpage content has hierarchy division, thereby being beneficial to the extraction of the list item content.

Description

Webpage information identification system

Technical Field

The invention relates to the technical field of computers, in particular to a webpage information identification system.

Background

With the rapid development of network technology, the world wide web becomes the information data transmission carrier with the largest transmission quantity and the highest transmission efficiency at present, and how to effectively acquire required information data from the world wide web and utilize massive information becomes a hot topic of research in the fields of network technology and communication technology.

The web crawler is a commonly used web page information data acquisition tool, and the principle of the web crawler is to automatically capture a program or script of web page information according to a certain rule, thereby reading the content of a web page, finding other link addresses in the web page, and then finding the next web page through the link addresses, and the process is circulated until all web pages of the website are captured. If the whole internet is regarded as a website, the web spider can capture all the web pages on the internet by using the principle.

However, in general, web crawlers do not perform any processing on web page information data acquired by the web crawlers, and need to rely on users to define the search range and determine the data to be acquired finally, so that the web crawlers are only suitable for persons with script editing capability.

Disclosure of Invention

The invention mainly aims to provide a webpage information identification system to solve the problem that a webpage information data acquisition tool in the prior art is narrow in application range.

In order to achieve the above object, an embodiment of the present invention provides a web page information identification system, including:

the path identification module is used for identifying and acquiring K label paths based on the content of the target webpage; each label path comprises a first node to an Nth node, wherein the first node is a child node, the Nth node is a child node, N represents the hierarchy of the nodes, and K and N are positive integers;

the path identification module is used for identifying K label paths through preset node attributes to obtain K identification documents;

the vectorization processing module is used for converting the K identification documents into K high-dimensional vector sets through vectorization processing;

the clustering processing module is used for clustering the K high-dimensional vector sets to obtain K first candidate node sets;

the checking module is used for checking the K first candidate node sets to obtain M second candidate node sets, wherein M is a positive integer less than or equal to K;

the webpage information identification module is used for acquiring M optimal list node sets by using an optimal selection algorithm based on each second candidate node set;

and the webpage information output module is used for marking the optimal list nodes in the M optimal list node sets on the target webpage.

Optionally, the path identification module includes a node tree establishment unit, a root node traversal unit, a child node traversal unit, and a label path output unit;

the node tree establishing unit identifies and acquires webpage nodes in a target webpage and establishes a node tree according to the incidence relation of the webpage nodes;

the root node traversal unit is used for acquiring a root node of the node tree and traversing all child nodes belonging to the root node;

the child node traversal unit is used for acquiring child nodes belonging to the child nodes and traversing node paths from the child nodes to the child nodes;

the label path output unit is configured to output a node path having a largest total number of node levels as the label path.

Optionally, the path identifying module further includes a node filtering unit;

the node filtering unit is used for filtering preset types of webpage nodes in the target webpage before the root node traversing unit traverses all child nodes belonging to the root node.

Optionally, the preset type of node includes at least one of a div tag, a span tag, an ul tag, and a li tag.

Optionally, the check module includes a node check unit;

the node checking unit is used for traversing K first candidate node sets;

if any two first candidate nodes in the kth first candidate node set do not comprise the same father node, deleting the kth first candidate node set;

wherein K is a positive integer less than or equal to K.

Optionally, the webpage information output module includes an identifier assigning unit;

the identification allocation unit is used for allocating the same identification mode to the optimal list nodes of the same level.

Optionally, the identifying means includes selecting the web page content corresponding to the optimal list node on the target web page;

and the frame selection colors used by different identification modes are different, and the frame selection colors used by the same identification mode are the same.

Optionally, the preset node attribute includes at least one of a tag name, an id attribute, and a class attribute.

The embodiment of the invention provides a webpage information identification system, which comprises a path identification module, a vectorization processing module, a clustering processing module, a verification module, a webpage information identification module and a webpage information output module. The webpage information identification system is in butt joint with the target webpage, so that a label path in the target webpage is identified and obtained based on the content of the target webpage, then vectorization processing and clustering processing are carried out on the label path, automatic sorting of webpage content is achieved, when the webpage information output module marks an optimal list node in an optimal list node set on the target webpage, nodes in the same label path are placed in the same optimal list node set, and the optimal list node in the optimal list node set is well-arranged, so that the webpage content is hierarchically divided, and further extraction of table item content is facilitated.

Drawings

Fig. 1 is a schematic structural diagram of a web page information identification system according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a node tree according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Suffixes such as "module", "part", or "unit" used to denote elements are used herein only for the convenience of description of the present invention, and have no specific meaning in themselves. Thus, "module" and "component" may be used in a mixture.

As shown in fig. 1, an embodiment of the present invention provides a web page information identification system 100, which includes, but is not limited to, a path identification module 10, a path identification module 20, a vectorization processing module 30, a cluster processing module 40, a verification module 50, a web page information identification module 60, and a web page information output module 70.

In a specific application, one application of the web page information identification system 100 may be: the web page information recognition system 100 interfaces with a target web page that needs to be subjected to web page information recognition, crawls to other web pages along all URLs in the target web page, and collects all crawled web pages as web page contents of the target web page, thereby performing processing using the above function modules.

In the application process of the web page information identification system 100, the functions of the above functional modules are as follows:

a path identification module 10, configured to identify and acquire K labeled paths based on content of a target web page; each label path comprises a first node to an Nth node, wherein the first node is a child node, the Nth node is a child node, N represents the hierarchy of the nodes, and K and N are positive integers.

In a specific application, the target webpage includes a plurality of webpage tags, as nodes in the embodiment of the present invention, any two webpage nodes may have an inclusion relationship, and therefore, each webpage node in the tag path acquired by the path identification module 10 has an inclusion relationship, as in the 1 st tag path, it includes a webpage node A, B, C, D, whose relationship is represented as a > B > C > D, the lower webpage node of a is B, the lower webpage node of B is C, the lower webpage node of C is D, and thus, a is a first node, D is a fourth node, a is a child node, and D is a child node of a.

In one embodiment, the path identifying module 10 includes a node tree establishing unit, a root node traversal unit, a child node traversal unit, and a label path output unit, so as to achieve the acquisition of K label paths in a target web page, and the function units of the path identifying module are as follows:

the node tree establishing unit is used for identifying and acquiring webpage nodes in the target webpage and establishing a node tree according to the incidence relation of the webpage nodes;

the root node traversing unit is used for acquiring a root node of the node tree and traversing all child nodes belonging to the root node;

the child node traversing unit is used for acquiring child nodes belonging to the child nodes and traversing node paths from the child nodes to the child nodes;

and a label path output unit for outputting the node path having the largest total number of node levels as a label path.

In one embodiment, the path identification module further comprises a node filtering unit; and the node filtering unit filters preset types of webpage nodes in the target webpage before the root node traversing unit traverses all child nodes belonging to the root node.

Illustratively, the preset type of web page node includes at least one of a div tag, a span tag, an ul tag, and a li tag.

As shown in fig. 2, an exemplary node tree established according to a web tag of a target web page is further shown in the embodiment of the present invention, in fig. 2, the web tag of the target web page includes a list, an item, a title, an info, an author, and a date, and the established node tree includes a web node list, a plurality of web node items, a web node title, a web node info, a web node author, and a web node date, where the web node items all belong to the web node list, in a certain web node item, its lower node includes a web node title and a web node info, and in the web node info, its lower node includes a web node author and a web node date. The web page node list is a root node of the node tree, and the preset type of the web page node is an ul label, so that the graph is represented by an ul class; the webpage node item is a child node of the node tree and is also a first node, and the preset type of the webpage node is a li label, so that the li class is used for representing the webpage node; the webpage node title and the webpage node info are child sub-nodes and are also second nodes, and the preset type of the webpage node is a div label, so that the diagram is represented by a div class; the webpage node author and the webpage node date are also child sub-nodes, but are third nodes, and the preset type of the webpage node is a span label, so the span class is used for representing the webpage node. In summary, according to the path identification module 10 and the sub-functional units thereof provided in the embodiment of the present invention, 5 label paths can be obtained, which are represented as:

{li.item}；

{li.item>a.title}；

{li.item>div.info}；

{li.item>div.info>span.author}；

{li.item>div.info>span.date}。

and the path identification module 20 is configured to identify K label paths through preset node attributes to obtain K identification documents.

And the vectorization processing module 30 is configured to convert the K identification documents into K high-dimensional vector sets through vectorization processing.

And the clustering processing module 40 is configured to perform clustering processing on the K high-dimensional vector sets to obtain K first candidate node sets.

The path identification module 20, the vectorization processing module 30, and the clustering processing module 40 process the label paths, where the path identification module 20 identifies the label paths to distinguish the label paths, the vectorization processing module 30 converts the identification documents, so that the clustering processing module 40 performs clustering processing on the web page nodes in the label paths, and then in a first candidate node set, the web page nodes in a path label are clustered, that is, the web page nodes of the same type are divided together, for example, in the first candidate node set obtained based on the label path { li.item > div.info }, li.item is a type of web page node, the number is not limited, and div.info is another type of web page node, and the number is not limited.

For example, based on the above 5 label paths, 5 kinds of first candidate node sets may be obtained.

A checking module 50, configured to check K first candidate node sets to obtain M second candidate node sets, where M is a positive integer less than or equal to K.

In a specific application, when the node tree is built according to the target webpage, there may exist discrete nodes, which are represented as discrete webpage nodes, and the discrete webpage nodes do not belong to important webpage information data, so that the first candidate node set needs to be checked to remove the discrete webpage nodes.

In one embodiment, the check module includes a node check unit; the node checking unit is used for traversing K first candidate node sets; if any two first candidate nodes in the kth first candidate node set do not comprise the same father node, deleting the kth first candidate node set; k is a positive integer less than or equal to K.

Taking a first candidate node set obtained based on the label path { li. As can be seen, after the first candidate node set of each category is verified, the number of the first candidate node sets is reduced, and in this embodiment, for example, 5 second candidate node sets are obtained based on the above 5 label paths.

In the first candidate node set, the discrete node does not include the same parent node as other nodes, and therefore, the discrete node is removed by the node checking unit, and the influence of the discrete webpage node on the selection of the optimal list node set in the webpage information identifying module 60 is avoided.

And the web page information identification module 60 is configured to obtain M optimal list node sets by using an optimal selection algorithm based on each second candidate node set.

A web page information output module 70, configured to mark an optimal list node in the M optimal list node sets on the target web page.

The second candidate node set may include a plurality of clustered web page nodes in the same partition result, and therefore, the web page information identification module 60 selects a unique web page node in the same partition result through an optimal selection algorithm.

It should be noted that, when an optimal selection algorithm is used for selection in each second candidate node set, a plurality of list node sets are obtained first, and then an optimal list node set is selected according to the average maximum text length and the average number of text labels in each list node set.

For example, assuming that the obtained optimal list node is a list node based on the tag path of { li.item > div.info > span.author } according to the node tree and the tag path shown in fig. 2, the web page content corresponding to the web page node item, the web page content corresponding to the web page node info, and the web page content corresponding to the web page node author will be marked on the target web page.

Each label path comprises a first node to an nth node, the first node is a child node, the nth node is a child node, and N represents a hierarchy of nodes, so that a first candidate node in a first candidate node set, a second candidate node in a second candidate node set and an optimal list node in an optimal list node set obtained by the label path also have hierarchical representations.

In one embodiment, the web page information output module 70 includes an identification assignment unit; the identification allocation unit is used for allocating the same identification mode to the optimal list nodes of the same level.

Exemplarily, the identification manner includes selecting the web page content corresponding to the optimal list node on the target web page; and the frame selection colors used by different identification modes are different, and the frame selection colors used by the same identification mode are the same.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the foregoing embodiments illustrate the present invention in detail, those of ordinary skill in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A web page information identification system, comprising:

2. The web page information identification system of claim 1, wherein the path identification module includes a node tree establishment unit, a root node traversal unit, a child node traversal unit, and a label path output unit;

3. The web page information identifying system of claim 2, wherein the path identifying module further comprises a node filtering unit;

4. A web page information recognition system as claimed in claim 3, wherein the preset type of web page node includes at least one of a div tag, a span tag, an ul tag, and a li tag.

5. The web page information identification system of claim 1, wherein the check module includes a node check unit;

the node checking unit is used for traversing K first candidate node sets;

wherein K is a positive integer less than or equal to K.

6. A web page information recognition system according to claim 1, wherein the web page information output module includes an identification assignment unit;

7. The system for identifying webpage information of claim 6, wherein the identification means comprises framing out the webpage content corresponding to the optimal list node on the target webpage;

8. The web page information identification system of claim 1, wherein the preset node attribute includes at least one of a tag name, an id attribute, and a class attribute.