CN112784135A - Webpage information identification system - Google Patents

Webpage information identification system Download PDF

Info

Publication number
CN112784135A
CN112784135A CN202110217963.7A CN202110217963A CN112784135A CN 112784135 A CN112784135 A CN 112784135A CN 202110217963 A CN202110217963 A CN 202110217963A CN 112784135 A CN112784135 A CN 112784135A
Authority
CN
China
Prior art keywords
node
webpage
web page
nodes
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110217963.7A
Other languages
Chinese (zh)
Inventor
张冶青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202110217963.7A priority Critical patent/CN112784135A/en
Publication of CN112784135A publication Critical patent/CN112784135A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The invention is suitable for the technical field of computers, and provides a webpage information identification system which comprises a path identification module, a vectorization processing module, a clustering processing module, a verification module, a webpage information identification module and a webpage information output module. The webpage information identification system is in butt joint with the target webpage, so that a label path in the target webpage is identified and obtained based on the content of the target webpage, then vectorization processing and clustering processing are carried out on the label path, automatic sorting of webpage content is achieved, and finally the optimal list nodes in the optimal list node set are marked on the target webpage through the webpage information output module. The invention can realize the automatic sorting of the webpage content, and the optimal list nodes in the optimal list node set have clear hierarchy, so that the marked webpage content has hierarchy division, thereby being beneficial to the extraction of the list item content.

Description

Webpage information identification system
Technical Field
The invention relates to the technical field of computers, in particular to a webpage information identification system.
Background
With the rapid development of network technology, the world wide web becomes the information data transmission carrier with the largest transmission quantity and the highest transmission efficiency at present, and how to effectively acquire required information data from the world wide web and utilize massive information becomes a hot topic of research in the fields of network technology and communication technology.
The web crawler is a commonly used web page information data acquisition tool, and the principle of the web crawler is to automatically capture a program or script of web page information according to a certain rule, thereby reading the content of a web page, finding other link addresses in the web page, and then finding the next web page through the link addresses, and the process is circulated until all web pages of the website are captured. If the whole internet is regarded as a website, the web spider can capture all the web pages on the internet by using the principle.
However, in general, web crawlers do not perform any processing on web page information data acquired by the web crawlers, and need to rely on users to define the search range and determine the data to be acquired finally, so that the web crawlers are only suitable for persons with script editing capability.
Disclosure of Invention
The invention mainly aims to provide a webpage information identification system to solve the problem that a webpage information data acquisition tool in the prior art is narrow in application range.
In order to achieve the above object, an embodiment of the present invention provides a web page information identification system, including:
the path identification module is used for identifying and acquiring K label paths based on the content of the target webpage; each label path comprises a first node to an Nth node, wherein the first node is a child node, the Nth node is a child node, N represents the hierarchy of the nodes, and K and N are positive integers;
the path identification module is used for identifying K label paths through preset node attributes to obtain K identification documents;
the vectorization processing module is used for converting the K identification documents into K high-dimensional vector sets through vectorization processing;
the clustering processing module is used for clustering the K high-dimensional vector sets to obtain K first candidate node sets;
the checking module is used for checking the K first candidate node sets to obtain M second candidate node sets, wherein M is a positive integer less than or equal to K;
the webpage information identification module is used for acquiring M optimal list node sets by using an optimal selection algorithm based on each second candidate node set;
and the webpage information output module is used for marking the optimal list nodes in the M optimal list node sets on the target webpage.
Optionally, the path identification module includes a node tree establishment unit, a root node traversal unit, a child node traversal unit, and a label path output unit;
the node tree establishing unit identifies and acquires webpage nodes in a target webpage and establishes a node tree according to the incidence relation of the webpage nodes;
the root node traversal unit is used for acquiring a root node of the node tree and traversing all child nodes belonging to the root node;
the child node traversal unit is used for acquiring child nodes belonging to the child nodes and traversing node paths from the child nodes to the child nodes;
the label path output unit is configured to output a node path having a largest total number of node levels as the label path.
Optionally, the path identifying module further includes a node filtering unit;
the node filtering unit is used for filtering preset types of webpage nodes in the target webpage before the root node traversing unit traverses all child nodes belonging to the root node.
Optionally, the preset type of node includes at least one of a div tag, a span tag, an ul tag, and a li tag.
Optionally, the check module includes a node check unit;
the node checking unit is used for traversing K first candidate node sets;
if any two first candidate nodes in the kth first candidate node set do not comprise the same father node, deleting the kth first candidate node set;
wherein K is a positive integer less than or equal to K.
Optionally, the webpage information output module includes an identifier assigning unit;
the identification allocation unit is used for allocating the same identification mode to the optimal list nodes of the same level.
Optionally, the identifying means includes selecting the web page content corresponding to the optimal list node on the target web page;
and the frame selection colors used by different identification modes are different, and the frame selection colors used by the same identification mode are the same.
Optionally, the preset node attribute includes at least one of a tag name, an id attribute, and a class attribute.
The embodiment of the invention provides a webpage information identification system, which comprises a path identification module, a vectorization processing module, a clustering processing module, a verification module, a webpage information identification module and a webpage information output module. The webpage information identification system is in butt joint with the target webpage, so that a label path in the target webpage is identified and obtained based on the content of the target webpage, then vectorization processing and clustering processing are carried out on the label path, automatic sorting of webpage content is achieved, when the webpage information output module marks an optimal list node in an optimal list node set on the target webpage, nodes in the same label path are placed in the same optimal list node set, and the optimal list node in the optimal list node set is well-arranged, so that the webpage content is hierarchically divided, and further extraction of table item content is facilitated.
Drawings
Fig. 1 is a schematic structural diagram of a web page information identification system according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a node tree according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Suffixes such as "module", "part", or "unit" used to denote elements are used herein only for the convenience of description of the present invention, and have no specific meaning in themselves. Thus, "module" and "component" may be used in a mixture.
As shown in fig. 1, an embodiment of the present invention provides a web page information identification system 100, which includes, but is not limited to, a path identification module 10, a path identification module 20, a vectorization processing module 30, a cluster processing module 40, a verification module 50, a web page information identification module 60, and a web page information output module 70.
In a specific application, one application of the web page information identification system 100 may be: the web page information recognition system 100 interfaces with a target web page that needs to be subjected to web page information recognition, crawls to other web pages along all URLs in the target web page, and collects all crawled web pages as web page contents of the target web page, thereby performing processing using the above function modules.
In the application process of the web page information identification system 100, the functions of the above functional modules are as follows:
a path identification module 10, configured to identify and acquire K labeled paths based on content of a target web page; each label path comprises a first node to an Nth node, wherein the first node is a child node, the Nth node is a child node, N represents the hierarchy of the nodes, and K and N are positive integers.
In a specific application, the target webpage includes a plurality of webpage tags, as nodes in the embodiment of the present invention, any two webpage nodes may have an inclusion relationship, and therefore, each webpage node in the tag path acquired by the path identification module 10 has an inclusion relationship, as in the 1 st tag path, it includes a webpage node A, B, C, D, whose relationship is represented as a > B > C > D, the lower webpage node of a is B, the lower webpage node of B is C, the lower webpage node of C is D, and thus, a is a first node, D is a fourth node, a is a child node, and D is a child node of a.
In one embodiment, the path identifying module 10 includes a node tree establishing unit, a root node traversal unit, a child node traversal unit, and a label path output unit, so as to achieve the acquisition of K label paths in a target web page, and the function units of the path identifying module are as follows:
the node tree establishing unit is used for identifying and acquiring webpage nodes in the target webpage and establishing a node tree according to the incidence relation of the webpage nodes;
the root node traversing unit is used for acquiring a root node of the node tree and traversing all child nodes belonging to the root node;
the child node traversing unit is used for acquiring child nodes belonging to the child nodes and traversing node paths from the child nodes to the child nodes;
and a label path output unit for outputting the node path having the largest total number of node levels as a label path.
In one embodiment, the path identification module further comprises a node filtering unit; and the node filtering unit filters preset types of webpage nodes in the target webpage before the root node traversing unit traverses all child nodes belonging to the root node.
Illustratively, the preset type of web page node includes at least one of a div tag, a span tag, an ul tag, and a li tag.
As shown in fig. 2, an exemplary node tree established according to a web tag of a target web page is further shown in the embodiment of the present invention, in fig. 2, the web tag of the target web page includes a list, an item, a title, an info, an author, and a date, and the established node tree includes a web node list, a plurality of web node items, a web node title, a web node info, a web node author, and a web node date, where the web node items all belong to the web node list, in a certain web node item, its lower node includes a web node title and a web node info, and in the web node info, its lower node includes a web node author and a web node date. The web page node list is a root node of the node tree, and the preset type of the web page node is an ul label, so that the graph is represented by an ul class; the webpage node item is a child node of the node tree and is also a first node, and the preset type of the webpage node is a li label, so that the li class is used for representing the webpage node; the webpage node title and the webpage node info are child sub-nodes and are also second nodes, and the preset type of the webpage node is a div label, so that the diagram is represented by a div class; the webpage node author and the webpage node date are also child sub-nodes, but are third nodes, and the preset type of the webpage node is a span label, so the span class is used for representing the webpage node. In summary, according to the path identification module 10 and the sub-functional units thereof provided in the embodiment of the present invention, 5 label paths can be obtained, which are represented as:
{li.item};
{li.item>a.title};
{li.item>div.info};
{li.item>div.info>span.author};
{li.item>div.info>span.date}。
and the path identification module 20 is configured to identify K label paths through preset node attributes to obtain K identification documents.
And the vectorization processing module 30 is configured to convert the K identification documents into K high-dimensional vector sets through vectorization processing.
And the clustering processing module 40 is configured to perform clustering processing on the K high-dimensional vector sets to obtain K first candidate node sets.
The path identification module 20, the vectorization processing module 30, and the clustering processing module 40 process the label paths, where the path identification module 20 identifies the label paths to distinguish the label paths, the vectorization processing module 30 converts the identification documents, so that the clustering processing module 40 performs clustering processing on the web page nodes in the label paths, and then in a first candidate node set, the web page nodes in a path label are clustered, that is, the web page nodes of the same type are divided together, for example, in the first candidate node set obtained based on the label path { li.item > div.info }, li.item is a type of web page node, the number is not limited, and div.info is another type of web page node, and the number is not limited.
For example, based on the above 5 label paths, 5 kinds of first candidate node sets may be obtained.
A checking module 50, configured to check K first candidate node sets to obtain M second candidate node sets, where M is a positive integer less than or equal to K.
In a specific application, when the node tree is built according to the target webpage, there may exist discrete nodes, which are represented as discrete webpage nodes, and the discrete webpage nodes do not belong to important webpage information data, so that the first candidate node set needs to be checked to remove the discrete webpage nodes.
In one embodiment, the check module includes a node check unit; the node checking unit is used for traversing K first candidate node sets; if any two first candidate nodes in the kth first candidate node set do not comprise the same father node, deleting the kth first candidate node set; k is a positive integer less than or equal to K.
Taking a first candidate node set obtained based on the label path { li. As can be seen, after the first candidate node set of each category is verified, the number of the first candidate node sets is reduced, and in this embodiment, for example, 5 second candidate node sets are obtained based on the above 5 label paths.
In the first candidate node set, the discrete node does not include the same parent node as other nodes, and therefore, the discrete node is removed by the node checking unit, and the influence of the discrete webpage node on the selection of the optimal list node set in the webpage information identifying module 60 is avoided.
And the web page information identification module 60 is configured to obtain M optimal list node sets by using an optimal selection algorithm based on each second candidate node set.
A web page information output module 70, configured to mark an optimal list node in the M optimal list node sets on the target web page.
The second candidate node set may include a plurality of clustered web page nodes in the same partition result, and therefore, the web page information identification module 60 selects a unique web page node in the same partition result through an optimal selection algorithm.
It should be noted that, when an optimal selection algorithm is used for selection in each second candidate node set, a plurality of list node sets are obtained first, and then an optimal list node set is selected according to the average maximum text length and the average number of text labels in each list node set.
For example, assuming that the obtained optimal list node is a list node based on the tag path of { li.item > div.info > span.author } according to the node tree and the tag path shown in fig. 2, the web page content corresponding to the web page node item, the web page content corresponding to the web page node info, and the web page content corresponding to the web page node author will be marked on the target web page.
Each label path comprises a first node to an nth node, the first node is a child node, the nth node is a child node, and N represents a hierarchy of nodes, so that a first candidate node in a first candidate node set, a second candidate node in a second candidate node set and an optimal list node in an optimal list node set obtained by the label path also have hierarchical representations.
In one embodiment, the web page information output module 70 includes an identification assignment unit; the identification allocation unit is used for allocating the same identification mode to the optimal list nodes of the same level.
Exemplarily, the identification manner includes selecting the web page content corresponding to the optimal list node on the target web page; and the frame selection colors used by different identification modes are different, and the frame selection colors used by the same identification mode are the same.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the foregoing embodiments illustrate the present invention in detail, those of ordinary skill in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (8)

1. A web page information identification system, comprising:
the path identification module is used for identifying and acquiring K label paths based on the content of the target webpage; each label path comprises a first node to an Nth node, wherein the first node is a child node, the Nth node is a child node, N represents the hierarchy of the nodes, and K and N are positive integers;
the path identification module is used for identifying K label paths through preset node attributes to obtain K identification documents;
the vectorization processing module is used for converting the K identification documents into K high-dimensional vector sets through vectorization processing;
the clustering processing module is used for clustering the K high-dimensional vector sets to obtain K first candidate node sets;
the checking module is used for checking the K first candidate node sets to obtain M second candidate node sets, wherein M is a positive integer less than or equal to K;
the webpage information identification module is used for acquiring M optimal list node sets by using an optimal selection algorithm based on each second candidate node set;
and the webpage information output module is used for marking the optimal list nodes in the M optimal list node sets on the target webpage.
2. The web page information identification system of claim 1, wherein the path identification module includes a node tree establishment unit, a root node traversal unit, a child node traversal unit, and a label path output unit;
the node tree establishing unit identifies and acquires webpage nodes in a target webpage and establishes a node tree according to the incidence relation of the webpage nodes;
the root node traversal unit is used for acquiring a root node of the node tree and traversing all child nodes belonging to the root node;
the child node traversal unit is used for acquiring child nodes belonging to the child nodes and traversing node paths from the child nodes to the child nodes;
the label path output unit is configured to output a node path having a largest total number of node levels as the label path.
3. The web page information identifying system of claim 2, wherein the path identifying module further comprises a node filtering unit;
the node filtering unit is used for filtering preset types of webpage nodes in the target webpage before the root node traversing unit traverses all child nodes belonging to the root node.
4. A web page information recognition system as claimed in claim 3, wherein the preset type of web page node includes at least one of a div tag, a span tag, an ul tag, and a li tag.
5. The web page information identification system of claim 1, wherein the check module includes a node check unit;
the node checking unit is used for traversing K first candidate node sets;
if any two first candidate nodes in the kth first candidate node set do not comprise the same father node, deleting the kth first candidate node set;
wherein K is a positive integer less than or equal to K.
6. A web page information recognition system according to claim 1, wherein the web page information output module includes an identification assignment unit;
the identification allocation unit is used for allocating the same identification mode to the optimal list nodes of the same level.
7. The system for identifying webpage information of claim 6, wherein the identification means comprises framing out the webpage content corresponding to the optimal list node on the target webpage;
and the frame selection colors used by different identification modes are different, and the frame selection colors used by the same identification mode are the same.
8. The web page information identification system of claim 1, wherein the preset node attribute includes at least one of a tag name, an id attribute, and a class attribute.
CN202110217963.7A 2021-02-26 2021-02-26 Webpage information identification system Pending CN112784135A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110217963.7A CN112784135A (en) 2021-02-26 2021-02-26 Webpage information identification system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110217963.7A CN112784135A (en) 2021-02-26 2021-02-26 Webpage information identification system

Publications (1)

Publication Number Publication Date
CN112784135A true CN112784135A (en) 2021-05-11

Family

ID=75761951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110217963.7A Pending CN112784135A (en) 2021-02-26 2021-02-26 Webpage information identification system

Country Status (1)

Country Link
CN (1) CN112784135A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304457A (en) * 2023-02-27 2023-06-23 山东乾舜广告传媒有限公司 Marking method for webpage multiple information attribute

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102467501A (en) * 2010-10-29 2012-05-23 北大方正集团有限公司 Method and system for extracting news record metadata from news list page
CN103116591A (en) * 2011-11-17 2013-05-22 北大方正集团有限公司 Forum post content extraction method and extraction device
CN104965901A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Method and apparatus for grabbing content of target page
CN105653668A (en) * 2015-12-29 2016-06-08 武汉理工大学 Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment
CN108694192A (en) * 2017-04-07 2018-10-23 北京国双科技有限公司 The judgment method and device of type of webpage
CN108733813A (en) * 2018-05-21 2018-11-02 山东管理学院 Information extracting method, system towards BBS forum Web pages contents and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102467501A (en) * 2010-10-29 2012-05-23 北大方正集团有限公司 Method and system for extracting news record metadata from news list page
CN103116591A (en) * 2011-11-17 2013-05-22 北大方正集团有限公司 Forum post content extraction method and extraction device
CN104965901A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Method and apparatus for grabbing content of target page
CN105653668A (en) * 2015-12-29 2016-06-08 武汉理工大学 Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment
CN108694192A (en) * 2017-04-07 2018-10-23 北京国双科技有限公司 The judgment method and device of type of webpage
CN108733813A (en) * 2018-05-21 2018-11-02 山东管理学院 Information extracting method, system towards BBS forum Web pages contents and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304457A (en) * 2023-02-27 2023-06-23 山东乾舜广告传媒有限公司 Marking method for webpage multiple information attribute
CN116304457B (en) * 2023-02-27 2024-03-29 山东乾舜广告传媒有限公司 Marking method for webpage multiple information attribute

Similar Documents

Publication Publication Date Title
CN112287273B (en) Method, system and storage medium for classifying website list pages
TW200900958A (en) Link spam detection using smooth classification function
JP2005092889A (en) Information block extraction apparatus and method for web page
CN101520770B (en) Method and device for analyzing, converting and splitting structured data
JP2009099124A (en) Method and system for data construction
CN111625694B (en) Multistage label processing method and device and computer equipment
US8090720B2 (en) Method for merging document clusters
TW201415254A (en) Method and system for recommending semantic annotations
CN104504151A (en) Public opinion monitoring system of Wechat
Paulus et al. Gathering and Combining Semantic Concepts from Multiple Knowledge Bases.
CN102902794B (en) Web page classification system and method
CN113486187A (en) Buddhism knowledge graph construction method, device, equipment and storage medium
CN110085299B (en) Image identification dryness removal method and system and image library
CN112784135A (en) Webpage information identification system
Mehler et al. Towards logical hypertext structure: a graph-theoretic perspective
CN103870495A (en) Method and device for extracting information from website
Sahni et al. Topic modeling on online news extraction
CN109948015B (en) Meta search list result extraction method and system
CN111401056A (en) Method for extracting keywords from various texts
LIM et al. Web mining-The ontology approach
CN114706948A (en) News processing method and device, storage medium and electronic equipment
Utard et al. Link-local features for hypertext classification
JP2004287670A (en) Image database preparing device, image database preparing method, program, and recording medium
Lynn et al. Semantically conceptualizing and annotating tables
CN102708099B (en) For extracting method and the device of picture header

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination