CN113343140B

CN113343140B - Method for automatically extracting webpage text content based on neo4j graphic database

Info

Publication number: CN113343140B
Application number: CN202010138403.8A
Authority: CN
Inventors: 刘亮; 李萧洋; 郑荣锋; 李孟铭
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-03-03
Filing date: 2020-03-03
Publication date: 2022-12-13
Anticipated expiration: 2040-03-03
Also published as: CN113343140A

Abstract

The method discloses a method for automatically extracting webpage text content based on a neo4j graphic database. The method comprises the following steps: step S101, acquiring HTML source codes of a webpage from an open source channel by using a simulation browser request technology to serve as a training set; step S102, extracting HTML labels and converting HTML source codes into tree structures; step S103, extracting triples representing the relationship between nodes from all nodes in the traversal tree; step S104, converting the relation triples into graphs by using a neo4j graph database; step S105, removing redundant nodes in the graph through node compression and branch compression; step S106, extracting multi-dimensional features, and training a text node classification model through machine learning; and S107, extracting the text nodes in the webpage by using the classification model, and sequentially recovering the complete webpage text content by using the child nodes of the text nodes. The invention provides a simple and easy-to-use implementation method for accurately and efficiently extracting the text content of the webpage.

Description

Method for automatically extracting webpage text content based on neo4j graphic database

Technical Field

The invention relates to the field of computer application and webpage content extraction, in particular to a method for automatically extracting webpage text content based on a neo4j graphic database.

Background

With the development of network technology, the internet provides an information sharing platform for human beings across space-time boundaries, and the text content of a webpage is an important source for people to quickly acquire information from the internet. Nowadays, the application field of web page text content extraction is more and more extensive, common users directly acquire own desired information from web page text content by using search engines, and other works based on web page processing, such as text mining, artificial intelligence, search engines and the like, all use efficient and accurate acquisition of web page text content as a premise.

As most of the existing websites use some specific templates or styles which are irrelevant to the text content to improve the readability of the webpage, the text content of the webpage is often mixed in some webpage noises such as advertisement links, navigation templates and the like. The method and the device can quickly and accurately extract the text content of the webpage from the noise, reduce the burden of the general public on acquiring information, and help other applications providing services based on the content of the webpage to improve the working efficiency. Therefore, how to automatically extract text content from a complex-designed webpage and apply the text content to a specific field is a problem that needs to be solved urgently by those skilled in the art.

Existing methods for extracting the body content of a web page can be classified into three categories according to the basis used for extracting the body content of the web page, namely, a text information-based web page body content extraction algorithm, a visual information-based web page body content extraction algorithm, and a document object type DOM-based web page body content extraction algorithm.

The main idea of the webpage text content extraction algorithm based on the text information is as follows: if a web page is divided into a plurality of areas, the text density of the text portion of the body of the web page is much higher than other areas of the web page, and furthermore if the entire web page is converted to text, the text lines containing the body content are generally closer in distance and contain a large number of punctuation marks. Although the method is simple to implement, characters near text content, such as copyright statement and the like, can be identified, so that the method has certain limitation in practical application.

The main idea of the webpage text content extraction algorithm based on the visual information is as follows: when browsing a web page, a user treats a semantic block as a single object, and the user often uses some visual information, such as font size, font color, background, table list, etc., to distinguish the semantic block. The visual information is combined with the DOM, the webpage is divided into a plurality of blocks, and the proportion of text nodes to leaf nodes in each block is calculated to judge whether the block belongs to the text block. However, this method needs to acquire the visual factors of the page, so the calculation amount is large; furthermore, if the visual factors in the page are controlled using different files such as CSS, extraction efficiency may be low.

The main idea of the webpage text content extraction algorithm based on the document object type DOM is as follows: a DOM tree is generated by extracting tags in HTML source codes, a webpage is divided into a plurality of blocks, and the relationship among nodes in the tree and the stored content are researched to extract the text content of the webpage. The method is mostly suitable for websites with good programming style and consistent typesetting, and the HTML language has much focus on how to display information rather than how to block the web pages, so the method has poor universality.

Disclosure of Invention

Aiming at the defects in the existing scheme, the application aims to provide the implementation method for automatically extracting the text content of the webpage with simple operation, high efficiency and accuracy.

In order to solve the technical problem, the present application provides a method for automatically extracting text content of a webpage based on a neo4j graph database, wherein the implementation method includes the following steps:

step S101: an HTML text file containing only an HTML tag is generated by using HTML source codes acquired from an open source channel and using an HTML processing technique to remove CSS styles and the like which are not related to the text content of an article. The method comprises the steps of acquiring tags such as < div >, < table > and the like in an HTML text by using an HTML processing technology, and converting source HTML into a tree structure according to the hierarchical relationship of the tags.

Step S102: and extracting a triple representing the mutual relation among the labels by traversing all the nodes in the tree and according to the connection relation among the nodes and the sequence relation among the sub-nodes. The relationship triplet structure is as follows: (src, r, dst), wherein "src" and "dst" respectively represent nodes in the tree structure, and "r" represents a relationship between two nodes. "src" includes the label of the parent node in the tree and a unique identifier, and "dst" includes the label of the child node in the tree, the unique identifier, the specific content stored by the child node, and the order of the child node in the set of child nodes where the child node is located.

Step S103: and converting the relation triple structure into a graph structure by using a neo4j graph database, and removing a part of redundant end nodes.

Step S104: and for the graph structure, dividing the empty nodes directly connected with the end nodes into two types according to the number of the end nodes connected with the empty nodes, and respectively performing node compression and branch node compression to obtain the compressed graph structure.

Step S105: and extracting the node quantity characteristic and the average text length characteristic of the compressed graph.

Step S106: combining the features to generate a feature vector, training a text node classification model by using an MLP model, classifying the nodes in the webpage by using the classification model, and extracting the text nodes in the webpage.

Step S107: and sequentially recovering the contents in the child nodes of the text nodes according to the extracted text nodes and extracting the complete text contents of the web pages.

Further, in step S102, the process of generating the relationship triple is as follows:

step S201: for the preprocessed and converted tree structure, circularly traversing all nodes in the tree structure, sequentially recording the sequence of all child nodes under each node from left to right, and generating records in the shapes of { parent node, 'connection relation', [ child node, xth child node ] } for all nodes under the < html > label;

step S202: converting the records into a relation triple structure and storing the relation triple structure into a neo4j graph database, wherein the relation triple structure is as follows: (src, r, dst), wherein the "src" includes the label and the unique identifier of the parent node in the tree, and the "dst" includes the label and the unique identifier of the child node in the tree, the specific content stored by the child node, and the sequence of the child node in the child node set where the child node is located.

Still further, in step S103, the process of generating the relationship graph structure includes:

step S301: storing the relation triple into a neo4j graph database, taking 'src' and 'dst' as the vertex of a graph, taking 'r' as the edge of the graph, representing the connection relation between each node, and generating a graph structure corresponding to an HTML text;

step S302: in the generated graph structure, all nodes containing body text exist in the end node of the graph, and the end node only containing the empty text and the nodes without the context attribute (namely the parent nodes of the original end nodes) generated by the end node are circularly removed.

Further, in step S104, the process of branch compression and node compression is as follows:

step S401: for an empty node connected to a single end node in the graph structure, directly connecting the single end node to a parent node of the empty node, and deleting the empty node is called node compression.

Step S402: for an empty node connected with two or more last nodes in the graph structure, directly connecting the empty node with a grandfather node of the empty node, and deleting the empty father node of the empty node, which is called branch compression.

Further, in step S105, the process of extracting the multidimensional feature includes:

step S501: the number of the non-empty nodes connected with the text nodes is far smaller than that of the non-empty nodes connected with the non-text nodes, and the nodes with more connected non-empty nodes are more likely to be the text nodes;

step S502: because there is only one text node in one web page and the average text length of the nodes containing the recommended content in the web page is far shorter than that of normal text nodes, the nodes with longer average text length are more likely to be the text nodes.

The invention provides a webpage text content extraction method, which comprises the steps of preprocessing HTML source codes of a webpage, removing a style template irrelevant to text content in the HTML source codes, and generating an HTML text file; extracting tags in an HTML text and generating a tree structure of a source HTML through an HTML processing technology; extracting the relationship among all nodes and the sequence relationship among the sub-nodes by traversing all the nodes in the tree, and generating a relationship triple; converting the relation triple into a graph by using a neo4j graph database, and performing node compression and branch compression on part of nodes in the graph to remove redundant empty nodes; extracting features from the graph through a natural language processing technology, training an MLP classification model, and separating text nodes from a large number of nodes; and (4) recovering the contents in all the child nodes of the text node in sequence to extract the complete text content of the webpage.

Drawings

Fig. 1 is a flowchart of a method for extracting text content of a web page according to an embodiment of the present application.

Fig. 2 is a storage format diagram of a relationship triplet generated in the present application.

Fig. 3 is a schematic diagram of node compression described in the present application.

Fig. 4 is a schematic view of the compression of the branches described in the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Referring to fig. 1, fig. 1 is a flowchart of a method for extracting text content of a web page according to an embodiment of the present application.

The method comprises the following specific steps.

Step S101: the method comprises the steps of acquiring HTML source codes of a webpage from an open source channel by using a simulated browser request technology, preprocessing the HTML source codes to acquire HTML labels, and converting the webpage source codes into a tree structure.

Removing template styles such as CSS and the like which are irrelevant to text content by an HTML file processing technology, and only keeping an HTML plain text; extracting tags such as < div >, < table > and the like in the HTML text by using the Beautiful Soup, and generating a tree structure according to the hierarchical relationship among the tags.

Step S102: and traversing the nodes in the tree, and extracting the triples according to the connection relation among the nodes and the sequence relation among the sub-nodes.

First, according to the generated tree structure, all the level one nodes under the original < html > tag, that is, the child nodes directly connected to the < html > tag in the tree structure, are obtained. Recording the relation between the primary nodes and the html tags as subtag, and recording the sequence of the primary nodes.

And then, circularly traversing all child nodes under the primary node, recording the relation between each node and the child nodes thereof as subtag, and recording the sequential relation among the child nodes.

Then, the result is converted into a relational triple structure and stored in a neo4j graph database, and the relational triple structure is stored in a format shown in fig. 2. In fig. 2, "src" and "dst" respectively represent nodes in a tree structure, and "r" represents a connection relationship between two nodes; the label and the unique identifier of a parent node in the tree are included in the src, the label and the unique identifier of a child node in the tree, the specific content stored by the child node and the sequence of the child node in the child node set in the dst are included in the dst, the unique _ id serves as an identification function, and the child _ sequence represents the sequence of the child node.

Step S103: and converting the relation triples into a graph through a neo4j graph database, and simplifying the graph.

And taking src and dst in the relation triple structure as the vertex of the graph, taking r as the edge of the graph, and generating the graph structure corresponding to the HTML text.

According to the structural characteristics of HTML, in the generated graph structure, all nodes containing characters are necessarily present at the end node of the graph. Based on this basis, the loop removes the end node that is empty text and the resulting nodes that do not have the context attribute (i.e., the parent of the original end node). The main purpose of this step is to reduce redundancy and complexity of subsequent processing.

Step S104: and carrying out node compression and branch compression on the graph, and further removing redundant nodes.

The nodes with parallel structure relationship in the HTML structure are related to the same node, and the nodes related to the same subject are related to the same node or similar nodes. Thus all nodes containing text information are associated with one and the same node, either directly or within a certain distance. Based on the idea, the influence of data discretization brought by the HTML label structure is eliminated through a compression step.

(1) The specific steps of node compression are shown in fig. 3.

For an empty node connected to a single end node in the graph structure, directly connecting the single end node to a parent node of the empty node, and deleting the empty node is called node compression. Referring to fig. 3, the empty node B connected to a single end node has both its parent node and its grandparent node empty, so that the node B is a redundant empty node. And deleting the empty node B, and directly connecting the end node to the parent node of the node B.

(2) The specific steps of branch compression are shown in fig. 4.

For an empty node connected with two or more end nodes in the graph structure, the empty node is directly connected with a grandfather node of the empty node, and the empty father node of the empty node is deleted, which is called branch node compression. Referring to fig. 4, a plurality of empty nodes C connected to the end node have empty nodes both of the father node and the grandfather node, so that the father node of the node C is a redundant empty node. And deleting the father node of the empty node C, and directly connecting the empty node C with the grandfather node thereof.

Step S105: and carrying out multi-dimensional feature extraction on the nodes in the graph structure by a natural language processing technology, wherein the specific steps comprise.

(1) And calculating the number of nodes which are not empty and connected with each node as a first characteristic. In the generated graph, the number of non-null nodes connected to the body nodes should be much smaller than the number of non-null nodes connected to the non-body nodes.

(2) And calculating the average text length of each node as a second characteristic. Since there is one and only one text node in one web page, there are many recommended contents or rated contents related to subjects in many web sites, and these contents, although not belonging to the text contents of the web page, can also generate a graph structure. These nodes containing recommended content have the following characteristics: the average text length is much shorter than the average text length of normal body nodes.

Step S106: HTML source codes of the webpage are obtained from an open source channel by using a simulated browser request technology, training samples are collected, the multi-dimensional features are extracted, and an MLP classification model is trained through machine learning.

Firstly, an HTML source code of a webpage is obtained from an open source channel by utilizing a simulated browser request technology, a certain number of webpage samples are collected, multidimensional features in the samples are extracted by using the method in the step S105, and an MLP classification model is trained by using the extracted features. The trained MLP classification model can divide all nodes in the graph into text nodes and non-text nodes according to the characteristics of the text nodes.

Step S107: and for the extracted text node, restoring the content in the child node according to the sequence of the child nodes recorded by the child sequence in the database so as to obtain the complete text content.

The above examples are merely illustrative of the technical solutions of the present invention and not restrictive, and a person skilled in the art may modify the technical solutions of the present invention or substitute them with equivalents without departing from the spirit and scope of the present invention, which should be determined by the claims.

Claims

1. A method for automatically extracting webpage text content based on a neo4j graphic database is characterized by comprising the following steps:

step S101, acquiring HTML source codes of a webpage from an open source channel by using a simulation browser request technology, preprocessing the HTML source codes to acquire HTML labels, and converting the webpage source codes into a tree structure;

step S102, traversing all nodes in the tree structure, and extracting a triple representing the relationship between the nodes according to the connection relationship between the nodes and the sequence relationship between the sub-nodes;

step S103, converting the relation triple structure into a graph structure by using a neo4j graph database;

step S104, in the graph structure, dividing the empty nodes directly connected with the end nodes into two types according to the number of the end nodes connected with the empty nodes, and respectively performing node compression and branch compression;

step S105, extracting node quantity characteristics and average text length characteristics from the compressed graph to generate a characteristic vector;

step S106, using the characteristic vector to perform machine learning, training a text node classification model, and classifying nodes in the webpage by using the classification model so as to automatically extract text nodes in the webpage;

and S107, sequentially restoring the contents in the child nodes according to the sequence of the child nodes of the extracted text nodes, and extracting the complete text contents of the webpage.

2. The method for automatically extracting the text contents of web pages based on the neo4j graph database as claimed in claim 1, wherein said step S102 further comprises the steps of:

step S201, for the preprocessed and converted tree structure, circularly traversing all nodes in the tree structure, for all child nodes under each node, sequentially recording the sequence from left to right, and for all nodes under the < html > label, generating records in the form of { parent node, 'connection relation', [ child node, xth child node ] };

step S202, converting the records into a relation triple structure and storing the relation triple structure into a neo4j graph database, wherein the structure of the relation triple structure is as follows: (src, r, dst), wherein "src" and "dst" respectively represent nodes in the tree structure, and "r" represents a connection relationship between two nodes; the label and the unique identifier of the parent node in the tree are included in the "src", and the label and the unique identifier of the child node in the tree, the specific content stored by the child node, and the sequence of the child node in the child node set where the child node is located are included in the "dst".

3. The method for automatically extracting the text contents of web pages based on the neo4j graphic database as claimed in claim 2, wherein said step S103 further comprises the steps of:

step S301, storing the relation triple into a neo4j graph database, wherein the "src" and the "dst" are used as vertexes of a graph, and the "r" is used as an edge of the graph, so as to generate a graph structure corresponding to an HTML text;

step S302, in the generated graph structure, all nodes containing the text of the body exist in the end node of the graph, and the end node only containing the empty text and the nodes without the context attribute generated by the end node are circularly removed.

4. The method for automatically extracting the text contents of web pages based on the neo4j graphic database as claimed in claim 1, wherein said step S104 further comprises the steps of:

step S401, for the empty node connected with the single end node in the graph structure, directly connecting the single end node with the father node of the empty node, and deleting the empty node, which is called node compression;

step S402, for the empty nodes connected with two or more last nodes in the graph structure, directly connecting the empty nodes with grandfather nodes of the empty nodes, and deleting the empty father nodes of the empty nodes, which is called branch node compression.

5. The method for automatically extracting the text contents of web pages based on neo4j graphic database as claimed in claim 1, wherein said step S105 further comprises the steps of:

step S501, calculating the number of nodes which are not empty and connected with each node as a first feature;

step S502, calculating the average text length of the connected nodes of each node as a second characteristic;

step S503, combining the first characteristic and the second characteristic to generate a characteristic vector, and training by using a machine learning model to obtain a text node classification model;

step S504, for the web pages of the text content to be extracted, the features such as feature one and feature two are extracted, the nodes in the graph are divided into text nodes and non-text nodes through a text node classification model, and only one text node can be extracted from each web page.

6. The method for automatically extracting the body content of a web page based on the neo4j graphic database as claimed in claim 1, wherein the step S106 further comprises the steps of:

step S601, extracting the multi-dimensional features in the sample by using the method in the step S105, and training a machine learning model by using the extracted features;

step S602, for the webpage of which the text content is to be extracted, dividing all nodes in the webpage into text nodes and non-text nodes according to the characteristics of the text nodes by using a trained classification model.

7. The method for automatically extracting the text contents of web pages based on the neo4j graph database as claimed in claim 2, wherein said step S107 further comprises the steps of:

and recovering the contents in the child nodes according to the child node sequence recorded in the step S202 for all the child nodes connected with the text node, thereby extracting the text contents of the webpage.