CN110851679A

CN110851679A - Method and system for extracting webpage text based on text node characteristics

Info

Publication number: CN110851679A
Application number: CN201910947241.XA
Authority: CN
Inventors: 杨永全; 翟世平; 魏志强
Original assignee: Ocean University of China; Qingdao National Laboratory for Marine Science and Technology Development Center
Current assignee: Ocean University of China; Qingdao National Laboratory for Marine Science and Technology Development Center
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-02-28

Abstract

The invention discloses a method and a system for extracting webpage text based on text node characteristics, and belongs to the technical field of internet. The method comprises the following steps: acquiring an HTML source code of a webpage to be extracted; filtering the HTML source code, extracting element key nodes in an HTML DOM tree of the HTML source code, and constructing a list aiming at the element key nodes; acquiring the probability that the node value of each element key node is the attribute value of the text node and sequencing; and extracting the text elements of the key element nodes according to the probability sequence of the attribute values of the text nodes, and determining the text of the webpage to be judged as the text of the webpage. In the webpage text extraction process, the important function of the attribute node of the HTML webpage DOM tree element on the marked text node is considered, the key attribute values id and class of the webpage node are compared with the attribute value characteristics of the text node, the text node value is accurately found, and the text is accurately extracted by combining the HTML parser technology.

Description

Method and system for extracting webpage text based on text node characteristics

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a method and a system for extracting a webpage text based on text node features.

Background

In the context of WEB mass information processing, demands such as WEB intelligent information retrieval, document automatic summarization, public opinion analysis and the like are brought forward. These requirements are the process of collecting and analyzing the vast number of WEB pages in the internet. Generally, this kind of technology uses web crawlers to capture information of original web pages from the web, and the original information usually contains various web noise data, such as advertisement links, tag information, navigation links, comments, etc., in addition to the text information of interest to the user. The existence of the noise data greatly influences the efficiency of network retrieval and also reduces the reading efficiency of people. The method has the advantages that the article texts are accurately and efficiently extracted from the semi-structured HTML source files with strong isomerism, and the method has important significance in the fields of data mining, information retrieval and the like based on the Internet.

With the rapid development of the internet, data carried by the WEB is increasing day by day, and the problems of information redundancy, various forms, difficult processing and the like are more and more prominent, so that WEB information extraction is generated at the same time. And because the WEB page contains a large amount of information irrelevant to the subject, the quick positioning and text content acquisition of the user are influenced. Therefore, the extraction of the page text information is very important, which not only can save a great deal of time and energy of users, but also can be used for various aspects such as data mining and the like. The WEB information extraction is mainly directed at unstructured or semi-structured WEB pages, and the mainstream is mostly based on an HTML structure. In the existing related research, when paying attention to the HTML element, a researcher ignores the influence of semantic information of the attribute tag on the content contained in the HTML element, so that a text node cannot be correctly found, the text content is difficult to extract, and the extraction efficiency is low.

Webpage text extraction technology:

in the field of webpage text extraction at present, an HTML page can be analyzed into a DOM tree, all tags, text information and the like in the page can be converted into a node in the tree, and data extraction can be converted into operation on one tree. Due to the structural advantages, information extraction based on the HTML structure gradually becomes the mainstream of research, and the method has a good effect of extracting the webpage text based on statistical learning and text features. The method is used for the single text and the multi-text web page

The method comprises the steps of constructing a webpage into a label tree, acquiring a path from a root node to a leaf node (the leaf node which must contain text) through statistical learning, automatically learning text features on the path, finding out the path with the same text features, finding out a text region and a sub-tree trunk, finding out a similar sub-tree trunk in the text region according to the learned text features, and finally pruning the content in the acquired text region to obtain the main information of the page. Although the method can effectively extract the text information, the path marking is required in advance, the learning process is long, and the method is not suitable for the blog web pages.

String similarity measurement techniques:

the measurement of the similarity of the character strings is to find a common sub-string of the two character strings, and the similarity of the two character strings is measured according to a corresponding formula by utilizing the length of the common sub-string. String similarity has wide application in many fields. For example, the method is applied to the fields of plagiarism detection systems, automatic scoring systems, code plagiarism prevention systems, data cleaning, webpage searching, DNA sequence matching and the like. At present, a plurality of character string similarity measurement algorithms exist, such as an edit distance algorithm, a longest common substring algorithm, a Heckel algorithm, a greedy character string matching algorithm, an RKR-GST algorithm and the like. Due to different implementation principles, the obtained similarity of the character strings is different, and further, the application fields are different.

Disclosure of Invention

In order to solve the problems, the invention provides a method for extracting a webpage text based on text node characteristics, which comprises the following steps:

acquiring an HTML source code of a webpage to be extracted;

filtering the HTML source code, extracting element key nodes in an HTML DOM tree of the HTML source code, and constructing a list aiming at the element key nodes;

comparing the node value of each element key node in the list with the node values in the node library in sequence, acquiring the probability that the node value of each element key node is the attribute value of the text node, and sequencing;

and extracting the text elements of the key element nodes according to the sequence of the probability of the attribute values of the text nodes, acquiring the text of the webpage to be judged, judging whether the text of the webpage to be judged exceeds a preset threshold value, and determining that the text of the webpage to be judged is the text of the webpage when the text of the webpage to be judged does not exceed the preset threshold value.

Optionally, the method of the present invention further includes:

and when the text of the webpage to be judged is determined to be the text of the webpage, the text elements are not extracted from the key nodes of the residual elements.

Optionally, the text element is extracted according to the attribute identifier of the text node of the webpage and the HTML parser.

Optionally, comparing the node value of each key node of the elements in the list with the node value in the text node library in sequence, and using an RKR-GST algorithm or an edit distance algorithm of a character string.

The invention also provides a system for extracting the webpage text based on the text node characteristics, which comprises the following steps:

the information acquisition module is used for acquiring HTML source codes of the webpage to be extracted;

the filtering module is used for filtering the HTML source code, extracting element key nodes in an HTML DOM tree of the HTML source code and constructing a list aiming at the element key nodes;

the sequencing module is used for comparing the node value of each element key node in the list with the node values in the node library in sequence, acquiring the probability that the node value of each element key node is the text node attribute value and sequencing the node values;

an extraction module for extracting the content of the content,

In the webpage text extraction process, the important function of the attribute node of the HTML webpage DOM tree element on the marked text node is considered, the key attribute values (id and class) of the webpage node are compared with the attribute value characteristics of the text node, the text node value is accurately found, and the text is accurately extracted by combining the HTML parser technology.

Drawings

FIG. 1 is a flow chart of a method for extracting a webpage text based on text node features according to the present invention;

FIG. 2 is a diagram of a system for extracting a web page text based on text node features according to the present invention.

Detailed Description

The method combines the text node characteristics and the HTML parser technology to realize the efficient and accurate extraction of the webpage text.

The invention provides a method for extracting webpage text based on text node characteristics, as shown in figure 1, comprising the following steps:

acquiring an HTML source code of a webpage to be extracted;

comparing the node value of each element key node in the list with the node values in the node library in sequence, acquiring the probability that the node value of each element key node is the attribute value of the text node and sequencing the node values, and comparing the node value of each element key node in the list with the node value in the text node library in sequence by using an RKR-GST algorithm or a character string editing distance algorithm;

the text node feature comparison used in the invention uses RKR-GST algorithm, GST (greedy StringTiling), which is a short name of greedy string covering algorithm, and is a similarity calculation measuring method based on token. The basic idea of the algorithm is: how two strings are completely equal, a maximum value is returned, which is exactly equal to the length of the string; if the two strings do not match at all, then the minimum value of 0 is reversed. We generally refer to the longer strings as text strings and the shorter strings as schema strings. The KR algorithm, KR (Karp-Rabin), is a short term for the random string matching algorithm, and can quickly find out the position of the first occurrence of the mode string in the text string. The basic idea of the algorithm is: first, a fixed-length pattern string is used to calculate a corresponding hash value by a hash function. Secondly, a corresponding hash value is calculated by the substring with fixed length through the hash function. And thirdly, if the hash value of the matched pattern string is the same as that of the text substring, the pattern string and the text substring can be matched, and if the pattern string and the text substring are not the same, the pattern string and the text substring are not matched.

RKR-GST algorithm combines the advantages of GST algorithm and KR algorithm. The basic idea of the algorithm is: each element of the pattern string in the text string does not need to be compared one by one, and only when the hash value of the pattern string substring is the same as the hash value of the text string substring, the pattern matching algorithm is high in operation efficiency.

The algorithm needs to calculate the hash value, define a base number b and a prime number q, and is a character string C with the length of k₁C₂...C_kThe hash value of (d) is:

Hash(c₁c₂...c_k)＝(ask(c₁)×b^k-1+ask(c₂)×b^k-2+...+ask(c_k-1)×b+ask(c_k) Mod q string C₂C₃...C_k+1The hash value of (d) is:

Hash(c₂c₃...c_k+1)＝(Hash(c₁c₂...c_k)-ask(c₁)×b^k-1×b+ask(c_k+1))modq

firstly, RKR-GST needs to create two tag string object text structures;

secondly, creating a hash table of a corresponding mark string with a specified length;

and thirdly, finding substrings with the same hash values through circulation.

Analysis of the RKR-GST algorithm.

Due to the introduction of the hash table mode, when the similarity of two strings is judged, the number of the sub strings used for matching is small, the efficiency is improved, and the average time complexity can be controlled to be O (n) -O (n)²) In the worst case, the complexity of time is O (n)³)。

Extracting the text elements of the key element nodes according to the sequence of the probability of the text node attribute values, extracting the text elements according to the webpage text node attribute identifiers and the HTML parser, acquiring the webpage text to be judged, judging whether the webpage text to be judged exceeds a preset threshold value, and determining the webpage text to be judged as the webpage text when the webpage text to be judged does not exceed the preset threshold value. And when the text of the webpage to be judged is determined to be the text of the webpage, the text elements are not extracted from the key nodes of the residual elements.

The invention also provides a system 200 for extracting the webpage text based on the text node characteristics, which comprises the following steps:

the information acquisition module 201 is used for acquiring HTML source codes of the webpage to be extracted;

the filtering module 202 is used for filtering the HTML source code, extracting element key nodes in an HTML DOM tree of the HTML source code and constructing a list aiming at the element key nodes;

the sorting module 203 compares the node value of each element key node in the list with the node values in the node library in sequence, obtains the probability that the node value of each element key node is the text node attribute value, sorts the node values, and compares the node value of each element key node in the list with the node value in the text node library in sequence by using an RKR-GST algorithm or a character string editing distance algorithm;

the extraction module 204 is used for extracting the text elements of the key element nodes according to the probability sequence of the text node attribute values, extracting the text elements according to the webpage text node attribute identifiers and the HTML parser, acquiring the webpage text to be judged, judging whether the webpage text to be judged exceeds a preset threshold value or not, and determining that the webpage text to be judged is the webpage text when the webpage text to be judged does not exceed the preset threshold value.

Claims

1. A method for extracting webpage text based on text node features comprises the following steps:

acquiring an HTML source code of a webpage to be extracted;

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein the extracting of the text element is performed according to a web page text node attribute identifier and an HTML parser.

4. The method as claimed in claim 1, wherein the comparing of the node value of each element key node in the list with the node value in the body node library in sequence uses RKR-GST algorithm or edit distance algorithm of character string.

5. A system for extracting a text of a web page based on text node features, the system comprising:

the sequencing module compares the node value of each element key node in the list with the node values in the node library in sequence, obtains the probability that the node value of each element key node is the text node attribute value and sequences the node values;

and the extraction module is used for extracting the text elements of the key element nodes according to the sequence of the probability of the attribute values of the text nodes, acquiring the text of the webpage to be judged, judging whether the text of the webpage to be judged exceeds a preset threshold value or not, and determining that the text of the webpage to be judged is the text of the webpage when the text of the webpage to be judged does not exceed the preset threshold value.

6. The system of claim 5, wherein the extraction of the text element is performed according to the attribute identifier of the web page text node and an HTML parser.

7. The system of claim 5, wherein the comparing of the node value of each element key node in the list with the node value in the body node library in sequence uses RKR-GST algorithm or editing distance algorithm of character string.