CN109325204B

CN109325204B - Automatic extraction method of webpage content

Info

Publication number: CN109325204B
Application number: CN201811067868.8A
Authority: CN
Inventors: 王世阳; 李阳
Original assignee: Wuhan Biorun Biotechnology LLC
Current assignee: Wuhan Biorun Biotechnology LLC
Priority date: 2018-09-13
Filing date: 2018-09-13
Publication date: 2022-01-07
Anticipated expiration: 2038-09-13
Also published as: CN109325204A

Abstract

The invention belongs to the technical field of webpage content extraction, and particularly relates to a method for automatically extracting webpage content, which is particularly suitable for extracting summary page content of periodical literature and comprises the following steps: s1, re-rendering the HTML; s2, segmenting the DOM tree; s3, pre-labeling the candidate visual block; and S4, labeling the candidate visual block. According to the method, a traditional visual algorithm is replaced by a Fast Fourier Transform (FFT) and a logarithmic cover filter, so that the time and space complexity is reduced, and the time and space efficiency of the algorithm is improved.

Description

Automatic extraction method of webpage content

Technical Field

The invention belongs to the technical field of webpage content extraction, and particularly relates to an automatic webpage content extraction method, which is particularly suitable for extracting summary page contents of periodical documents.

Background

With the development of information technology, the importance of the internet in information acquisition is increasing day by day. The internet is also an effective way for researchers to obtain the latest published documents. Academic journal publishers (Elsevier, Wiley, Taylor & Francis, etc.) provide journal literature summary pages at the Master site. Extracting information such as authors, publication time, summaries and the like from the summary pages is a key point for establishing an integrated database and is also a difficult problem.

The web content Extraction technology is a hot problem in the field of Information Extraction (Information Extraction). Existing methods can be broadly divided into three categories: the method is based on the template, the method extracts according to xpath and css expressions of webpage elements, and has the advantage of high accuracy, but the template creation consumes a large amount of manpower, a large amount of templates are difficult to maintain, and the robustness of the change of the webpage structure is poor; secondly, a DOM tree-based method is used, the method analyzes the webpage into a DOM tree, carries out tree structure matching (alignment) or partial matching (partial alignment) on the target webpage and the labeled page through a supervised or semi-supervised learning method, labels the target webpage and further extracts the webpage content, and the method has low efficiency (the time complexity of the Shing-Ling algorithm is in direct proportion to the depth of the tree) and needs a plurality of pages generated by the same template as input; and thirdly, a method based on visual information, such as a VIPS page segmentation algorithm proposed by Microsoft Asian institute. The method comprises the steps of dividing a page into a plurality of visual blocks (visual blocks) according to clues (cue) such as background color, character density and font, obtaining importance indexes of the visual blocks through learning of a Support Vector Machine (SVM) or a neural network model, and further extracting text content of the page; the method has high time and space complexity, depends on artificially established rules, and has poor robustness for the novel webpage template.

Disclosure of Invention

In view of the above technical problems, an object of the present invention is to provide an automatic extraction method for web page content, which adopts Fast Fourier Transform (FFT) and logarithmic cover filter to replace the conventional visual algorithm, thereby reducing the time and space complexity and improving the time and space efficiency of the algorithm.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a method for automatically extracting webpage content is characterized by comprising the following steps:

s1, re-rendering the HTML

Firstly, establishing a DOM tree and a rendering tree of an HTML document, re-rendering each visual block according to the DOM tree and the rendering tree, re-rendering an img label into an arbitrary geometric figure, and re-rendering each line of p, div and a labels into an arbitrary geometric figure;

s2, segmenting the DOM tree

Firstly, traversing the DOM tree from a root node according to the breadth priority sequence until a node with the number of child nodes larger than 1 is found; performing transverse segmentation on the node, and then selecting a node with a longitudinal direction in sub-nodes under the node;

secondly, longitudinally dividing the nodes with the longitudinal direction more than once, and then selecting the node with the largest visual block area in the sub-nodes under the node;

finally, transversely dividing the nodes with the largest visual block area to obtain a plurality of candidate visual blocks;

s3, pre-labeling candidate visual blocks

Giving a pre-labeled label corresponding to each candidate visual block through a heuristic algorithm or/and a keyword frequency algorithm, wherein all the pre-labeled labels form a pre-labeled label set;

s4, labeling the candidate visual block

Labeling each candidate visual block through a probability graph model to obtain a corresponding label; and matching all the label labels with the pre-labeled label set one by one, and screening out the label labels falling in the pre-labeled label set.

Preferably, the DOM tree and the render tree only contain img, p, div, and a tags.

Preferably, the geometric figure is a set of intersecting longitudinal and transverse line segments.

Preferably, the geometric figure is a circle or an ellipse.

Preferably, the geometric figure is a regular polygon.

Preferably, the node segmentation method includes: the method comprises the steps of firstly obtaining frequency domain representation of a visual block through fast Fourier transform, then adopting a group of orthogonal logarithmic cover-Bob filtering to separate horizontal and vertical components of the frequency domain representation of the visual block, and finally comparing the horizontal and vertical components of the visual block to determine the direction of the visual block.

The invention has the beneficial effects that: the method of the invention adopts Fast Fourier Transform (FFT) and logarithmic cover filtering to replace the traditional visual algorithm, thereby reducing the complexity of time and space and improving the time and space efficiency of the algorithm. In addition, the method adopts a probability graph model to describe the local dependency relationship among the candidate visual blocks so as to adapt to different sites and page layout changes, and has certain robustness to the page layout changes. And logarithmic cover filtering is adopted to judge the directionality of the page elements, and the model extraction accuracy is improved by combining the condition vector field, so that the method is another way for automatically extracting the webpage content. The geometric figure is a group of criss-cross line segments, wherein the simpler the geometric figure is, the simpler the calculation is, the higher the operation speed is, and the higher the operation speed corresponding to the group of criss-cross line segments is.

Drawings

FIG. 1 is a schematic flow diagram of the present invention.

FIG. 2 is a first schematic diagram of an embodiment of the present invention.

FIG. 3 is a second schematic diagram of an embodiment of the present invention.

FIG. 4 is a third schematic diagram of an embodiment of the present invention.

FIG. 5 is a fourth schematic diagram of an embodiment of the present invention.

Detailed Description

For better understanding of the present invention, the technical solutions of the present invention are further described below with reference to the following examples and accompanying drawings (as shown in fig. 1, 2, 3, 4, and 5).

As shown in fig. 1, an automatic extraction method of web page content includes:

s1, re-rendering the HTML

Firstly, establishing a DOM tree and a rendering tree (render tree) of an HTML document, wherein the DOM tree and the rendering tree only comprise img, p, div and a labels, then, according to the DOM tree and the rendering tree, re-rendering each visual block (page elements are processed by a browser rendering engine and are represented as a rectangular area with non-zero area in a page, namely the visual block, wherein the page elements are a section of HTML code surrounded by a group of HTML labels, such as < p >, and < div > and the like, the visual blocks correspond to nodes in the DOM tree), re-rendering the img label into an arbitrary geometric figure (such as a group of criss-cross line segments, polygons, circles, ellipses and other regular geometric figures or any irregular geometric figure), and re-rendering each line (character) of the p, div and a labels into an arbitrary geometric figure;

as shown in fig. 2 (each cross corresponds to a label in the figure), the following takes re-rendering into a set of intersecting vertical and horizontal line segments (such as crosses) as an example:

the img label is re-rendered into a group of vertical and horizontal intersecting line segments;

for example, the visual block of the img tag corresponds to a rectangular area in the page. Coordinates of four corner points of the rectangular area are respectively arranged from the upper left corner point in a counterclockwise direction as R1(x1, y1), R2(x1, y2), R3(x2, y2) and R4(x2, y 1). P (x1, (y1+ y2)/2), Q ((x1+ x2)/2, y2), R (x2, (y1+ y2)/2), S ((x1+ x2)/2, y1) are the midpoints of segments R1R2, R2R3, R3R4, and R4R1, respectively. Then a set of segments PR, QS (hereinafter referred to as "crosses") that bisect each other perpendicularly may be re-rendered as a result of the img tag.

The p, div and a labels re-render each line of characters of the labels into a group of criss-cross line segments;

for example, a visual block of a p-tag corresponds to a rectangular area in a page. The coordinates of four corner points of the rectangle are respectively arranged from the upper left corner point in the counterclockwise direction as R1(x1, y1), R2(x1, y2), R3(x2, y2) and R4(x2, y 1). The width (width) of the rectangle is W pixels. The text contained in the p-tag is C bytes in length and the font size (font size) is F pixels. Then, the number of lines N of the text in the p-tag visual block can be obtained by estimation

Line (A)

Is a rounded up symbol). Taking P1 and P2 … Pn as the N +1 equally dividing points of the segment R1R 2; r1 and R2 … Rn are the N +1 equally dividing points of the line segment R3R 4; q, S are the midpoints of the segments R2R3 and R1R 4. Then, the segment groups P1R1, P2R2, …, PnRn, QS may be the result of the P-tag re-rendering.

S2, segmenting the DOM tree (transverse-longitudinal-transverse segmentation)

As shown in fig. 3, first, traverse the DOM tree from the root node according to the breadth-first order until finding a node with a child node number greater than 1; dividing the node transversely (e.g. dividing VB1, VB2 and VB3 in FIG. 3 into three longitudinal blocks), and then selecting the node with the longitudinal direction from the sub-nodes below the node (i.e. VB3 in FIG. 3);

as shown in fig. 4, secondly, performing more than one longitudinal division on the nodes with the longitudinal direction (dividing into three transverse blocks as VB1, VB2 and VB3 in fig. 4), and then selecting the node with the largest visual block area in the sub-nodes under the node (namely, VB2 in fig. 4);

when the DOM tree has nested nodes, the DOM tree needs to be longitudinally decomposed for many times to obtain a clean result;

as shown in fig. 5, finally, performing horizontal segmentation on the nodes with the largest area of the visual blocks (as a plurality of vertically segmented boxes in fig. 5, the boxes represent periodicals, DOIs, titles, authors, release times, abstracts, and keywords from top to bottom, respectively) to obtain a plurality of candidate visual blocks;

the node segmentation (including transverse segmentation and longitudinal segmentation) method comprises the following steps: obtaining frequency domain representation of the visual block through Fast Fourier Transform (FFT), separating horizontal and vertical components of the frequency domain representation of the visual block by adopting a group of orthogonal logarithmic cover-Bob filtering, and determining the direction of the visual block by comparing the horizontal and vertical components of the visual block (if the horizontal component is smaller than the vertical component, the direction of the visual block is transverse, and if the vertical component is smaller than the horizontal component, the direction of the visual block is longitudinal);

(1) traversing DOM tree from top to bottom

Traversing the DOM tree from the root node according to the breadth priority sequence until the node with the number of the child nodes larger than 1 is found; if the horizontal decomposition is performed, the N (N >1) sub-nodes of the node are processed as described in s2, and a node whose arrangement direction is the vertical direction is selected. And if the longitudinal decomposition is carried out, selecting the node with the largest visual block area in the sub-nodes.

S3, pre-labeling candidate visual blocks

the heuristic algorithm (itself) may refer to Extracting multiple news based on visual effects.

The keyword frequency algorithm is similar to the TF-IDF algorithm widely used in search engines. Firstly, carrying out word frequency statistics on text segments in a group of collected data blocks, and selecting a group of words with the occurrence frequency greater than N as keywords; counting the occurrence frequency of the keywords as the frequency of reference keywords; then, carrying out word frequency statistics on the text segments in the candidate visual block, carrying out intersection operation on words appearing in the text segments of the candidate visual block and the keywords, multiplying the frequency of the words appearing in the candidate text segments in the set by the frequency of the reference keywords, and then summing to obtain the keyword score of the candidate visual block. If the score is larger than s, the corresponding label is given (such as title, author, abstract and the like in the periodical abstract page; such as news title, news author, release time and the like in the news page).

S4, labeling the candidate visual block

The illustrated Probabilistic graphical Models include CRF, MLN, etc., as referenced by Conditional Random Fields, Probalistic Models for segmentation and Label Sequence Data. The characteristics selected by establishing the probability graph model can refer to Template-Independent News Extraction Based on Visual Consistency.

And labeling the candidate visual blocks through a probability undirected graph model to obtain the key information of the page (as shown in FIG. 5). The key information refers to the partial information of the page that is most concerned by the reader, such as title, author, abstract, etc. in the summary page of the periodical. As well as news headlines, news authors, time of release, etc. in a news page.

Take CRF as an example. Firstly, 200 pages are collected, and manual labeling is carried out according to eight labels of periodicals, DOIs, titles, authors, release time, abstracts, keywords, invalidities and the like. The CRF model was trained using the quasi-Newton method. Then, a feature vector for each candidate visual block is calculated. If only four features of the width-height ratio, the character number-area ratio, the abscissa x of the upper left corner point, and the ordinate y of the upper left corner point are considered, the feature vector of the candidate visual block is (ratio, density, x, y). And (3) sequentially inputting the calculated feature vectors into a CRF (model reference) according to the appearance sequence of the candidate visual blocks, and predicting (reference) by adopting a Viterbi algorithm. To this end, each candidate visual block gets two kinds of labels: a set of pre-labeled labels and a label.

The above description is only an application example of the present invention, and certainly, the present invention should not be limited by this application, and therefore, the present invention is still within the protection scope of the present invention by equivalent changes made in the claims of the present invention.

Claims

1. A method for automatically extracting webpage content is characterized by comprising the following steps:

s1, re-rendering the HTML

s2, segmenting the DOM tree

the node segmentation method comprises the following steps: firstly, obtaining frequency domain representation of a visual block through fast Fourier transform, then separating horizontal and vertical components of the frequency domain representation of the visual block by adopting a group of orthogonal logarithmic cover-bosch filters, and finally determining the direction of the visual block by comparing the horizontal and vertical components of the visual block;

s3, pre-labeling candidate visual blocks

s4, labeling the candidate visual block

2. The method for automatically extracting webpage content according to claim 1, wherein the DOM tree and the render tree only contain img, p, div, a tags.

3. The method for automatically extracting web page content according to claim 1, wherein the geometric figure is a set of vertical and horizontal intersecting line segments.

4. The method for automatically extracting web page content according to claim 1, wherein the geometric figure is a circle or an ellipse.

5. The method for automatically extracting web page content according to claim 1, wherein the geometric figure is a regular polygon.