CN109325204A

CN109325204A - Web page contents extraction method

Info

Publication number: CN109325204A
Application number: CN201811067868.8A
Authority: CN
Inventors: 王世阳; 李阳
Original assignee: WUHAN BIORUN BIO-TECH Co Ltd
Current assignee: WUHAN BIORUN BIO-TECH Co Ltd
Priority date: 2018-09-13
Filing date: 2018-09-13
Publication date: 2019-02-12
Anticipated expiration: 2038-09-13
Also published as: CN109325204B

Abstract

The invention belongs to web page contents extractive technique fields, and in particular to a kind of web page contents extraction method is particularly suitable for the extraction of periodical literature summary page content, comprising: S1, renders HTML again；S2, dom tree is split；S3, candidate vision block is marked in advance；S4, candidate vision block is labeled.This method replaces Conventional visual algorithm using Fast Fourier Transform (FFT) (FFT) and the primary filter of logarithm lid, reduces time, space complexity, improves time, the space efficiency of algorithm.

Description

Web page contents extraction method

Technical field

The invention belongs to web page contents extractive technique fields, and in particular to a kind of web page contents extraction method, especially It is the extraction suitable for periodical literature summary page content.

Background technique

With the development of information technology, importance of the internet in acquisition of information is just growing day by day.Internet is also section The worker of grinding obtains the newest effective way for delivering document.Academic journal publisher (Elsevier, Wiley, Taylor& Francis etc.) in main website offer periodical literature summary page.Author is extracted from these summary pages, time, abstract is delivered etc. and believes Breath is the main points and problem for establishing integrated database.

Web page contents extractive technique is the hot spot of the information extraction field (Information Extraction) all the time Problem.Existing method can substantially be divided into three classes: first is that the method based on template, this method is according to web page element Xpath, css expression formula extract, and have the advantages that accuracy is strong, but drawing template establishment needs to consume a large amount of manpowers, great Liang Mo Plate is difficult to safeguard, and to the change poor robustness of structure of web page；Second is that the method based on dom tree, such methods are by web analysis Target webpage and the mark page are carried out by tree construction matching by supervision or semi-supervised learning method for dom tree (alignment) or part matches (partial alignment), is labeled to target pages, and then extract web page contents, Such methods are inefficient (Shing-Ling Algorithms T-cbmplexity is directly proportional to the depth of tree), and need multiple by same The page of template generation is as input；Third is that the method for view-based access control model information, for example, Microsoft Research, Asia proposes VIPS pages Face partitioning algorithm.The page is divided into several visions according to the clues such as background color, text density, font (cue) by such methods Block (visual block) learns to obtain each vision block significance index by support vector machines (SVM) or neural network model, And then extract Web page text content；This method time, space complexity are higher, and dependent on the rule artificially formulated, right In Webpage template poor robustness.

Summary of the invention

In view of the above technical problems, the purpose of the present invention is to provide a kind of web page contents extraction method, this method Conventional visual algorithm is replaced using Fast Fourier Transform (FFT) (FFT) and the primary filter of logarithm lid, reduces time, spatial complex Degree, improves time, the space efficiency of algorithm.

To achieve the above object, the technical solution used in the present invention is:

A kind of web page contents extraction method characterized by comprising

S1, HTML is rendered again

The dom tree and render tree for initially setting up html document, further according to the dom tree and render tree to each vision block into Row renders again, img label is rendered to an arbitrary geometric figure again, also again by every a line of p, div, a label It is rendered to an arbitrary geometric figure；

S2, dom tree is split

First, dom tree is begun stepping through from root node according to breadth First sequence, the knot for being greater than 1 until finding child node number Point；Horizontal partition is carried out to the node, then selecting direction in the child node under the node is longitudinal node；

Secondly, it is that longitudinal node carries out more than once longitudinally split to the direction, then selects under the node The maximum node of vision block area in child node；

Finally, horizontal partition is carried out to the maximum node of vision block area again, obtains several candidate vision blocks；

S3, candidate vision block is marked in advance

Give each candidate vision block corresponding pre- mark label by heuritic approach or/and keyword frequency algorithm, All pre- mark labels form a pre- mark tag set；

S4, candidate vision block is labeled

Each candidate visual block is labeled by probability graph model, obtains corresponding mark label；By all marks Note label is matched with pre- mark tag set one by one, filters out the mark label fallen in pre- mark tag set.

Preferably, the dom tree and render tree only include img, p, div, a label.

Preferably, the geometric figure is one group intersects line segment in length and breadth.

Preferably, the geometric figure is round or ellipse.

Preferably, the geometric figure is regular polygon.

Preferably, the dividing method of the node are as follows: first pass through Fast Fourier Transform (FFT) and obtain the frequency domain table of vision block Show, then using the horizontal and vertical component of one group of orthogonal logarithm lid uncle's filtering separation vision block frequency domain representation, finally compares view Feel that the horizontal and vertical component of block determines the direction of vision block.

The invention has the benefit that method of the invention is using Fast Fourier Transform (FFT) (FFT) and logarithm lid uncle's filtering Replace Conventional visual algorithm, reduces time, space complexity, improve time, the space efficiency of algorithm.In addition, this method Local dependence relationship between candidate visual block is described using probability graph model, is changed with adapting to different websites and page layout, it is right There is certain robustness in the variation of page layout.Page elements directionality, conjugation condition are judged using logarithm lid uncle's filtering Vector field improves model extraction accuracy, is the another approach that web page contents automatically extract.Shown geometric figure be one group in length and breadth Intersect line segment, wherein geometric figure more simple computation is simpler, and arithmetic speed is faster, and one group intersects the corresponding fortune of line segment in length and breadth It is faster to calculate speed.

Detailed description of the invention

Fig. 1 is flow diagram of the invention.

Fig. 2 is the schematic diagram one of the embodiment of the present invention.

Fig. 3 is the schematic diagram two of the embodiment of the present invention.

Fig. 4 is the schematic diagram three of the embodiment of the present invention.

Fig. 5 is the schematic diagram four of the embodiment of the present invention.

Specific embodiment

For a better understanding of the present invention, technical solution of the present invention is done further below with reference to embodiment and attached drawing Illustrate (as shown in Fig. 1,2,3,4,5).

As shown in Figure 1, a kind of web page contents extraction method, comprising:

S1, HTML is rendered again

The dom tree and render tree (render tree) of html document are initially set up, the dom tree only includes with render tree Img, p, div, a label, further according to the dom tree and render tree, to each vision block, (page elements draw via browser rendering Processing is held up, the rectangular area that area in the page is not zero, referred to as vision block are expressed as.Page elements are by one group of html tag packet The one section of HTML code enclosed, such as<p>,<div>deng.Here it is node in dom tree that vision block is corresponding) it is rendered again, Img label is rendered to an arbitrary geometric figure again, and (such as one group intersects line segment or polygon, circle, oval isotactic in length and breadth Geometric figure or any irregular geometric figure then), every a line (text) of p, div, a label is also rendered to again One arbitrary geometric figure；

(the corresponding label of each cross in figure) as shown in Figure 2, below to be rendered to one group of intersecting lens in length and breadth again For section (such as cross):

Img label is rendered to one group again and intersects line segment in length and breadth by img label；

For example, a rectangular area in the vision block corresponding page of img label.The angular coordinate of rectangular area four is from a left side Upper angle point starts to arrange respectively R1 (x1, y1), R2 (x1, y2), R3 (x2, y2), R4 (x2, y1) counterclockwise.P (x1, (y1+y2)/2), Q ((x1+x2)/2, y2), R (x2, (y1+y2)/2), S ((x1+x2)/2, y1) be respectively line segment R1R2, The midpoint of R2R3, R3R4 and R4R1.It is possible to which divide equally one group of line segment PR, QS (hereinafter referred to as " cross will be mutually perpendicular to Shape ") result that is rendered again as img label.

P, each row text of such label is rendered to one group again and intersects line segment in length and breadth by div, a label；

For example, a rectangular area in the vision block corresponding page of p label.Four angular coordinates of rectangle are from upper left angle point Starting arrangement counterclockwise is respectively R1 (x1, y1), R2 (x1, y2), R3 (x2, y2), R4 (x2, y1).Rectangle width It (width) is W pixel.The word length for including in p label is C byte, and font size (font size) is F pixel.So, By estimating that text line number N is in available p label vision blockRow (It is the symbol that rounds up).Take P1, P2 ... Pn is the N+1 Along ent of line segment R1R2；R1, R2 ... Rn are the N+1 Along ent of line segment R3R4；Q, S be respectively line segment R2R3, The midpoint R1R4.So, line segment group P1R1, P2R2 ..., PnRn, QS can be used as the result that p label renders again.

S2, (transverse direction-longitudinal direction-horizontal partition) is split to dom tree

As shown in figure 3, dom tree is begun stepping through first, according to breadth First sequence from root node, until finding child node number Node greater than 1；Horizontal partition (VB1, VB2, VB3 in such as Fig. 3 are divided into three pieces longitudinal) is carried out to the node, then selection should Direction is longitudinal node (i.e. VB3 in Fig. 3) in child node under node；

As shown in figure 4, secondly, to the direction being that longitudinal node carries out more than once longitudinally split (in such as Fig. 4 VB1, VB2, VB3 are divided into laterally three pieces), then select maximum node (i.e. Fig. 4 of vision block area in the child node under the node In VB2)；

When nested node occurs in dom tree, need multiple longitudinal decomposition to obtain clean result；

As shown in figure 5, it is last, horizontal partition carried out to the maximum node of vision block area again (longitudinally divide in such as Fig. 5 The multiple boxes cut, shown box respectively represent periodical, DOI, title, author, issuing time, abstract, key from top to bottom Word), obtain several candidate vision blocks；

Segmentation (including horizontal partition and longitudinally split) method of the node are as follows: first pass through Fast Fourier Transform (FFT) (FFT) frequency domain representation of vision block, then the water using one group of orthogonal logarithm lid uncle's filtering separation vision block frequency domain representation are obtained Gentle vertical component, the horizontal and vertical component for finally comparing vision block determine that the direction of vision block (is hung down if horizontal component is less than Straight component, then vision Block direction is laterally；If vertical component is less than horizontal component, vision Block direction is longitudinal)；

(1) from the lower traversal dom tree in top

Dom tree is begun stepping through from root node according to breadth First sequence, the node for being greater than 1 until finding child node number；If Lateral decomposition is carried out, then is handled the N of the node (N > 1) a child node by described in s2, it is longitudinal for selecting orientation Node.If carrying out longitudinal decomposition, the maximum node of vision block area in child node is chosen.

S3, candidate vision block is marked in advance

The heuritic approach (heuristic) can refer to Extracting multiple news attributes based on visual features。

The keyword frequency algorithm is similar with TF-IDF algorithm widely used in search engine.Firstly, to being collected into One group of data block in text fragments carry out word frequency statistics, select the frequency of occurrences be greater than N one group of word as keyword；Statistics The frequency that these keywords occur, as reference keyword frequency；Then, word frequency is carried out to the text fragments in candidate vision block The word occurred in candidate vision block text segment and keyword are carried out intersection operation, the word in set are being waited by statistics It selects the frequency occurred in text fragments and refers to keyword frequency multiplication, then sum and obtained to get the keyword to candidate visual block Point.If score is greater than s, respective labels (title, author, abstract in such as periodical summary page are given；For another example news pages In headline, news author, issuing time etc.).

S4, candidate vision block is labeled

Shown probability graph model includes CRF, MLN etc., can refer to Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.Establish probability graph model The feature of selection can refer to Template-Independent News Extraction Based on Visual Consistency。

Candidate visual block is labeled by probability undirected graph model, obtains page key message (such as Fig. 5).Key letter Breath refers to the partial information that reader is concerned about the most in the page, such as title, author, abstract in periodical summary page.It is for another example new Hear headline, news author, the issuing time etc. in the page.

By taking CRF as an example.Firstly, 200 pages are collected, according to periodical, DOI, title, author, issuing time, abstract, pass Eight labels such as keyword, invalid are manually marked.Using quasi-Newton method training CRF model.Then, each candidate visual block is calculated Feature vector.If only considering width-highly than ratio, number of characters-area ratio density, upper left angle point abscissa x and a left side Upper tetra- features of angle point ordinate y, then the feature vector of candidate visual block is (ratio, density, x, y).It will be calculated Feature vector according to candidate visual block appearance sequence sequentially input CRF model, predicted using Viterbi algorithm (inference).So far, each candidate visual block obtains two kinds of labels: one group marks label and a mark label in advance.

Described above is only Application Example of the invention, cannot limit the right model of the present invention with this certainly It encloses, therefore according to equivalence changes made by scope of the present invention patent, still belongs to protection scope of the present invention.

Claims

1. a kind of web page contents extraction method characterized by comprising

S1, HTML is rendered again

The dom tree and render tree for initially setting up html document carry out weight to each vision block further according to the dom tree and render tree New rendering, is rendered to an arbitrary geometric figure for img label again, every a line of p, div, a label is also rendered again At an arbitrary geometric figure；

S2, dom tree is split

First, dom tree is begun stepping through from root node according to breadth First sequence, the node for being greater than 1 until finding child node number；It is right The node carries out horizontal partition, and then selecting direction in the child node under the node is longitudinal node；

Secondly, it is that longitudinal node carries out more than once longitudinally split to the direction, then selects the son knot under the node The maximum node of vision block area in point；

S3, candidate vision block is marked in advance

It gives each candidate vision block corresponding pre- mark label by heuritic approach or/and keyword frequency algorithm, owns Pre- mark label form a pre- mark tag set；

S4, candidate vision block is labeled

Each candidate visual block is labeled by probability graph model, obtains corresponding mark label；By all mark marks Label are matched with pre- mark tag set one by one, filter out the mark label fallen in pre- mark tag set.

2. web page contents extraction method according to claim 1, which is characterized in that the dom tree and render tree are only Include img, p, div, a label.

3. web page contents extraction method according to claim 1, which is characterized in that the geometric figure is one group vertical Horizontal intersection line segment.

4. web page contents extraction method according to claim 1, which is characterized in that the geometric figure be it is round or Person's ellipse.

5. web page contents extraction method according to claim 1, which is characterized in that the geometric figure is positive polygon Shape.

6. web page contents extraction method according to claim 1, which is characterized in that the dividing method of the node Are as follows: it first passes through Fast Fourier Transform (FFT) and obtains the frequency domain representation of vision block, then using one group of orthogonal logarithm lid uncle's filtering separation The horizontal and vertical component of vision block frequency domain representation, the horizontal and vertical component for finally comparing vision block determine the side of vision block To.