CN109325204A - Web page contents extraction method - Google Patents
Web page contents extraction method Download PDFInfo
- Publication number
- CN109325204A CN109325204A CN201811067868.8A CN201811067868A CN109325204A CN 109325204 A CN109325204 A CN 109325204A CN 201811067868 A CN201811067868 A CN 201811067868A CN 109325204 A CN109325204 A CN 109325204A
- Authority
- CN
- China
- Prior art keywords
- node
- web page
- label
- vision
- vision block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The invention belongs to web page contents extractive technique fields, and in particular to a kind of web page contents extraction method is particularly suitable for the extraction of periodical literature summary page content, comprising: S1, renders HTML again;S2, dom tree is split;S3, candidate vision block is marked in advance;S4, candidate vision block is labeled.This method replaces Conventional visual algorithm using Fast Fourier Transform (FFT) (FFT) and the primary filter of logarithm lid, reduces time, space complexity, improves time, the space efficiency of algorithm.
Description
Technical field
The invention belongs to web page contents extractive technique fields, and in particular to a kind of web page contents extraction method, especially
It is the extraction suitable for periodical literature summary page content.
Background technique
With the development of information technology, importance of the internet in acquisition of information is just growing day by day.Internet is also section
The worker of grinding obtains the newest effective way for delivering document.Academic journal publisher (Elsevier, Wiley, Taylor&
Francis etc.) in main website offer periodical literature summary page.Author is extracted from these summary pages, time, abstract is delivered etc. and believes
Breath is the main points and problem for establishing integrated database.
Web page contents extractive technique is the hot spot of the information extraction field (Information Extraction) all the time
Problem.Existing method can substantially be divided into three classes: first is that the method based on template, this method is according to web page element
Xpath, css expression formula extract, and have the advantages that accuracy is strong, but drawing template establishment needs to consume a large amount of manpowers, great Liang Mo
Plate is difficult to safeguard, and to the change poor robustness of structure of web page;Second is that the method based on dom tree, such methods are by web analysis
Target webpage and the mark page are carried out by tree construction matching by supervision or semi-supervised learning method for dom tree
(alignment) or part matches (partial alignment), is labeled to target pages, and then extract web page contents,
Such methods are inefficient (Shing-Ling Algorithms T-cbmplexity is directly proportional to the depth of tree), and need multiple by same
The page of template generation is as input;Third is that the method for view-based access control model information, for example, Microsoft Research, Asia proposes VIPS pages
Face partitioning algorithm.The page is divided into several visions according to the clues such as background color, text density, font (cue) by such methods
Block (visual block) learns to obtain each vision block significance index by support vector machines (SVM) or neural network model,
And then extract Web page text content;This method time, space complexity are higher, and dependent on the rule artificially formulated, right
In Webpage template poor robustness.
Summary of the invention
In view of the above technical problems, the purpose of the present invention is to provide a kind of web page contents extraction method, this method
Conventional visual algorithm is replaced using Fast Fourier Transform (FFT) (FFT) and the primary filter of logarithm lid, reduces time, spatial complex
Degree, improves time, the space efficiency of algorithm.
To achieve the above object, the technical solution used in the present invention is:
A kind of web page contents extraction method characterized by comprising
S1, HTML is rendered again
The dom tree and render tree for initially setting up html document, further according to the dom tree and render tree to each vision block into
Row renders again, img label is rendered to an arbitrary geometric figure again, also again by every a line of p, div, a label
It is rendered to an arbitrary geometric figure;
S2, dom tree is split
First, dom tree is begun stepping through from root node according to breadth First sequence, the knot for being greater than 1 until finding child node number
Point;Horizontal partition is carried out to the node, then selecting direction in the child node under the node is longitudinal node;
Secondly, it is that longitudinal node carries out more than once longitudinally split to the direction, then selects under the node
The maximum node of vision block area in child node;
Finally, horizontal partition is carried out to the maximum node of vision block area again, obtains several candidate vision blocks;
S3, candidate vision block is marked in advance
Give each candidate vision block corresponding pre- mark label by heuritic approach or/and keyword frequency algorithm,
All pre- mark labels form a pre- mark tag set;
S4, candidate vision block is labeled
Each candidate visual block is labeled by probability graph model, obtains corresponding mark label;By all marks
Note label is matched with pre- mark tag set one by one, filters out the mark label fallen in pre- mark tag set.
Preferably, the dom tree and render tree only include img, p, div, a label.
Preferably, the geometric figure is one group intersects line segment in length and breadth.
Preferably, the geometric figure is round or ellipse.
Preferably, the geometric figure is regular polygon.
Preferably, the dividing method of the node are as follows: first pass through Fast Fourier Transform (FFT) and obtain the frequency domain table of vision block
Show, then using the horizontal and vertical component of one group of orthogonal logarithm lid uncle's filtering separation vision block frequency domain representation, finally compares view
Feel that the horizontal and vertical component of block determines the direction of vision block.
The invention has the benefit that method of the invention is using Fast Fourier Transform (FFT) (FFT) and logarithm lid uncle's filtering
Replace Conventional visual algorithm, reduces time, space complexity, improve time, the space efficiency of algorithm.In addition, this method
Local dependence relationship between candidate visual block is described using probability graph model, is changed with adapting to different websites and page layout, it is right
There is certain robustness in the variation of page layout.Page elements directionality, conjugation condition are judged using logarithm lid uncle's filtering
Vector field improves model extraction accuracy, is the another approach that web page contents automatically extract.Shown geometric figure be one group in length and breadth
Intersect line segment, wherein geometric figure more simple computation is simpler, and arithmetic speed is faster, and one group intersects the corresponding fortune of line segment in length and breadth
It is faster to calculate speed.
Detailed description of the invention
Fig. 1 is flow diagram of the invention.
Fig. 2 is the schematic diagram one of the embodiment of the present invention.
Fig. 3 is the schematic diagram two of the embodiment of the present invention.
Fig. 4 is the schematic diagram three of the embodiment of the present invention.
Fig. 5 is the schematic diagram four of the embodiment of the present invention.
Specific embodiment
For a better understanding of the present invention, technical solution of the present invention is done further below with reference to embodiment and attached drawing
Illustrate (as shown in Fig. 1,2,3,4,5).
As shown in Figure 1, a kind of web page contents extraction method, comprising:
S1, HTML is rendered again
The dom tree and render tree (render tree) of html document are initially set up, the dom tree only includes with render tree
Img, p, div, a label, further according to the dom tree and render tree, to each vision block, (page elements draw via browser rendering
Processing is held up, the rectangular area that area in the page is not zero, referred to as vision block are expressed as.Page elements are by one group of html tag packet
The one section of HTML code enclosed, such as<p>,<div>deng.Here it is node in dom tree that vision block is corresponding) it is rendered again,
Img label is rendered to an arbitrary geometric figure again, and (such as one group intersects line segment or polygon, circle, oval isotactic in length and breadth
Geometric figure or any irregular geometric figure then), every a line (text) of p, div, a label is also rendered to again
One arbitrary geometric figure;
(the corresponding label of each cross in figure) as shown in Figure 2, below to be rendered to one group of intersecting lens in length and breadth again
For section (such as cross):
Img label is rendered to one group again and intersects line segment in length and breadth by img label;
For example, a rectangular area in the vision block corresponding page of img label.The angular coordinate of rectangular area four is from a left side
Upper angle point starts to arrange respectively R1 (x1, y1), R2 (x1, y2), R3 (x2, y2), R4 (x2, y1) counterclockwise.P (x1,
(y1+y2)/2), Q ((x1+x2)/2, y2), R (x2, (y1+y2)/2), S ((x1+x2)/2, y1) be respectively line segment R1R2,
The midpoint of R2R3, R3R4 and R4R1.It is possible to which divide equally one group of line segment PR, QS (hereinafter referred to as " cross will be mutually perpendicular to
Shape ") result that is rendered again as img label.
P, each row text of such label is rendered to one group again and intersects line segment in length and breadth by div, a label;
For example, a rectangular area in the vision block corresponding page of p label.Four angular coordinates of rectangle are from upper left angle point
Starting arrangement counterclockwise is respectively R1 (x1, y1), R2 (x1, y2), R3 (x2, y2), R4 (x2, y1).Rectangle width
It (width) is W pixel.The word length for including in p label is C byte, and font size (font size) is F pixel.So,
By estimating that text line number N is in available p label vision blockRow (It is the symbol that rounds up).Take P1,
P2 ... Pn is the N+1 Along ent of line segment R1R2;R1, R2 ... Rn are the N+1 Along ent of line segment R3R4;Q, S be respectively line segment R2R3,
The midpoint R1R4.So, line segment group P1R1, P2R2 ..., PnRn, QS can be used as the result that p label renders again.
S2, (transverse direction-longitudinal direction-horizontal partition) is split to dom tree
As shown in figure 3, dom tree is begun stepping through first, according to breadth First sequence from root node, until finding child node number
Node greater than 1;Horizontal partition (VB1, VB2, VB3 in such as Fig. 3 are divided into three pieces longitudinal) is carried out to the node, then selection should
Direction is longitudinal node (i.e. VB3 in Fig. 3) in child node under node;
As shown in figure 4, secondly, to the direction being that longitudinal node carries out more than once longitudinally split (in such as Fig. 4
VB1, VB2, VB3 are divided into laterally three pieces), then select maximum node (i.e. Fig. 4 of vision block area in the child node under the node
In VB2);
When nested node occurs in dom tree, need multiple longitudinal decomposition to obtain clean result;
As shown in figure 5, it is last, horizontal partition carried out to the maximum node of vision block area again (longitudinally divide in such as Fig. 5
The multiple boxes cut, shown box respectively represent periodical, DOI, title, author, issuing time, abstract, key from top to bottom
Word), obtain several candidate vision blocks;
Segmentation (including horizontal partition and longitudinally split) method of the node are as follows: first pass through Fast Fourier Transform (FFT)
(FFT) frequency domain representation of vision block, then the water using one group of orthogonal logarithm lid uncle's filtering separation vision block frequency domain representation are obtained
Gentle vertical component, the horizontal and vertical component for finally comparing vision block determine that the direction of vision block (is hung down if horizontal component is less than
Straight component, then vision Block direction is laterally;If vertical component is less than horizontal component, vision Block direction is longitudinal);
(1) from the lower traversal dom tree in top
Dom tree is begun stepping through from root node according to breadth First sequence, the node for being greater than 1 until finding child node number;If
Lateral decomposition is carried out, then is handled the N of the node (N > 1) a child node by described in s2, it is longitudinal for selecting orientation
Node.If carrying out longitudinal decomposition, the maximum node of vision block area in child node is chosen.
S3, candidate vision block is marked in advance
Give each candidate vision block corresponding pre- mark label by heuritic approach or/and keyword frequency algorithm,
All pre- mark labels form a pre- mark tag set;
The heuritic approach (heuristic) can refer to Extracting multiple news attributes
based on visual features。
The keyword frequency algorithm is similar with TF-IDF algorithm widely used in search engine.Firstly, to being collected into
One group of data block in text fragments carry out word frequency statistics, select the frequency of occurrences be greater than N one group of word as keyword;Statistics
The frequency that these keywords occur, as reference keyword frequency;Then, word frequency is carried out to the text fragments in candidate vision block
The word occurred in candidate vision block text segment and keyword are carried out intersection operation, the word in set are being waited by statistics
It selects the frequency occurred in text fragments and refers to keyword frequency multiplication, then sum and obtained to get the keyword to candidate visual block
Point.If score is greater than s, respective labels (title, author, abstract in such as periodical summary page are given;For another example news pages
In headline, news author, issuing time etc.).
S4, candidate vision block is labeled
Each candidate visual block is labeled by probability graph model, obtains corresponding mark label;By all marks
Note label is matched with pre- mark tag set one by one, filters out the mark label fallen in pre- mark tag set.
Shown probability graph model includes CRF, MLN etc., can refer to Conditional Random Fields:
Probabilistic Models for Segmenting and Labeling Sequence Data.Establish probability graph model
The feature of selection can refer to Template-Independent News Extraction Based on Visual
Consistency。
Candidate visual block is labeled by probability undirected graph model, obtains page key message (such as Fig. 5).Key letter
Breath refers to the partial information that reader is concerned about the most in the page, such as title, author, abstract in periodical summary page.It is for another example new
Hear headline, news author, the issuing time etc. in the page.
By taking CRF as an example.Firstly, 200 pages are collected, according to periodical, DOI, title, author, issuing time, abstract, pass
Eight labels such as keyword, invalid are manually marked.Using quasi-Newton method training CRF model.Then, each candidate visual block is calculated
Feature vector.If only considering width-highly than ratio, number of characters-area ratio density, upper left angle point abscissa x and a left side
Upper tetra- features of angle point ordinate y, then the feature vector of candidate visual block is (ratio, density, x, y).It will be calculated
Feature vector according to candidate visual block appearance sequence sequentially input CRF model, predicted using Viterbi algorithm
(inference).So far, each candidate visual block obtains two kinds of labels: one group marks label and a mark label in advance.
Described above is only Application Example of the invention, cannot limit the right model of the present invention with this certainly
It encloses, therefore according to equivalence changes made by scope of the present invention patent, still belongs to protection scope of the present invention.
Claims (6)
1. a kind of web page contents extraction method characterized by comprising
S1, HTML is rendered again
The dom tree and render tree for initially setting up html document carry out weight to each vision block further according to the dom tree and render tree
New rendering, is rendered to an arbitrary geometric figure for img label again, every a line of p, div, a label is also rendered again
At an arbitrary geometric figure;
S2, dom tree is split
First, dom tree is begun stepping through from root node according to breadth First sequence, the node for being greater than 1 until finding child node number;It is right
The node carries out horizontal partition, and then selecting direction in the child node under the node is longitudinal node;
Secondly, it is that longitudinal node carries out more than once longitudinally split to the direction, then selects the son knot under the node
The maximum node of vision block area in point;
Finally, horizontal partition is carried out to the maximum node of vision block area again, obtains several candidate vision blocks;
S3, candidate vision block is marked in advance
It gives each candidate vision block corresponding pre- mark label by heuritic approach or/and keyword frequency algorithm, owns
Pre- mark label form a pre- mark tag set;
S4, candidate vision block is labeled
Each candidate visual block is labeled by probability graph model, obtains corresponding mark label;By all mark marks
Label are matched with pre- mark tag set one by one, filter out the mark label fallen in pre- mark tag set.
2. web page contents extraction method according to claim 1, which is characterized in that the dom tree and render tree are only
Include img, p, div, a label.
3. web page contents extraction method according to claim 1, which is characterized in that the geometric figure is one group vertical
Horizontal intersection line segment.
4. web page contents extraction method according to claim 1, which is characterized in that the geometric figure be it is round or
Person's ellipse.
5. web page contents extraction method according to claim 1, which is characterized in that the geometric figure is positive polygon
Shape.
6. web page contents extraction method according to claim 1, which is characterized in that the dividing method of the node
Are as follows: it first passes through Fast Fourier Transform (FFT) and obtains the frequency domain representation of vision block, then using one group of orthogonal logarithm lid uncle's filtering separation
The horizontal and vertical component of vision block frequency domain representation, the horizontal and vertical component for finally comparing vision block determine the side of vision block
To.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811067868.8A CN109325204B (en) | 2018-09-13 | 2018-09-13 | Automatic extraction method of webpage content |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811067868.8A CN109325204B (en) | 2018-09-13 | 2018-09-13 | Automatic extraction method of webpage content |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109325204A true CN109325204A (en) | 2019-02-12 |
CN109325204B CN109325204B (en) | 2022-01-07 |
Family
ID=65266010
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811067868.8A Active CN109325204B (en) | 2018-09-13 | 2018-09-13 | Automatic extraction method of webpage content |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109325204B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110968761A (en) * | 2019-11-29 | 2020-04-07 | 福州大学 | Self-adaptive extraction method for webpage structured data |
CN112347332A (en) * | 2020-11-17 | 2021-02-09 | 南开大学 | XPath-based crawler target positioning method |
WO2021102387A1 (en) * | 2019-11-22 | 2021-05-27 | Tenweb, Inc | Generating higher-level semantics data for development of visual content |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156737A (en) * | 2011-04-12 | 2011-08-17 | 华中师范大学 | Method for extracting subject content of Chinese webpage |
CN102253979A (en) * | 2011-06-23 | 2011-11-23 | 天津海量信息技术有限公司 | Vision-based web page extracting method |
US20120246137A1 (en) * | 2011-03-22 | 2012-09-27 | Satish Sallakonda | Visual profiles |
CN106557565A (en) * | 2016-11-22 | 2017-04-05 | 福州大学 | A kind of text message extracting method based on website construction |
-
2018
- 2018-09-13 CN CN201811067868.8A patent/CN109325204B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120246137A1 (en) * | 2011-03-22 | 2012-09-27 | Satish Sallakonda | Visual profiles |
CN102156737A (en) * | 2011-04-12 | 2011-08-17 | 华中师范大学 | Method for extracting subject content of Chinese webpage |
CN102253979A (en) * | 2011-06-23 | 2011-11-23 | 天津海量信息技术有限公司 | Vision-based web page extracting method |
CN106557565A (en) * | 2016-11-22 | 2017-04-05 | 福州大学 | A kind of text message extracting method based on website construction |
Non-Patent Citations (1)
Title |
---|
张乃洲等: "一种基于节点密度分割和标签传播的Web页面挖掘方法", 《计算机学报》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021102387A1 (en) * | 2019-11-22 | 2021-05-27 | Tenweb, Inc | Generating higher-level semantics data for development of visual content |
US11579849B2 (en) | 2019-11-22 | 2023-02-14 | Tenweb, Inc. | Generating higher-level semantics data for development of visual content |
CN110968761A (en) * | 2019-11-29 | 2020-04-07 | 福州大学 | Self-adaptive extraction method for webpage structured data |
CN110968761B (en) * | 2019-11-29 | 2022-07-08 | 福州大学 | Webpage structured data self-adaptive extraction method |
CN112347332A (en) * | 2020-11-17 | 2021-02-09 | 南开大学 | XPath-based crawler target positioning method |
Also Published As
Publication number | Publication date |
---|---|
CN109325204B (en) | 2022-01-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Vide: A vision-based approach for deep web data extraction | |
CN102890713B (en) | A kind of music recommend method based on user's current geographic position and physical environment | |
WO2019041521A1 (en) | Apparatus and method for extracting user keyword, and computer-readable storage medium | |
CN109325204A (en) | Web page contents extraction method | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
KR20100113423A (en) | Method for representing keyword using an inversed vector space model and apparatus thereof | |
CN104217038A (en) | Knowledge network building method for financial news | |
US20110055285A1 (en) | Information extraction combining spatial and textual layout cues | |
Kim et al. | Web information extraction by HTML tree edit distance matching | |
CN105630940A (en) | Readability indicator based information retrieval method | |
CN111177404A (en) | Knowledge graph construction method and device of home decoration knowledge and computer equipment | |
CN107908749B (en) | Character retrieval system and method based on search engine | |
Rosenfeld et al. | Structural extraction from visual layout of documents | |
CN104572874B (en) | A kind of abstracting method and device of webpage information | |
CN105528421A (en) | Search dimension excavation method of query terms in mass data | |
CN110083760B (en) | Multi-recording dynamic webpage information extraction method based on visual block | |
CN108694192A (en) | The judgment method and device of type of webpage | |
CN113283432A (en) | Image recognition and character sorting method and equipment | |
Jayashree et al. | Multimodal web page segmentation using self-organized multi-objective clustering | |
CN103678432B (en) | A kind of web page body extracting method based on web page body feature and intermediary's true value | |
CN115982390A (en) | Industrial chain construction and iterative expansion development method | |
CN115640439A (en) | Method, system and storage medium for network public opinion monitoring | |
CN103488743A (en) | Page element extraction method and page element extraction system | |
Banu et al. | Dwde-ir: an efficient deep Web data extraction for information retrieval on Web mining | |
CN107463615B (en) | Real-time going and dealing recommendation method based on context and user interest in open network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |