CN109325204A - Web page contents extraction method - Google Patents

Web page contents extraction method Download PDF

Info

Publication number
CN109325204A
CN109325204A CN201811067868.8A CN201811067868A CN109325204A CN 109325204 A CN109325204 A CN 109325204A CN 201811067868 A CN201811067868 A CN 201811067868A CN 109325204 A CN109325204 A CN 109325204A
Authority
CN
China
Prior art keywords
node
web page
label
vision
vision block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811067868.8A
Other languages
Chinese (zh)
Other versions
CN109325204B (en
Inventor
王世阳
李阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WUHAN BIORUN BIO-TECH Co Ltd
Original Assignee
WUHAN BIORUN BIO-TECH Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WUHAN BIORUN BIO-TECH Co Ltd filed Critical WUHAN BIORUN BIO-TECH Co Ltd
Priority to CN201811067868.8A priority Critical patent/CN109325204B/en
Publication of CN109325204A publication Critical patent/CN109325204A/en
Application granted granted Critical
Publication of CN109325204B publication Critical patent/CN109325204B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention belongs to web page contents extractive technique fields, and in particular to a kind of web page contents extraction method is particularly suitable for the extraction of periodical literature summary page content, comprising: S1, renders HTML again;S2, dom tree is split;S3, candidate vision block is marked in advance;S4, candidate vision block is labeled.This method replaces Conventional visual algorithm using Fast Fourier Transform (FFT) (FFT) and the primary filter of logarithm lid, reduces time, space complexity, improves time, the space efficiency of algorithm.

Description

Web page contents extraction method
Technical field
The invention belongs to web page contents extractive technique fields, and in particular to a kind of web page contents extraction method, especially It is the extraction suitable for periodical literature summary page content.
Background technique
With the development of information technology, importance of the internet in acquisition of information is just growing day by day.Internet is also section The worker of grinding obtains the newest effective way for delivering document.Academic journal publisher (Elsevier, Wiley, Taylor& Francis etc.) in main website offer periodical literature summary page.Author is extracted from these summary pages, time, abstract is delivered etc. and believes Breath is the main points and problem for establishing integrated database.
Web page contents extractive technique is the hot spot of the information extraction field (Information Extraction) all the time Problem.Existing method can substantially be divided into three classes: first is that the method based on template, this method is according to web page element Xpath, css expression formula extract, and have the advantages that accuracy is strong, but drawing template establishment needs to consume a large amount of manpowers, great Liang Mo Plate is difficult to safeguard, and to the change poor robustness of structure of web page;Second is that the method based on dom tree, such methods are by web analysis Target webpage and the mark page are carried out by tree construction matching by supervision or semi-supervised learning method for dom tree (alignment) or part matches (partial alignment), is labeled to target pages, and then extract web page contents, Such methods are inefficient (Shing-Ling Algorithms T-cbmplexity is directly proportional to the depth of tree), and need multiple by same The page of template generation is as input;Third is that the method for view-based access control model information, for example, Microsoft Research, Asia proposes VIPS pages Face partitioning algorithm.The page is divided into several visions according to the clues such as background color, text density, font (cue) by such methods Block (visual block) learns to obtain each vision block significance index by support vector machines (SVM) or neural network model, And then extract Web page text content;This method time, space complexity are higher, and dependent on the rule artificially formulated, right In Webpage template poor robustness.
Summary of the invention
In view of the above technical problems, the purpose of the present invention is to provide a kind of web page contents extraction method, this method Conventional visual algorithm is replaced using Fast Fourier Transform (FFT) (FFT) and the primary filter of logarithm lid, reduces time, spatial complex Degree, improves time, the space efficiency of algorithm.
To achieve the above object, the technical solution used in the present invention is:
A kind of web page contents extraction method characterized by comprising
S1, HTML is rendered again
The dom tree and render tree for initially setting up html document, further according to the dom tree and render tree to each vision block into Row renders again, img label is rendered to an arbitrary geometric figure again, also again by every a line of p, div, a label It is rendered to an arbitrary geometric figure;
S2, dom tree is split
First, dom tree is begun stepping through from root node according to breadth First sequence, the knot for being greater than 1 until finding child node number Point;Horizontal partition is carried out to the node, then selecting direction in the child node under the node is longitudinal node;
Secondly, it is that longitudinal node carries out more than once longitudinally split to the direction, then selects under the node The maximum node of vision block area in child node;
Finally, horizontal partition is carried out to the maximum node of vision block area again, obtains several candidate vision blocks;
S3, candidate vision block is marked in advance
Give each candidate vision block corresponding pre- mark label by heuritic approach or/and keyword frequency algorithm, All pre- mark labels form a pre- mark tag set;
S4, candidate vision block is labeled
Each candidate visual block is labeled by probability graph model, obtains corresponding mark label;By all marks Note label is matched with pre- mark tag set one by one, filters out the mark label fallen in pre- mark tag set.
Preferably, the dom tree and render tree only include img, p, div, a label.
Preferably, the geometric figure is one group intersects line segment in length and breadth.
Preferably, the geometric figure is round or ellipse.
Preferably, the geometric figure is regular polygon.
Preferably, the dividing method of the node are as follows: first pass through Fast Fourier Transform (FFT) and obtain the frequency domain table of vision block Show, then using the horizontal and vertical component of one group of orthogonal logarithm lid uncle's filtering separation vision block frequency domain representation, finally compares view Feel that the horizontal and vertical component of block determines the direction of vision block.
The invention has the benefit that method of the invention is using Fast Fourier Transform (FFT) (FFT) and logarithm lid uncle's filtering Replace Conventional visual algorithm, reduces time, space complexity, improve time, the space efficiency of algorithm.In addition, this method Local dependence relationship between candidate visual block is described using probability graph model, is changed with adapting to different websites and page layout, it is right There is certain robustness in the variation of page layout.Page elements directionality, conjugation condition are judged using logarithm lid uncle's filtering Vector field improves model extraction accuracy, is the another approach that web page contents automatically extract.Shown geometric figure be one group in length and breadth Intersect line segment, wherein geometric figure more simple computation is simpler, and arithmetic speed is faster, and one group intersects the corresponding fortune of line segment in length and breadth It is faster to calculate speed.
Detailed description of the invention
Fig. 1 is flow diagram of the invention.
Fig. 2 is the schematic diagram one of the embodiment of the present invention.
Fig. 3 is the schematic diagram two of the embodiment of the present invention.
Fig. 4 is the schematic diagram three of the embodiment of the present invention.
Fig. 5 is the schematic diagram four of the embodiment of the present invention.
Specific embodiment
For a better understanding of the present invention, technical solution of the present invention is done further below with reference to embodiment and attached drawing Illustrate (as shown in Fig. 1,2,3,4,5).
As shown in Figure 1, a kind of web page contents extraction method, comprising:
S1, HTML is rendered again
The dom tree and render tree (render tree) of html document are initially set up, the dom tree only includes with render tree Img, p, div, a label, further according to the dom tree and render tree, to each vision block, (page elements draw via browser rendering Processing is held up, the rectangular area that area in the page is not zero, referred to as vision block are expressed as.Page elements are by one group of html tag packet The one section of HTML code enclosed, such as<p>,<div>deng.Here it is node in dom tree that vision block is corresponding) it is rendered again, Img label is rendered to an arbitrary geometric figure again, and (such as one group intersects line segment or polygon, circle, oval isotactic in length and breadth Geometric figure or any irregular geometric figure then), every a line (text) of p, div, a label is also rendered to again One arbitrary geometric figure;
(the corresponding label of each cross in figure) as shown in Figure 2, below to be rendered to one group of intersecting lens in length and breadth again For section (such as cross):
Img label is rendered to one group again and intersects line segment in length and breadth by img label;
For example, a rectangular area in the vision block corresponding page of img label.The angular coordinate of rectangular area four is from a left side Upper angle point starts to arrange respectively R1 (x1, y1), R2 (x1, y2), R3 (x2, y2), R4 (x2, y1) counterclockwise.P (x1, (y1+y2)/2), Q ((x1+x2)/2, y2), R (x2, (y1+y2)/2), S ((x1+x2)/2, y1) be respectively line segment R1R2, The midpoint of R2R3, R3R4 and R4R1.It is possible to which divide equally one group of line segment PR, QS (hereinafter referred to as " cross will be mutually perpendicular to Shape ") result that is rendered again as img label.
P, each row text of such label is rendered to one group again and intersects line segment in length and breadth by div, a label;
For example, a rectangular area in the vision block corresponding page of p label.Four angular coordinates of rectangle are from upper left angle point Starting arrangement counterclockwise is respectively R1 (x1, y1), R2 (x1, y2), R3 (x2, y2), R4 (x2, y1).Rectangle width It (width) is W pixel.The word length for including in p label is C byte, and font size (font size) is F pixel.So, By estimating that text line number N is in available p label vision blockRow (It is the symbol that rounds up).Take P1, P2 ... Pn is the N+1 Along ent of line segment R1R2;R1, R2 ... Rn are the N+1 Along ent of line segment R3R4;Q, S be respectively line segment R2R3, The midpoint R1R4.So, line segment group P1R1, P2R2 ..., PnRn, QS can be used as the result that p label renders again.
S2, (transverse direction-longitudinal direction-horizontal partition) is split to dom tree
As shown in figure 3, dom tree is begun stepping through first, according to breadth First sequence from root node, until finding child node number Node greater than 1;Horizontal partition (VB1, VB2, VB3 in such as Fig. 3 are divided into three pieces longitudinal) is carried out to the node, then selection should Direction is longitudinal node (i.e. VB3 in Fig. 3) in child node under node;
As shown in figure 4, secondly, to the direction being that longitudinal node carries out more than once longitudinally split (in such as Fig. 4 VB1, VB2, VB3 are divided into laterally three pieces), then select maximum node (i.e. Fig. 4 of vision block area in the child node under the node In VB2);
When nested node occurs in dom tree, need multiple longitudinal decomposition to obtain clean result;
As shown in figure 5, it is last, horizontal partition carried out to the maximum node of vision block area again (longitudinally divide in such as Fig. 5 The multiple boxes cut, shown box respectively represent periodical, DOI, title, author, issuing time, abstract, key from top to bottom Word), obtain several candidate vision blocks;
Segmentation (including horizontal partition and longitudinally split) method of the node are as follows: first pass through Fast Fourier Transform (FFT) (FFT) frequency domain representation of vision block, then the water using one group of orthogonal logarithm lid uncle's filtering separation vision block frequency domain representation are obtained Gentle vertical component, the horizontal and vertical component for finally comparing vision block determine that the direction of vision block (is hung down if horizontal component is less than Straight component, then vision Block direction is laterally;If vertical component is less than horizontal component, vision Block direction is longitudinal);
(1) from the lower traversal dom tree in top
Dom tree is begun stepping through from root node according to breadth First sequence, the node for being greater than 1 until finding child node number;If Lateral decomposition is carried out, then is handled the N of the node (N > 1) a child node by described in s2, it is longitudinal for selecting orientation Node.If carrying out longitudinal decomposition, the maximum node of vision block area in child node is chosen.
S3, candidate vision block is marked in advance
Give each candidate vision block corresponding pre- mark label by heuritic approach or/and keyword frequency algorithm, All pre- mark labels form a pre- mark tag set;
The heuritic approach (heuristic) can refer to Extracting multiple news attributes based on visual features。
The keyword frequency algorithm is similar with TF-IDF algorithm widely used in search engine.Firstly, to being collected into One group of data block in text fragments carry out word frequency statistics, select the frequency of occurrences be greater than N one group of word as keyword;Statistics The frequency that these keywords occur, as reference keyword frequency;Then, word frequency is carried out to the text fragments in candidate vision block The word occurred in candidate vision block text segment and keyword are carried out intersection operation, the word in set are being waited by statistics It selects the frequency occurred in text fragments and refers to keyword frequency multiplication, then sum and obtained to get the keyword to candidate visual block Point.If score is greater than s, respective labels (title, author, abstract in such as periodical summary page are given;For another example news pages In headline, news author, issuing time etc.).
S4, candidate vision block is labeled
Each candidate visual block is labeled by probability graph model, obtains corresponding mark label;By all marks Note label is matched with pre- mark tag set one by one, filters out the mark label fallen in pre- mark tag set.
Shown probability graph model includes CRF, MLN etc., can refer to Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.Establish probability graph model The feature of selection can refer to Template-Independent News Extraction Based on Visual Consistency。
Candidate visual block is labeled by probability undirected graph model, obtains page key message (such as Fig. 5).Key letter Breath refers to the partial information that reader is concerned about the most in the page, such as title, author, abstract in periodical summary page.It is for another example new Hear headline, news author, the issuing time etc. in the page.
By taking CRF as an example.Firstly, 200 pages are collected, according to periodical, DOI, title, author, issuing time, abstract, pass Eight labels such as keyword, invalid are manually marked.Using quasi-Newton method training CRF model.Then, each candidate visual block is calculated Feature vector.If only considering width-highly than ratio, number of characters-area ratio density, upper left angle point abscissa x and a left side Upper tetra- features of angle point ordinate y, then the feature vector of candidate visual block is (ratio, density, x, y).It will be calculated Feature vector according to candidate visual block appearance sequence sequentially input CRF model, predicted using Viterbi algorithm (inference).So far, each candidate visual block obtains two kinds of labels: one group marks label and a mark label in advance.
Described above is only Application Example of the invention, cannot limit the right model of the present invention with this certainly It encloses, therefore according to equivalence changes made by scope of the present invention patent, still belongs to protection scope of the present invention.

Claims (6)

1. a kind of web page contents extraction method characterized by comprising
S1, HTML is rendered again
The dom tree and render tree for initially setting up html document carry out weight to each vision block further according to the dom tree and render tree New rendering, is rendered to an arbitrary geometric figure for img label again, every a line of p, div, a label is also rendered again At an arbitrary geometric figure;
S2, dom tree is split
First, dom tree is begun stepping through from root node according to breadth First sequence, the node for being greater than 1 until finding child node number;It is right The node carries out horizontal partition, and then selecting direction in the child node under the node is longitudinal node;
Secondly, it is that longitudinal node carries out more than once longitudinally split to the direction, then selects the son knot under the node The maximum node of vision block area in point;
Finally, horizontal partition is carried out to the maximum node of vision block area again, obtains several candidate vision blocks;
S3, candidate vision block is marked in advance
It gives each candidate vision block corresponding pre- mark label by heuritic approach or/and keyword frequency algorithm, owns Pre- mark label form a pre- mark tag set;
S4, candidate vision block is labeled
Each candidate visual block is labeled by probability graph model, obtains corresponding mark label;By all mark marks Label are matched with pre- mark tag set one by one, filter out the mark label fallen in pre- mark tag set.
2. web page contents extraction method according to claim 1, which is characterized in that the dom tree and render tree are only Include img, p, div, a label.
3. web page contents extraction method according to claim 1, which is characterized in that the geometric figure is one group vertical Horizontal intersection line segment.
4. web page contents extraction method according to claim 1, which is characterized in that the geometric figure be it is round or Person's ellipse.
5. web page contents extraction method according to claim 1, which is characterized in that the geometric figure is positive polygon Shape.
6. web page contents extraction method according to claim 1, which is characterized in that the dividing method of the node Are as follows: it first passes through Fast Fourier Transform (FFT) and obtains the frequency domain representation of vision block, then using one group of orthogonal logarithm lid uncle's filtering separation The horizontal and vertical component of vision block frequency domain representation, the horizontal and vertical component for finally comparing vision block determine the side of vision block To.
CN201811067868.8A 2018-09-13 2018-09-13 Automatic extraction method of webpage content Active CN109325204B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811067868.8A CN109325204B (en) 2018-09-13 2018-09-13 Automatic extraction method of webpage content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811067868.8A CN109325204B (en) 2018-09-13 2018-09-13 Automatic extraction method of webpage content

Publications (2)

Publication Number Publication Date
CN109325204A true CN109325204A (en) 2019-02-12
CN109325204B CN109325204B (en) 2022-01-07

Family

ID=65266010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811067868.8A Active CN109325204B (en) 2018-09-13 2018-09-13 Automatic extraction method of webpage content

Country Status (1)

Country Link
CN (1) CN109325204B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968761A (en) * 2019-11-29 2020-04-07 福州大学 Self-adaptive extraction method for webpage structured data
CN112347332A (en) * 2020-11-17 2021-02-09 南开大学 XPath-based crawler target positioning method
WO2021102387A1 (en) * 2019-11-22 2021-05-27 Tenweb, Inc Generating higher-level semantics data for development of visual content

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156737A (en) * 2011-04-12 2011-08-17 华中师范大学 Method for extracting subject content of Chinese webpage
CN102253979A (en) * 2011-06-23 2011-11-23 天津海量信息技术有限公司 Vision-based web page extracting method
US20120246137A1 (en) * 2011-03-22 2012-09-27 Satish Sallakonda Visual profiles
CN106557565A (en) * 2016-11-22 2017-04-05 福州大学 A kind of text message extracting method based on website construction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120246137A1 (en) * 2011-03-22 2012-09-27 Satish Sallakonda Visual profiles
CN102156737A (en) * 2011-04-12 2011-08-17 华中师范大学 Method for extracting subject content of Chinese webpage
CN102253979A (en) * 2011-06-23 2011-11-23 天津海量信息技术有限公司 Vision-based web page extracting method
CN106557565A (en) * 2016-11-22 2017-04-05 福州大学 A kind of text message extracting method based on website construction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张乃洲等: "一种基于节点密度分割和标签传播的Web页面挖掘方法", 《计算机学报》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021102387A1 (en) * 2019-11-22 2021-05-27 Tenweb, Inc Generating higher-level semantics data for development of visual content
US11579849B2 (en) 2019-11-22 2023-02-14 Tenweb, Inc. Generating higher-level semantics data for development of visual content
CN110968761A (en) * 2019-11-29 2020-04-07 福州大学 Self-adaptive extraction method for webpage structured data
CN110968761B (en) * 2019-11-29 2022-07-08 福州大学 Webpage structured data self-adaptive extraction method
CN112347332A (en) * 2020-11-17 2021-02-09 南开大学 XPath-based crawler target positioning method

Also Published As

Publication number Publication date
CN109325204B (en) 2022-01-07

Similar Documents

Publication Publication Date Title
Liu et al. Vide: A vision-based approach for deep web data extraction
CN102890713B (en) A kind of music recommend method based on user&#39;s current geographic position and physical environment
WO2019041521A1 (en) Apparatus and method for extracting user keyword, and computer-readable storage medium
CN109325204A (en) Web page contents extraction method
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
KR20100113423A (en) Method for representing keyword using an inversed vector space model and apparatus thereof
CN104217038A (en) Knowledge network building method for financial news
US20110055285A1 (en) Information extraction combining spatial and textual layout cues
Kim et al. Web information extraction by HTML tree edit distance matching
CN105630940A (en) Readability indicator based information retrieval method
CN111177404A (en) Knowledge graph construction method and device of home decoration knowledge and computer equipment
CN107908749B (en) Character retrieval system and method based on search engine
Rosenfeld et al. Structural extraction from visual layout of documents
CN104572874B (en) A kind of abstracting method and device of webpage information
CN105528421A (en) Search dimension excavation method of query terms in mass data
CN110083760B (en) Multi-recording dynamic webpage information extraction method based on visual block
CN108694192A (en) The judgment method and device of type of webpage
CN113283432A (en) Image recognition and character sorting method and equipment
Jayashree et al. Multimodal web page segmentation using self-organized multi-objective clustering
CN103678432B (en) A kind of web page body extracting method based on web page body feature and intermediary&#39;s true value
CN115982390A (en) Industrial chain construction and iterative expansion development method
CN115640439A (en) Method, system and storage medium for network public opinion monitoring
CN103488743A (en) Page element extraction method and page element extraction system
Banu et al. Dwde-ir: an efficient deep Web data extraction for information retrieval on Web mining
CN107463615B (en) Real-time going and dealing recommendation method based on context and user interest in open network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant