CN109325204B - Automatic extraction method of webpage content - Google Patents
Automatic extraction method of webpage content Download PDFInfo
- Publication number
- CN109325204B CN109325204B CN201811067868.8A CN201811067868A CN109325204B CN 109325204 B CN109325204 B CN 109325204B CN 201811067868 A CN201811067868 A CN 201811067868A CN 109325204 B CN109325204 B CN 109325204B
- Authority
- CN
- China
- Prior art keywords
- node
- visual block
- rendering
- visual
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of webpage content extraction, and particularly relates to a method for automatically extracting webpage content, which is particularly suitable for extracting summary page content of periodical literature and comprises the following steps: s1, re-rendering the HTML; s2, segmenting the DOM tree; s3, pre-labeling the candidate visual block; and S4, labeling the candidate visual block. According to the method, a traditional visual algorithm is replaced by a Fast Fourier Transform (FFT) and a logarithmic cover filter, so that the time and space complexity is reduced, and the time and space efficiency of the algorithm is improved.
Description
Technical Field
The invention belongs to the technical field of webpage content extraction, and particularly relates to an automatic webpage content extraction method, which is particularly suitable for extracting summary page contents of periodical documents.
Background
With the development of information technology, the importance of the internet in information acquisition is increasing day by day. The internet is also an effective way for researchers to obtain the latest published documents. Academic journal publishers (Elsevier, Wiley, Taylor & Francis, etc.) provide journal literature summary pages at the Master site. Extracting information such as authors, publication time, summaries and the like from the summary pages is a key point for establishing an integrated database and is also a difficult problem.
The web content Extraction technology is a hot problem in the field of Information Extraction (Information Extraction). Existing methods can be broadly divided into three categories: the method is based on the template, the method extracts according to xpath and css expressions of webpage elements, and has the advantage of high accuracy, but the template creation consumes a large amount of manpower, a large amount of templates are difficult to maintain, and the robustness of the change of the webpage structure is poor; secondly, a DOM tree-based method is used, the method analyzes the webpage into a DOM tree, carries out tree structure matching (alignment) or partial matching (partial alignment) on the target webpage and the labeled page through a supervised or semi-supervised learning method, labels the target webpage and further extracts the webpage content, and the method has low efficiency (the time complexity of the Shing-Ling algorithm is in direct proportion to the depth of the tree) and needs a plurality of pages generated by the same template as input; and thirdly, a method based on visual information, such as a VIPS page segmentation algorithm proposed by Microsoft Asian institute. The method comprises the steps of dividing a page into a plurality of visual blocks (visual blocks) according to clues (cue) such as background color, character density and font, obtaining importance indexes of the visual blocks through learning of a Support Vector Machine (SVM) or a neural network model, and further extracting text content of the page; the method has high time and space complexity, depends on artificially established rules, and has poor robustness for the novel webpage template.
Disclosure of Invention
In view of the above technical problems, an object of the present invention is to provide an automatic extraction method for web page content, which adopts Fast Fourier Transform (FFT) and logarithmic cover filter to replace the conventional visual algorithm, thereby reducing the time and space complexity and improving the time and space efficiency of the algorithm.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a method for automatically extracting webpage content is characterized by comprising the following steps:
s1, re-rendering the HTML
Firstly, establishing a DOM tree and a rendering tree of an HTML document, re-rendering each visual block according to the DOM tree and the rendering tree, re-rendering an img label into an arbitrary geometric figure, and re-rendering each line of p, div and a labels into an arbitrary geometric figure;
s2, segmenting the DOM tree
Firstly, traversing the DOM tree from a root node according to the breadth priority sequence until a node with the number of child nodes larger than 1 is found; performing transverse segmentation on the node, and then selecting a node with a longitudinal direction in sub-nodes under the node;
secondly, longitudinally dividing the nodes with the longitudinal direction more than once, and then selecting the node with the largest visual block area in the sub-nodes under the node;
finally, transversely dividing the nodes with the largest visual block area to obtain a plurality of candidate visual blocks;
s3, pre-labeling candidate visual blocks
Giving a pre-labeled label corresponding to each candidate visual block through a heuristic algorithm or/and a keyword frequency algorithm, wherein all the pre-labeled labels form a pre-labeled label set;
s4, labeling the candidate visual block
Labeling each candidate visual block through a probability graph model to obtain a corresponding label; and matching all the label labels with the pre-labeled label set one by one, and screening out the label labels falling in the pre-labeled label set.
Preferably, the DOM tree and the render tree only contain img, p, div, and a tags.
Preferably, the geometric figure is a set of intersecting longitudinal and transverse line segments.
Preferably, the geometric figure is a circle or an ellipse.
Preferably, the geometric figure is a regular polygon.
Preferably, the node segmentation method includes: the method comprises the steps of firstly obtaining frequency domain representation of a visual block through fast Fourier transform, then adopting a group of orthogonal logarithmic cover-Bob filtering to separate horizontal and vertical components of the frequency domain representation of the visual block, and finally comparing the horizontal and vertical components of the visual block to determine the direction of the visual block.
The invention has the beneficial effects that: the method of the invention adopts Fast Fourier Transform (FFT) and logarithmic cover filtering to replace the traditional visual algorithm, thereby reducing the complexity of time and space and improving the time and space efficiency of the algorithm. In addition, the method adopts a probability graph model to describe the local dependency relationship among the candidate visual blocks so as to adapt to different sites and page layout changes, and has certain robustness to the page layout changes. And logarithmic cover filtering is adopted to judge the directionality of the page elements, and the model extraction accuracy is improved by combining the condition vector field, so that the method is another way for automatically extracting the webpage content. The geometric figure is a group of criss-cross line segments, wherein the simpler the geometric figure is, the simpler the calculation is, the higher the operation speed is, and the higher the operation speed corresponding to the group of criss-cross line segments is.
Drawings
FIG. 1 is a schematic flow diagram of the present invention.
FIG. 2 is a first schematic diagram of an embodiment of the present invention.
FIG. 3 is a second schematic diagram of an embodiment of the present invention.
FIG. 4 is a third schematic diagram of an embodiment of the present invention.
FIG. 5 is a fourth schematic diagram of an embodiment of the present invention.
Detailed Description
For better understanding of the present invention, the technical solutions of the present invention are further described below with reference to the following examples and accompanying drawings (as shown in fig. 1, 2, 3, 4, and 5).
As shown in fig. 1, an automatic extraction method of web page content includes:
s1, re-rendering the HTML
Firstly, establishing a DOM tree and a rendering tree (render tree) of an HTML document, wherein the DOM tree and the rendering tree only comprise img, p, div and a labels, then, according to the DOM tree and the rendering tree, re-rendering each visual block (page elements are processed by a browser rendering engine and are represented as a rectangular area with non-zero area in a page, namely the visual block, wherein the page elements are a section of HTML code surrounded by a group of HTML labels, such as < p >, and < div > and the like, the visual blocks correspond to nodes in the DOM tree), re-rendering the img label into an arbitrary geometric figure (such as a group of criss-cross line segments, polygons, circles, ellipses and other regular geometric figures or any irregular geometric figure), and re-rendering each line (character) of the p, div and a labels into an arbitrary geometric figure;
as shown in fig. 2 (each cross corresponds to a label in the figure), the following takes re-rendering into a set of intersecting vertical and horizontal line segments (such as crosses) as an example:
the img label is re-rendered into a group of vertical and horizontal intersecting line segments;
for example, the visual block of the img tag corresponds to a rectangular area in the page. Coordinates of four corner points of the rectangular area are respectively arranged from the upper left corner point in a counterclockwise direction as R1(x1, y1), R2(x1, y2), R3(x2, y2) and R4(x2, y 1). P (x1, (y1+ y2)/2), Q ((x1+ x2)/2, y2), R (x2, (y1+ y2)/2), S ((x1+ x2)/2, y1) are the midpoints of segments R1R2, R2R3, R3R4, and R4R1, respectively. Then a set of segments PR, QS (hereinafter referred to as "crosses") that bisect each other perpendicularly may be re-rendered as a result of the img tag.
The p, div and a labels re-render each line of characters of the labels into a group of criss-cross line segments;
for example, a visual block of a p-tag corresponds to a rectangular area in a page. The coordinates of four corner points of the rectangle are respectively arranged from the upper left corner point in the counterclockwise direction as R1(x1, y1), R2(x1, y2), R3(x2, y2) and R4(x2, y 1). The width (width) of the rectangle is W pixels. The text contained in the p-tag is C bytes in length and the font size (font size) is F pixels. Then, the number of lines N of the text in the p-tag visual block can be obtained by estimationLine (A)Is a rounded up symbol). Taking P1 and P2 … Pn as the N +1 equally dividing points of the segment R1R 2; r1 and R2 … Rn are the N +1 equally dividing points of the line segment R3R 4; q, S are the midpoints of the segments R2R3 and R1R 4. Then, the segment groups P1R1, P2R2, …, PnRn, QS may be the result of the P-tag re-rendering.
S2, segmenting the DOM tree (transverse-longitudinal-transverse segmentation)
As shown in fig. 3, first, traverse the DOM tree from the root node according to the breadth-first order until finding a node with a child node number greater than 1; dividing the node transversely (e.g. dividing VB1, VB2 and VB3 in FIG. 3 into three longitudinal blocks), and then selecting the node with the longitudinal direction from the sub-nodes below the node (i.e. VB3 in FIG. 3);
as shown in fig. 4, secondly, performing more than one longitudinal division on the nodes with the longitudinal direction (dividing into three transverse blocks as VB1, VB2 and VB3 in fig. 4), and then selecting the node with the largest visual block area in the sub-nodes under the node (namely, VB2 in fig. 4);
when the DOM tree has nested nodes, the DOM tree needs to be longitudinally decomposed for many times to obtain a clean result;
as shown in fig. 5, finally, performing horizontal segmentation on the nodes with the largest area of the visual blocks (as a plurality of vertically segmented boxes in fig. 5, the boxes represent periodicals, DOIs, titles, authors, release times, abstracts, and keywords from top to bottom, respectively) to obtain a plurality of candidate visual blocks;
the node segmentation (including transverse segmentation and longitudinal segmentation) method comprises the following steps: obtaining frequency domain representation of the visual block through Fast Fourier Transform (FFT), separating horizontal and vertical components of the frequency domain representation of the visual block by adopting a group of orthogonal logarithmic cover-Bob filtering, and determining the direction of the visual block by comparing the horizontal and vertical components of the visual block (if the horizontal component is smaller than the vertical component, the direction of the visual block is transverse, and if the vertical component is smaller than the horizontal component, the direction of the visual block is longitudinal);
(1) traversing DOM tree from top to bottom
Traversing the DOM tree from the root node according to the breadth priority sequence until the node with the number of the child nodes larger than 1 is found; if the horizontal decomposition is performed, the N (N >1) sub-nodes of the node are processed as described in s2, and a node whose arrangement direction is the vertical direction is selected. And if the longitudinal decomposition is carried out, selecting the node with the largest visual block area in the sub-nodes.
S3, pre-labeling candidate visual blocks
Giving a pre-labeled label corresponding to each candidate visual block through a heuristic algorithm or/and a keyword frequency algorithm, wherein all the pre-labeled labels form a pre-labeled label set;
the heuristic algorithm (itself) may refer to Extracting multiple news based on visual effects.
The keyword frequency algorithm is similar to the TF-IDF algorithm widely used in search engines. Firstly, carrying out word frequency statistics on text segments in a group of collected data blocks, and selecting a group of words with the occurrence frequency greater than N as keywords; counting the occurrence frequency of the keywords as the frequency of reference keywords; then, carrying out word frequency statistics on the text segments in the candidate visual block, carrying out intersection operation on words appearing in the text segments of the candidate visual block and the keywords, multiplying the frequency of the words appearing in the candidate text segments in the set by the frequency of the reference keywords, and then summing to obtain the keyword score of the candidate visual block. If the score is larger than s, the corresponding label is given (such as title, author, abstract and the like in the periodical abstract page; such as news title, news author, release time and the like in the news page).
S4, labeling the candidate visual block
Labeling each candidate visual block through a probability graph model to obtain a corresponding label; and matching all the label labels with the pre-labeled label set one by one, and screening out the label labels falling in the pre-labeled label set.
The illustrated Probabilistic graphical Models include CRF, MLN, etc., as referenced by Conditional Random Fields, Probalistic Models for segmentation and Label Sequence Data. The characteristics selected by establishing the probability graph model can refer to Template-Independent News Extraction Based on Visual Consistency.
And labeling the candidate visual blocks through a probability undirected graph model to obtain the key information of the page (as shown in FIG. 5). The key information refers to the partial information of the page that is most concerned by the reader, such as title, author, abstract, etc. in the summary page of the periodical. As well as news headlines, news authors, time of release, etc. in a news page.
Take CRF as an example. Firstly, 200 pages are collected, and manual labeling is carried out according to eight labels of periodicals, DOIs, titles, authors, release time, abstracts, keywords, invalidities and the like. The CRF model was trained using the quasi-Newton method. Then, a feature vector for each candidate visual block is calculated. If only four features of the width-height ratio, the character number-area ratio, the abscissa x of the upper left corner point, and the ordinate y of the upper left corner point are considered, the feature vector of the candidate visual block is (ratio, density, x, y). And (3) sequentially inputting the calculated feature vectors into a CRF (model reference) according to the appearance sequence of the candidate visual blocks, and predicting (reference) by adopting a Viterbi algorithm. To this end, each candidate visual block gets two kinds of labels: a set of pre-labeled labels and a label.
The above description is only an application example of the present invention, and certainly, the present invention should not be limited by this application, and therefore, the present invention is still within the protection scope of the present invention by equivalent changes made in the claims of the present invention.
Claims (5)
1. A method for automatically extracting webpage content is characterized by comprising the following steps:
s1, re-rendering the HTML
Firstly, establishing a DOM tree and a rendering tree of an HTML document, re-rendering each visual block according to the DOM tree and the rendering tree, re-rendering an img label into an arbitrary geometric figure, and re-rendering each line of p, div and a labels into an arbitrary geometric figure;
s2, segmenting the DOM tree
Firstly, traversing the DOM tree from a root node according to the breadth priority sequence until a node with the number of child nodes larger than 1 is found; performing transverse segmentation on the node, and then selecting a node with a longitudinal direction in sub-nodes under the node;
secondly, longitudinally dividing the nodes with the longitudinal direction more than once, and then selecting the node with the largest visual block area in the sub-nodes under the node;
finally, transversely dividing the nodes with the largest visual block area to obtain a plurality of candidate visual blocks;
the node segmentation method comprises the following steps: firstly, obtaining frequency domain representation of a visual block through fast Fourier transform, then separating horizontal and vertical components of the frequency domain representation of the visual block by adopting a group of orthogonal logarithmic cover-bosch filters, and finally determining the direction of the visual block by comparing the horizontal and vertical components of the visual block;
s3, pre-labeling candidate visual blocks
Giving a pre-labeled label corresponding to each candidate visual block through a heuristic algorithm or/and a keyword frequency algorithm, wherein all the pre-labeled labels form a pre-labeled label set;
s4, labeling the candidate visual block
Labeling each candidate visual block through a probability graph model to obtain a corresponding label; and matching all the label labels with the pre-labeled label set one by one, and screening out the label labels falling in the pre-labeled label set.
2. The method for automatically extracting webpage content according to claim 1, wherein the DOM tree and the render tree only contain img, p, div, a tags.
3. The method for automatically extracting web page content according to claim 1, wherein the geometric figure is a set of vertical and horizontal intersecting line segments.
4. The method for automatically extracting web page content according to claim 1, wherein the geometric figure is a circle or an ellipse.
5. The method for automatically extracting web page content according to claim 1, wherein the geometric figure is a regular polygon.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811067868.8A CN109325204B (en) | 2018-09-13 | 2018-09-13 | Automatic extraction method of webpage content |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811067868.8A CN109325204B (en) | 2018-09-13 | 2018-09-13 | Automatic extraction method of webpage content |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109325204A CN109325204A (en) | 2019-02-12 |
CN109325204B true CN109325204B (en) | 2022-01-07 |
Family
ID=65266010
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811067868.8A Active CN109325204B (en) | 2018-09-13 | 2018-09-13 | Automatic extraction method of webpage content |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109325204B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11579849B2 (en) | 2019-11-22 | 2023-02-14 | Tenweb, Inc. | Generating higher-level semantics data for development of visual content |
CN110968761B (en) * | 2019-11-29 | 2022-07-08 | 福州大学 | Webpage structured data self-adaptive extraction method |
CN112347332A (en) * | 2020-11-17 | 2021-02-09 | 南开大学 | XPath-based crawler target positioning method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156737A (en) * | 2011-04-12 | 2011-08-17 | 华中师范大学 | Method for extracting subject content of Chinese webpage |
CN102253979A (en) * | 2011-06-23 | 2011-11-23 | 天津海量信息技术有限公司 | Vision-based web page extracting method |
CN106557565A (en) * | 2016-11-22 | 2017-04-05 | 福州大学 | A kind of text message extracting method based on website construction |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120246137A1 (en) * | 2011-03-22 | 2012-09-27 | Satish Sallakonda | Visual profiles |
-
2018
- 2018-09-13 CN CN201811067868.8A patent/CN109325204B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156737A (en) * | 2011-04-12 | 2011-08-17 | 华中师范大学 | Method for extracting subject content of Chinese webpage |
CN102253979A (en) * | 2011-06-23 | 2011-11-23 | 天津海量信息技术有限公司 | Vision-based web page extracting method |
CN106557565A (en) * | 2016-11-22 | 2017-04-05 | 福州大学 | A kind of text message extracting method based on website construction |
Non-Patent Citations (1)
Title |
---|
一种基于节点密度分割和标签传播的Web页面挖掘方法;张乃洲等;《计算机学报》;20150227;第38卷(第2期);第349-364页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109325204A (en) | 2019-02-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104199972B (en) | A kind of name entity relation extraction and construction method based on deep learning | |
Sun et al. | Dom based content extraction via text density | |
Weninger et al. | CETR: content extraction via tag ratios | |
US8255793B2 (en) | Automatic visual segmentation of webpages | |
CN109325204B (en) | Automatic extraction method of webpage content | |
US20130031461A1 (en) | Detecting repeat patterns on a web page | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
US8205153B2 (en) | Information extraction combining spatial and textual layout cues | |
CN111143547B (en) | Big data display method based on knowledge graph | |
Chen et al. | Information extraction from resume documents in pdf format | |
Shi et al. | AutoRM: An effective approach for automatic Web data record mining | |
CN112084451B (en) | Webpage LOGO extraction system and method based on visual blocking | |
CN104217038A (en) | Knowledge network building method for financial news | |
CN108959204B (en) | Internet financial project information extraction method and system | |
CN102915361A (en) | Webpage text extracting method based on character distribution characteristic | |
CN107463571A (en) | Web color method | |
CN106372232B (en) | Information mining method and device based on artificial intelligence | |
CN105528421A (en) | Search dimension excavation method of query terms in mass data | |
Nguyen et al. | Web document analysis based on visual segmentation and page rendering | |
CN105550279A (en) | Vision-based list page identification method | |
Eldirdiery et al. | Detecting and removing noisy data on web document using text density approach | |
CN106547851B (en) | Webpage content extraction method based on fuzzy sequence mode mining | |
Pu et al. | A vision-based approach for deep web form extraction | |
CN112347353A (en) | Webpage denoising method | |
Liu et al. | Structured data extraction: wrapper generation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |