CN109325204B - Automatic extraction method of webpage content - Google Patents

Automatic extraction method of webpage content Download PDF

Info

Publication number
CN109325204B
CN109325204B CN201811067868.8A CN201811067868A CN109325204B CN 109325204 B CN109325204 B CN 109325204B CN 201811067868 A CN201811067868 A CN 201811067868A CN 109325204 B CN109325204 B CN 109325204B
Authority
CN
China
Prior art keywords
node
visual block
rendering
visual
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811067868.8A
Other languages
Chinese (zh)
Other versions
CN109325204A (en
Inventor
王世阳
李阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Biorun Biotechnology LLC
Original Assignee
Wuhan Biorun Biotechnology LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Biorun Biotechnology LLC filed Critical Wuhan Biorun Biotechnology LLC
Priority to CN201811067868.8A priority Critical patent/CN109325204B/en
Publication of CN109325204A publication Critical patent/CN109325204A/en
Application granted granted Critical
Publication of CN109325204B publication Critical patent/CN109325204B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of webpage content extraction, and particularly relates to a method for automatically extracting webpage content, which is particularly suitable for extracting summary page content of periodical literature and comprises the following steps: s1, re-rendering the HTML; s2, segmenting the DOM tree; s3, pre-labeling the candidate visual block; and S4, labeling the candidate visual block. According to the method, a traditional visual algorithm is replaced by a Fast Fourier Transform (FFT) and a logarithmic cover filter, so that the time and space complexity is reduced, and the time and space efficiency of the algorithm is improved.

Description

Automatic extraction method of webpage content
Technical Field
The invention belongs to the technical field of webpage content extraction, and particularly relates to an automatic webpage content extraction method, which is particularly suitable for extracting summary page contents of periodical documents.
Background
With the development of information technology, the importance of the internet in information acquisition is increasing day by day. The internet is also an effective way for researchers to obtain the latest published documents. Academic journal publishers (Elsevier, Wiley, Taylor & Francis, etc.) provide journal literature summary pages at the Master site. Extracting information such as authors, publication time, summaries and the like from the summary pages is a key point for establishing an integrated database and is also a difficult problem.
The web content Extraction technology is a hot problem in the field of Information Extraction (Information Extraction). Existing methods can be broadly divided into three categories: the method is based on the template, the method extracts according to xpath and css expressions of webpage elements, and has the advantage of high accuracy, but the template creation consumes a large amount of manpower, a large amount of templates are difficult to maintain, and the robustness of the change of the webpage structure is poor; secondly, a DOM tree-based method is used, the method analyzes the webpage into a DOM tree, carries out tree structure matching (alignment) or partial matching (partial alignment) on the target webpage and the labeled page through a supervised or semi-supervised learning method, labels the target webpage and further extracts the webpage content, and the method has low efficiency (the time complexity of the Shing-Ling algorithm is in direct proportion to the depth of the tree) and needs a plurality of pages generated by the same template as input; and thirdly, a method based on visual information, such as a VIPS page segmentation algorithm proposed by Microsoft Asian institute. The method comprises the steps of dividing a page into a plurality of visual blocks (visual blocks) according to clues (cue) such as background color, character density and font, obtaining importance indexes of the visual blocks through learning of a Support Vector Machine (SVM) or a neural network model, and further extracting text content of the page; the method has high time and space complexity, depends on artificially established rules, and has poor robustness for the novel webpage template.
Disclosure of Invention
In view of the above technical problems, an object of the present invention is to provide an automatic extraction method for web page content, which adopts Fast Fourier Transform (FFT) and logarithmic cover filter to replace the conventional visual algorithm, thereby reducing the time and space complexity and improving the time and space efficiency of the algorithm.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a method for automatically extracting webpage content is characterized by comprising the following steps:
s1, re-rendering the HTML
Firstly, establishing a DOM tree and a rendering tree of an HTML document, re-rendering each visual block according to the DOM tree and the rendering tree, re-rendering an img label into an arbitrary geometric figure, and re-rendering each line of p, div and a labels into an arbitrary geometric figure;
s2, segmenting the DOM tree
Firstly, traversing the DOM tree from a root node according to the breadth priority sequence until a node with the number of child nodes larger than 1 is found; performing transverse segmentation on the node, and then selecting a node with a longitudinal direction in sub-nodes under the node;
secondly, longitudinally dividing the nodes with the longitudinal direction more than once, and then selecting the node with the largest visual block area in the sub-nodes under the node;
finally, transversely dividing the nodes with the largest visual block area to obtain a plurality of candidate visual blocks;
s3, pre-labeling candidate visual blocks
Giving a pre-labeled label corresponding to each candidate visual block through a heuristic algorithm or/and a keyword frequency algorithm, wherein all the pre-labeled labels form a pre-labeled label set;
s4, labeling the candidate visual block
Labeling each candidate visual block through a probability graph model to obtain a corresponding label; and matching all the label labels with the pre-labeled label set one by one, and screening out the label labels falling in the pre-labeled label set.
Preferably, the DOM tree and the render tree only contain img, p, div, and a tags.
Preferably, the geometric figure is a set of intersecting longitudinal and transverse line segments.
Preferably, the geometric figure is a circle or an ellipse.
Preferably, the geometric figure is a regular polygon.
Preferably, the node segmentation method includes: the method comprises the steps of firstly obtaining frequency domain representation of a visual block through fast Fourier transform, then adopting a group of orthogonal logarithmic cover-Bob filtering to separate horizontal and vertical components of the frequency domain representation of the visual block, and finally comparing the horizontal and vertical components of the visual block to determine the direction of the visual block.
The invention has the beneficial effects that: the method of the invention adopts Fast Fourier Transform (FFT) and logarithmic cover filtering to replace the traditional visual algorithm, thereby reducing the complexity of time and space and improving the time and space efficiency of the algorithm. In addition, the method adopts a probability graph model to describe the local dependency relationship among the candidate visual blocks so as to adapt to different sites and page layout changes, and has certain robustness to the page layout changes. And logarithmic cover filtering is adopted to judge the directionality of the page elements, and the model extraction accuracy is improved by combining the condition vector field, so that the method is another way for automatically extracting the webpage content. The geometric figure is a group of criss-cross line segments, wherein the simpler the geometric figure is, the simpler the calculation is, the higher the operation speed is, and the higher the operation speed corresponding to the group of criss-cross line segments is.
Drawings
FIG. 1 is a schematic flow diagram of the present invention.
FIG. 2 is a first schematic diagram of an embodiment of the present invention.
FIG. 3 is a second schematic diagram of an embodiment of the present invention.
FIG. 4 is a third schematic diagram of an embodiment of the present invention.
FIG. 5 is a fourth schematic diagram of an embodiment of the present invention.
Detailed Description
For better understanding of the present invention, the technical solutions of the present invention are further described below with reference to the following examples and accompanying drawings (as shown in fig. 1, 2, 3, 4, and 5).
As shown in fig. 1, an automatic extraction method of web page content includes:
s1, re-rendering the HTML
Firstly, establishing a DOM tree and a rendering tree (render tree) of an HTML document, wherein the DOM tree and the rendering tree only comprise img, p, div and a labels, then, according to the DOM tree and the rendering tree, re-rendering each visual block (page elements are processed by a browser rendering engine and are represented as a rectangular area with non-zero area in a page, namely the visual block, wherein the page elements are a section of HTML code surrounded by a group of HTML labels, such as < p >, and < div > and the like, the visual blocks correspond to nodes in the DOM tree), re-rendering the img label into an arbitrary geometric figure (such as a group of criss-cross line segments, polygons, circles, ellipses and other regular geometric figures or any irregular geometric figure), and re-rendering each line (character) of the p, div and a labels into an arbitrary geometric figure;
as shown in fig. 2 (each cross corresponds to a label in the figure), the following takes re-rendering into a set of intersecting vertical and horizontal line segments (such as crosses) as an example:
the img label is re-rendered into a group of vertical and horizontal intersecting line segments;
for example, the visual block of the img tag corresponds to a rectangular area in the page. Coordinates of four corner points of the rectangular area are respectively arranged from the upper left corner point in a counterclockwise direction as R1(x1, y1), R2(x1, y2), R3(x2, y2) and R4(x2, y 1). P (x1, (y1+ y2)/2), Q ((x1+ x2)/2, y2), R (x2, (y1+ y2)/2), S ((x1+ x2)/2, y1) are the midpoints of segments R1R2, R2R3, R3R4, and R4R1, respectively. Then a set of segments PR, QS (hereinafter referred to as "crosses") that bisect each other perpendicularly may be re-rendered as a result of the img tag.
The p, div and a labels re-render each line of characters of the labels into a group of criss-cross line segments;
for example, a visual block of a p-tag corresponds to a rectangular area in a page. The coordinates of four corner points of the rectangle are respectively arranged from the upper left corner point in the counterclockwise direction as R1(x1, y1), R2(x1, y2), R3(x2, y2) and R4(x2, y 1). The width (width) of the rectangle is W pixels. The text contained in the p-tag is C bytes in length and the font size (font size) is F pixels. Then, the number of lines N of the text in the p-tag visual block can be obtained by estimation
Figure BDA0001798760420000031
Line (A)
Figure BDA0001798760420000032
Is a rounded up symbol). Taking P1 and P2 … Pn as the N +1 equally dividing points of the segment R1R 2; r1 and R2 … Rn are the N +1 equally dividing points of the line segment R3R 4; q, S are the midpoints of the segments R2R3 and R1R 4. Then, the segment groups P1R1, P2R2, …, PnRn, QS may be the result of the P-tag re-rendering.
S2, segmenting the DOM tree (transverse-longitudinal-transverse segmentation)
As shown in fig. 3, first, traverse the DOM tree from the root node according to the breadth-first order until finding a node with a child node number greater than 1; dividing the node transversely (e.g. dividing VB1, VB2 and VB3 in FIG. 3 into three longitudinal blocks), and then selecting the node with the longitudinal direction from the sub-nodes below the node (i.e. VB3 in FIG. 3);
as shown in fig. 4, secondly, performing more than one longitudinal division on the nodes with the longitudinal direction (dividing into three transverse blocks as VB1, VB2 and VB3 in fig. 4), and then selecting the node with the largest visual block area in the sub-nodes under the node (namely, VB2 in fig. 4);
when the DOM tree has nested nodes, the DOM tree needs to be longitudinally decomposed for many times to obtain a clean result;
as shown in fig. 5, finally, performing horizontal segmentation on the nodes with the largest area of the visual blocks (as a plurality of vertically segmented boxes in fig. 5, the boxes represent periodicals, DOIs, titles, authors, release times, abstracts, and keywords from top to bottom, respectively) to obtain a plurality of candidate visual blocks;
the node segmentation (including transverse segmentation and longitudinal segmentation) method comprises the following steps: obtaining frequency domain representation of the visual block through Fast Fourier Transform (FFT), separating horizontal and vertical components of the frequency domain representation of the visual block by adopting a group of orthogonal logarithmic cover-Bob filtering, and determining the direction of the visual block by comparing the horizontal and vertical components of the visual block (if the horizontal component is smaller than the vertical component, the direction of the visual block is transverse, and if the vertical component is smaller than the horizontal component, the direction of the visual block is longitudinal);
(1) traversing DOM tree from top to bottom
Traversing the DOM tree from the root node according to the breadth priority sequence until the node with the number of the child nodes larger than 1 is found; if the horizontal decomposition is performed, the N (N >1) sub-nodes of the node are processed as described in s2, and a node whose arrangement direction is the vertical direction is selected. And if the longitudinal decomposition is carried out, selecting the node with the largest visual block area in the sub-nodes.
S3, pre-labeling candidate visual blocks
Giving a pre-labeled label corresponding to each candidate visual block through a heuristic algorithm or/and a keyword frequency algorithm, wherein all the pre-labeled labels form a pre-labeled label set;
the heuristic algorithm (itself) may refer to Extracting multiple news based on visual effects.
The keyword frequency algorithm is similar to the TF-IDF algorithm widely used in search engines. Firstly, carrying out word frequency statistics on text segments in a group of collected data blocks, and selecting a group of words with the occurrence frequency greater than N as keywords; counting the occurrence frequency of the keywords as the frequency of reference keywords; then, carrying out word frequency statistics on the text segments in the candidate visual block, carrying out intersection operation on words appearing in the text segments of the candidate visual block and the keywords, multiplying the frequency of the words appearing in the candidate text segments in the set by the frequency of the reference keywords, and then summing to obtain the keyword score of the candidate visual block. If the score is larger than s, the corresponding label is given (such as title, author, abstract and the like in the periodical abstract page; such as news title, news author, release time and the like in the news page).
S4, labeling the candidate visual block
Labeling each candidate visual block through a probability graph model to obtain a corresponding label; and matching all the label labels with the pre-labeled label set one by one, and screening out the label labels falling in the pre-labeled label set.
The illustrated Probabilistic graphical Models include CRF, MLN, etc., as referenced by Conditional Random Fields, Probalistic Models for segmentation and Label Sequence Data. The characteristics selected by establishing the probability graph model can refer to Template-Independent News Extraction Based on Visual Consistency.
And labeling the candidate visual blocks through a probability undirected graph model to obtain the key information of the page (as shown in FIG. 5). The key information refers to the partial information of the page that is most concerned by the reader, such as title, author, abstract, etc. in the summary page of the periodical. As well as news headlines, news authors, time of release, etc. in a news page.
Take CRF as an example. Firstly, 200 pages are collected, and manual labeling is carried out according to eight labels of periodicals, DOIs, titles, authors, release time, abstracts, keywords, invalidities and the like. The CRF model was trained using the quasi-Newton method. Then, a feature vector for each candidate visual block is calculated. If only four features of the width-height ratio, the character number-area ratio, the abscissa x of the upper left corner point, and the ordinate y of the upper left corner point are considered, the feature vector of the candidate visual block is (ratio, density, x, y). And (3) sequentially inputting the calculated feature vectors into a CRF (model reference) according to the appearance sequence of the candidate visual blocks, and predicting (reference) by adopting a Viterbi algorithm. To this end, each candidate visual block gets two kinds of labels: a set of pre-labeled labels and a label.
The above description is only an application example of the present invention, and certainly, the present invention should not be limited by this application, and therefore, the present invention is still within the protection scope of the present invention by equivalent changes made in the claims of the present invention.

Claims (5)

1. A method for automatically extracting webpage content is characterized by comprising the following steps:
s1, re-rendering the HTML
Firstly, establishing a DOM tree and a rendering tree of an HTML document, re-rendering each visual block according to the DOM tree and the rendering tree, re-rendering an img label into an arbitrary geometric figure, and re-rendering each line of p, div and a labels into an arbitrary geometric figure;
s2, segmenting the DOM tree
Firstly, traversing the DOM tree from a root node according to the breadth priority sequence until a node with the number of child nodes larger than 1 is found; performing transverse segmentation on the node, and then selecting a node with a longitudinal direction in sub-nodes under the node;
secondly, longitudinally dividing the nodes with the longitudinal direction more than once, and then selecting the node with the largest visual block area in the sub-nodes under the node;
finally, transversely dividing the nodes with the largest visual block area to obtain a plurality of candidate visual blocks;
the node segmentation method comprises the following steps: firstly, obtaining frequency domain representation of a visual block through fast Fourier transform, then separating horizontal and vertical components of the frequency domain representation of the visual block by adopting a group of orthogonal logarithmic cover-bosch filters, and finally determining the direction of the visual block by comparing the horizontal and vertical components of the visual block;
s3, pre-labeling candidate visual blocks
Giving a pre-labeled label corresponding to each candidate visual block through a heuristic algorithm or/and a keyword frequency algorithm, wherein all the pre-labeled labels form a pre-labeled label set;
s4, labeling the candidate visual block
Labeling each candidate visual block through a probability graph model to obtain a corresponding label; and matching all the label labels with the pre-labeled label set one by one, and screening out the label labels falling in the pre-labeled label set.
2. The method for automatically extracting webpage content according to claim 1, wherein the DOM tree and the render tree only contain img, p, div, a tags.
3. The method for automatically extracting web page content according to claim 1, wherein the geometric figure is a set of vertical and horizontal intersecting line segments.
4. The method for automatically extracting web page content according to claim 1, wherein the geometric figure is a circle or an ellipse.
5. The method for automatically extracting web page content according to claim 1, wherein the geometric figure is a regular polygon.
CN201811067868.8A 2018-09-13 2018-09-13 Automatic extraction method of webpage content Active CN109325204B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811067868.8A CN109325204B (en) 2018-09-13 2018-09-13 Automatic extraction method of webpage content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811067868.8A CN109325204B (en) 2018-09-13 2018-09-13 Automatic extraction method of webpage content

Publications (2)

Publication Number Publication Date
CN109325204A CN109325204A (en) 2019-02-12
CN109325204B true CN109325204B (en) 2022-01-07

Family

ID=65266010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811067868.8A Active CN109325204B (en) 2018-09-13 2018-09-13 Automatic extraction method of webpage content

Country Status (1)

Country Link
CN (1) CN109325204B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11579849B2 (en) 2019-11-22 2023-02-14 Tenweb, Inc. Generating higher-level semantics data for development of visual content
CN110968761B (en) * 2019-11-29 2022-07-08 福州大学 Webpage structured data self-adaptive extraction method
CN112347332A (en) * 2020-11-17 2021-02-09 南开大学 XPath-based crawler target positioning method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156737A (en) * 2011-04-12 2011-08-17 华中师范大学 Method for extracting subject content of Chinese webpage
CN102253979A (en) * 2011-06-23 2011-11-23 天津海量信息技术有限公司 Vision-based web page extracting method
CN106557565A (en) * 2016-11-22 2017-04-05 福州大学 A kind of text message extracting method based on website construction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120246137A1 (en) * 2011-03-22 2012-09-27 Satish Sallakonda Visual profiles

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156737A (en) * 2011-04-12 2011-08-17 华中师范大学 Method for extracting subject content of Chinese webpage
CN102253979A (en) * 2011-06-23 2011-11-23 天津海量信息技术有限公司 Vision-based web page extracting method
CN106557565A (en) * 2016-11-22 2017-04-05 福州大学 A kind of text message extracting method based on website construction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于节点密度分割和标签传播的Web页面挖掘方法;张乃洲等;《计算机学报》;20150227;第38卷(第2期);第349-364页 *

Also Published As

Publication number Publication date
CN109325204A (en) 2019-02-12

Similar Documents

Publication Publication Date Title
CN104199972B (en) A kind of name entity relation extraction and construction method based on deep learning
Sun et al. Dom based content extraction via text density
Weninger et al. CETR: content extraction via tag ratios
US8255793B2 (en) Automatic visual segmentation of webpages
CN109325204B (en) Automatic extraction method of webpage content
US20130031461A1 (en) Detecting repeat patterns on a web page
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
US8205153B2 (en) Information extraction combining spatial and textual layout cues
CN111143547B (en) Big data display method based on knowledge graph
Chen et al. Information extraction from resume documents in pdf format
Shi et al. AutoRM: An effective approach for automatic Web data record mining
CN112084451B (en) Webpage LOGO extraction system and method based on visual blocking
CN104217038A (en) Knowledge network building method for financial news
CN108959204B (en) Internet financial project information extraction method and system
CN102915361A (en) Webpage text extracting method based on character distribution characteristic
CN107463571A (en) Web color method
CN106372232B (en) Information mining method and device based on artificial intelligence
CN105528421A (en) Search dimension excavation method of query terms in mass data
Nguyen et al. Web document analysis based on visual segmentation and page rendering
CN105550279A (en) Vision-based list page identification method
Eldirdiery et al. Detecting and removing noisy data on web document using text density approach
CN106547851B (en) Webpage content extraction method based on fuzzy sequence mode mining
Pu et al. A vision-based approach for deep web form extraction
CN112347353A (en) Webpage denoising method
Liu et al. Structured data extraction: wrapper generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant