CN111428444B

CN111428444B - Automatic extraction method for webpage information

Info

Publication number: CN111428444B
Application number: CN202010228475.1A
Authority: CN
Inventors: 吕聚旺
Original assignee: Xinhua Zhiyun Technology Co ltd
Current assignee: Xinhua Zhiyun Technology Co ltd
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2023-10-20
Anticipated expiration: 2040-03-27
Also published as: CN111428444A

Abstract

The invention discloses an automatic webpage information extraction method which is characterized by comprising the following steps: preprocessing the webpage information; constructing a block DOM tree; positioning a text region; extracting the text of the webpage; wherein, the building block DOM tree comprises the following steps: performing fault tolerance compensation and DOM analysis on the webpage source code; constructing a block DOM structure by combining the HTML block layout elements on the basis of the DOM; counting the number of basic theme elements of the DOM block by combining the display characteristics; weighting calculation is carried out on basic theme elements of the DOM block; and when the text region is positioned, positioning the text region according to the theme weight obtained by the weighting calculation. The method has the advantages that the efficiency and the accuracy of webpage information extraction are considered, the layout characteristics and the partial visual characteristics of the HTML of the webpage are considered on the basis that the traditional webpage extraction method is not remarkably reduced, and the accuracy of webpage information extraction is effectively improved.

Description

Automatic extraction method for webpage information

Technical Field

The invention relates to an automatic webpage information extraction method.

Background

With the rapid development of the Internet and its technology, networks have become the most huge database for human history. But the Web page contains a large number of navigation links, advertisement links, copyright notices and other contents which are not related or basically unrelated to the theme besides the contents expressing the theme. These data that are not significantly or substantially related to the subject matter of the Web page are commonly referred to as noise data for the page, the presence of which presents a significant challenge for Web page data-based applications. Currently, the mainstream webpage topic information extraction technology is divided into text pie with text density as a core and visual pie with visual display characteristics as a core. The method mainly depends on the text density characteristics of the web pages, the processing speed is high, and the traditional news web pages can meet most application requirements; the method mainly utilizes browser rendering technology to restore visual display characteristics of the webpage, and utilizes the visual characteristics of the webpage to extract webpage theme information.

The method based on text density cannot process novel websites with increasingly abundant display modes and display elements. The method based on the visual characteristics is highly dependent on browser rendering technology, has higher requirements on a hardware system, is slow in processing speed, relatively poor in stability, has higher algorithm technical threshold and is not beneficial to large-scale application.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides the automatic webpage information extraction method, which takes the webpage information extraction efficiency and accuracy into consideration, considers the layout characteristics of the webpage and the partial visual characteristics of the HTML on the basis of not obviously reducing the traditional webpage extraction method, and effectively improves the webpage information extraction accuracy.

In order to achieve the above object, the present invention adopts the following technical scheme:

a webpage information automatic extraction method comprises the following steps: preprocessing the webpage information, constructing a block DOM tree, positioning a text region, and extracting the webpage text;

wherein, the building block DOM tree comprises the following steps: performing fault tolerance compensation and DOM analysis on the webpage source code, constructing a block DOM structure by combining HTML block layout elements on the basis of the DOM, counting the number of DOM block basic theme elements by combining display characteristics, and performing weighted calculation on the DOM block basic theme elements;

and when the text region is positioned, positioning the text region according to the theme weight obtained by the weighting calculation.

Further, locating the text region comprises the steps of: and recursively shrinking and positioning the candidate subject blocks from top to bottom according to the subject weight of the DOM blocks, merging the candidate DOM blocks to obtain the text blocks, and cutting and denoising the text blocks according to the subject weight.

Further, locating the text region comprises the steps of: the copyright block is filtered.

Further, traversing the DOM block in reverse order in combination with the copyright statement feature library filters the copyright statement blocks.

Further, extracting the webpage text comprises the following steps of; determining a text-related picture, determining a text-related video, determining a text-related data table, and constructing a text by combining the text of the text block on the basis of determining the text-related picture, the video and the data table.

Further, the sibling block and the body block before the body block are traversed, and the pictures and the video links in the non-blacklist are extracted to be used as the body related pictures and the body related videos respectively.

Further, the text block is traversed to extract the data table as a text-related data table.

Further, the automatic extraction method of the webpage information further comprises the following steps: extracting text-related basic metadata;

extracting the text-related basic metadata comprises: title extraction, source extraction, distribution time extraction, and author extraction.

Further, traversing brother blocks before the text block and short text nodes in the text block, calculating the longest common substring of text node characters and web page title text, and adding a title candidate set when the ratio of the length of the longest common substring to the length of the text node characters exceeds a certain threshold;

traversing brother blocks before the text block, extracting character strings conforming to source prefixes and post-positioned features according to a source feature library, and adding the character strings into a source candidate set;

traversing brother blocks before the text block, extracting character strings conforming to the prefix and the post-feature of the release time according to the release time feature library, and adding the character strings into the release time candidate set;

and traversing brother blocks before the text block, extracting character strings conforming to the prefix and the postamble characteristics of the author according to the author characteristic library, and adding the character strings into the author candidate set.

Further, preprocessing the webpage data includes:

unicode transcoding is performed on the HTML web page source code and special character encoding and decoding is performed.

The method has the advantages that the efficiency and the accuracy of webpage information extraction are considered, the layout characteristics and the partial visual characteristics of the HTML of the webpage are considered on the basis that the traditional webpage extraction method is not remarkably reduced, and the accuracy of webpage information extraction is effectively improved.

On the basis of automatically extracting webpage information by using a program, the blacklist, the rule base and the knowledge base which are already precipitated are fully utilized, the accuracy of automatic extraction is obviously improved, and the application range and accuracy of the extraction method can be improved by continuously updating the rule base and the knowledge base.

Combining the webpage DOM structure with the layout characteristics of the webpage, fusing and calculating texts, pictures, videos and forms to construct a block DOM with comprehensive theme weight and partial visual characteristics, improving the accuracy of text extraction and improving the application range of a webpage extraction algorithm; besides the text of the webpage, the existing blacklist, knowledge base and rule base can be utilized to extract key fields such as text pictures, videos, tables, titles, release time, sources, authors and the like more accurately.

Drawings

Fig. 1 is a flowchart of a method for automatically extracting web page information.

Detailed Description

The invention is described in detail below with reference to the drawings and the specific embodiments.

As shown in fig. 1, a method for automatically extracting web page information includes the following steps: 1. preprocessing the webpage information; 2. constructing a block DOM tree; 3. positioning a text region; 4. extracting the text of the webpage; 5. and extracting the text-related basic metadata.

1. Preprocessing web page information

Preprocessing the webpage information comprises the following steps: unicode transcoding is performed on the HTML web page source code and special character encoding and decoding is performed.

2. Building a block DOM tree

Building a block DOM tree comprises the steps of:

2.1, performing fault tolerance compensation and DOM analysis on the webpage source code;

2.2 constructing a block DOM structure by combining the HTML block layout elements on the basis of the DOM;

2.3 combining the display characteristics to count the number of basic theme elements of the DOM block;

and 2.4, weighting calculation is carried out on the DOM block basic theme elements.

The weight is the product of the number and the weight. The weight is mainly referred to the visual display information of the element node, and the element weight with the segmentation, the blocking, the centering and the display enhancement effects is higher.

Statistics text information and weights (forward weights): number of plain text words and weight, number of valid text and weight (long text).

Statistical hyperlink information and weights (negative weights): the number of hyperlinks and the weight, the number of linked words, the average ratio of word links (the negative weight of the out-field links is higher).

Counting picture information and weight: number of garbage pictures (hit picture and small picture in black list negative weight), number of unlinked pictures and weight, number of linked large pictures and weight.

Statistics table number and weights: number of data table cells.

Counting the number and weight of videos: number of junk videos (hit video in blacklist), number of normal videos, and weight.

3. Locating text regions

Filtering the copyright block: and traversing the DOM block in a reverse order in combination with the copyright statement feature library to filter the copyright statement block.

Recursively shrinking and positioning candidate theme blocks from top to bottom according to the theme weights of the DOM blocks: finding a DOM block with the largest topic weight value and recording the DOM block with the second largest topic weight value as a second block; and if the ratio of the weight of the max_block and the weight of the parent node exceeds a certain threshold, taking the max_block as a root node for shrinkage, otherwise stopping shrinkage.

Merging the candidate DOM blocks to obtain a text block: if the value of the second_block is greater than a certain threshold value or the ratio of the second_block to the max_block is greater than a certain threshold value, checking whether the second_block and the max_block have a common parent node or a grandparent node, if so, taking the common parent node or the grandparent node as a text block content_block, and simultaneously setting a multi_block flag as TRUE.

Cutting and denoising the text block according to the theme weight: if the multi_block is TRUE, content cutting is carried out on the content_block, and blocks with topic weights smaller than the average value are filtered; if Multiblock is FLASE, blocks with topic weights less than zero are filtered out.

4. Extracting the text of the webpage

Extracting the text of the webpage comprises the following steps of; determining a text-related picture, determining a text-related video, determining a text-related data table and constructing a text.

And traversing the brother block and the text block before the text block, and extracting pictures and video links in a non-blacklist to serve as text-related pictures and text-related videos respectively.

And traversing the text block to extract the data table as a text related data table.

And (3) constructing a text: and constructing the text by combining the text of the text block on the basis of determining the text related picture, video and data form. Specifically, on the basis of the determined picture, video and data tables, the basic HTML display characteristics are reserved according to the appearance sequence in the HTML by combining the text information of the text blocks, and the rich text of the mixed arrangement of the pictures, the tables and the videos is constructed.

5. Extracting text-related basic metadata

5.1 extraction of titles

And sequentially traversing brother blocks before the text block and short text nodes in the text block, calculating the longest common substring of text node characters and web title text, and adding a title candidate set when the ratio of the length of the longest common substring to the text node characters exceeds a certain threshold value. If the title candidate set is larger than 1, comprehensively considering the visual enhancement effect of the node, the length of the public substring and the ratio of the length of the public substring to the length of the text node, and optimizing one text node; if the title candidate node set is empty, returning the web title as the web main title.

5.2 extracting the sources

Sequentially traversing brother blocks before the text block, extracting character strings conforming to source prefixes and post-positioned features according to a source feature library, and adding the character strings into a source candidate set; and if the candidate set is empty, extracting character strings conforming to source prefix and suffix features from the beginning and the end of the text respectively according to the source feature library, and adding the character strings into the source candidate set. If the number of candidate sets is greater than 1, then the content of the media source library may preferably be matched as the article source.

5.3 extracting the release time

Sequentially traversing brother blocks before the text block, extracting character strings conforming to the prefix and the post-feature of the release time according to the release time feature library, and adding the character strings into the release time candidate set; if the number of candidate sets is greater than 1, the preferred value is normal and may match the content of the distribution time format library as the distribution time.

5.4 extracting authors

Sequentially traversing brother blocks before the text block, extracting character strings conforming to the prefix and the postnatal features of the author according to the author feature library, and adding the character strings into the author candidate set; and if the author candidate set is empty, extracting character strings conforming to the characteristic of the prefix and the suffix of the author from the beginning and the end of the text respectively according to the characteristic library of the author, and adding the character strings into the author candidate set. If the number of candidate sets is greater than 1, then the content of the author source library may preferably be matched as the article author.

Combining the webpage DOM structure with the block layout elements of the webpage to construct a block DOM structure with text features and partial visual features, carrying out fusion calculation on various text basic elements such as characters, pictures, videos, forms and the like, and quantitatively calculating the theme contribution value of the DOM block; positioning to a core block of a webpage theme by a top-down block contraction algorithm, screening out a theme candidate block of the webpage by a bottom-up block expansion algorithm, and finally performing noise cutting on the candidate theme block to finish final theme block positioning; based on the determined topic blocks, extracting text information containing texts, pictures, videos and charts by combining a blacklist, a rule base and a knowledge base; the topic block is used as a center to be combined with a rule base, a knowledge base, a context position, a display characteristic extraction text title, a release time, a source and an author.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be appreciated by persons skilled in the art that the above embodiments are not intended to limit the invention in any way, and that all technical solutions obtained by means of equivalent substitutions or equivalent transformations fall within the scope of the invention.

Claims

1. The automatic webpage information extraction method is characterized by comprising the following steps of:

preprocessing the webpage information;

constructing a block DOM tree;

positioning a text region; and

extracting the text of the webpage;

wherein, the building block DOM tree comprises the following steps:

performing fault tolerance compensation and DOM analysis on the webpage source code;

constructing a block DOM structure by combining the HTML block layout elements on the basis of the DOM;

counting the number of basic theme elements of the DOM block by combining the display characteristics; and

weighting calculation is carried out on basic theme elements of the DOM block;

when the text region is positioned, positioning the text region according to the theme weight obtained by weighting calculation;

extracting the text of the webpage comprises the following steps of;

determining text related pictures;

determining text related videos;

determining a text-related data table; and

constructing a text by combining the text of the text block on the basis of determining the text related picture, video and data form;

2. The method for automatically extracting web page information according to claim 1, wherein,

locating the text region comprises the steps of:

recursively shrinking and positioning candidate topic blocks from top to bottom according to the topic weights of the DOM blocks;

merging the candidate DOM blocks to obtain a text block; and

and cutting and denoising the text block according to the theme weight.

3. The method for automatically extracting web page information according to claim 2, wherein,

locating the text region comprises the steps of:

the copyright block is filtered.

4. The method for automatically extracting web page information according to claim 3, wherein,

and traversing the DOM block in a reverse order in combination with the copyright statement feature library to filter the copyright statement block.

5. The method for automatically extracting web page information according to claim 1, wherein,

6. The method for automatically extracting web page information according to claim 2, wherein,

the automatic webpage information extraction method further comprises the following steps: extracting text-related basic metadata;

7. The method for automatically extracting web page information as recited in claim 6, wherein,

traversing brother blocks before the text block and short text nodes in the text block, calculating the longest common substring of text node characters and web page title text, and adding a title candidate set when the ratio of the length of the longest substring to the length of the text node characters exceeds a certain threshold value;

8. The method for automatically extracting web page information according to claim 2, wherein,

preprocessing the webpage data comprises the following steps: