CN111428444B - Automatic extraction method for webpage information - Google Patents

Automatic extraction method for webpage information Download PDF

Info

Publication number
CN111428444B
CN111428444B CN202010228475.1A CN202010228475A CN111428444B CN 111428444 B CN111428444 B CN 111428444B CN 202010228475 A CN202010228475 A CN 202010228475A CN 111428444 B CN111428444 B CN 111428444B
Authority
CN
China
Prior art keywords
text
block
dom
webpage
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010228475.1A
Other languages
Chinese (zh)
Other versions
CN111428444A (en
Inventor
吕聚旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinhua Zhiyun Technology Co ltd
Original Assignee
Xinhua Zhiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinhua Zhiyun Technology Co ltd filed Critical Xinhua Zhiyun Technology Co ltd
Priority to CN202010228475.1A priority Critical patent/CN111428444B/en
Publication of CN111428444A publication Critical patent/CN111428444A/en
Application granted granted Critical
Publication of CN111428444B publication Critical patent/CN111428444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses an automatic webpage information extraction method which is characterized by comprising the following steps: preprocessing the webpage information; constructing a block DOM tree; positioning a text region; extracting the text of the webpage; wherein, the building block DOM tree comprises the following steps: performing fault tolerance compensation and DOM analysis on the webpage source code; constructing a block DOM structure by combining the HTML block layout elements on the basis of the DOM; counting the number of basic theme elements of the DOM block by combining the display characteristics; weighting calculation is carried out on basic theme elements of the DOM block; and when the text region is positioned, positioning the text region according to the theme weight obtained by the weighting calculation. The method has the advantages that the efficiency and the accuracy of webpage information extraction are considered, the layout characteristics and the partial visual characteristics of the HTML of the webpage are considered on the basis that the traditional webpage extraction method is not remarkably reduced, and the accuracy of webpage information extraction is effectively improved.

Description

Automatic extraction method for webpage information
Technical Field
The invention relates to an automatic webpage information extraction method.
Background
With the rapid development of the Internet and its technology, networks have become the most huge database for human history. But the Web page contains a large number of navigation links, advertisement links, copyright notices and other contents which are not related or basically unrelated to the theme besides the contents expressing the theme. These data that are not significantly or substantially related to the subject matter of the Web page are commonly referred to as noise data for the page, the presence of which presents a significant challenge for Web page data-based applications. Currently, the mainstream webpage topic information extraction technology is divided into text pie with text density as a core and visual pie with visual display characteristics as a core. The method mainly depends on the text density characteristics of the web pages, the processing speed is high, and the traditional news web pages can meet most application requirements; the method mainly utilizes browser rendering technology to restore visual display characteristics of the webpage, and utilizes the visual characteristics of the webpage to extract webpage theme information.
The method based on text density cannot process novel websites with increasingly abundant display modes and display elements. The method based on the visual characteristics is highly dependent on browser rendering technology, has higher requirements on a hardware system, is slow in processing speed, relatively poor in stability, has higher algorithm technical threshold and is not beneficial to large-scale application.
Disclosure of Invention
In order to solve the defects in the prior art, the invention provides the automatic webpage information extraction method, which takes the webpage information extraction efficiency and accuracy into consideration, considers the layout characteristics of the webpage and the partial visual characteristics of the HTML on the basis of not obviously reducing the traditional webpage extraction method, and effectively improves the webpage information extraction accuracy.
In order to achieve the above object, the present invention adopts the following technical scheme:
a webpage information automatic extraction method comprises the following steps: preprocessing the webpage information, constructing a block DOM tree, positioning a text region, and extracting the webpage text;
wherein, the building block DOM tree comprises the following steps: performing fault tolerance compensation and DOM analysis on the webpage source code, constructing a block DOM structure by combining HTML block layout elements on the basis of the DOM, counting the number of DOM block basic theme elements by combining display characteristics, and performing weighted calculation on the DOM block basic theme elements;
and when the text region is positioned, positioning the text region according to the theme weight obtained by the weighting calculation.
Further, locating the text region comprises the steps of: and recursively shrinking and positioning the candidate subject blocks from top to bottom according to the subject weight of the DOM blocks, merging the candidate DOM blocks to obtain the text blocks, and cutting and denoising the text blocks according to the subject weight.
Further, locating the text region comprises the steps of: the copyright block is filtered.
Further, traversing the DOM block in reverse order in combination with the copyright statement feature library filters the copyright statement blocks.
Further, extracting the webpage text comprises the following steps of; determining a text-related picture, determining a text-related video, determining a text-related data table, and constructing a text by combining the text of the text block on the basis of determining the text-related picture, the video and the data table.
Further, the sibling block and the body block before the body block are traversed, and the pictures and the video links in the non-blacklist are extracted to be used as the body related pictures and the body related videos respectively.
Further, the text block is traversed to extract the data table as a text-related data table.
Further, the automatic extraction method of the webpage information further comprises the following steps: extracting text-related basic metadata;
extracting the text-related basic metadata comprises: title extraction, source extraction, distribution time extraction, and author extraction.
Further, traversing brother blocks before the text block and short text nodes in the text block, calculating the longest common substring of text node characters and web page title text, and adding a title candidate set when the ratio of the length of the longest common substring to the length of the text node characters exceeds a certain threshold;
traversing brother blocks before the text block, extracting character strings conforming to source prefixes and post-positioned features according to a source feature library, and adding the character strings into a source candidate set;
traversing brother blocks before the text block, extracting character strings conforming to the prefix and the post-feature of the release time according to the release time feature library, and adding the character strings into the release time candidate set;
and traversing brother blocks before the text block, extracting character strings conforming to the prefix and the postamble characteristics of the author according to the author characteristic library, and adding the character strings into the author candidate set.
Further, preprocessing the webpage data includes:
unicode transcoding is performed on the HTML web page source code and special character encoding and decoding is performed.
The method has the advantages that the efficiency and the accuracy of webpage information extraction are considered, the layout characteristics and the partial visual characteristics of the HTML of the webpage are considered on the basis that the traditional webpage extraction method is not remarkably reduced, and the accuracy of webpage information extraction is effectively improved.
On the basis of automatically extracting webpage information by using a program, the blacklist, the rule base and the knowledge base which are already precipitated are fully utilized, the accuracy of automatic extraction is obviously improved, and the application range and accuracy of the extraction method can be improved by continuously updating the rule base and the knowledge base.
Combining the webpage DOM structure with the layout characteristics of the webpage, fusing and calculating texts, pictures, videos and forms to construct a block DOM with comprehensive theme weight and partial visual characteristics, improving the accuracy of text extraction and improving the application range of a webpage extraction algorithm; besides the text of the webpage, the existing blacklist, knowledge base and rule base can be utilized to extract key fields such as text pictures, videos, tables, titles, release time, sources, authors and the like more accurately.
Drawings
Fig. 1 is a flowchart of a method for automatically extracting web page information.
Detailed Description
The invention is described in detail below with reference to the drawings and the specific embodiments.
As shown in fig. 1, a method for automatically extracting web page information includes the following steps: 1. preprocessing the webpage information; 2. constructing a block DOM tree; 3. positioning a text region; 4. extracting the text of the webpage; 5. and extracting the text-related basic metadata.
And when the text region is positioned, positioning the text region according to the theme weight obtained by the weighting calculation.
1. Preprocessing web page information
Preprocessing the webpage information comprises the following steps: unicode transcoding is performed on the HTML web page source code and special character encoding and decoding is performed.
2. Building a block DOM tree
Building a block DOM tree comprises the steps of:
2.1, performing fault tolerance compensation and DOM analysis on the webpage source code;
2.2 constructing a block DOM structure by combining the HTML block layout elements on the basis of the DOM;
2.3 combining the display characteristics to count the number of basic theme elements of the DOM block;
and 2.4, weighting calculation is carried out on the DOM block basic theme elements.
The weight is the product of the number and the weight. The weight is mainly referred to the visual display information of the element node, and the element weight with the segmentation, the blocking, the centering and the display enhancement effects is higher.
Statistics text information and weights (forward weights): number of plain text words and weight, number of valid text and weight (long text).
Statistical hyperlink information and weights (negative weights): the number of hyperlinks and the weight, the number of linked words, the average ratio of word links (the negative weight of the out-field links is higher).
Counting picture information and weight: number of garbage pictures (hit picture and small picture in black list negative weight), number of unlinked pictures and weight, number of linked large pictures and weight.
Statistics table number and weights: number of data table cells.
Counting the number and weight of videos: number of junk videos (hit video in blacklist), number of normal videos, and weight.
3. Locating text regions
Filtering the copyright block: and traversing the DOM block in a reverse order in combination with the copyright statement feature library to filter the copyright statement block.
Recursively shrinking and positioning candidate theme blocks from top to bottom according to the theme weights of the DOM blocks: finding a DOM block with the largest topic weight value and recording the DOM block with the second largest topic weight value as a second block; and if the ratio of the weight of the max_block and the weight of the parent node exceeds a certain threshold, taking the max_block as a root node for shrinkage, otherwise stopping shrinkage.
Merging the candidate DOM blocks to obtain a text block: if the value of the second_block is greater than a certain threshold value or the ratio of the second_block to the max_block is greater than a certain threshold value, checking whether the second_block and the max_block have a common parent node or a grandparent node, if so, taking the common parent node or the grandparent node as a text block content_block, and simultaneously setting a multi_block flag as TRUE.
Cutting and denoising the text block according to the theme weight: if the multi_block is TRUE, content cutting is carried out on the content_block, and blocks with topic weights smaller than the average value are filtered; if Multiblock is FLASE, blocks with topic weights less than zero are filtered out.
4. Extracting the text of the webpage
Extracting the text of the webpage comprises the following steps of; determining a text-related picture, determining a text-related video, determining a text-related data table and constructing a text.
And traversing the brother block and the text block before the text block, and extracting pictures and video links in a non-blacklist to serve as text-related pictures and text-related videos respectively.
And traversing the text block to extract the data table as a text related data table.
And (3) constructing a text: and constructing the text by combining the text of the text block on the basis of determining the text related picture, video and data form. Specifically, on the basis of the determined picture, video and data tables, the basic HTML display characteristics are reserved according to the appearance sequence in the HTML by combining the text information of the text blocks, and the rich text of the mixed arrangement of the pictures, the tables and the videos is constructed.
5. Extracting text-related basic metadata
5.1 extraction of titles
And sequentially traversing brother blocks before the text block and short text nodes in the text block, calculating the longest common substring of text node characters and web title text, and adding a title candidate set when the ratio of the length of the longest common substring to the text node characters exceeds a certain threshold value. If the title candidate set is larger than 1, comprehensively considering the visual enhancement effect of the node, the length of the public substring and the ratio of the length of the public substring to the length of the text node, and optimizing one text node; if the title candidate node set is empty, returning the web title as the web main title.
5.2 extracting the sources
Sequentially traversing brother blocks before the text block, extracting character strings conforming to source prefixes and post-positioned features according to a source feature library, and adding the character strings into a source candidate set; and if the candidate set is empty, extracting character strings conforming to source prefix and suffix features from the beginning and the end of the text respectively according to the source feature library, and adding the character strings into the source candidate set. If the number of candidate sets is greater than 1, then the content of the media source library may preferably be matched as the article source.
5.3 extracting the release time
Sequentially traversing brother blocks before the text block, extracting character strings conforming to the prefix and the post-feature of the release time according to the release time feature library, and adding the character strings into the release time candidate set; if the number of candidate sets is greater than 1, the preferred value is normal and may match the content of the distribution time format library as the distribution time.
5.4 extracting authors
Sequentially traversing brother blocks before the text block, extracting character strings conforming to the prefix and the postnatal features of the author according to the author feature library, and adding the character strings into the author candidate set; and if the author candidate set is empty, extracting character strings conforming to the characteristic of the prefix and the suffix of the author from the beginning and the end of the text respectively according to the characteristic library of the author, and adding the character strings into the author candidate set. If the number of candidate sets is greater than 1, then the content of the author source library may preferably be matched as the article author.
Combining the webpage DOM structure with the block layout elements of the webpage to construct a block DOM structure with text features and partial visual features, carrying out fusion calculation on various text basic elements such as characters, pictures, videos, forms and the like, and quantitatively calculating the theme contribution value of the DOM block; positioning to a core block of a webpage theme by a top-down block contraction algorithm, screening out a theme candidate block of the webpage by a bottom-up block expansion algorithm, and finally performing noise cutting on the candidate theme block to finish final theme block positioning; based on the determined topic blocks, extracting text information containing texts, pictures, videos and charts by combining a blacklist, a rule base and a knowledge base; the topic block is used as a center to be combined with a rule base, a knowledge base, a context position, a display characteristic extraction text title, a release time, a source and an author.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be appreciated by persons skilled in the art that the above embodiments are not intended to limit the invention in any way, and that all technical solutions obtained by means of equivalent substitutions or equivalent transformations fall within the scope of the invention.

Claims (8)

1. The automatic webpage information extraction method is characterized by comprising the following steps of:
preprocessing the webpage information;
constructing a block DOM tree;
positioning a text region; and
extracting the text of the webpage;
wherein, the building block DOM tree comprises the following steps:
performing fault tolerance compensation and DOM analysis on the webpage source code;
constructing a block DOM structure by combining the HTML block layout elements on the basis of the DOM;
counting the number of basic theme elements of the DOM block by combining the display characteristics; and
weighting calculation is carried out on basic theme elements of the DOM block;
when the text region is positioned, positioning the text region according to the theme weight obtained by weighting calculation;
extracting the text of the webpage comprises the following steps of;
determining text related pictures;
determining text related videos;
determining a text-related data table; and
constructing a text by combining the text of the text block on the basis of determining the text related picture, video and data form;
and traversing the brother block and the text block before the text block, and extracting pictures and video links in a non-blacklist to serve as text-related pictures and text-related videos respectively.
2. The method for automatically extracting web page information according to claim 1, wherein,
locating the text region comprises the steps of:
recursively shrinking and positioning candidate topic blocks from top to bottom according to the topic weights of the DOM blocks;
merging the candidate DOM blocks to obtain a text block; and
and cutting and denoising the text block according to the theme weight.
3. The method for automatically extracting web page information according to claim 2, wherein,
locating the text region comprises the steps of:
the copyright block is filtered.
4. The method for automatically extracting web page information according to claim 3, wherein,
and traversing the DOM block in a reverse order in combination with the copyright statement feature library to filter the copyright statement block.
5. The method for automatically extracting web page information according to claim 1, wherein,
and traversing the text block to extract the data table as a text related data table.
6. The method for automatically extracting web page information according to claim 2, wherein,
the automatic webpage information extraction method further comprises the following steps: extracting text-related basic metadata;
extracting the text-related basic metadata comprises: title extraction, source extraction, distribution time extraction, and author extraction.
7. The method for automatically extracting web page information as recited in claim 6, wherein,
traversing brother blocks before the text block and short text nodes in the text block, calculating the longest common substring of text node characters and web page title text, and adding a title candidate set when the ratio of the length of the longest substring to the length of the text node characters exceeds a certain threshold value;
traversing brother blocks before the text block, extracting character strings conforming to source prefixes and post-positioned features according to a source feature library, and adding the character strings into a source candidate set;
traversing brother blocks before the text block, extracting character strings conforming to the prefix and the post-feature of the release time according to the release time feature library, and adding the character strings into the release time candidate set;
and traversing brother blocks before the text block, extracting character strings conforming to the prefix and the postamble characteristics of the author according to the author characteristic library, and adding the character strings into the author candidate set.
8. The method for automatically extracting web page information according to claim 2, wherein,
preprocessing the webpage data comprises the following steps:
unicode transcoding is performed on the HTML web page source code and special character encoding and decoding is performed.
CN202010228475.1A 2020-03-27 2020-03-27 Automatic extraction method for webpage information Active CN111428444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010228475.1A CN111428444B (en) 2020-03-27 2020-03-27 Automatic extraction method for webpage information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010228475.1A CN111428444B (en) 2020-03-27 2020-03-27 Automatic extraction method for webpage information

Publications (2)

Publication Number Publication Date
CN111428444A CN111428444A (en) 2020-07-17
CN111428444B true CN111428444B (en) 2023-10-20

Family

ID=71549019

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010228475.1A Active CN111428444B (en) 2020-03-27 2020-03-27 Automatic extraction method for webpage information

Country Status (1)

Country Link
CN (1) CN111428444B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434797B (en) * 2021-06-29 2024-05-31 中电信数智科技有限公司 Webpage information extraction method and device
CN114037828A (en) * 2021-11-26 2022-02-11 北京沃东天骏信息技术有限公司 Component identification method and device, electronic equipment and storage medium
CN115658993B (en) * 2022-09-27 2023-06-06 观澜网络(杭州)有限公司 Intelligent extraction method and system for core content of webpage
CN116362223B (en) * 2023-03-07 2023-12-15 北京粉笔蓝天科技有限公司 Automatic identification method and device for web page article titles and texts

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014000572A1 (en) * 2012-06-25 2014-01-03 北京奇虎科技有限公司 System and method for identifying floors of webpage main text
CN108268433A (en) * 2018-02-26 2018-07-10 杭州数梦工场科技有限公司 Title abstracting method and device based on webpage article
CN109086361A (en) * 2018-07-20 2018-12-25 北京开普云信息科技有限公司 A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint
CN109543126A (en) * 2018-11-19 2019-03-29 四川长虹电器股份有限公司 Web page text information extracting method based on block text accounting

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014000572A1 (en) * 2012-06-25 2014-01-03 北京奇虎科技有限公司 System and method for identifying floors of webpage main text
CN108268433A (en) * 2018-02-26 2018-07-10 杭州数梦工场科技有限公司 Title abstracting method and device based on webpage article
CN109086361A (en) * 2018-07-20 2018-12-25 北京开普云信息科技有限公司 A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint
CN109543126A (en) * 2018-11-19 2019-03-29 四川长虹电器股份有限公司 Web page text information extracting method based on block text accounting

Also Published As

Publication number Publication date
CN111428444A (en) 2020-07-17

Similar Documents

Publication Publication Date Title
CN111428444B (en) Automatic extraction method for webpage information
Sun et al. Dom based content extraction via text density
WO2021114745A1 (en) Named entity recognition method employing affix perception for use in social media
CN107229668B (en) Text extraction method based on keyword matching
CN101079031A (en) Web page subject extraction system and method
US7941420B2 (en) Method for organizing structurally similar web pages from a web site
CN101464898B (en) Method for extracting feature word of text
CN110119444B (en) Drawing type and generating type combined document abstract generating model
US20050267915A1 (en) Method and apparatus for recognizing specific type of information files
US20090177959A1 (en) Automatic visual segmentation of webpages
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
Chen et al. Template detection for large scale search engines
CN104598577B (en) A kind of extracting method of Web page text
CN103927397B (en) Recognition method for Web page link blocks based on block tree
CN101251855A (en) Equipment, system and method for cleaning internet web page
CN110489745B (en) Paper text similarity detection method based on citation network
Wu et al. News filtering and summarization on the web
CN101149739A (en) Internet faced sensing string digging method and system
JP2005063432A (en) Multimedia object retrieval apparatus and multimedia object retrieval method
Fauzi et al. Webpage segmentation for extracting images and their surrounding contextual information
CN107451120B (en) Content conflict detection method and system for open text information
CN112256861A (en) Rumor detection method based on search engine return result and electronic device
CN107239520B (en) General forum text extraction method
Fan et al. Article clipper: a system for web article extraction
CN100336061C (en) Multimedia object searching device and methoed

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant