CN107247742A

CN107247742A - A kind of text message abstracting method based on web page characteristics

Info

Publication number: CN107247742A
Application number: CN201710346591.1A
Authority: CN
Inventors: 李晓林; 刘志杰; 谢婷婷; 严柯; 张懿
Original assignee: Wuhan Institute of Technology
Current assignee: Wuhan Institute of Technology
Priority date: 2017-05-17
Filing date: 2017-05-17
Publication date: 2017-10-13

Abstract

The present invention relates to information extraction technique field, more particularly to a kind of text message abstracting method based on web page characteristics, it pre-processes page source code according to features such as page layouts the set of line number and text, then page body part is extracted by composing a piece of writing this threshold value and line space threshold value, optimizes extraction result finally according to punctuation mark.This method has preferable effect for the different types of page, with certain versatility.

Description

A kind of text message abstracting method based on web page characteristics

Technical field

The present invention relates to information extraction technique field, more particularly to a kind of text message extraction side based on web page characteristics Method.

Background technology

The fast development of Internet technology causes webpage to turn into people and obtain one of main source of information.However, with New things are continued to bring out, and webpage quantity also increases with surprising quantity, abundant letter is contained in countless webpage Resource is ceased, for the information for allowing user's quick obtaining to need, Jim Cowie and Yorick Wilks proposed that information was taken out in 1996 Take this concept.In this development, existing many scholars propose different information according to different extraction demands and taken out Method is taken, it is as follows：

Method based on wrapper is mainly using Web page module and the feature extraction Web page text of structuring, and this method is according to page Layout characteristics, the rule in face etc. design unified template, and obtained template is analyzed to obtain the text in the page.The party Method needs manual compiling decimation rule, and text message, but versatility can be accurately navigated to for the similar template page of structure It is not strong, it is only applicable to the specific page, it is impossible to the various Web page for the treatment of types.In addition, artificial rules for writing easily malfunction, It is not easy to safeguard.

Method based on web page tag relies on the specific label in html language（Such as：<table></table>、<p></ p>Deng）, this kind of method apply in general to text be in specific label situation, have very big dependence to feature tag, to the page Contents and distribution have high requirements, the page for handling other layout types will be unable to be applicable.

Method basic ideas based on document tree are the structures that html web page is parsed into dom tree, by counting node The information such as link length, text size, link and amount of text ratio determine text node, and other are extracted according to similarity of paths Text, is finally integrated into Web page text.This method pretreatment work is more complicated, less efficient.

The Page Segmentation Algorithm VIPS of view-based access control model feature（Vision based Page Segmentation）, the algorithm The visual performance features such as the spacing between word size, background color, logical block and logical block in the page split language Adopted block, reaches the effect of Segment, weights is assigned to the divider between page block both horizontally and vertically and by configuring The regular therefrom Extracting Information of Web page information extraction.The page is mainly carried out piecemeal by VIPS algorithms, and being extracted for info web needs Want information extraction rules so that this method versatility is limited, and adds algorithm complex.

The content of the invention

The technical problems to be solved by the invention are：A kind of base with preferable versatility and higher accuracy rate is provided In the text message abstracting method of web page characteristics.

The technical solution adopted in the present invention is：A kind of text message abstracting method based on web page characteristics, it include with Lower step：

（1）, webpage is pre-processed；

（2）, by all line labels of pretreated webpage and count often capable character length, form an original text；

（3）, style of writing length threshold L is set；

（4）And then traversal step（2）In obtained original text, made with the row that the style of writing length of current line is more than or equal to threshold value L For the initial row of body text, gone by 0 row of the style of writing length of current line as ending, between the initial row and ending row Position turn into text group；

（5）, then proceed to travel through the remainder in original text, and obtain all text groups in original text；

（6）, line-spacing threshold value D is set；

（7）, line-spacing between all text groups of detection, if detecting the presence of line-spacing more than threshold value D, leave out below this line-spacing All text groups, then other text groups are determined as to the body part of webpage；If not detecting the presence of line-spacing more than threshold Value D, then it is the body part of webpage to judge all text groups.

Using above method compared with prior art, the present invention has advantages below：Starting is selected by composing a piece of writing length Row and end line, and judge whether by line-spacing to belong to text, the body part degree of accuracy so extracted is higher, and And versatility is also higher.

Preferably, step（7）It is further comprising the steps of afterwards,

（8）, detecting step from top to bottom（7）In obtained body part, until having detected fullstop, then by before fullstop Part is judged as real body part.By detect fullstop can leave out the comment that some are connected directly between behind text, its He such as quotes at the content for being not belonging to text, and then make it that the body part degree of accuracy extracted is higher.

Preferably, the step（1）In pretreatment comprise the following steps：

A, acquisition web page title；

B, by webpage html tag filter；

C, deletion HTML symbolic entities.

So first leave out the misleading factor that can much influence to screen accuracy when carrying out style of writing length screening, and then cause The body part accuracy extracted is higher.

Embodiment

The present invention is described further below by way of embodiment, but the present invention is not limited only in detail below in fact Apply mode.

A kind of text message abstracting method based on web page characteristics, it comprises the following steps：

（1）, webpage is pre-processed；

A, acquisition web page title；General web page title is in<head>Label in region<title>With</title>Between, Obtain after webpage source code, extract label<title>With</title>Between content is as page title and preserves.If can not carry Get title, then from<body>In region<h1>Tag extraction；

B, by webpage html tag filter；Preferably only retain text message；

C, deletion HTML symbolic entities, including space, tab, quotation marks etc.

（2）, by all line labels of pretreated webpage and count often capable character length, form an original text； Mainly count the length of word；

（3）, style of writing length threshold L is set；The general values of L are 60-90；

（4）And then traversal step（2）In obtained original text, made with the row that the style of writing length of current line is more than or equal to threshold value L For the initial row of body text, gone by 0 row of the style of writing length of current line as ending, between the initial row and ending row Position turn into text group；Traversal is primarily referred to as the whole original text of scanning of a line a line from top to bottom；

（5）, then proceed to travel through the remainder in original text, and obtain all text groups in original text；It is multiple Text group may be exactly multiple paragraphs；

（6）, line-spacing threshold value D is set；The general values of D are 8-12；

（7）, line-spacing between all text groups of detection, if detecting the presence of line-spacing more than threshold value D, leave out below this line-spacing All text groups, then other text groups are determined as to the body part of webpage；If not detecting the presence of line-spacing more than threshold Value D, then it is the body part of webpage to judge all text groups；Here the distance between two paragraphs are primarily referred to as longer, It so may determine that latter paragraph is just not belonging to body part；

（8）, detecting step from top to bottom（7）In obtained body part, until having detected fullstop, then by before fullstop Part is judged as real body part.

Claims

1. a kind of text message abstracting method based on web page characteristics, it is characterised in that it comprises the following steps：

(1), webpage is pre-processed；

(2), by all line labels of pretreated webpage and the often capable character length of statistics, an original text is formed；

(3), style of writing length threshold L is set；

And then traversal step (4)（2）In obtained original text, made with the row that the style of writing length of current line is more than or equal to threshold value L For the initial row of body text, gone by 0 row of the style of writing length of current line as ending, between the initial row and ending row Position turn into text group；

(5), then proceed to travel through the remainder in original text, and obtain all text groups in original text；

(6), line-spacing threshold value D is set；

(7) line-spacing between all text groups, is detected, if detecting the presence of line-spacing more than threshold value D, is left out below this line-spacing All text groups, then other text groups are determined as to the body part of webpage；If not detecting the presence of line-spacing more than threshold Value D, then it is the body part of webpage to judge all text groups.

2. a kind of text message abstracting method based on web page characteristics according to claim 1, it is characterised in that：Step （7）It is further comprising the steps of afterwards,

(8), detecting step from top to bottom（7）In obtained body part, until having detected fullstop, then by before fullstop Part is judged as real body part.

3. a kind of text message abstracting method based on web page characteristics according to claim 1, it is characterised in that：The step Suddenly（1）In pretreatment comprise the following steps：

A, acquisition web page title；

B, by webpage html tag filter；

C, deletion HTML symbolic entities.