CN107247742A - A kind of text message abstracting method based on web page characteristics - Google Patents

A kind of text message abstracting method based on web page characteristics Download PDF

Info

Publication number
CN107247742A
CN107247742A CN201710346591.1A CN201710346591A CN107247742A CN 107247742 A CN107247742 A CN 107247742A CN 201710346591 A CN201710346591 A CN 201710346591A CN 107247742 A CN107247742 A CN 107247742A
Authority
CN
China
Prior art keywords
text
line
web page
webpage
spacing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710346591.1A
Other languages
Chinese (zh)
Inventor
李晓林
刘志杰
谢婷婷
严柯
张懿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Institute of Technology
Original Assignee
Wuhan Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Institute of Technology filed Critical Wuhan Institute of Technology
Priority to CN201710346591.1A priority Critical patent/CN107247742A/en
Publication of CN107247742A publication Critical patent/CN107247742A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention relates to information extraction technique field, more particularly to a kind of text message abstracting method based on web page characteristics, it pre-processes page source code according to features such as page layouts the set of line number and text, then page body part is extracted by composing a piece of writing this threshold value and line space threshold value, optimizes extraction result finally according to punctuation mark.This method has preferable effect for the different types of page, with certain versatility.

Description

A kind of text message abstracting method based on web page characteristics
Technical field
The present invention relates to information extraction technique field, more particularly to a kind of text message extraction side based on web page characteristics Method.
Background technology
The fast development of Internet technology causes webpage to turn into people and obtain one of main source of information.However, with New things are continued to bring out, and webpage quantity also increases with surprising quantity, abundant letter is contained in countless webpage Resource is ceased, for the information for allowing user's quick obtaining to need, Jim Cowie and Yorick Wilks proposed that information was taken out in 1996 Take this concept.In this development, existing many scholars propose different information according to different extraction demands and taken out Method is taken, it is as follows:
Method based on wrapper is mainly using Web page module and the feature extraction Web page text of structuring, and this method is according to page Layout characteristics, the rule in face etc. design unified template, and obtained template is analyzed to obtain the text in the page.The party Method needs manual compiling decimation rule, and text message, but versatility can be accurately navigated to for the similar template page of structure It is not strong, it is only applicable to the specific page, it is impossible to the various Web page for the treatment of types.In addition, artificial rules for writing easily malfunction, It is not easy to safeguard.
Method based on web page tag relies on the specific label in html language(Such as:<table></table>、<p></ p>Deng), this kind of method apply in general to text be in specific label situation, have very big dependence to feature tag, to the page Contents and distribution have high requirements, the page for handling other layout types will be unable to be applicable.
Method basic ideas based on document tree are the structures that html web page is parsed into dom tree, by counting node The information such as link length, text size, link and amount of text ratio determine text node, and other are extracted according to similarity of paths Text, is finally integrated into Web page text.This method pretreatment work is more complicated, less efficient.
The Page Segmentation Algorithm VIPS of view-based access control model feature(Vision based Page Segmentation), the algorithm The visual performance features such as the spacing between word size, background color, logical block and logical block in the page split language Adopted block, reaches the effect of Segment, weights is assigned to the divider between page block both horizontally and vertically and by configuring The regular therefrom Extracting Information of Web page information extraction.The page is mainly carried out piecemeal by VIPS algorithms, and being extracted for info web needs Want information extraction rules so that this method versatility is limited, and adds algorithm complex.
The content of the invention
The technical problems to be solved by the invention are:A kind of base with preferable versatility and higher accuracy rate is provided In the text message abstracting method of web page characteristics.
The technical solution adopted in the present invention is:A kind of text message abstracting method based on web page characteristics, it include with Lower step:
(1), webpage is pre-processed;
(2), by all line labels of pretreated webpage and count often capable character length, form an original text;
(3), style of writing length threshold L is set;
(4)And then traversal step(2)In obtained original text, made with the row that the style of writing length of current line is more than or equal to threshold value L For the initial row of body text, gone by 0 row of the style of writing length of current line as ending, between the initial row and ending row Position turn into text group;
(5), then proceed to travel through the remainder in original text, and obtain all text groups in original text;
(6), line-spacing threshold value D is set;
(7), line-spacing between all text groups of detection, if detecting the presence of line-spacing more than threshold value D, leave out below this line-spacing All text groups, then other text groups are determined as to the body part of webpage;If not detecting the presence of line-spacing more than threshold Value D, then it is the body part of webpage to judge all text groups.
Using above method compared with prior art, the present invention has advantages below:Starting is selected by composing a piece of writing length Row and end line, and judge whether by line-spacing to belong to text, the body part degree of accuracy so extracted is higher, and And versatility is also higher.
Preferably, step(7)It is further comprising the steps of afterwards,
(8), detecting step from top to bottom(7)In obtained body part, until having detected fullstop, then by before fullstop Part is judged as real body part.By detect fullstop can leave out the comment that some are connected directly between behind text, its He such as quotes at the content for being not belonging to text, and then make it that the body part degree of accuracy extracted is higher.
Preferably, the step(1)In pretreatment comprise the following steps:
A, acquisition web page title;
B, by webpage html tag filter;
C, deletion HTML symbolic entities.
So first leave out the misleading factor that can much influence to screen accuracy when carrying out style of writing length screening, and then cause The body part accuracy extracted is higher.
Embodiment
The present invention is described further below by way of embodiment, but the present invention is not limited only in detail below in fact Apply mode.
A kind of text message abstracting method based on web page characteristics, it comprises the following steps:
(1), webpage is pre-processed;
A, acquisition web page title;General web page title is in<head>Label in region<title>With</title>Between, Obtain after webpage source code, extract label<title>With</title>Between content is as page title and preserves.If can not carry Get title, then from<body>In region<h1>Tag extraction;
B, by webpage html tag filter;Preferably only retain text message;
C, deletion HTML symbolic entities, including space, tab, quotation marks etc.
(2), by all line labels of pretreated webpage and count often capable character length, form an original text; Mainly count the length of word;
(3), style of writing length threshold L is set;The general values of L are 60-90;
(4)And then traversal step(2)In obtained original text, made with the row that the style of writing length of current line is more than or equal to threshold value L For the initial row of body text, gone by 0 row of the style of writing length of current line as ending, between the initial row and ending row Position turn into text group;Traversal is primarily referred to as the whole original text of scanning of a line a line from top to bottom;
(5), then proceed to travel through the remainder in original text, and obtain all text groups in original text;It is multiple Text group may be exactly multiple paragraphs;
(6), line-spacing threshold value D is set;The general values of D are 8-12;
(7), line-spacing between all text groups of detection, if detecting the presence of line-spacing more than threshold value D, leave out below this line-spacing All text groups, then other text groups are determined as to the body part of webpage;If not detecting the presence of line-spacing more than threshold Value D, then it is the body part of webpage to judge all text groups;Here the distance between two paragraphs are primarily referred to as longer, It so may determine that latter paragraph is just not belonging to body part;
(8), detecting step from top to bottom(7)In obtained body part, until having detected fullstop, then by before fullstop Part is judged as real body part.

Claims (3)

1. a kind of text message abstracting method based on web page characteristics, it is characterised in that it comprises the following steps:
(1), webpage is pre-processed;
(2), by all line labels of pretreated webpage and the often capable character length of statistics, an original text is formed;
(3), style of writing length threshold L is set;
And then traversal step (4)(2)In obtained original text, made with the row that the style of writing length of current line is more than or equal to threshold value L For the initial row of body text, gone by 0 row of the style of writing length of current line as ending, between the initial row and ending row Position turn into text group;
(5), then proceed to travel through the remainder in original text, and obtain all text groups in original text;
(6), line-spacing threshold value D is set;
(7) line-spacing between all text groups, is detected, if detecting the presence of line-spacing more than threshold value D, is left out below this line-spacing All text groups, then other text groups are determined as to the body part of webpage;If not detecting the presence of line-spacing more than threshold Value D, then it is the body part of webpage to judge all text groups.
2. a kind of text message abstracting method based on web page characteristics according to claim 1, it is characterised in that:Step (7)It is further comprising the steps of afterwards,
(8), detecting step from top to bottom(7)In obtained body part, until having detected fullstop, then by before fullstop Part is judged as real body part.
3. a kind of text message abstracting method based on web page characteristics according to claim 1, it is characterised in that:The step Suddenly(1)In pretreatment comprise the following steps:
A, acquisition web page title;
B, by webpage html tag filter;
C, deletion HTML symbolic entities.
CN201710346591.1A 2017-05-17 2017-05-17 A kind of text message abstracting method based on web page characteristics Pending CN107247742A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710346591.1A CN107247742A (en) 2017-05-17 2017-05-17 A kind of text message abstracting method based on web page characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710346591.1A CN107247742A (en) 2017-05-17 2017-05-17 A kind of text message abstracting method based on web page characteristics

Publications (1)

Publication Number Publication Date
CN107247742A true CN107247742A (en) 2017-10-13

Family

ID=60017092

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710346591.1A Pending CN107247742A (en) 2017-05-17 2017-05-17 A kind of text message abstracting method based on web page characteristics

Country Status (1)

Country Link
CN (1) CN107247742A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064845A (en) * 2011-10-20 2013-04-24 北京中搜网络技术股份有限公司 Website information processing device and website information processing method
CN103854063A (en) * 2012-11-29 2014-06-11 中国科学院计算机网络信息中心 Internet open information-based event occurrence risk prediction and early-warning method
CN105512225A (en) * 2015-11-30 2016-04-20 北大方正集团有限公司 Method and device extracting main content from webpage

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064845A (en) * 2011-10-20 2013-04-24 北京中搜网络技术股份有限公司 Website information processing device and website information processing method
CN103854063A (en) * 2012-11-29 2014-06-11 中国科学院计算机网络信息中心 Internet open information-based event occurrence risk prediction and early-warning method
CN105512225A (en) * 2015-11-30 2016-04-20 北大方正集团有限公司 Method and device extracting main content from webpage

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
向菁菁 等: "一种新闻网页关键信息的提取算法", 《计算机应用》 *
姬鑫 等: "基于分块的新闻网页信息抽取算法", 《计算机应用与软件》 *

Similar Documents

Publication Publication Date Title
CN102831121B (en) Method and system for extracting webpage information
CN102253979B (en) Vision-based web page extracting method
CN105630941B (en) Web body matter abstracting methods based on statistics and structure of web page
Gu et al. Visual based content understanding towards web adaptation
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN102541874B (en) Webpage text content extracting method and device
CN103123618B (en) Text similarity acquisition methods and device
Zheng et al. Template-independent news extraction based on visual consistency
CN106055667B (en) It is a kind of based on text-label densities web page core content extracting method
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN106557565A (en) A kind of text message extracting method based on website construction
CN102156737A (en) Method for extracting subject content of Chinese webpage
CN109492177B (en) web page blocking method based on web page semantic structure
CN104598577A (en) Extraction method for webpage text
CN109857912A (en) A kind of font recognition methods, electronic equipment and storage medium
CN106777259A (en) The method and device of structured message in adaptive decimation HTML Table labels
CN110399496A (en) A kind of knowledge mapping construction method based on CR decision tree
CN107463571A (en) Web color method
CN105740355A (en) Aggregated text density based webpage body text extraction method and apparatus
CN106528509A (en) Webpage information extracting method and apparatus
CN101996190A (en) Method and device for extracting information from webpage
CN106897287B (en) Webpage release time extraction method and device for webpage release time extraction
CN108255895A (en) A kind of web data acquisition methods using context environmental rule
CN107247742A (en) A kind of text message abstracting method based on web page characteristics
CN106649767A (en) Web page information extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20171013