CN107247742A - A kind of text message abstracting method based on web page characteristics - Google Patents
A kind of text message abstracting method based on web page characteristics Download PDFInfo
- Publication number
- CN107247742A CN107247742A CN201710346591.1A CN201710346591A CN107247742A CN 107247742 A CN107247742 A CN 107247742A CN 201710346591 A CN201710346591 A CN 201710346591A CN 107247742 A CN107247742 A CN 107247742A
- Authority
- CN
- China
- Prior art keywords
- text
- line
- web page
- webpage
- spacing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention relates to information extraction technique field, more particularly to a kind of text message abstracting method based on web page characteristics, it pre-processes page source code according to features such as page layouts the set of line number and text, then page body part is extracted by composing a piece of writing this threshold value and line space threshold value, optimizes extraction result finally according to punctuation mark.This method has preferable effect for the different types of page, with certain versatility.
Description
Technical field
The present invention relates to information extraction technique field, more particularly to a kind of text message extraction side based on web page characteristics
Method.
Background technology
The fast development of Internet technology causes webpage to turn into people and obtain one of main source of information.However, with
New things are continued to bring out, and webpage quantity also increases with surprising quantity, abundant letter is contained in countless webpage
Resource is ceased, for the information for allowing user's quick obtaining to need, Jim Cowie and Yorick Wilks proposed that information was taken out in 1996
Take this concept.In this development, existing many scholars propose different information according to different extraction demands and taken out
Method is taken, it is as follows:
Method based on wrapper is mainly using Web page module and the feature extraction Web page text of structuring, and this method is according to page
Layout characteristics, the rule in face etc. design unified template, and obtained template is analyzed to obtain the text in the page.The party
Method needs manual compiling decimation rule, and text message, but versatility can be accurately navigated to for the similar template page of structure
It is not strong, it is only applicable to the specific page, it is impossible to the various Web page for the treatment of types.In addition, artificial rules for writing easily malfunction,
It is not easy to safeguard.
Method based on web page tag relies on the specific label in html language(Such as:<table></table>、<p></
p>Deng), this kind of method apply in general to text be in specific label situation, have very big dependence to feature tag, to the page
Contents and distribution have high requirements, the page for handling other layout types will be unable to be applicable.
Method basic ideas based on document tree are the structures that html web page is parsed into dom tree, by counting node
The information such as link length, text size, link and amount of text ratio determine text node, and other are extracted according to similarity of paths
Text, is finally integrated into Web page text.This method pretreatment work is more complicated, less efficient.
The Page Segmentation Algorithm VIPS of view-based access control model feature(Vision based Page Segmentation), the algorithm
The visual performance features such as the spacing between word size, background color, logical block and logical block in the page split language
Adopted block, reaches the effect of Segment, weights is assigned to the divider between page block both horizontally and vertically and by configuring
The regular therefrom Extracting Information of Web page information extraction.The page is mainly carried out piecemeal by VIPS algorithms, and being extracted for info web needs
Want information extraction rules so that this method versatility is limited, and adds algorithm complex.
The content of the invention
The technical problems to be solved by the invention are:A kind of base with preferable versatility and higher accuracy rate is provided
In the text message abstracting method of web page characteristics.
The technical solution adopted in the present invention is:A kind of text message abstracting method based on web page characteristics, it include with
Lower step:
(1), webpage is pre-processed;
(2), by all line labels of pretreated webpage and count often capable character length, form an original text;
(3), style of writing length threshold L is set;
(4)And then traversal step(2)In obtained original text, made with the row that the style of writing length of current line is more than or equal to threshold value L
For the initial row of body text, gone by 0 row of the style of writing length of current line as ending, between the initial row and ending row
Position turn into text group;
(5), then proceed to travel through the remainder in original text, and obtain all text groups in original text;
(6), line-spacing threshold value D is set;
(7), line-spacing between all text groups of detection, if detecting the presence of line-spacing more than threshold value D, leave out below this line-spacing
All text groups, then other text groups are determined as to the body part of webpage;If not detecting the presence of line-spacing more than threshold
Value D, then it is the body part of webpage to judge all text groups.
Using above method compared with prior art, the present invention has advantages below:Starting is selected by composing a piece of writing length
Row and end line, and judge whether by line-spacing to belong to text, the body part degree of accuracy so extracted is higher, and
And versatility is also higher.
Preferably, step(7)It is further comprising the steps of afterwards,
(8), detecting step from top to bottom(7)In obtained body part, until having detected fullstop, then by before fullstop
Part is judged as real body part.By detect fullstop can leave out the comment that some are connected directly between behind text, its
He such as quotes at the content for being not belonging to text, and then make it that the body part degree of accuracy extracted is higher.
Preferably, the step(1)In pretreatment comprise the following steps:
A, acquisition web page title;
B, by webpage html tag filter;
C, deletion HTML symbolic entities.
So first leave out the misleading factor that can much influence to screen accuracy when carrying out style of writing length screening, and then cause
The body part accuracy extracted is higher.
Embodiment
The present invention is described further below by way of embodiment, but the present invention is not limited only in detail below in fact
Apply mode.
A kind of text message abstracting method based on web page characteristics, it comprises the following steps:
(1), webpage is pre-processed;
A, acquisition web page title;General web page title is in<head>Label in region<title>With</title>Between,
Obtain after webpage source code, extract label<title>With</title>Between content is as page title and preserves.If can not carry
Get title, then from<body>In region<h1>Tag extraction;
B, by webpage html tag filter;Preferably only retain text message;
C, deletion HTML symbolic entities, including space, tab, quotation marks etc.
(2), by all line labels of pretreated webpage and count often capable character length, form an original text;
Mainly count the length of word;
(3), style of writing length threshold L is set;The general values of L are 60-90;
(4)And then traversal step(2)In obtained original text, made with the row that the style of writing length of current line is more than or equal to threshold value L
For the initial row of body text, gone by 0 row of the style of writing length of current line as ending, between the initial row and ending row
Position turn into text group;Traversal is primarily referred to as the whole original text of scanning of a line a line from top to bottom;
(5), then proceed to travel through the remainder in original text, and obtain all text groups in original text;It is multiple
Text group may be exactly multiple paragraphs;
(6), line-spacing threshold value D is set;The general values of D are 8-12;
(7), line-spacing between all text groups of detection, if detecting the presence of line-spacing more than threshold value D, leave out below this line-spacing
All text groups, then other text groups are determined as to the body part of webpage;If not detecting the presence of line-spacing more than threshold
Value D, then it is the body part of webpage to judge all text groups;Here the distance between two paragraphs are primarily referred to as longer,
It so may determine that latter paragraph is just not belonging to body part;
(8), detecting step from top to bottom(7)In obtained body part, until having detected fullstop, then by before fullstop
Part is judged as real body part.
Claims (3)
1. a kind of text message abstracting method based on web page characteristics, it is characterised in that it comprises the following steps:
(1), webpage is pre-processed;
(2), by all line labels of pretreated webpage and the often capable character length of statistics, an original text is formed;
(3), style of writing length threshold L is set;
And then traversal step (4)(2)In obtained original text, made with the row that the style of writing length of current line is more than or equal to threshold value L
For the initial row of body text, gone by 0 row of the style of writing length of current line as ending, between the initial row and ending row
Position turn into text group;
(5), then proceed to travel through the remainder in original text, and obtain all text groups in original text;
(6), line-spacing threshold value D is set;
(7) line-spacing between all text groups, is detected, if detecting the presence of line-spacing more than threshold value D, is left out below this line-spacing
All text groups, then other text groups are determined as to the body part of webpage;If not detecting the presence of line-spacing more than threshold
Value D, then it is the body part of webpage to judge all text groups.
2. a kind of text message abstracting method based on web page characteristics according to claim 1, it is characterised in that:Step
(7)It is further comprising the steps of afterwards,
(8), detecting step from top to bottom(7)In obtained body part, until having detected fullstop, then by before fullstop
Part is judged as real body part.
3. a kind of text message abstracting method based on web page characteristics according to claim 1, it is characterised in that:The step
Suddenly(1)In pretreatment comprise the following steps:
A, acquisition web page title;
B, by webpage html tag filter;
C, deletion HTML symbolic entities.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710346591.1A CN107247742A (en) | 2017-05-17 | 2017-05-17 | A kind of text message abstracting method based on web page characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710346591.1A CN107247742A (en) | 2017-05-17 | 2017-05-17 | A kind of text message abstracting method based on web page characteristics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107247742A true CN107247742A (en) | 2017-10-13 |
Family
ID=60017092
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710346591.1A Pending CN107247742A (en) | 2017-05-17 | 2017-05-17 | A kind of text message abstracting method based on web page characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107247742A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103064845A (en) * | 2011-10-20 | 2013-04-24 | 北京中搜网络技术股份有限公司 | Website information processing device and website information processing method |
CN103854063A (en) * | 2012-11-29 | 2014-06-11 | 中国科学院计算机网络信息中心 | Internet open information-based event occurrence risk prediction and early-warning method |
CN105512225A (en) * | 2015-11-30 | 2016-04-20 | 北大方正集团有限公司 | Method and device extracting main content from webpage |
-
2017
- 2017-05-17 CN CN201710346591.1A patent/CN107247742A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103064845A (en) * | 2011-10-20 | 2013-04-24 | 北京中搜网络技术股份有限公司 | Website information processing device and website information processing method |
CN103854063A (en) * | 2012-11-29 | 2014-06-11 | 中国科学院计算机网络信息中心 | Internet open information-based event occurrence risk prediction and early-warning method |
CN105512225A (en) * | 2015-11-30 | 2016-04-20 | 北大方正集团有限公司 | Method and device extracting main content from webpage |
Non-Patent Citations (2)
Title |
---|
向菁菁 等: "一种新闻网页关键信息的提取算法", 《计算机应用》 * |
姬鑫 等: "基于分块的新闻网页信息抽取算法", 《计算机应用与软件》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102831121B (en) | Method and system for extracting webpage information | |
CN102253979B (en) | Vision-based web page extracting method | |
CN105630941B (en) | Web body matter abstracting methods based on statistics and structure of web page | |
Gu et al. | Visual based content understanding towards web adaptation | |
CN110413787B (en) | Text clustering method, device, terminal and storage medium | |
CN102541874B (en) | Webpage text content extracting method and device | |
CN103123618B (en) | Text similarity acquisition methods and device | |
Zheng et al. | Template-independent news extraction based on visual consistency | |
CN106055667B (en) | It is a kind of based on text-label densities web page core content extracting method | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
CN106557565A (en) | A kind of text message extracting method based on website construction | |
CN102156737A (en) | Method for extracting subject content of Chinese webpage | |
CN109492177B (en) | web page blocking method based on web page semantic structure | |
CN104598577A (en) | Extraction method for webpage text | |
CN109857912A (en) | A kind of font recognition methods, electronic equipment and storage medium | |
CN106777259A (en) | The method and device of structured message in adaptive decimation HTML Table labels | |
CN110399496A (en) | A kind of knowledge mapping construction method based on CR decision tree | |
CN107463571A (en) | Web color method | |
CN105740355A (en) | Aggregated text density based webpage body text extraction method and apparatus | |
CN106528509A (en) | Webpage information extracting method and apparatus | |
CN101996190A (en) | Method and device for extracting information from webpage | |
CN106897287B (en) | Webpage release time extraction method and device for webpage release time extraction | |
CN108255895A (en) | A kind of web data acquisition methods using context environmental rule | |
CN107247742A (en) | A kind of text message abstracting method based on web page characteristics | |
CN106649767A (en) | Web page information extraction method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171013 |