CN108509469A

CN108509469A - A kind of Web page text information extracting method based on piecemeal

Info

Publication number: CN108509469A
Application number: CN201710349695.8A
Authority: CN
Inventors: 姚国平
Original assignee: Suzhou Pure Green Intelligent Technology Co Ltd
Current assignee: Suzhou Pure Green Intelligent Technology Co Ltd
Priority date: 2017-05-17
Filing date: 2017-05-17
Publication date: 2018-09-07

Abstract

The present invention proposes a kind of Web page text information extracting method based on piecemeal, includes the following steps：（1）Web standards；（2）Construct tag tree；（3）Web-page segmentation is blocking；（4）Extract the block containing text.The present invention by carrying out piecemeal to webpage and carries out information extraction to the choice of content blocks, web page release uses the automatic Partitioning algorithm of bottom-up parse tag tree, this method is more more acurrate than the prior art, it is more preferable to the Segment effect of labyrinth, pass through each content blocks of importance and block feature analysis of block simultaneously, extract user's information needed, accuracy is high, effect is good.

Description

A kind of Web page text information extracting method based on piecemeal

Technical field

The present invention relates to data acquisition technology fields, and in particular to a kind of Web page text information extraction side based on piecemeal Method.

Background technology

With the continuous expansion of Internet resources enriched constantly with network information, people to the dependence of network increasingly By force, the specific resources needed for oneself are quickly found out from vast as the open sea Internet resources but also to service object to bring not Just；Information just has unlimited value from ancient times, and with the continuous development in epoch, the mankind have had come to the information age unconsciously, All trades and professions have all been full of countless information, and the value of information is that the circulation of data, if data can timely flow Logical and transmission is got up, the real incomparable value of competence exertion information；Under condition of market economy, gathered data at For important tool and means.

With the rapid development of Web, the information of Web is more and more abundant.In order to preferably use the information on Web, people Technology and systems strong and using network information can effectively be organized by constantly pursuing.Right page, Web document is unlike traditional text Neatly, totally, wherein including a large amount of noise content, such as the script being added to enhance user interactivity, for the ease of User browses and the navigation link of addition, and the advertisement link etc. being added for commercial factors.These noise contents not only shadow Web information effectiveness of retrieval is rung, and also results in the decline of retrieval accuracy.

Therefore, in view of the above-mentioned problems, the present invention proposes a kind of new technical solution.

Invention content

Be effectively removed noise jamming the object of the present invention is to provide one kind, rapid extraction information needed content based on point The Web page text information extracting method of block.

The present invention is achieved through the following technical solutions：

A kind of Web page text information extracting method based on piecemeal, includes the following steps：

Web standards：HTML code is pre-processed first, is standardized；

Construct tag tree：The specification webpage put in order is constructed into tag tree, the label in webpage is arranged according to nest relation At one tree shape structure, the perceptual property of each node is retained during construction, meanwhile, tag tree is cut, by nothing Joint point deletion；

Web-page segmentation is blocking, webpage is divided as containers labels according to content blocks label in webpage；

A, it is counted by counting the quantity of the various containers labels in tag tree, judges webpage is which kind of containers labels used To be laid out；

B, bottom containers labels node is investigated, all text nodes under the node of the tag tree bottom are merged, and counts The information content of the block, while investigating visual signature；

C, the last layer node of each bottom layer node is investigated, and calculates the information content of the node, judges that can the node become Blocking node；

Extract the block containing text

After piecemeal, content root tuber is accepted or rejected according to the difference of user, takes out the content blocks containing text message.

Further, the stepThe method of middle construction tag tree uses DOM tag tree Construct Tools.

Further, webpage is divided as containers labels according to content blocks label in webpage in the step, The attribute of the label information of its type content blocks as where.

Further, the visual signature includes size, position, the size of font and the color and paragraph of table Length.

Further, the stepIn, importance and block feature according to content blocks accept or reject content blocks.

Further, the block feature includes space characteristics and content characteristic, and the space characteristics include the position of content blocks It sets and size, content characteristic includes word length, number of links and picture number.

The beneficial effects of the invention are as follows：The present invention is by carrying out piecemeal to webpage and being carried into row information to the choice of content blocks It takes, web page release uses the automatic Partitioning algorithm of bottom-up parse tag tree, and this method is more more acurrate than the prior art, to complexity The Segment effect of structure is more preferable, while by the importance of block and each content blocks of block feature analysis, to extract user institute Information is needed, accuracy is high, and effect is good.

Specific implementation mode

The present invention is described further with reference to embodiment.

Embodiment 1

Web standards：HTML code is pre-processed first, is standardized；

Extract the block containing text

In the present embodiment, stepThe method of middle construction tag tree uses DOM tag tree Construct Tools.

In the present embodiment, stepIt is middle that webpage is divided as containers labels according to content blocks label in webpage, The attribute of other types of label information content blocks as where.

In the present embodiment, visual signature includes size, position, the size of font and the color and paragraph of table Length.

In the present embodiment, stepIn, importance and block feature according to content blocks accept or reject content blocks.

In the present embodiment, block feature includes space characteristics and content characteristic, space characteristics include content blocks position and Size, content characteristic include word length, number of links and picture number.

The present invention by carrying out piecemeal to webpage and carries out information extraction to the choice of content blocks, and web page release uses the bottom of from The automatic Partitioning algorithm of upward analyzing tags tree, this method is more more acurrate than the prior art, to the Segment effect of labyrinth More preferably, while by the importance of block and each content blocks of block feature analysis, to extract user's information needed, accuracy is high, effect Fruit is good.

Claims

1. a kind of Web page text information extracting method based on piecemeal, it is characterised in that：Include the following steps：

Web standards：HTML code is pre-processed first, is standardized；

Extract the block containing text

2. a kind of Web page text information extracting method based on piecemeal according to claim 1, it is characterised in that：The stepThe method of middle construction tag tree uses DOM tag tree Construct Tools.

3. a kind of Web page text information extracting method based on piecemeal according to claim 1, it is characterised in that：The stepIt is middle that webpage is divided as containers labels according to content blocks label in webpage, where other types of label information is used as The attribute of content blocks.

4. a kind of Web page text information extracting method based on piecemeal according to claim 1, it is characterised in that：The vision Feature includes the length of the size of table, position, the size of font and color and paragraph.

5. a kind of Web page text information extracting method based on piecemeal according to claim 1, it is characterised in that：The stepIn, importance and block feature according to content blocks accept or reject content blocks.

6. a kind of Web page text information extracting method based on piecemeal according to claim 5, it is characterised in that：Described piece of spy Sign includes space characteristics and content characteristic, and the space characteristics include position and the size of content blocks, and content characteristic includes word Length, number of links and picture number.