CN108509469A - A kind of Web page text information extracting method based on piecemeal - Google Patents

A kind of Web page text information extracting method based on piecemeal Download PDF

Info

Publication number
CN108509469A
CN108509469A CN201710349695.8A CN201710349695A CN108509469A CN 108509469 A CN108509469 A CN 108509469A CN 201710349695 A CN201710349695 A CN 201710349695A CN 108509469 A CN108509469 A CN 108509469A
Authority
CN
China
Prior art keywords
piecemeal
webpage
web page
content
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710349695.8A
Other languages
Chinese (zh)
Inventor
姚国平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Pure Green Intelligent Technology Co Ltd
Original Assignee
Suzhou Pure Green Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Pure Green Intelligent Technology Co Ltd filed Critical Suzhou Pure Green Intelligent Technology Co Ltd
Priority to CN201710349695.8A priority Critical patent/CN108509469A/en
Publication of CN108509469A publication Critical patent/CN108509469A/en
Pending legal-status Critical Current

Links

Landscapes

  • Medicines Containing Plant Substances (AREA)

Abstract

The present invention proposes a kind of Web page text information extracting method based on piecemeal, includes the following steps:(1)Web standards;(2)Construct tag tree;(3)Web-page segmentation is blocking;(4)Extract the block containing text.The present invention by carrying out piecemeal to webpage and carries out information extraction to the choice of content blocks, web page release uses the automatic Partitioning algorithm of bottom-up parse tag tree, this method is more more acurrate than the prior art, it is more preferable to the Segment effect of labyrinth, pass through each content blocks of importance and block feature analysis of block simultaneously, extract user's information needed, accuracy is high, effect is good.

Description

A kind of Web page text information extracting method based on piecemeal
Technical field
The present invention relates to data acquisition technology fields, and in particular to a kind of Web page text information extraction side based on piecemeal Method.
Background technology
With the continuous expansion of Internet resources enriched constantly with network information, people to the dependence of network increasingly By force, the specific resources needed for oneself are quickly found out from vast as the open sea Internet resources but also to service object to bring not Just;Information just has unlimited value from ancient times, and with the continuous development in epoch, the mankind have had come to the information age unconsciously, All trades and professions have all been full of countless information, and the value of information is that the circulation of data, if data can timely flow Logical and transmission is got up, the real incomparable value of competence exertion information;Under condition of market economy, gathered data at For important tool and means.
With the rapid development of Web, the information of Web is more and more abundant.In order to preferably use the information on Web, people Technology and systems strong and using network information can effectively be organized by constantly pursuing.Right page, Web document is unlike traditional text Neatly, totally, wherein including a large amount of noise content, such as the script being added to enhance user interactivity, for the ease of User browses and the navigation link of addition, and the advertisement link etc. being added for commercial factors.These noise contents not only shadow Web information effectiveness of retrieval is rung, and also results in the decline of retrieval accuracy.
Therefore, in view of the above-mentioned problems, the present invention proposes a kind of new technical solution.
Invention content
Be effectively removed noise jamming the object of the present invention is to provide one kind, rapid extraction information needed content based on point The Web page text information extracting method of block.
The present invention is achieved through the following technical solutions:
A kind of Web page text information extracting method based on piecemeal, includes the following steps:
Web standards:HTML code is pre-processed first, is standardized;
Construct tag tree:The specification webpage put in order is constructed into tag tree, the label in webpage is arranged according to nest relation At one tree shape structure, the perceptual property of each node is retained during construction, meanwhile, tag tree is cut, by nothing Joint point deletion;
Web-page segmentation is blocking, webpage is divided as containers labels according to content blocks label in webpage;
A, it is counted by counting the quantity of the various containers labels in tag tree, judges webpage is which kind of containers labels used To be laid out;
B, bottom containers labels node is investigated, all text nodes under the node of the tag tree bottom are merged, and counts The information content of the block, while investigating visual signature;
C, the last layer node of each bottom layer node is investigated, and calculates the information content of the node, judges that can the node become Blocking node;
Extract the block containing text
After piecemeal, content root tuber is accepted or rejected according to the difference of user, takes out the content blocks containing text message.
Further, the stepThe method of middle construction tag tree uses DOM tag tree Construct Tools.
Further, webpage is divided as containers labels according to content blocks label in webpage in the step, The attribute of the label information of its type content blocks as where.
Further, the visual signature includes size, position, the size of font and the color and paragraph of table Length.
Further, the stepIn, importance and block feature according to content blocks accept or reject content blocks.
Further, the block feature includes space characteristics and content characteristic, and the space characteristics include the position of content blocks It sets and size, content characteristic includes word length, number of links and picture number.
The beneficial effects of the invention are as follows:The present invention is by carrying out piecemeal to webpage and being carried into row information to the choice of content blocks It takes, web page release uses the automatic Partitioning algorithm of bottom-up parse tag tree, and this method is more more acurrate than the prior art, to complexity The Segment effect of structure is more preferable, while by the importance of block and each content blocks of block feature analysis, to extract user institute Information is needed, accuracy is high, and effect is good.
Specific implementation mode
The present invention is described further with reference to embodiment.
Embodiment 1
A kind of Web page text information extracting method based on piecemeal, includes the following steps:
Web standards:HTML code is pre-processed first, is standardized;
Construct tag tree:The specification webpage put in order is constructed into tag tree, the label in webpage is arranged according to nest relation At one tree shape structure, the perceptual property of each node is retained during construction, meanwhile, tag tree is cut, by nothing Joint point deletion;
Web-page segmentation is blocking, webpage is divided as containers labels according to content blocks label in webpage;
A, it is counted by counting the quantity of the various containers labels in tag tree, judges webpage is which kind of containers labels used To be laid out;
B, bottom containers labels node is investigated, all text nodes under the node of the tag tree bottom are merged, and counts The information content of the block, while investigating visual signature;
C, the last layer node of each bottom layer node is investigated, and calculates the information content of the node, judges that can the node become Blocking node;
Extract the block containing text
After piecemeal, content root tuber is accepted or rejected according to the difference of user, takes out the content blocks containing text message.
In the present embodiment, stepThe method of middle construction tag tree uses DOM tag tree Construct Tools.
In the present embodiment, stepIt is middle that webpage is divided as containers labels according to content blocks label in webpage, The attribute of other types of label information content blocks as where.
In the present embodiment, visual signature includes size, position, the size of font and the color and paragraph of table Length.
In the present embodiment, stepIn, importance and block feature according to content blocks accept or reject content blocks.
In the present embodiment, block feature includes space characteristics and content characteristic, space characteristics include content blocks position and Size, content characteristic include word length, number of links and picture number.
The present invention by carrying out piecemeal to webpage and carries out information extraction to the choice of content blocks, and web page release uses the bottom of from The automatic Partitioning algorithm of upward analyzing tags tree, this method is more more acurrate than the prior art, to the Segment effect of labyrinth More preferably, while by the importance of block and each content blocks of block feature analysis, to extract user's information needed, accuracy is high, effect Fruit is good.

Claims (6)

1. a kind of Web page text information extracting method based on piecemeal, it is characterised in that:Include the following steps:
Web standards:HTML code is pre-processed first, is standardized;
Construct tag tree:The specification webpage put in order is constructed into tag tree, the label in webpage is arranged according to nest relation At one tree shape structure, the perceptual property of each node is retained during construction, meanwhile, tag tree is cut, by nothing Joint point deletion;
Web-page segmentation is blocking, webpage is divided as containers labels according to content blocks label in webpage;
A, it is counted by counting the quantity of the various containers labels in tag tree, judges webpage is which kind of containers labels used To be laid out;
B, bottom containers labels node is investigated, all text nodes under the node of the tag tree bottom are merged, and counts The information content of the block, while investigating visual signature;
C, the last layer node of each bottom layer node is investigated, and calculates the information content of the node, judges that can the node become Blocking node;
Extract the block containing text
After piecemeal, content root tuber is accepted or rejected according to the difference of user, takes out the content blocks containing text message.
2. a kind of Web page text information extracting method based on piecemeal according to claim 1, it is characterised in that:The stepThe method of middle construction tag tree uses DOM tag tree Construct Tools.
3. a kind of Web page text information extracting method based on piecemeal according to claim 1, it is characterised in that:The stepIt is middle that webpage is divided as containers labels according to content blocks label in webpage, where other types of label information is used as The attribute of content blocks.
4. a kind of Web page text information extracting method based on piecemeal according to claim 1, it is characterised in that:The vision Feature includes the length of the size of table, position, the size of font and color and paragraph.
5. a kind of Web page text information extracting method based on piecemeal according to claim 1, it is characterised in that:The stepIn, importance and block feature according to content blocks accept or reject content blocks.
6. a kind of Web page text information extracting method based on piecemeal according to claim 5, it is characterised in that:Described piece of spy Sign includes space characteristics and content characteristic, and the space characteristics include position and the size of content blocks, and content characteristic includes word Length, number of links and picture number.
CN201710349695.8A 2017-05-17 2017-05-17 A kind of Web page text information extracting method based on piecemeal Pending CN108509469A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710349695.8A CN108509469A (en) 2017-05-17 2017-05-17 A kind of Web page text information extracting method based on piecemeal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710349695.8A CN108509469A (en) 2017-05-17 2017-05-17 A kind of Web page text information extracting method based on piecemeal

Publications (1)

Publication Number Publication Date
CN108509469A true CN108509469A (en) 2018-09-07

Family

ID=63373328

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710349695.8A Pending CN108509469A (en) 2017-05-17 2017-05-17 A kind of Web page text information extracting method based on piecemeal

Country Status (1)

Country Link
CN (1) CN108509469A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109857956A (en) * 2019-01-25 2019-06-07 四川大学 The automatic abstracting method of news web page key message based on label and blocking characteristic
CN110377796A (en) * 2019-07-25 2019-10-25 中南民族大学 Text extracting method, device, equipment and storage medium based on dom tree

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
CN101944109A (en) * 2010-09-06 2011-01-12 华南理工大学 System and method for extracting picture abstract based on page partitioning
CN105677764A (en) * 2015-12-30 2016-06-15 百度在线网络技术(北京)有限公司 Information extraction method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
CN101944109A (en) * 2010-09-06 2011-01-12 华南理工大学 System and method for extracting picture abstract based on page partitioning
CN105677764A (en) * 2015-12-30 2016-06-15 百度在线网络技术(北京)有限公司 Information extraction method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109857956A (en) * 2019-01-25 2019-06-07 四川大学 The automatic abstracting method of news web page key message based on label and blocking characteristic
CN109857956B (en) * 2019-01-25 2019-12-31 四川大学 News webpage key information automatic extraction method based on label and block characteristics
CN110377796A (en) * 2019-07-25 2019-10-25 中南民族大学 Text extracting method, device, equipment and storage medium based on dom tree
CN110377796B (en) * 2019-07-25 2021-11-02 中南民族大学 Text extraction method, device and equipment based on DOM tree and storage medium

Similar Documents

Publication Publication Date Title
CN102253979B (en) Vision-based web page extracting method
CN102156737B (en) Method for extracting subject content of Chinese webpage
CN104504086B (en) The clustering method and device of Webpage
CN103488746B (en) Method and device for acquiring business information
EP2633432A1 (en) Extraction of content from a web page
CN109492177B (en) web page blocking method based on web page semantic structure
CN102314520A (en) Webpage text extraction method and device based on statistical backtracking positioning
CN107230123A (en) commodity mapping method, device and equipment
CN106326451B (en) A kind of webpage heat transfer agent block decision method of view-based access control model feature extraction
Prasad et al. Coreex: content extraction from online news articles
CN103440494A (en) Horrible image identification method and system based on visual significance analyses
CN109144513B (en) Method for automatically extracting list page
Madan et al. Synthetically trained icon proposals for parsing and summarizing infographics
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
CN108509469A (en) A kind of Web page text information extracting method based on piecemeal
CN106528068A (en) Webpage content reconstruction method and system
CN102141998B (en) Automatic evaluation method for webpage vision complexity
CN104462394B (en) A kind of system and method for identifying text floor of webpage
CN109934852A (en) A kind of video presentation method based on object properties relational graph
CN108874870A (en) A kind of data pick-up method, equipment and computer can storage mediums
CN103744920A (en) Commodity attribute name-value pair extraction method and system
CN103678432B (en) A kind of web page body extracting method based on web page body feature and intermediary's true value
CN104484451B (en) The extracting method and device of Webpage information
CN100590623C (en) System and method for abstraction of Web data based on vision
CN106897287A (en) Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180907

WD01 Invention patent application deemed withdrawn after publication