CN108509469A - A kind of Web page text information extracting method based on piecemeal - Google Patents
A kind of Web page text information extracting method based on piecemeal Download PDFInfo
- Publication number
- CN108509469A CN108509469A CN201710349695.8A CN201710349695A CN108509469A CN 108509469 A CN108509469 A CN 108509469A CN 201710349695 A CN201710349695 A CN 201710349695A CN 108509469 A CN108509469 A CN 108509469A
- Authority
- CN
- China
- Prior art keywords
- piecemeal
- webpage
- web page
- content
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Medicines Containing Plant Substances (AREA)
Abstract
The present invention proposes a kind of Web page text information extracting method based on piecemeal, includes the following steps:(1)Web standards;(2)Construct tag tree;(3)Web-page segmentation is blocking;(4)Extract the block containing text.The present invention by carrying out piecemeal to webpage and carries out information extraction to the choice of content blocks, web page release uses the automatic Partitioning algorithm of bottom-up parse tag tree, this method is more more acurrate than the prior art, it is more preferable to the Segment effect of labyrinth, pass through each content blocks of importance and block feature analysis of block simultaneously, extract user's information needed, accuracy is high, effect is good.
Description
Technical field
The present invention relates to data acquisition technology fields, and in particular to a kind of Web page text information extraction side based on piecemeal
Method.
Background technology
With the continuous expansion of Internet resources enriched constantly with network information, people to the dependence of network increasingly
By force, the specific resources needed for oneself are quickly found out from vast as the open sea Internet resources but also to service object to bring not
Just;Information just has unlimited value from ancient times, and with the continuous development in epoch, the mankind have had come to the information age unconsciously,
All trades and professions have all been full of countless information, and the value of information is that the circulation of data, if data can timely flow
Logical and transmission is got up, the real incomparable value of competence exertion information;Under condition of market economy, gathered data at
For important tool and means.
With the rapid development of Web, the information of Web is more and more abundant.In order to preferably use the information on Web, people
Technology and systems strong and using network information can effectively be organized by constantly pursuing.Right page, Web document is unlike traditional text
Neatly, totally, wherein including a large amount of noise content, such as the script being added to enhance user interactivity, for the ease of
User browses and the navigation link of addition, and the advertisement link etc. being added for commercial factors.These noise contents not only shadow
Web information effectiveness of retrieval is rung, and also results in the decline of retrieval accuracy.
Therefore, in view of the above-mentioned problems, the present invention proposes a kind of new technical solution.
Invention content
Be effectively removed noise jamming the object of the present invention is to provide one kind, rapid extraction information needed content based on point
The Web page text information extracting method of block.
The present invention is achieved through the following technical solutions:
A kind of Web page text information extracting method based on piecemeal, includes the following steps:
Web standards:HTML code is pre-processed first, is standardized;
Construct tag tree:The specification webpage put in order is constructed into tag tree, the label in webpage is arranged according to nest relation
At one tree shape structure, the perceptual property of each node is retained during construction, meanwhile, tag tree is cut, by nothing
Joint point deletion;
Web-page segmentation is blocking, webpage is divided as containers labels according to content blocks label in webpage;
A, it is counted by counting the quantity of the various containers labels in tag tree, judges webpage is which kind of containers labels used
To be laid out;
B, bottom containers labels node is investigated, all text nodes under the node of the tag tree bottom are merged, and counts
The information content of the block, while investigating visual signature;
C, the last layer node of each bottom layer node is investigated, and calculates the information content of the node, judges that can the node become
Blocking node;
Extract the block containing text
After piecemeal, content root tuber is accepted or rejected according to the difference of user, takes out the content blocks containing text message.
Further, the stepThe method of middle construction tag tree uses DOM tag tree Construct Tools.
Further, webpage is divided as containers labels according to content blocks label in webpage in the step,
The attribute of the label information of its type content blocks as where.
Further, the visual signature includes size, position, the size of font and the color and paragraph of table
Length.
Further, the stepIn, importance and block feature according to content blocks accept or reject content blocks.
Further, the block feature includes space characteristics and content characteristic, and the space characteristics include the position of content blocks
It sets and size, content characteristic includes word length, number of links and picture number.
The beneficial effects of the invention are as follows:The present invention is by carrying out piecemeal to webpage and being carried into row information to the choice of content blocks
It takes, web page release uses the automatic Partitioning algorithm of bottom-up parse tag tree, and this method is more more acurrate than the prior art, to complexity
The Segment effect of structure is more preferable, while by the importance of block and each content blocks of block feature analysis, to extract user institute
Information is needed, accuracy is high, and effect is good.
Specific implementation mode
The present invention is described further with reference to embodiment.
Embodiment 1
A kind of Web page text information extracting method based on piecemeal, includes the following steps:
Web standards:HTML code is pre-processed first, is standardized;
Construct tag tree:The specification webpage put in order is constructed into tag tree, the label in webpage is arranged according to nest relation
At one tree shape structure, the perceptual property of each node is retained during construction, meanwhile, tag tree is cut, by nothing
Joint point deletion;
Web-page segmentation is blocking, webpage is divided as containers labels according to content blocks label in webpage;
A, it is counted by counting the quantity of the various containers labels in tag tree, judges webpage is which kind of containers labels used
To be laid out;
B, bottom containers labels node is investigated, all text nodes under the node of the tag tree bottom are merged, and counts
The information content of the block, while investigating visual signature;
C, the last layer node of each bottom layer node is investigated, and calculates the information content of the node, judges that can the node become
Blocking node;
Extract the block containing text
After piecemeal, content root tuber is accepted or rejected according to the difference of user, takes out the content blocks containing text message.
In the present embodiment, stepThe method of middle construction tag tree uses DOM tag tree Construct Tools.
In the present embodiment, stepIt is middle that webpage is divided as containers labels according to content blocks label in webpage,
The attribute of other types of label information content blocks as where.
In the present embodiment, visual signature includes size, position, the size of font and the color and paragraph of table
Length.
In the present embodiment, stepIn, importance and block feature according to content blocks accept or reject content blocks.
In the present embodiment, block feature includes space characteristics and content characteristic, space characteristics include content blocks position and
Size, content characteristic include word length, number of links and picture number.
The present invention by carrying out piecemeal to webpage and carries out information extraction to the choice of content blocks, and web page release uses the bottom of from
The automatic Partitioning algorithm of upward analyzing tags tree, this method is more more acurrate than the prior art, to the Segment effect of labyrinth
More preferably, while by the importance of block and each content blocks of block feature analysis, to extract user's information needed, accuracy is high, effect
Fruit is good.
Claims (6)
1. a kind of Web page text information extracting method based on piecemeal, it is characterised in that:Include the following steps:
Web standards:HTML code is pre-processed first, is standardized;
Construct tag tree:The specification webpage put in order is constructed into tag tree, the label in webpage is arranged according to nest relation
At one tree shape structure, the perceptual property of each node is retained during construction, meanwhile, tag tree is cut, by nothing
Joint point deletion;
Web-page segmentation is blocking, webpage is divided as containers labels according to content blocks label in webpage;
A, it is counted by counting the quantity of the various containers labels in tag tree, judges webpage is which kind of containers labels used
To be laid out;
B, bottom containers labels node is investigated, all text nodes under the node of the tag tree bottom are merged, and counts
The information content of the block, while investigating visual signature;
C, the last layer node of each bottom layer node is investigated, and calculates the information content of the node, judges that can the node become
Blocking node;
Extract the block containing text
After piecemeal, content root tuber is accepted or rejected according to the difference of user, takes out the content blocks containing text message.
2. a kind of Web page text information extracting method based on piecemeal according to claim 1, it is characterised in that:The stepThe method of middle construction tag tree uses DOM tag tree Construct Tools.
3. a kind of Web page text information extracting method based on piecemeal according to claim 1, it is characterised in that:The stepIt is middle that webpage is divided as containers labels according to content blocks label in webpage, where other types of label information is used as
The attribute of content blocks.
4. a kind of Web page text information extracting method based on piecemeal according to claim 1, it is characterised in that:The vision
Feature includes the length of the size of table, position, the size of font and color and paragraph.
5. a kind of Web page text information extracting method based on piecemeal according to claim 1, it is characterised in that:The stepIn, importance and block feature according to content blocks accept or reject content blocks.
6. a kind of Web page text information extracting method based on piecemeal according to claim 5, it is characterised in that:Described piece of spy
Sign includes space characteristics and content characteristic, and the space characteristics include position and the size of content blocks, and content characteristic includes word
Length, number of links and picture number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710349695.8A CN108509469A (en) | 2017-05-17 | 2017-05-17 | A kind of Web page text information extracting method based on piecemeal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710349695.8A CN108509469A (en) | 2017-05-17 | 2017-05-17 | A kind of Web page text information extracting method based on piecemeal |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108509469A true CN108509469A (en) | 2018-09-07 |
Family
ID=63373328
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710349695.8A Pending CN108509469A (en) | 2017-05-17 | 2017-05-17 | A kind of Web page text information extracting method based on piecemeal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108509469A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109857956A (en) * | 2019-01-25 | 2019-06-07 | 四川大学 | The automatic abstracting method of news web page key message based on label and blocking characteristic |
CN110377796A (en) * | 2019-07-25 | 2019-10-25 | 中南民族大学 | Text extracting method, device, equipment and storage medium based on dom tree |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727461A (en) * | 2008-10-13 | 2010-06-09 | 中国科学院计算技术研究所 | Method for extracting content of web page |
CN101944109A (en) * | 2010-09-06 | 2011-01-12 | 华南理工大学 | System and method for extracting picture abstract based on page partitioning |
CN105677764A (en) * | 2015-12-30 | 2016-06-15 | 百度在线网络技术(北京)有限公司 | Information extraction method and device |
-
2017
- 2017-05-17 CN CN201710349695.8A patent/CN108509469A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727461A (en) * | 2008-10-13 | 2010-06-09 | 中国科学院计算技术研究所 | Method for extracting content of web page |
CN101944109A (en) * | 2010-09-06 | 2011-01-12 | 华南理工大学 | System and method for extracting picture abstract based on page partitioning |
CN105677764A (en) * | 2015-12-30 | 2016-06-15 | 百度在线网络技术(北京)有限公司 | Information extraction method and device |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109857956A (en) * | 2019-01-25 | 2019-06-07 | 四川大学 | The automatic abstracting method of news web page key message based on label and blocking characteristic |
CN109857956B (en) * | 2019-01-25 | 2019-12-31 | 四川大学 | News webpage key information automatic extraction method based on label and block characteristics |
CN110377796A (en) * | 2019-07-25 | 2019-10-25 | 中南民族大学 | Text extracting method, device, equipment and storage medium based on dom tree |
CN110377796B (en) * | 2019-07-25 | 2021-11-02 | 中南民族大学 | Text extraction method, device and equipment based on DOM tree and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102253979B (en) | Vision-based web page extracting method | |
CN102156737B (en) | Method for extracting subject content of Chinese webpage | |
CN104504086B (en) | The clustering method and device of Webpage | |
CN103488746B (en) | Method and device for acquiring business information | |
EP2633432A1 (en) | Extraction of content from a web page | |
CN109492177B (en) | web page blocking method based on web page semantic structure | |
CN102314520A (en) | Webpage text extraction method and device based on statistical backtracking positioning | |
CN107230123A (en) | commodity mapping method, device and equipment | |
CN106326451B (en) | A kind of webpage heat transfer agent block decision method of view-based access control model feature extraction | |
Prasad et al. | Coreex: content extraction from online news articles | |
CN103440494A (en) | Horrible image identification method and system based on visual significance analyses | |
CN109144513B (en) | Method for automatically extracting list page | |
Madan et al. | Synthetically trained icon proposals for parsing and summarizing infographics | |
CN105740355B (en) | Webpage context extraction method and device based on aggregation text density | |
CN108509469A (en) | A kind of Web page text information extracting method based on piecemeal | |
CN106528068A (en) | Webpage content reconstruction method and system | |
CN102141998B (en) | Automatic evaluation method for webpage vision complexity | |
CN104462394B (en) | A kind of system and method for identifying text floor of webpage | |
CN109934852A (en) | A kind of video presentation method based on object properties relational graph | |
CN108874870A (en) | A kind of data pick-up method, equipment and computer can storage mediums | |
CN103744920A (en) | Commodity attribute name-value pair extraction method and system | |
CN103678432B (en) | A kind of web page body extracting method based on web page body feature and intermediary's true value | |
CN104484451B (en) | The extracting method and device of Webpage information | |
CN100590623C (en) | System and method for abstraction of Web data based on vision | |
CN106897287A (en) | Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180907 |
|
WD01 | Invention patent application deemed withdrawn after publication |