CN104598577A - Extraction method for webpage text - Google Patents
Extraction method for webpage text Download PDFInfo
- Publication number
- CN104598577A CN104598577A CN201510017223.3A CN201510017223A CN104598577A CN 104598577 A CN104598577 A CN 104598577A CN 201510017223 A CN201510017223 A CN 201510017223A CN 104598577 A CN104598577 A CN 104598577A
- Authority
- CN
- China
- Prior art keywords
- text
- text block
- web page
- density
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention relates to an extraction method for a webpage text. The extraction method comprises the following steps: 1) extracting a webpage title through a regular expression; 2) preprocessing the webpage; 3) dynamically dividing a text block; 4) rating the text block and selecting the optimal text block; 5) circularly expanding the text block. According to the extraction method for the webpage text, the extraction speed is high, the extraction effects for various pages of the news gateway, personal blog and community forum are all excellent, the accuracy is high and the robustness is excellent.
Description
Technical field
The present invention relates to field of information processing, be specifically related to a kind of general and webpage context extraction method efficiently.
Technical background
Web page text extracts and refers to that computer system identifies the method for the structurized text message such as title, text from various destructuring, content is different, layout is different internet web page.To Web information searching system, especially search engine, it is a very important basic module that Web page text extracts.If text extracts inaccurate, such as, there is a large amount of omission to the extraction of body part, or non-body part is identified as text, so follow-uply just can not ensure precisely, to be difficult to the demand meeting user with the matching process of query word.
In recent years, wait chat sharing platform along with micro-letter and personalized read the universal of application, these application needed to carry out text extraction to the webpage of third party's website, so as on the screen of different size adaptive web page contents, raising Consumer's Experience.On the other hand, the development of accurate advertisement technology etc., the demand excavated large data text is more and more higher, and the prerequisite that large data text excavates, be the Web page text that can identify that on internet, the URL of magnanimity is corresponding.These webpages, not only from news portal, also include encyclopaedia website, individual blog, Ask-Answer Community etc., and data volume is large, and content embraces a wide spectrum of ideas.
Above demand all describes the significance level that Web page text extracts, and also requires that the method that Web page text extracts possesses quick, accurate, general characteristic simultaneously.Existing webpage extracting method may be summarized to be following several:
(1) rule-based way.Artificial is that extracting rule, such as regular expression or XPath etc. are specified in specific website.Advantage is that order of accuarcy is high, but shortcoming is the webpage can only resolving fixing website or set form, and the formulation process of rule is wasted time and energy, once page layout changes, follow-up being difficult to finds and updating maintenance.
(2) the DOM(DOM Document Object Model of HTML is resolved) tree construction.By building dom tree to html web page, tree being traveled through, identifies and reject non-text message, and according to the Rule Extraction such as page layout, text density body text.Doing maximum shortcoming is like this that to build the speed of dom tree comparatively slow, and text extraction efficiency is very low.And along with the update of webpage origination techniques, a lot of website HTML becomes increasingly complex, also more and more lack of standardization, this analytic method not only speed is slow, also has certain probability and can not build dom tree and cause extracting unsuccessfully.
(3) based on the implicit annotation in HTML and visual information (such as text color, font etc.).The versatility of this method is inadequate, still needs a lot of artificial rule, usually only have very good effect to news portal etc., but to the webpage that this style of individual blog changes greatly, success ratio is lower.
Summary of the invention
In order to overcome the deficiencies in the prior art, the invention provides a kind of general and webpage context extraction method efficiently, can process webpage efficiently, extract body text, the tasks such as search engine, greatly data text excavation can well be applied to.
The object of the invention is to be achieved through the following technical solutions:
An extracting method for Web page text, comprises the following steps:
Step one, extracts web page title by regular expression.
Step 2, Web-page preprocessing, namely removes irrelevant character and web page tag, obtains continuous print text fragments in webpage.Irrelevant character includes but not limited to page script, CSS (cascading style sheet), annotation.Remove web page tag concrete grammar to comprise: (1) removes original newline in webpage; (2) remove the label in webpage, and replace with new newline.
Step 3, dynamically divides text block.(1) add up all nulls in the text fragments obtained, check that each null has how many continuous print nulls altogether, obtain an array with this; (2) array is sorted from small to large, select the little number of array 1/5th as threshold value; (3) number using this threshold value as continuous null is separated text fragments, obtains multiple dynamic text block.
Step 4, gives a mark to text block, chooses optimum text block.Particular content comprises: add up the word length of each text block, hyperlink number, punctuation mark number, stop-word number, with the similarity of title, the information of the relative position of text block in all texts, be designated as text_density respectively, link_density, punctuation_density, stopword_density, title_match_rate, position_rate, give a mark to each text block, mark is higher, text is better, chooses optimum text block with this.
The score formula of text block is:
y = text_density * position_rate * (1 + title_match_rate) * sqrt(punctuation_density) / link_density / sqrt(stopword_density)。
Step 5, circulation expands text block.From the optimum text block obtained, merge the text block at its front or rear, and again calculate the score; If score raises, then the text block after merging is added optimum text block, if score declines, then abandon this time merging; Circulation is gone down, until score can not raise, and the text block finally obtained, the text message that will extract exactly.
In sum, owing to have employed technique scheme, compared to the prior art the present invention, has following beneficial effect:
(1) efficient.Context extraction method of the present invention does not need to build dom tree, does not need recursive tree structure yet, and only need several acquisition work all over just completing pre-service and text fragments of HTML scanning, extraction rate is very fast.
(2) general.The present invention does not rely on specific page layout, does not need to make a large amount of hypothesis to web page tag, annotation etc. yet, thus has good versatility.Experiment shows, all has good extraction effect to the various webpages of news portal or individual blog, community of forum.
(3) accuracy is high.Because extracting method of the present invention needs, to text block marking, to can be good at distinguishing non-body matter and filtering, and do the strategy of loop fusion from optimum text block can minimal omission high-quality text.
(4) robustness is good.Extracting method of the present invention does not need to make hypothesis to the correctness of HTML, even if therefore to nonstandard html text, can carry out text extraction yet.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the extracting method of a kind of Web page text of the present invention.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is elaborated.
Refer to Fig. 1, the extracting method of a kind of Web page text of the present invention is divided into two key steps: one, text block is separated; Two, statistics is filtered and is merged.The object that text block is separated is that text segment and non-text segment are separated; Statistics is filtered and the object of merging is that text segment is screened.Text block separation comprises to be extracted web page title, Web-page preprocessing by regular expression, is dynamically divided text block three steps; Statistics filtration and merging comprise chooses optimum text block, circulation expansion text block two steps.
First web page title to be extracted from original web page, for the marking of follow-up text block is prepared.Extract web page title to have been come by setting general-purpose regular expression, such as, extracting rule in Python:
_ title=re.compile (r'<title> (.*) </title>', re.I|re.S) and,
_title = re.compile(r'<h1>(.*?)</h1>', re.I|re.S)。
Will carry out Web-page preprocessing after extracting web page title, the object of Web-page preprocessing removes irrelevant character and web page tag, obtains continuous print text fragments in webpage.Irrelevant character includes but not limited to page script, CSS (cascading style sheet), annotation etc., and these characters there will not be in the body of the email, and do not help the extraction of text.Because the HTML of each website differs widely, what have has newline, and what have does not have, so before extracting text fragments, first will remove original newline in webpage.In order to separate web page text, to distinguish text and non-text, also will remove the label in webpage, the label removed is replaced with new newline.After this step, continuous text fragment line by line, varying in size of fragment will be obtained, and the distance at interval is also different.
Will carry out the division of dynamic text block below, the object dynamically dividing text block is that close text block is aggregated to together.But because the structure of different web pages is different, the sparse degree difference of the text fragments that Web-page preprocessing obtains is very large, so in order to address this problem, needs the sparse degree based on webpage Chinese version, set dynamic threshold value, and with this threshold value, webpage is divided into multiple text block.Specific practice is: first add up all nulls in the text fragments of acquisition, checks that each null has how many continuous print nulls altogether, obtains an array with this, such as [1,1,3,3,1,1,5,1,1]; Then array is sorted from small to large, select the little number of array 1/5th as threshold value; Number using this threshold value as continuous null is separated text fragments, obtains multiple dynamic text block.
After text block has been separated, need to give a mark to text block, choose optimum text block.Specific practice is: the word length of adding up each text block, hyperlink number, punctuation mark number, (the such as advertisement of stop-word number, about us, friendly links etc. often appear at the word in non-text region), with the similarity of title, (the text block position such as started most is 1 to the relative position of text block in all texts, what finally occur is 0) etc. information, be designated as text_density respectively, link_density, punctuation_density, stopword_density, title_match_rate, position_rate, each text block is given a mark, optimum text block is chosen with this.
The score formula of text block is:
y = text_density * position_rate * (1 + title_match_rate) * sqrt(punctuation_density) / link_density / sqrt(stopword_density)。
From the optimum text block obtained, merge the text block at its front or rear, and again calculate the score; If score raises, then the text block after merging is added optimum text block, if score declines, then abandon this time merging; Circulation is gone down, until score can not raise, and the text block finally obtained, the text message that will extract exactly.The number of times of cycle calculations is limited, can not exceed the number of text block.Like this can merging high-quality text as much as possible, reduce the omission of text.
In sum, owing to have employed technique scheme, compared to the prior art the present invention, has following beneficial effect:
(1) efficient.Context extraction method of the present invention does not need to build dom tree, does not need recursive tree structure yet, and only need several acquisition work all over just completing pre-service and text fragments of HTML scanning, extraction rate is very fast.
(2) general.The present invention does not rely on specific page layout, does not need to make a large amount of hypothesis to web page tag, annotation etc. yet, thus has good versatility.Experiment shows, all has good extraction effect to the various webpages of news portal or individual blog, community of forum.
(3) accuracy is high.Because extracting method of the present invention needs, to text block marking, to can be good at distinguishing non-body matter and filtering, and do the strategy of loop fusion from optimum text block can minimal omission high-quality text.
(4) robustness is good.Extracting method of the present invention does not need to make hypothesis to the correctness of HTML, even if therefore to nonstandard html text, can carry out text extraction yet.
Above-described embodiment is only for illustration of technological thought and the feature of this patent, its object is to enable those skilled in the art understand the content of this patent and implement according to this, the scope of the claims of this patent only can not be limited with the present embodiment, namely the equal change done of all spirit disclosed according to this patent or modification, still drop in the scope of the claims of this patent.
Claims (5)
1. an extracting method for Web page text, is characterized in that, comprises the following steps:
Step one, extracts web page title by regular expression;
Step 2, Web-page preprocessing;
Step 3, dynamically divides text block;
Step 4, gives a mark to text block, chooses optimum text block;
Step 5, circulation expands text block.
2. the extracting method of a kind of Web page text according to claim 1, is characterized in that, described step 2 comprises: remove irrelevant character and web page tag, obtain continuous print text fragments in webpage;
Remove web page tag concrete grammar to comprise: (1) removes original newline in webpage; (2) remove the label in webpage, and replace with new newline;
Described irrelevant character includes but not limited to page script, CSS (cascading style sheet), annotation.
3. the extracting method of a kind of Web page text according to claim 1 and 2, is characterized in that, described step 3 comprises:
(1) add up all nulls in the text fragments obtained, check that each null has how many continuous print nulls altogether, obtain an array with this;
(2) array is sorted from small to large, select the little number of array 1/5th as threshold value;
(3) number using this threshold value as continuous null is separated text fragments, obtains multiple dynamic text block.
4. the extracting method of a kind of Web page text according to claim 1, it is characterized in that, described step 4 comprises: add up the word length of each text block, hyperlink number, punctuation mark number, stop-word number, with the similarity of title, the information of the relative position of text block in all texts, be designated as text_density respectively, link_density, punctuation_density, stopword_density, title_match_rate, position_rate, each text block is given a mark, chooses optimum text block with this;
The score formula of text block is:
y = text_density * position_rate * (1 + title_match_rate) * sqrt(punctuation_density) / link_density / sqrt(stopword_density)。
5. the extracting method of a kind of Web page text according to claim 1, is characterized in that, described step 5 comprises: from the optimum text block obtained, merge the text block at its front or rear, and again calculate the score; If score raises, then the text block after merging is added optimum text block, if score declines, then abandon this time merging; Circulation is gone down, until score can not raise, and the text block finally obtained, the text message that will extract exactly.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510017223.3A CN104598577B (en) | 2015-01-14 | 2015-01-14 | A kind of extracting method of Web page text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510017223.3A CN104598577B (en) | 2015-01-14 | 2015-01-14 | A kind of extracting method of Web page text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104598577A true CN104598577A (en) | 2015-05-06 |
CN104598577B CN104598577B (en) | 2017-09-15 |
Family
ID=53124362
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510017223.3A Active CN104598577B (en) | 2015-01-14 | 2015-01-14 | A kind of extracting method of Web page text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104598577B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105320734A (en) * | 2015-07-14 | 2016-02-10 | 中国互联网络信息中心 | Web page core content extraction method |
CN105740355A (en) * | 2016-01-26 | 2016-07-06 | 中国人民解放军国防科学技术大学 | Aggregated text density based webpage body text extraction method and apparatus |
CN106844441A (en) * | 2016-12-15 | 2017-06-13 | 北京容联光辉科技有限公司 | A kind of method and device of Information Sharing |
CN106960057A (en) * | 2017-04-05 | 2017-07-18 | 上海威固信息技术有限公司 | A kind of method that Web page text is extracted based on information density |
CN107273491A (en) * | 2017-06-15 | 2017-10-20 | 华中师范大学 | Webpage splitting method, device and electronic equipment |
CN108763591A (en) * | 2018-06-21 | 2018-11-06 | 湖南星汉数智科技有限公司 | A kind of webpage context extraction method, device, computer installation and computer readable storage medium |
CN108897749A (en) * | 2018-04-19 | 2018-11-27 | 中国科学院计算技术研究所 | Method for abstracting web page information and system based on syntax tree and text block density |
CN109635219A (en) * | 2018-12-05 | 2019-04-16 | 云孚科技(北京)有限公司 | A kind of webpage content extracting method |
CN110020312A (en) * | 2017-12-11 | 2019-07-16 | 北京京东尚科信息技术有限公司 | The method and apparatus for extracting Web page text |
CN110795933A (en) * | 2019-09-30 | 2020-02-14 | 奇安信科技集团股份有限公司 | Method and device for identifying and processing webpage text |
CN113051390A (en) * | 2019-12-26 | 2021-06-29 | 百度在线网络技术(北京)有限公司 | Knowledge base construction method and device, electronic equipment and medium |
CN113537091A (en) * | 2021-07-20 | 2021-10-22 | 东莞市盟大塑化科技有限公司 | Webpage text recognition method and device, electronic equipment and storage medium |
CN115408594A (en) * | 2022-11-01 | 2022-11-29 | 长沙火线云网络科技有限公司 | Webpage title extraction method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4606439B2 (en) * | 2007-05-28 | 2011-01-05 | モバイダーズ・インコーポレイテッド | File conversion apparatus and method for converting HTML file into flash image |
CN101937438A (en) * | 2009-06-30 | 2011-01-05 | 富士通株式会社 | Method and device for extracting webpage content |
CN102663023A (en) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | Implementation method for extracting web content |
CN102810097A (en) * | 2011-06-02 | 2012-12-05 | 高德软件有限公司 | Method and device for extracting webpage text content |
CN103064827A (en) * | 2013-01-16 | 2013-04-24 | 盘古文化传播有限公司 | Method and device for extracting webpage content |
-
2015
- 2015-01-14 CN CN201510017223.3A patent/CN104598577B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4606439B2 (en) * | 2007-05-28 | 2011-01-05 | モバイダーズ・インコーポレイテッド | File conversion apparatus and method for converting HTML file into flash image |
CN101937438A (en) * | 2009-06-30 | 2011-01-05 | 富士通株式会社 | Method and device for extracting webpage content |
CN102810097A (en) * | 2011-06-02 | 2012-12-05 | 高德软件有限公司 | Method and device for extracting webpage text content |
CN102663023A (en) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | Implementation method for extracting web content |
CN103064827A (en) * | 2013-01-16 | 2013-04-24 | 盘古文化传播有限公司 | Method and device for extracting webpage content |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105320734A (en) * | 2015-07-14 | 2016-02-10 | 中国互联网络信息中心 | Web page core content extraction method |
CN105320734B (en) * | 2015-07-14 | 2019-02-22 | 中国互联网络信息中心 | A kind of web page core content extracting method |
CN105740355A (en) * | 2016-01-26 | 2016-07-06 | 中国人民解放军国防科学技术大学 | Aggregated text density based webpage body text extraction method and apparatus |
CN105740355B (en) * | 2016-01-26 | 2019-03-26 | 中国人民解放军国防科学技术大学 | Webpage context extraction method and device based on aggregation text density |
CN106844441A (en) * | 2016-12-15 | 2017-06-13 | 北京容联光辉科技有限公司 | A kind of method and device of Information Sharing |
CN106960057A (en) * | 2017-04-05 | 2017-07-18 | 上海威固信息技术有限公司 | A kind of method that Web page text is extracted based on information density |
CN107273491A (en) * | 2017-06-15 | 2017-10-20 | 华中师范大学 | Webpage splitting method, device and electronic equipment |
CN107273491B (en) * | 2017-06-15 | 2020-07-24 | 华中师范大学 | Webpage segmentation method and device and electronic equipment |
CN110020312A (en) * | 2017-12-11 | 2019-07-16 | 北京京东尚科信息技术有限公司 | The method and apparatus for extracting Web page text |
CN108897749A (en) * | 2018-04-19 | 2018-11-27 | 中国科学院计算技术研究所 | Method for abstracting web page information and system based on syntax tree and text block density |
CN108763591A (en) * | 2018-06-21 | 2018-11-06 | 湖南星汉数智科技有限公司 | A kind of webpage context extraction method, device, computer installation and computer readable storage medium |
CN108763591B (en) * | 2018-06-21 | 2021-01-08 | 湖南星汉数智科技有限公司 | Webpage text extraction method and device, computer device and computer readable storage medium |
CN109635219A (en) * | 2018-12-05 | 2019-04-16 | 云孚科技(北京)有限公司 | A kind of webpage content extracting method |
CN110795933A (en) * | 2019-09-30 | 2020-02-14 | 奇安信科技集团股份有限公司 | Method and device for identifying and processing webpage text |
CN110795933B (en) * | 2019-09-30 | 2023-10-31 | 奇安信科技集团股份有限公司 | Webpage text recognition processing method and device |
CN113051390A (en) * | 2019-12-26 | 2021-06-29 | 百度在线网络技术(北京)有限公司 | Knowledge base construction method and device, electronic equipment and medium |
CN113051390B (en) * | 2019-12-26 | 2023-09-26 | 百度在线网络技术(北京)有限公司 | Knowledge base construction method, knowledge base construction device, electronic equipment and medium |
CN113537091A (en) * | 2021-07-20 | 2021-10-22 | 东莞市盟大塑化科技有限公司 | Webpage text recognition method and device, electronic equipment and storage medium |
CN113537091B (en) * | 2021-07-20 | 2024-05-03 | 东莞盟大集团有限公司 | Webpage text recognition method and device, electronic equipment and storage medium |
CN115408594A (en) * | 2022-11-01 | 2022-11-29 | 长沙火线云网络科技有限公司 | Webpage title extraction method and system |
Also Published As
Publication number | Publication date |
---|---|
CN104598577B (en) | 2017-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104598577B (en) | A kind of extracting method of Web page text | |
CN101251855B (en) | Equipment, system and method for cleaning internet web page | |
CN102253979B (en) | Vision-based web page extracting method | |
CN101950284B (en) | Chinese word segmentation method and system | |
CN102541874B (en) | Webpage text content extracting method and device | |
CN106126502B (en) | A kind of emotional semantic classification system and method based on support vector machines | |
CN101515272B (en) | Method and device for extracting webpage content | |
CN109685052A (en) | Method for processing text images, device, electronic equipment and computer-readable medium | |
CN106055667B (en) | It is a kind of based on text-label densities web page core content extracting method | |
CN104199972A (en) | Named entity relation extraction and construction method based on deep learning | |
CN101593200A (en) | Chinese Web page classification method based on the keyword frequency analysis | |
WO2017080090A1 (en) | Extraction and comparison method for text of webpage | |
CN103488724A (en) | Book-oriented reading field knowledge map construction method | |
CN106021392A (en) | News key information extraction method and system | |
CN103324622A (en) | Method and device for automatic generating of front page abstract | |
WO2011072434A1 (en) | System and method for web content extraction | |
EP2425353A1 (en) | Method and apparatus for identifying synonyms and using synonyms to search | |
CN110457579B (en) | Webpage denoising method and system based on cooperative work of template and classifier | |
CN103166981A (en) | Wireless webpage transcoding method and device | |
CN109543126A (en) | Web page text information extracting method based on block text accounting | |
CN102651002A (en) | Webpage information extracting method and system | |
EP2790111A1 (en) | Method and device for acquiring structured information in layout file | |
CN105320734A (en) | Web page core content extraction method | |
CN108959204B (en) | Internet financial project information extraction method and system | |
CN109492177A (en) | A kind of web page release method based on web page semantics structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |