CN104598577A - Extraction method for webpage text - Google Patents

Extraction method for webpage text Download PDF

Info

Publication number
CN104598577A
CN104598577A CN201510017223.3A CN201510017223A CN104598577A CN 104598577 A CN104598577 A CN 104598577A CN 201510017223 A CN201510017223 A CN 201510017223A CN 104598577 A CN104598577 A CN 104598577A
Authority
CN
China
Prior art keywords
text
text block
web page
density
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510017223.3A
Other languages
Chinese (zh)
Other versions
CN104598577B (en
Inventor
汤奇峰
刘作涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Original Assignee
ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd filed Critical ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Priority to CN201510017223.3A priority Critical patent/CN104598577B/en
Publication of CN104598577A publication Critical patent/CN104598577A/en
Application granted granted Critical
Publication of CN104598577B publication Critical patent/CN104598577B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention relates to an extraction method for a webpage text. The extraction method comprises the following steps: 1) extracting a webpage title through a regular expression; 2) preprocessing the webpage; 3) dynamically dividing a text block; 4) rating the text block and selecting the optimal text block; 5) circularly expanding the text block. According to the extraction method for the webpage text, the extraction speed is high, the extraction effects for various pages of the news gateway, personal blog and community forum are all excellent, the accuracy is high and the robustness is excellent.

Description

A kind of extracting method of Web page text
Technical field
The present invention relates to field of information processing, be specifically related to a kind of general and webpage context extraction method efficiently.
Technical background
Web page text extracts and refers to that computer system identifies the method for the structurized text message such as title, text from various destructuring, content is different, layout is different internet web page.To Web information searching system, especially search engine, it is a very important basic module that Web page text extracts.If text extracts inaccurate, such as, there is a large amount of omission to the extraction of body part, or non-body part is identified as text, so follow-uply just can not ensure precisely, to be difficult to the demand meeting user with the matching process of query word.
In recent years, wait chat sharing platform along with micro-letter and personalized read the universal of application, these application needed to carry out text extraction to the webpage of third party's website, so as on the screen of different size adaptive web page contents, raising Consumer's Experience.On the other hand, the development of accurate advertisement technology etc., the demand excavated large data text is more and more higher, and the prerequisite that large data text excavates, be the Web page text that can identify that on internet, the URL of magnanimity is corresponding.These webpages, not only from news portal, also include encyclopaedia website, individual blog, Ask-Answer Community etc., and data volume is large, and content embraces a wide spectrum of ideas.
Above demand all describes the significance level that Web page text extracts, and also requires that the method that Web page text extracts possesses quick, accurate, general characteristic simultaneously.Existing webpage extracting method may be summarized to be following several:
(1) rule-based way.Artificial is that extracting rule, such as regular expression or XPath etc. are specified in specific website.Advantage is that order of accuarcy is high, but shortcoming is the webpage can only resolving fixing website or set form, and the formulation process of rule is wasted time and energy, once page layout changes, follow-up being difficult to finds and updating maintenance.
(2) the DOM(DOM Document Object Model of HTML is resolved) tree construction.By building dom tree to html web page, tree being traveled through, identifies and reject non-text message, and according to the Rule Extraction such as page layout, text density body text.Doing maximum shortcoming is like this that to build the speed of dom tree comparatively slow, and text extraction efficiency is very low.And along with the update of webpage origination techniques, a lot of website HTML becomes increasingly complex, also more and more lack of standardization, this analytic method not only speed is slow, also has certain probability and can not build dom tree and cause extracting unsuccessfully.
(3) based on the implicit annotation in HTML and visual information (such as text color, font etc.).The versatility of this method is inadequate, still needs a lot of artificial rule, usually only have very good effect to news portal etc., but to the webpage that this style of individual blog changes greatly, success ratio is lower.
Summary of the invention
In order to overcome the deficiencies in the prior art, the invention provides a kind of general and webpage context extraction method efficiently, can process webpage efficiently, extract body text, the tasks such as search engine, greatly data text excavation can well be applied to.
The object of the invention is to be achieved through the following technical solutions:
An extracting method for Web page text, comprises the following steps:
Step one, extracts web page title by regular expression.
Step 2, Web-page preprocessing, namely removes irrelevant character and web page tag, obtains continuous print text fragments in webpage.Irrelevant character includes but not limited to page script, CSS (cascading style sheet), annotation.Remove web page tag concrete grammar to comprise: (1) removes original newline in webpage; (2) remove the label in webpage, and replace with new newline.
Step 3, dynamically divides text block.(1) add up all nulls in the text fragments obtained, check that each null has how many continuous print nulls altogether, obtain an array with this; (2) array is sorted from small to large, select the little number of array 1/5th as threshold value; (3) number using this threshold value as continuous null is separated text fragments, obtains multiple dynamic text block.
Step 4, gives a mark to text block, chooses optimum text block.Particular content comprises: add up the word length of each text block, hyperlink number, punctuation mark number, stop-word number, with the similarity of title, the information of the relative position of text block in all texts, be designated as text_density respectively, link_density, punctuation_density, stopword_density, title_match_rate, position_rate, give a mark to each text block, mark is higher, text is better, chooses optimum text block with this.
The score formula of text block is:
y = text_density * position_rate * (1 + title_match_rate) * sqrt(punctuation_density) / link_density / sqrt(stopword_density)。
Step 5, circulation expands text block.From the optimum text block obtained, merge the text block at its front or rear, and again calculate the score; If score raises, then the text block after merging is added optimum text block, if score declines, then abandon this time merging; Circulation is gone down, until score can not raise, and the text block finally obtained, the text message that will extract exactly.
In sum, owing to have employed technique scheme, compared to the prior art the present invention, has following beneficial effect:
(1) efficient.Context extraction method of the present invention does not need to build dom tree, does not need recursive tree structure yet, and only need several acquisition work all over just completing pre-service and text fragments of HTML scanning, extraction rate is very fast.
(2) general.The present invention does not rely on specific page layout, does not need to make a large amount of hypothesis to web page tag, annotation etc. yet, thus has good versatility.Experiment shows, all has good extraction effect to the various webpages of news portal or individual blog, community of forum.
(3) accuracy is high.Because extracting method of the present invention needs, to text block marking, to can be good at distinguishing non-body matter and filtering, and do the strategy of loop fusion from optimum text block can minimal omission high-quality text.
(4) robustness is good.Extracting method of the present invention does not need to make hypothesis to the correctness of HTML, even if therefore to nonstandard html text, can carry out text extraction yet.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the extracting method of a kind of Web page text of the present invention.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is elaborated.
Refer to Fig. 1, the extracting method of a kind of Web page text of the present invention is divided into two key steps: one, text block is separated; Two, statistics is filtered and is merged.The object that text block is separated is that text segment and non-text segment are separated; Statistics is filtered and the object of merging is that text segment is screened.Text block separation comprises to be extracted web page title, Web-page preprocessing by regular expression, is dynamically divided text block three steps; Statistics filtration and merging comprise chooses optimum text block, circulation expansion text block two steps.
First web page title to be extracted from original web page, for the marking of follow-up text block is prepared.Extract web page title to have been come by setting general-purpose regular expression, such as, extracting rule in Python:
_ title=re.compile (r'<title> (.*) </title>', re.I|re.S) and,
_title = re.compile(r'<h1>(.*?)</h1>', re.I|re.S)。
Will carry out Web-page preprocessing after extracting web page title, the object of Web-page preprocessing removes irrelevant character and web page tag, obtains continuous print text fragments in webpage.Irrelevant character includes but not limited to page script, CSS (cascading style sheet), annotation etc., and these characters there will not be in the body of the email, and do not help the extraction of text.Because the HTML of each website differs widely, what have has newline, and what have does not have, so before extracting text fragments, first will remove original newline in webpage.In order to separate web page text, to distinguish text and non-text, also will remove the label in webpage, the label removed is replaced with new newline.After this step, continuous text fragment line by line, varying in size of fragment will be obtained, and the distance at interval is also different.
Will carry out the division of dynamic text block below, the object dynamically dividing text block is that close text block is aggregated to together.But because the structure of different web pages is different, the sparse degree difference of the text fragments that Web-page preprocessing obtains is very large, so in order to address this problem, needs the sparse degree based on webpage Chinese version, set dynamic threshold value, and with this threshold value, webpage is divided into multiple text block.Specific practice is: first add up all nulls in the text fragments of acquisition, checks that each null has how many continuous print nulls altogether, obtains an array with this, such as [1,1,3,3,1,1,5,1,1]; Then array is sorted from small to large, select the little number of array 1/5th as threshold value; Number using this threshold value as continuous null is separated text fragments, obtains multiple dynamic text block.
After text block has been separated, need to give a mark to text block, choose optimum text block.Specific practice is: the word length of adding up each text block, hyperlink number, punctuation mark number, (the such as advertisement of stop-word number, about us, friendly links etc. often appear at the word in non-text region), with the similarity of title, (the text block position such as started most is 1 to the relative position of text block in all texts, what finally occur is 0) etc. information, be designated as text_density respectively, link_density, punctuation_density, stopword_density, title_match_rate, position_rate, each text block is given a mark, optimum text block is chosen with this.
The score formula of text block is:
y = text_density * position_rate * (1 + title_match_rate) * sqrt(punctuation_density) / link_density / sqrt(stopword_density)。
From the optimum text block obtained, merge the text block at its front or rear, and again calculate the score; If score raises, then the text block after merging is added optimum text block, if score declines, then abandon this time merging; Circulation is gone down, until score can not raise, and the text block finally obtained, the text message that will extract exactly.The number of times of cycle calculations is limited, can not exceed the number of text block.Like this can merging high-quality text as much as possible, reduce the omission of text.
In sum, owing to have employed technique scheme, compared to the prior art the present invention, has following beneficial effect:
(1) efficient.Context extraction method of the present invention does not need to build dom tree, does not need recursive tree structure yet, and only need several acquisition work all over just completing pre-service and text fragments of HTML scanning, extraction rate is very fast.
(2) general.The present invention does not rely on specific page layout, does not need to make a large amount of hypothesis to web page tag, annotation etc. yet, thus has good versatility.Experiment shows, all has good extraction effect to the various webpages of news portal or individual blog, community of forum.
(3) accuracy is high.Because extracting method of the present invention needs, to text block marking, to can be good at distinguishing non-body matter and filtering, and do the strategy of loop fusion from optimum text block can minimal omission high-quality text.
(4) robustness is good.Extracting method of the present invention does not need to make hypothesis to the correctness of HTML, even if therefore to nonstandard html text, can carry out text extraction yet.
Above-described embodiment is only for illustration of technological thought and the feature of this patent, its object is to enable those skilled in the art understand the content of this patent and implement according to this, the scope of the claims of this patent only can not be limited with the present embodiment, namely the equal change done of all spirit disclosed according to this patent or modification, still drop in the scope of the claims of this patent.

Claims (5)

1. an extracting method for Web page text, is characterized in that, comprises the following steps:
Step one, extracts web page title by regular expression;
Step 2, Web-page preprocessing;
Step 3, dynamically divides text block;
Step 4, gives a mark to text block, chooses optimum text block;
Step 5, circulation expands text block.
2. the extracting method of a kind of Web page text according to claim 1, is characterized in that, described step 2 comprises: remove irrelevant character and web page tag, obtain continuous print text fragments in webpage;
Remove web page tag concrete grammar to comprise: (1) removes original newline in webpage; (2) remove the label in webpage, and replace with new newline;
Described irrelevant character includes but not limited to page script, CSS (cascading style sheet), annotation.
3. the extracting method of a kind of Web page text according to claim 1 and 2, is characterized in that, described step 3 comprises:
(1) add up all nulls in the text fragments obtained, check that each null has how many continuous print nulls altogether, obtain an array with this;
(2) array is sorted from small to large, select the little number of array 1/5th as threshold value;
(3) number using this threshold value as continuous null is separated text fragments, obtains multiple dynamic text block.
4. the extracting method of a kind of Web page text according to claim 1, it is characterized in that, described step 4 comprises: add up the word length of each text block, hyperlink number, punctuation mark number, stop-word number, with the similarity of title, the information of the relative position of text block in all texts, be designated as text_density respectively, link_density, punctuation_density, stopword_density, title_match_rate, position_rate, each text block is given a mark, chooses optimum text block with this;
The score formula of text block is:
y = text_density * position_rate * (1 + title_match_rate) * sqrt(punctuation_density) / link_density / sqrt(stopword_density)。
5. the extracting method of a kind of Web page text according to claim 1, is characterized in that, described step 5 comprises: from the optimum text block obtained, merge the text block at its front or rear, and again calculate the score; If score raises, then the text block after merging is added optimum text block, if score declines, then abandon this time merging; Circulation is gone down, until score can not raise, and the text block finally obtained, the text message that will extract exactly.
CN201510017223.3A 2015-01-14 2015-01-14 A kind of extracting method of Web page text Active CN104598577B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510017223.3A CN104598577B (en) 2015-01-14 2015-01-14 A kind of extracting method of Web page text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510017223.3A CN104598577B (en) 2015-01-14 2015-01-14 A kind of extracting method of Web page text

Publications (2)

Publication Number Publication Date
CN104598577A true CN104598577A (en) 2015-05-06
CN104598577B CN104598577B (en) 2017-09-15

Family

ID=53124362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510017223.3A Active CN104598577B (en) 2015-01-14 2015-01-14 A kind of extracting method of Web page text

Country Status (1)

Country Link
CN (1) CN104598577B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320734A (en) * 2015-07-14 2016-02-10 中国互联网络信息中心 Web page core content extraction method
CN105740355A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Aggregated text density based webpage body text extraction method and apparatus
CN106844441A (en) * 2016-12-15 2017-06-13 北京容联光辉科技有限公司 A kind of method and device of Information Sharing
CN106960057A (en) * 2017-04-05 2017-07-18 上海威固信息技术有限公司 A kind of method that Web page text is extracted based on information density
CN107273491A (en) * 2017-06-15 2017-10-20 华中师范大学 Webpage splitting method, device and electronic equipment
CN108763591A (en) * 2018-06-21 2018-11-06 湖南星汉数智科技有限公司 A kind of webpage context extraction method, device, computer installation and computer readable storage medium
CN108897749A (en) * 2018-04-19 2018-11-27 中国科学院计算技术研究所 Method for abstracting web page information and system based on syntax tree and text block density
CN109635219A (en) * 2018-12-05 2019-04-16 云孚科技(北京)有限公司 A kind of webpage content extracting method
CN110020312A (en) * 2017-12-11 2019-07-16 北京京东尚科信息技术有限公司 The method and apparatus for extracting Web page text
CN110795933A (en) * 2019-09-30 2020-02-14 奇安信科技集团股份有限公司 Method and device for identifying and processing webpage text
CN113051390A (en) * 2019-12-26 2021-06-29 百度在线网络技术(北京)有限公司 Knowledge base construction method and device, electronic equipment and medium
CN113537091A (en) * 2021-07-20 2021-10-22 东莞市盟大塑化科技有限公司 Webpage text recognition method and device, electronic equipment and storage medium
CN115408594A (en) * 2022-11-01 2022-11-29 长沙火线云网络科技有限公司 Webpage title extraction method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4606439B2 (en) * 2007-05-28 2011-01-05 モバイダーズ・インコーポレイテッド File conversion apparatus and method for converting HTML file into flash image
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN102810097A (en) * 2011-06-02 2012-12-05 高德软件有限公司 Method and device for extracting webpage text content
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4606439B2 (en) * 2007-05-28 2011-01-05 モバイダーズ・インコーポレイテッド File conversion apparatus and method for converting HTML file into flash image
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content
CN102810097A (en) * 2011-06-02 2012-12-05 高德软件有限公司 Method and device for extracting webpage text content
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320734A (en) * 2015-07-14 2016-02-10 中国互联网络信息中心 Web page core content extraction method
CN105320734B (en) * 2015-07-14 2019-02-22 中国互联网络信息中心 A kind of web page core content extracting method
CN105740355A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Aggregated text density based webpage body text extraction method and apparatus
CN105740355B (en) * 2016-01-26 2019-03-26 中国人民解放军国防科学技术大学 Webpage context extraction method and device based on aggregation text density
CN106844441A (en) * 2016-12-15 2017-06-13 北京容联光辉科技有限公司 A kind of method and device of Information Sharing
CN106960057A (en) * 2017-04-05 2017-07-18 上海威固信息技术有限公司 A kind of method that Web page text is extracted based on information density
CN107273491A (en) * 2017-06-15 2017-10-20 华中师范大学 Webpage splitting method, device and electronic equipment
CN107273491B (en) * 2017-06-15 2020-07-24 华中师范大学 Webpage segmentation method and device and electronic equipment
CN110020312A (en) * 2017-12-11 2019-07-16 北京京东尚科信息技术有限公司 The method and apparatus for extracting Web page text
CN108897749A (en) * 2018-04-19 2018-11-27 中国科学院计算技术研究所 Method for abstracting web page information and system based on syntax tree and text block density
CN108763591A (en) * 2018-06-21 2018-11-06 湖南星汉数智科技有限公司 A kind of webpage context extraction method, device, computer installation and computer readable storage medium
CN108763591B (en) * 2018-06-21 2021-01-08 湖南星汉数智科技有限公司 Webpage text extraction method and device, computer device and computer readable storage medium
CN109635219A (en) * 2018-12-05 2019-04-16 云孚科技(北京)有限公司 A kind of webpage content extracting method
CN110795933A (en) * 2019-09-30 2020-02-14 奇安信科技集团股份有限公司 Method and device for identifying and processing webpage text
CN110795933B (en) * 2019-09-30 2023-10-31 奇安信科技集团股份有限公司 Webpage text recognition processing method and device
CN113051390A (en) * 2019-12-26 2021-06-29 百度在线网络技术(北京)有限公司 Knowledge base construction method and device, electronic equipment and medium
CN113051390B (en) * 2019-12-26 2023-09-26 百度在线网络技术(北京)有限公司 Knowledge base construction method, knowledge base construction device, electronic equipment and medium
CN113537091A (en) * 2021-07-20 2021-10-22 东莞市盟大塑化科技有限公司 Webpage text recognition method and device, electronic equipment and storage medium
CN113537091B (en) * 2021-07-20 2024-05-03 东莞盟大集团有限公司 Webpage text recognition method and device, electronic equipment and storage medium
CN115408594A (en) * 2022-11-01 2022-11-29 长沙火线云网络科技有限公司 Webpage title extraction method and system

Also Published As

Publication number Publication date
CN104598577B (en) 2017-09-15

Similar Documents

Publication Publication Date Title
CN104598577B (en) A kind of extracting method of Web page text
CN101251855B (en) Equipment, system and method for cleaning internet web page
CN102253979B (en) Vision-based web page extracting method
CN101950284B (en) Chinese word segmentation method and system
CN102541874B (en) Webpage text content extracting method and device
CN106126502B (en) A kind of emotional semantic classification system and method based on support vector machines
CN101515272B (en) Method and device for extracting webpage content
CN109685052A (en) Method for processing text images, device, electronic equipment and computer-readable medium
CN106055667B (en) It is a kind of based on text-label densities web page core content extracting method
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
WO2017080090A1 (en) Extraction and comparison method for text of webpage
CN103488724A (en) Book-oriented reading field knowledge map construction method
CN106021392A (en) News key information extraction method and system
CN103324622A (en) Method and device for automatic generating of front page abstract
WO2011072434A1 (en) System and method for web content extraction
EP2425353A1 (en) Method and apparatus for identifying synonyms and using synonyms to search
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN103166981A (en) Wireless webpage transcoding method and device
CN109543126A (en) Web page text information extracting method based on block text accounting
CN102651002A (en) Webpage information extracting method and system
EP2790111A1 (en) Method and device for acquiring structured information in layout file
CN105320734A (en) Web page core content extraction method
CN108959204B (en) Internet financial project information extraction method and system
CN109492177A (en) A kind of web page release method based on web page semantics structure

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant