CN104598577B - A kind of extracting method of Web page text - Google Patents

A kind of extracting method of Web page text Download PDF

Info

Publication number
CN104598577B
CN104598577B CN201510017223.3A CN201510017223A CN104598577B CN 104598577 B CN104598577 B CN 104598577B CN 201510017223 A CN201510017223 A CN 201510017223A CN 104598577 B CN104598577 B CN 104598577B
Authority
CN
China
Prior art keywords
text
text block
web page
block
density
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510017223.3A
Other languages
Chinese (zh)
Other versions
CN104598577A (en
Inventor
汤奇峰
刘作涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Original Assignee
ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd filed Critical ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Priority to CN201510017223.3A priority Critical patent/CN104598577B/en
Publication of CN104598577A publication Critical patent/CN104598577A/en
Application granted granted Critical
Publication of CN104598577B publication Critical patent/CN104598577B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A kind of extracting method of Web page text, comprises the following steps:Step one, web page title is extracted by regular expression;Step 2, Web-page preprocessing;Step 3, dynamic divides text block;Step 4, gives a mark to text block, chooses optimal text block;Step 5, circulation expands text block.Extraction rate of the present invention quickly, all has good extraction effect to news portal or personal blog, the various webpages of community of forum, and accuracy is high, robustness is good.

Description

A kind of extracting method of Web page text
Technical field
The present invention relates to field of information processing, and in particular to a kind of general and efficient webpage context extraction method.
Technical background
Web page text, which is extracted, refers to computer system from the internet net that various unstructured, contents are different, layout is different The method that the text message of the structurings such as title, text is identified in page.To Web information searching system, especially search engine For, it is a very important basic module that Web page text, which is extracted,.If text extracts inaccurate, such as to body part Extraction have a large amount of omissions, or non-body part is identified as text, then subsequently the matching process with query word just can not Ensure accurate, it is difficult to meet the demand of user.
In recent years, with the chat sharing platform such as wechat and the personalized popularization for reading application, these are using needs pair The webpage of third party's website carries out text extraction, to be adapted to web page contents on various sizes of screen, improves Consumer's Experience. On the other hand, the development of accurate advertisement technology etc., to the demand more and more higher of big data text mining, and big data text mining A premise, be that can recognize the corresponding Web page texts of URL of magnanimity on internet.These webpages are not only from news Door, also includes encyclopaedia website, personal blog, Ask-Answer Community etc., data volume is big, and content embraces a wide spectrum of ideas.
Above demand all illustrates the significance level that Web page text is extracted, while also requiring that the method tool that Web page text is extracted Standby quick, accurate, general characteristic.Existing webpage extracting method may be summarized to be following several:
(1)Rule-based way.Manually extracting rule, such as regular expression or XPath are specified for specific website Deng.Advantage is that order of accuarcy is high, but has the disadvantage that the webpage of fixed website or set form, and the formulation of rule can only be parsed Journey wastes time and energy, once page layout changes, is subsequently difficult to find and updating maintenance.
(2)Parse HTML DOM(DOM Document Object Model)Tree construction.By building dom tree to html web page, tree is carried out Traversal, recognizes and rejects non-text message, and according to the Rule Extraction body text such as page layout, text density.So do most Big shortcoming is that the speed of structure dom tree is slower, and text extraction efficiency is very low.And with the update of webpage origination techniques, Many website HTML become increasingly complex, also more and more lack of standardization, and not only speed is slow for this analytic method, also has certain probability structure Can not build dom tree causes to extract failure.
(3)Based on the implicit annotation and visual information in HTML(Such as text color, font).This method it is general Property not enough, be still required for many artificial rules, generally only have very good effect to news portal etc., but to this style of personal blog The webpage changed greatly, success rate is relatively low.
The content of the invention
In order to overcome the deficiencies in the prior art, the invention provides a kind of general and efficient webpage context extraction method, Efficiently webpage can be handled, extract body text, can be very good to be applied to search engine, big data text mining Etc. task.
The purpose of the present invention is achieved through the following technical solutions:
A kind of extracting method of Web page text, comprises the following steps:
Step one, web page title is extracted by regular expression.
Step 2, Web-page preprocessing removes unrelated character and web page tag, obtains continuous text fragments in webpage. Unrelated character includes but is not limited to page script, CSS, annotation.Removing web page tag specific method includes:(1)Remove Original newline in webpage;(2)The label in webpage is removed, and is replaced with new newline.
Step 3, dynamic divides text block.(1)All nulls in the text fragments obtained are counted, check that each is empty Capable how many continuous null altogether, an array is obtained with this;(2)Array is sorted from small to large, number is selected The 1/5th small number of group is used as threshold value;(3)Text fragments are separated using the number of this threshold value as continuous null, obtained Obtain multiple dynamic text blocks.
Step 4, gives a mark to text block, chooses optimal text block.Particular content includes:Count each text block Word length, hyperlink number, punctuation mark number, stop-word number, the similarity with title, text block are in all texts Relative position information, text_density, link_density, punctuation_density are designated as respectively, Stopword_density, title_match_rate, position_rate, give a mark, fraction is got over to each text block Height, text is better, and optimal text block is chosen with this.
The score formula of text block is:
y = text_density * position_rate * (1 + title_match_rate) * sqrt (punctuation_density) / link_density / sqrt(stopword_density)。
Step 5, circulation expands text block.From the optimal text BOB(beginning of block) of acquisition, merge the text of its front or behind Block, and calculate again point;If score is raised, the text block after merging is added optimal text block, if score declines, Then abandon this time merging;Circulation is gone down, untill score can not be raised, the text block finally given, seeks to extract just Literary information.
In summary, by adopting the above-described technical solution, compared to the prior art the present invention, there is following beneficial effect:
(1)Efficiently.The context extraction method of the present invention need not build dom tree, it is not required that recurrence tree construction, only need Pretreatment and the acquisition work of text fragments can just be completed by scanning several times to HTML, and extraction rate is quickly.
(2)It is general.The present invention is not relying on specific page layout, it is not required that web page tag, annotation etc. are made greatly Amount it is assumed that thus have good versatility.Experiment shows, no matter to news portal or personal blog, community of forum it is each Planting webpage has good extraction effect.
(3)Accuracy is high.Because the extracting method of the present invention needs to give a mark to text block, it can be good to non-text Content is distinguished and filtered, and the strategy for doing loop fusion from optimal text block being capable of minimal omission high-quality text.
(4)Robustness is good.The present invention extracting method simultaneously HTML correctness need not be made the assumption that, even if therefore To nonstandard html text, text extraction can also be carried out.
Brief description of the drawings
Fig. 1 is a kind of flow chart of the extracting method of Web page text of the invention.
Embodiment
The embodiment to the present invention elaborates with reference to the accompanying drawings and examples.
Fig. 1 is referred to, a kind of extracting method of Web page text of the invention is divided into two key steps:First, text block separates; 2nd, statistics filtering and merging.The purpose that text block separates is that text segment is separated with non-text segment;Statistics filtering and conjunction And purpose be that text segment is screened.Text block separates pre- including extracting web page title, webpage by regular expression Processing, dynamic divide three steps of text block;Statistics filtering and merging include the optimal text block of selection, circulation and expand text block two Individual step.
First have to extract web page title from original web page, be that follow-up text block marking is prepared.Extract web page title It can be completed by setting general-purpose regular expression, such as the extracting rule in Python:
_title = re.compile(r'<title>(.*)</title>', re.I | re.S) and,
_title = re.compile(r'<h1>(.*?)</h1>', re.I|re.S)。
Web-page preprocessing will be carried out by extracting after web page title, the purpose of Web-page preprocessing is to remove unrelated character and webpage mark Label, obtain continuous text fragments in webpage.Unrelated character includes but is not limited to page script, CSS, annotation etc., this A little characters are not in the body of the email, and the extraction of text not to be helped.Because the HTML of each website differs widely, have There is newline, what is had does not have, so before extracting text fragments, first to remove original newline in webpage.In order to webpage Text is separated, to distinguish text and non-text, also to remove the label in webpage, the label of removal is with new newline Replace.After this step, continuous text fragment line by line will be obtained, fragment it is of different sizes, and be spaced it is far and near It is different.
The division of dynamic text block is carried out below, and the purpose that dynamic divides text block is that close text block is aggregated to Together.But it is due to the structure difference of different web pages, the sparse degree difference for the text fragments that Web-page preprocessing is obtained is very big, institute This problem of solution has been thought, it is necessary to the sparse degree based on webpage Chinese version, sets dynamic threshold value, and with this threshold value by net Page is divided into multiple text blocks.Specific practice is:All nulls in the text fragments of acquisition are first counted, each null is checked How many continuous null, an array, such as [1,1,3,3,1,1,5,1,1] are obtained with this altogether;Then logarithm Group sorts from small to large, and the selection small number of array the 1/5th is used as threshold value;The number of continuous null is used as using this threshold value Text fragments are separated, multiple dynamic text blocks are obtained.
, it is necessary to be given a mark to text block after text block separation is completed, optimal text block is chosen.Specific practice is:System Count word length, hyperlink number, punctuation mark number, the stop-word number of each text block(Such as advertisement, on we, Friendly link etc. frequently appears in the word in non-text region)And relative position of the similarity, text block of title in all texts Put(The text block position for example most started is 1, finally occur for 0)Etc. information, text_density is designated as respectively, link_density, punctuation_density, stopword_density, title_match_rate, Position_rate, is given a mark to each text block, and optimal text block is chosen with this.
The score formula of text block is:
y = text_density * position_rate * (1 + title_match_rate) * sqrt (punctuation_density) / link_density / sqrt(stopword_density)。
From the optimal text BOB(beginning of block) of acquisition, merge the text block of its front or behind, and calculate again point;If Decilitre is high, then the text block after merging is added optimal text block, if score declines, and abandons this time merging;Circulation is gone down, Untill score can not be raised, the text block finally given seeks to the text message extracted.The number of times of cycle calculations is that have Limit, not over the number of text block.So the omission of text can be reduced with merging high-quality text as much as possible.
In summary, by adopting the above-described technical solution, compared to the prior art the present invention, there is following beneficial effect:
(1)Efficiently.The context extraction method of the present invention need not build dom tree, it is not required that recurrence tree construction, only need Pretreatment and the acquisition work of text fragments can just be completed by scanning several times to HTML, and extraction rate is quickly.
(2)It is general.The present invention is not relying on specific page layout, it is not required that web page tag, annotation etc. are made greatly Amount it is assumed that thus have good versatility.Experiment shows, no matter to news portal or personal blog, community of forum it is each Planting webpage has good extraction effect.
(3)Accuracy is high.Because the extracting method of the present invention needs to give a mark to text block, it can be good to non-text Content is distinguished and filtered, and the strategy for doing loop fusion from optimal text block being capable of minimal omission high-quality text.
(4)Robustness is good.The present invention extracting method simultaneously HTML correctness need not be made the assumption that, even if therefore To nonstandard html text, text extraction can also be carried out.
Embodiment described above is merely to illustrate the technological thought and feature of this patent, in the art its object is to make Technical staff can understand the content of this patent and implement according to this, it is impossible to the patent model of this patent is only limited with the present embodiment Enclose, i.e., equal change or modification that all spirit according to disclosed in this patent is made still fall in the scope of the claims of this patent.

Claims (5)

1. a kind of extracting method of Web page text, it is characterised in that comprise the following steps:
Step one, web page title is extracted by regular expression;
Step 2, Web-page preprocessing obtains continuous text fragments in webpage;
Step 3, dynamic divides text block;
Step 4, according to the word length of each text block, hyperlink number, punctuation mark number, stop-word number and title The information of relative position in all texts of similarity, text block each text block is beaten respectively based on pre-defined rule Point, choose optimal text block;
Step 5, the optimal text block is expanded based on pre-defined rule circulation, and the optimal text block after expansion is text message.
2. the extracting method of a kind of Web page text according to claim 1, it is characterised in that the step 2 includes:Go Except unrelated character and web page tag, continuous text fragments in webpage are obtained;
Removing web page tag specific method includes:(1) original newline in webpage is removed;(2) label in webpage is removed, and Replaced with new newline;
The unrelated character includes but is not limited to page script, CSS, annotation.
3. the extracting method of a kind of Web page text according to claim 1 or 2, it is characterised in that the step 3 includes:
(1) all nulls in the text fragments that statistics is obtained, check each null how many continuous sky altogether OK, an array is obtained with this;
(2) array is sorted from small to large, the selection small number of array the 1/5th is used as threshold value;
(3) number using this threshold value as continuous null is separated to text fragments, obtains multiple dynamic text blocks.
4. the extracting method of a kind of Web page text according to claim 1, it is characterised in that in the step 4, will be every The word length of individual text block, hyperlink number, punctuation mark number, stop-word number, exist with similarity, the text block of title Relative position in all texts is designated as text_density, link_density, punctuation_density respectively, Stopword_density, title_match_rate, position_rate,
The score formula of text block is:
Y=text_density*position_rate* (1+title_match_rate) * sqrt (punctuation_ density)/link_density/sqrt(stopword_density)。
5. the extracting method of a kind of Web page text according to claim 1, it is characterised in that the step 5 includes:From The optimal text BOB(beginning of block) obtained, merges the text block of its front or behind, and calculates again point;If score is raised, Text block after merging is added optimal text block, if score declines, abandons this time merging;Circulation is gone down, until score Untill can not raising, the text block finally given seeks to the text message extracted.
CN201510017223.3A 2015-01-14 2015-01-14 A kind of extracting method of Web page text Active CN104598577B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510017223.3A CN104598577B (en) 2015-01-14 2015-01-14 A kind of extracting method of Web page text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510017223.3A CN104598577B (en) 2015-01-14 2015-01-14 A kind of extracting method of Web page text

Publications (2)

Publication Number Publication Date
CN104598577A CN104598577A (en) 2015-05-06
CN104598577B true CN104598577B (en) 2017-09-15

Family

ID=53124362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510017223.3A Active CN104598577B (en) 2015-01-14 2015-01-14 A kind of extracting method of Web page text

Country Status (1)

Country Link
CN (1) CN104598577B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320734B (en) * 2015-07-14 2019-02-22 中国互联网络信息中心 A kind of web page core content extracting method
CN105740355B (en) * 2016-01-26 2019-03-26 中国人民解放军国防科学技术大学 Webpage context extraction method and device based on aggregation text density
CN106844441A (en) * 2016-12-15 2017-06-13 北京容联光辉科技有限公司 A kind of method and device of Information Sharing
CN106960057A (en) * 2017-04-05 2017-07-18 上海威固信息技术有限公司 A kind of method that Web page text is extracted based on information density
CN107273491B (en) * 2017-06-15 2020-07-24 华中师范大学 Webpage segmentation method and device and electronic equipment
CN110020312B (en) * 2017-12-11 2022-09-06 北京京东尚科信息技术有限公司 Method and device for extracting webpage text
CN108897749A (en) * 2018-04-19 2018-11-27 中国科学院计算技术研究所 Method for abstracting web page information and system based on syntax tree and text block density
CN108763591B (en) * 2018-06-21 2021-01-08 湖南星汉数智科技有限公司 Webpage text extraction method and device, computer device and computer readable storage medium
CN109635219A (en) * 2018-12-05 2019-04-16 云孚科技(北京)有限公司 A kind of webpage content extracting method
CN110795933B (en) * 2019-09-30 2023-10-31 奇安信科技集团股份有限公司 Webpage text recognition processing method and device
CN113051390B (en) * 2019-12-26 2023-09-26 百度在线网络技术(北京)有限公司 Knowledge base construction method, knowledge base construction device, electronic equipment and medium
CN113537091B (en) * 2021-07-20 2024-05-03 东莞盟大集团有限公司 Webpage text recognition method and device, electronic equipment and storage medium
CN115408594A (en) * 2022-11-01 2022-11-29 长沙火线云网络科技有限公司 Webpage title extraction method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content
JP4606439B2 (en) * 2007-05-28 2011-01-05 モバイダーズ・インコーポレイテッド File conversion apparatus and method for converting HTML file into flash image
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN102810097A (en) * 2011-06-02 2012-12-05 高德软件有限公司 Method and device for extracting webpage text content
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4606439B2 (en) * 2007-05-28 2011-01-05 モバイダーズ・インコーポレイテッド File conversion apparatus and method for converting HTML file into flash image
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content
CN102810097A (en) * 2011-06-02 2012-12-05 高德软件有限公司 Method and device for extracting webpage text content
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content

Also Published As

Publication number Publication date
CN104598577A (en) 2015-05-06

Similar Documents

Publication Publication Date Title
CN104598577B (en) A kind of extracting method of Web page text
CN107145479B (en) Text semantic-based chapter structure analysis method
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN102253979B (en) Vision-based web page extracting method
CN101251855B (en) Equipment, system and method for cleaning internet web page
CN102663023B (en) Implementation method for extracting web content
CN102270206A (en) Method and device for capturing valid web page contents
CN101515272B (en) Method and device for extracting webpage content
CN102156737B (en) Method for extracting subject content of Chinese webpage
CN107590219A (en) Webpage personage subject correlation message extracting method
CN106528583A (en) Method for extracting and comparing web page main body
CN102591612B (en) General webpage text extraction method based on punctuation continuity and system thereof
CN102135967A (en) Webpage keywords extracting method, device and system
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN101079031A (en) Web page subject extraction system and method
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN109492177B (en) web page blocking method based on web page semantic structure
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
WO2011072434A1 (en) System and method for web content extraction
CN102609427A (en) Public opinion vertical search analysis system and method
CN103324622A (en) Method and device for automatic generating of front page abstract
CN103166981A (en) Wireless webpage transcoding method and device
CN104268283A (en) Method for automatically analyzing Internet web page
CN107145591B (en) Title-based webpage effective metadata content extraction method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant