CN104598577B - A kind of extracting method of Web page text - Google Patents
A kind of extracting method of Web page text Download PDFInfo
- Publication number
- CN104598577B CN104598577B CN201510017223.3A CN201510017223A CN104598577B CN 104598577 B CN104598577 B CN 104598577B CN 201510017223 A CN201510017223 A CN 201510017223A CN 104598577 B CN104598577 B CN 104598577B
- Authority
- CN
- China
- Prior art keywords
- text
- text block
- web page
- block
- density
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Document Processing Apparatus (AREA)
Abstract
A kind of extracting method of Web page text, comprises the following steps:Step one, web page title is extracted by regular expression;Step 2, Web-page preprocessing;Step 3, dynamic divides text block;Step 4, gives a mark to text block, chooses optimal text block;Step 5, circulation expands text block.Extraction rate of the present invention quickly, all has good extraction effect to news portal or personal blog, the various webpages of community of forum, and accuracy is high, robustness is good.
Description
Technical field
The present invention relates to field of information processing, and in particular to a kind of general and efficient webpage context extraction method.
Technical background
Web page text, which is extracted, refers to computer system from the internet net that various unstructured, contents are different, layout is different
The method that the text message of the structurings such as title, text is identified in page.To Web information searching system, especially search engine
For, it is a very important basic module that Web page text, which is extracted,.If text extracts inaccurate, such as to body part
Extraction have a large amount of omissions, or non-body part is identified as text, then subsequently the matching process with query word just can not
Ensure accurate, it is difficult to meet the demand of user.
In recent years, with the chat sharing platform such as wechat and the personalized popularization for reading application, these are using needs pair
The webpage of third party's website carries out text extraction, to be adapted to web page contents on various sizes of screen, improves Consumer's Experience.
On the other hand, the development of accurate advertisement technology etc., to the demand more and more higher of big data text mining, and big data text mining
A premise, be that can recognize the corresponding Web page texts of URL of magnanimity on internet.These webpages are not only from news
Door, also includes encyclopaedia website, personal blog, Ask-Answer Community etc., data volume is big, and content embraces a wide spectrum of ideas.
Above demand all illustrates the significance level that Web page text is extracted, while also requiring that the method tool that Web page text is extracted
Standby quick, accurate, general characteristic.Existing webpage extracting method may be summarized to be following several:
(1)Rule-based way.Manually extracting rule, such as regular expression or XPath are specified for specific website
Deng.Advantage is that order of accuarcy is high, but has the disadvantage that the webpage of fixed website or set form, and the formulation of rule can only be parsed
Journey wastes time and energy, once page layout changes, is subsequently difficult to find and updating maintenance.
(2)Parse HTML DOM(DOM Document Object Model)Tree construction.By building dom tree to html web page, tree is carried out
Traversal, recognizes and rejects non-text message, and according to the Rule Extraction body text such as page layout, text density.So do most
Big shortcoming is that the speed of structure dom tree is slower, and text extraction efficiency is very low.And with the update of webpage origination techniques,
Many website HTML become increasingly complex, also more and more lack of standardization, and not only speed is slow for this analytic method, also has certain probability structure
Can not build dom tree causes to extract failure.
(3)Based on the implicit annotation and visual information in HTML(Such as text color, font).This method it is general
Property not enough, be still required for many artificial rules, generally only have very good effect to news portal etc., but to this style of personal blog
The webpage changed greatly, success rate is relatively low.
The content of the invention
In order to overcome the deficiencies in the prior art, the invention provides a kind of general and efficient webpage context extraction method,
Efficiently webpage can be handled, extract body text, can be very good to be applied to search engine, big data text mining
Etc. task.
The purpose of the present invention is achieved through the following technical solutions:
A kind of extracting method of Web page text, comprises the following steps:
Step one, web page title is extracted by regular expression.
Step 2, Web-page preprocessing removes unrelated character and web page tag, obtains continuous text fragments in webpage.
Unrelated character includes but is not limited to page script, CSS, annotation.Removing web page tag specific method includes:(1)Remove
Original newline in webpage;(2)The label in webpage is removed, and is replaced with new newline.
Step 3, dynamic divides text block.(1)All nulls in the text fragments obtained are counted, check that each is empty
Capable how many continuous null altogether, an array is obtained with this;(2)Array is sorted from small to large, number is selected
The 1/5th small number of group is used as threshold value;(3)Text fragments are separated using the number of this threshold value as continuous null, obtained
Obtain multiple dynamic text blocks.
Step 4, gives a mark to text block, chooses optimal text block.Particular content includes:Count each text block
Word length, hyperlink number, punctuation mark number, stop-word number, the similarity with title, text block are in all texts
Relative position information, text_density, link_density, punctuation_density are designated as respectively,
Stopword_density, title_match_rate, position_rate, give a mark, fraction is got over to each text block
Height, text is better, and optimal text block is chosen with this.
The score formula of text block is:
y = text_density * position_rate * (1 + title_match_rate) * sqrt
(punctuation_density) / link_density / sqrt(stopword_density)。
Step 5, circulation expands text block.From the optimal text BOB(beginning of block) of acquisition, merge the text of its front or behind
Block, and calculate again point;If score is raised, the text block after merging is added optimal text block, if score declines,
Then abandon this time merging;Circulation is gone down, untill score can not be raised, the text block finally given, seeks to extract just
Literary information.
In summary, by adopting the above-described technical solution, compared to the prior art the present invention, there is following beneficial effect:
(1)Efficiently.The context extraction method of the present invention need not build dom tree, it is not required that recurrence tree construction, only need
Pretreatment and the acquisition work of text fragments can just be completed by scanning several times to HTML, and extraction rate is quickly.
(2)It is general.The present invention is not relying on specific page layout, it is not required that web page tag, annotation etc. are made greatly
Amount it is assumed that thus have good versatility.Experiment shows, no matter to news portal or personal blog, community of forum it is each
Planting webpage has good extraction effect.
(3)Accuracy is high.Because the extracting method of the present invention needs to give a mark to text block, it can be good to non-text
Content is distinguished and filtered, and the strategy for doing loop fusion from optimal text block being capable of minimal omission high-quality text.
(4)Robustness is good.The present invention extracting method simultaneously HTML correctness need not be made the assumption that, even if therefore
To nonstandard html text, text extraction can also be carried out.
Brief description of the drawings
Fig. 1 is a kind of flow chart of the extracting method of Web page text of the invention.
Embodiment
The embodiment to the present invention elaborates with reference to the accompanying drawings and examples.
Fig. 1 is referred to, a kind of extracting method of Web page text of the invention is divided into two key steps:First, text block separates;
2nd, statistics filtering and merging.The purpose that text block separates is that text segment is separated with non-text segment;Statistics filtering and conjunction
And purpose be that text segment is screened.Text block separates pre- including extracting web page title, webpage by regular expression
Processing, dynamic divide three steps of text block;Statistics filtering and merging include the optimal text block of selection, circulation and expand text block two
Individual step.
First have to extract web page title from original web page, be that follow-up text block marking is prepared.Extract web page title
It can be completed by setting general-purpose regular expression, such as the extracting rule in Python:
_title = re.compile(r'<title>(.*)</title>', re.I | re.S) and,
_title = re.compile(r'<h1>(.*?)</h1>', re.I|re.S)。
Web-page preprocessing will be carried out by extracting after web page title, the purpose of Web-page preprocessing is to remove unrelated character and webpage mark
Label, obtain continuous text fragments in webpage.Unrelated character includes but is not limited to page script, CSS, annotation etc., this
A little characters are not in the body of the email, and the extraction of text not to be helped.Because the HTML of each website differs widely, have
There is newline, what is had does not have, so before extracting text fragments, first to remove original newline in webpage.In order to webpage
Text is separated, to distinguish text and non-text, also to remove the label in webpage, the label of removal is with new newline
Replace.After this step, continuous text fragment line by line will be obtained, fragment it is of different sizes, and be spaced it is far and near
It is different.
The division of dynamic text block is carried out below, and the purpose that dynamic divides text block is that close text block is aggregated to
Together.But it is due to the structure difference of different web pages, the sparse degree difference for the text fragments that Web-page preprocessing is obtained is very big, institute
This problem of solution has been thought, it is necessary to the sparse degree based on webpage Chinese version, sets dynamic threshold value, and with this threshold value by net
Page is divided into multiple text blocks.Specific practice is:All nulls in the text fragments of acquisition are first counted, each null is checked
How many continuous null, an array, such as [1,1,3,3,1,1,5,1,1] are obtained with this altogether;Then logarithm
Group sorts from small to large, and the selection small number of array the 1/5th is used as threshold value;The number of continuous null is used as using this threshold value
Text fragments are separated, multiple dynamic text blocks are obtained.
, it is necessary to be given a mark to text block after text block separation is completed, optimal text block is chosen.Specific practice is:System
Count word length, hyperlink number, punctuation mark number, the stop-word number of each text block(Such as advertisement, on we,
Friendly link etc. frequently appears in the word in non-text region)And relative position of the similarity, text block of title in all texts
Put(The text block position for example most started is 1, finally occur for 0)Etc. information, text_density is designated as respectively,
link_density, punctuation_density, stopword_density, title_match_rate,
Position_rate, is given a mark to each text block, and optimal text block is chosen with this.
The score formula of text block is:
y = text_density * position_rate * (1 + title_match_rate) * sqrt
(punctuation_density) / link_density / sqrt(stopword_density)。
From the optimal text BOB(beginning of block) of acquisition, merge the text block of its front or behind, and calculate again point;If
Decilitre is high, then the text block after merging is added optimal text block, if score declines, and abandons this time merging;Circulation is gone down,
Untill score can not be raised, the text block finally given seeks to the text message extracted.The number of times of cycle calculations is that have
Limit, not over the number of text block.So the omission of text can be reduced with merging high-quality text as much as possible.
In summary, by adopting the above-described technical solution, compared to the prior art the present invention, there is following beneficial effect:
(1)Efficiently.The context extraction method of the present invention need not build dom tree, it is not required that recurrence tree construction, only need
Pretreatment and the acquisition work of text fragments can just be completed by scanning several times to HTML, and extraction rate is quickly.
(2)It is general.The present invention is not relying on specific page layout, it is not required that web page tag, annotation etc. are made greatly
Amount it is assumed that thus have good versatility.Experiment shows, no matter to news portal or personal blog, community of forum it is each
Planting webpage has good extraction effect.
(3)Accuracy is high.Because the extracting method of the present invention needs to give a mark to text block, it can be good to non-text
Content is distinguished and filtered, and the strategy for doing loop fusion from optimal text block being capable of minimal omission high-quality text.
(4)Robustness is good.The present invention extracting method simultaneously HTML correctness need not be made the assumption that, even if therefore
To nonstandard html text, text extraction can also be carried out.
Embodiment described above is merely to illustrate the technological thought and feature of this patent, in the art its object is to make
Technical staff can understand the content of this patent and implement according to this, it is impossible to the patent model of this patent is only limited with the present embodiment
Enclose, i.e., equal change or modification that all spirit according to disclosed in this patent is made still fall in the scope of the claims of this patent.
Claims (5)
1. a kind of extracting method of Web page text, it is characterised in that comprise the following steps:
Step one, web page title is extracted by regular expression;
Step 2, Web-page preprocessing obtains continuous text fragments in webpage;
Step 3, dynamic divides text block;
Step 4, according to the word length of each text block, hyperlink number, punctuation mark number, stop-word number and title
The information of relative position in all texts of similarity, text block each text block is beaten respectively based on pre-defined rule
Point, choose optimal text block;
Step 5, the optimal text block is expanded based on pre-defined rule circulation, and the optimal text block after expansion is text message.
2. the extracting method of a kind of Web page text according to claim 1, it is characterised in that the step 2 includes:Go
Except unrelated character and web page tag, continuous text fragments in webpage are obtained;
Removing web page tag specific method includes:(1) original newline in webpage is removed;(2) label in webpage is removed, and
Replaced with new newline;
The unrelated character includes but is not limited to page script, CSS, annotation.
3. the extracting method of a kind of Web page text according to claim 1 or 2, it is characterised in that the step 3 includes:
(1) all nulls in the text fragments that statistics is obtained, check each null how many continuous sky altogether
OK, an array is obtained with this;
(2) array is sorted from small to large, the selection small number of array the 1/5th is used as threshold value;
(3) number using this threshold value as continuous null is separated to text fragments, obtains multiple dynamic text blocks.
4. the extracting method of a kind of Web page text according to claim 1, it is characterised in that in the step 4, will be every
The word length of individual text block, hyperlink number, punctuation mark number, stop-word number, exist with similarity, the text block of title
Relative position in all texts is designated as text_density, link_density, punctuation_density respectively,
Stopword_density, title_match_rate, position_rate,
The score formula of text block is:
Y=text_density*position_rate* (1+title_match_rate) * sqrt (punctuation_
density)/link_density/sqrt(stopword_density)。
5. the extracting method of a kind of Web page text according to claim 1, it is characterised in that the step 5 includes:From
The optimal text BOB(beginning of block) obtained, merges the text block of its front or behind, and calculates again point;If score is raised,
Text block after merging is added optimal text block, if score declines, abandons this time merging;Circulation is gone down, until score
Untill can not raising, the text block finally given seeks to the text message extracted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510017223.3A CN104598577B (en) | 2015-01-14 | 2015-01-14 | A kind of extracting method of Web page text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510017223.3A CN104598577B (en) | 2015-01-14 | 2015-01-14 | A kind of extracting method of Web page text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104598577A CN104598577A (en) | 2015-05-06 |
CN104598577B true CN104598577B (en) | 2017-09-15 |
Family
ID=53124362
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510017223.3A Active CN104598577B (en) | 2015-01-14 | 2015-01-14 | A kind of extracting method of Web page text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104598577B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105320734B (en) * | 2015-07-14 | 2019-02-22 | 中国互联网络信息中心 | A kind of web page core content extracting method |
CN105740355B (en) * | 2016-01-26 | 2019-03-26 | 中国人民解放军国防科学技术大学 | Webpage context extraction method and device based on aggregation text density |
CN106844441A (en) * | 2016-12-15 | 2017-06-13 | 北京容联光辉科技有限公司 | A kind of method and device of Information Sharing |
CN106960057A (en) * | 2017-04-05 | 2017-07-18 | 上海威固信息技术有限公司 | A kind of method that Web page text is extracted based on information density |
CN107273491B (en) * | 2017-06-15 | 2020-07-24 | 华中师范大学 | Webpage segmentation method and device and electronic equipment |
CN110020312B (en) * | 2017-12-11 | 2022-09-06 | 北京京东尚科信息技术有限公司 | Method and device for extracting webpage text |
CN108897749A (en) * | 2018-04-19 | 2018-11-27 | 中国科学院计算技术研究所 | Method for abstracting web page information and system based on syntax tree and text block density |
CN108763591B (en) * | 2018-06-21 | 2021-01-08 | 湖南星汉数智科技有限公司 | Webpage text extraction method and device, computer device and computer readable storage medium |
CN109635219A (en) * | 2018-12-05 | 2019-04-16 | 云孚科技(北京)有限公司 | A kind of webpage content extracting method |
CN110795933B (en) * | 2019-09-30 | 2023-10-31 | 奇安信科技集团股份有限公司 | Webpage text recognition processing method and device |
CN113051390B (en) * | 2019-12-26 | 2023-09-26 | 百度在线网络技术(北京)有限公司 | Knowledge base construction method, knowledge base construction device, electronic equipment and medium |
CN113537091B (en) * | 2021-07-20 | 2024-05-03 | 东莞盟大集团有限公司 | Webpage text recognition method and device, electronic equipment and storage medium |
CN115408594A (en) * | 2022-11-01 | 2022-11-29 | 长沙火线云网络科技有限公司 | Webpage title extraction method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101937438A (en) * | 2009-06-30 | 2011-01-05 | 富士通株式会社 | Method and device for extracting webpage content |
JP4606439B2 (en) * | 2007-05-28 | 2011-01-05 | モバイダーズ・インコーポレイテッド | File conversion apparatus and method for converting HTML file into flash image |
CN102663023A (en) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | Implementation method for extracting web content |
CN102810097A (en) * | 2011-06-02 | 2012-12-05 | 高德软件有限公司 | Method and device for extracting webpage text content |
CN103064827A (en) * | 2013-01-16 | 2013-04-24 | 盘古文化传播有限公司 | Method and device for extracting webpage content |
-
2015
- 2015-01-14 CN CN201510017223.3A patent/CN104598577B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4606439B2 (en) * | 2007-05-28 | 2011-01-05 | モバイダーズ・インコーポレイテッド | File conversion apparatus and method for converting HTML file into flash image |
CN101937438A (en) * | 2009-06-30 | 2011-01-05 | 富士通株式会社 | Method and device for extracting webpage content |
CN102810097A (en) * | 2011-06-02 | 2012-12-05 | 高德软件有限公司 | Method and device for extracting webpage text content |
CN102663023A (en) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | Implementation method for extracting web content |
CN103064827A (en) * | 2013-01-16 | 2013-04-24 | 盘古文化传播有限公司 | Method and device for extracting webpage content |
Also Published As
Publication number | Publication date |
---|---|
CN104598577A (en) | 2015-05-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104598577B (en) | A kind of extracting method of Web page text | |
CN107145479B (en) | Text semantic-based chapter structure analysis method | |
CN101593200B (en) | Method for classifying Chinese webpages based on keyword frequency analysis | |
CN102253979B (en) | Vision-based web page extracting method | |
CN101251855B (en) | Equipment, system and method for cleaning internet web page | |
CN102663023B (en) | Implementation method for extracting web content | |
CN102270206A (en) | Method and device for capturing valid web page contents | |
CN101515272B (en) | Method and device for extracting webpage content | |
CN102156737B (en) | Method for extracting subject content of Chinese webpage | |
CN107590219A (en) | Webpage personage subject correlation message extracting method | |
CN106528583A (en) | Method for extracting and comparing web page main body | |
CN102591612B (en) | General webpage text extraction method based on punctuation continuity and system thereof | |
CN102135967A (en) | Webpage keywords extracting method, device and system | |
CN104199972A (en) | Named entity relation extraction and construction method based on deep learning | |
CN101079031A (en) | Web page subject extraction system and method | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
CN110457579B (en) | Webpage denoising method and system based on cooperative work of template and classifier | |
CN109492177B (en) | web page blocking method based on web page semantic structure | |
CN110609998A (en) | Data extraction method of electronic document information, electronic equipment and storage medium | |
WO2011072434A1 (en) | System and method for web content extraction | |
CN102609427A (en) | Public opinion vertical search analysis system and method | |
CN103324622A (en) | Method and device for automatic generating of front page abstract | |
CN103166981A (en) | Wireless webpage transcoding method and device | |
CN104268283A (en) | Method for automatically analyzing Internet web page | |
CN107145591B (en) | Title-based webpage effective metadata content extraction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |