CN105320734B - A kind of web page core content extracting method - Google Patents
A kind of web page core content extracting method Download PDFInfo
- Publication number
- CN105320734B CN105320734B CN201510413180.0A CN201510413180A CN105320734B CN 105320734 B CN105320734 B CN 105320734B CN 201510413180 A CN201510413180 A CN 201510413180A CN 105320734 B CN105320734 B CN 105320734B
- Authority
- CN
- China
- Prior art keywords
- paragraph
- core
- web page
- value
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention provides a kind of extracting method of web page core content, comprising the following steps: 1) according to the html label in web page code, web page contents are divided into multiple paragraphs;2) character length of each paragraph, the spacing distance of adjacent paragraph and paragraph inside concentration are counted as characteristic value.3) the core feature value of each paragraph is calculated according to the characteristic value.According to the core feature Distribution value situation of paragraph each in webpage, the range that core feature value is concentrated the most is obtained, paragraph of the core feature value in this threshold range is the core paragraph of webpage, to obtain the core content of webpage.It is had the advantage that compared with prior art not merely dependent on html label, fully takes into account the feature between text fragment feature itself, paragraph layout, thus accuracy rate is high.Implementation is not rely on certain types of webpage, has versatility, can handle all kinds of common webpages on internet.Implement simply, calculation amount is small, and treatment effeciency is high.
Description
Technical field
The present invention relates to information technology field more particularly to internet information processing technology fields, and in particular to a kind of net
Page core content extracting method.
Background technique
With the development of internet, internet site's webpage number, Internet user constantly increase, internet web page contents
Become the indispensable channel that people obtain information.And under the factor of commercial operation, original letter is provided for user
The website of breath can provide some additional information, such as ad data and to other in the webpage it includes worth of data
The link of website related content (these advertisements, link data may be text, it is also possible to picture, in some instances it may even be possible to be plug-in unit);
The data such as these advertisements, link are continuously added but also the Page Views that should be simplified become cumbersome very much;All kinds of nets
The addition of page tools and various dynamic elements is but also the immanent structure of the page becomes complicated.
The increasingly sophisticated influence user reading experience of web page contents and structure, expends a large amount of Internet bandwidth resources, these
Data not only affect the efficiency of webpage information browsing, and if being applied to retrieval, the accuracy for also resulting in retrieval is reduced.
How accurate quickly analysis obtain web page core content become numerous web contents processing application (such as search engine, network filing,
Information Collection System etc.) problem in the urgent need to address.
In addition, mobile Internet flourishes so that becoming trend of the times, and mobile terminal institute in mobile terminal browsing webpage
The features such as screen having is small, flow is limited, can not show all the elements in conventional web page, this is but also web page core content
It is effective extraction become more urgent.
Generally there are several types of methods for the method for extraction web page core content in the prior art:
1. being determined according to going in webpage with the number of characters of row
1) it is directed to webpage, determines the character sum and Chinese character number of the i-th row and (i+1) row content;
2) text density of the i-th row and (i+1) row content is calculated, such as can be with Chinese character number divided by character sum
Calculate text density;
3) text density being calculated is compared with preset threshold values;
4) if comparison result is that text density is not less than preset threshold values, it is determined that in the i-th row and (i+1) behavior core
Hold, if comparison result is that text density is less than preset threshold values, it is determined that the i-th row and (i+1) row content are non-core content;
5) if it is determined that the i-th row and (i+1) row content are core content, then the i-th row, (i are determined according to the method described above
+ 1) whether capable and (i+2) row content is core content;
6) if it is determined that the i-th row and (i+1) row content are non-core content, then (i+2) is determined according to the method described above
Whether capable and (i+3) row content is core content;
7) above-mentioned steps are executed, until traversing all rows of the webpage.
The above method of the prior art, when extracting web page core content, if the text density of continuous multiple line content is not small
In preset threshold, being considered as the continuous multiple line content is body matter, but now in many webpages, there are more degree of disturbance compared with
High non-core content, such as personal information, short essay abstract, relief statement etc., these non-core contents equally have text close
Spend larger feature, it is likely that be greater than preset threshold values, thus with core content is mistakenly considered;And if adjusting threshold values, it is possible to
Core content is mistaken for non-core content, so that the extraction accuracy of core content reduces.
In addition, when encountering the case where webpage is loaded with a large amount of contents since above method algorithm comparison is cumbersome, it may be necessary to
Longer handling duration could complete the extraction of web page core content, influence the experience sense of user by being also unable to satisfy at this stage
Increasingly tend to the requirement of the information processing of high-speed and high-efficiency to information technology.
2. carrying out region segmentation to webpage using structure of web page layout information, the content of core web page blocks is extracted
By carrying out piecemeal using the layout of Webpage, a webpage is divided into multiple portions, further according to these portions
The feature divided is classified.But this method based on page layout is not particularly suited for all webpages, needs to be set in advance
Processing template.The peak Xin Rui Information technology Co., Ltd in Jiangsu improves the above method, proposes based on html label to net
Page carries out area dividing and then extracts content of text (number of patent application 201210213554.0).This method only depends on merely
Html label can only carry out news web page in actual effect there is no the correlation in view of content of text itself in webpage
(being 80% to 85% according to processing success rate of its description to news web page) is effectively treated
3. extracting the core content of webpage based on DOM Document Object Model (DOM, Document Object Model)
By extracting the DOM Document Object Model in web document, extracted in webpage according to specific object model nodes
Hold.In fact content node is all Web page maker's self-defining in the DOM Document Object Model of each webpage, and this method can not
Suitable for all webpages.
Summary of the invention
To solve the above-mentioned problems, the object of the present invention is to provide a kind of extracting method of web page core content, this method
It is intensive by the text distance between the length of paragraph, paragraph, the text inside paragraph by the way that web page contents are divided into paragraph
The core content of degree locating web-pages.
To achieve the goals above, the scheme that the present invention takes is:
A kind of extracting method of web page core content, comprising the following steps:
1) according to the html label in web page code, web page contents are divided into multiple paragraphs;
2) character length of each paragraph, the spacing distance of adjacent paragraph and paragraph inside concentration are counted as feature
Value.
3) the core feature value of each paragraph is calculated according to the characteristic value.According to the core feature of paragraph each in webpage
Distribution value situation obtains the range that core feature value is concentrated the most, and paragraph of the core feature value in this threshold range is net
The core paragraph of page, to obtain the core content of webpage.
Further, step 1) according to html label (including<p></p><div></div><span></span><div>
</div><br><br/>deng) paragraph division is carried out to webpage.
The type that the spacing distance of the adjacent paragraph includes have a paragraph at a distance from a paragraph thereon and the paragraph with
The distance of its next paragraph.
Further, the spacing distance of the adjacent paragraph is defined as the number of characters+M between paragraph, wherein the value of M according to
The end-tag of the previous paragraph of one paragraph and the beginning label of the paragraph determine.
Further, concentration is defined as the Chinese and English character summation/Q occurred in paragraph inside the paragraph,
Wherein the value of Q is defined as punctuation mark number × Q1+html label in the Chinese and English character summation+paragraph that occur in paragraph
1 length × 2 length of Q1+html label × Q2 ...+html label P length × QP;Q1, Q2 ... QP are the type according to html label
It determines.
Further, the core feature value of a paragraph is defined as the character length of paragraph × concentration inside paragraph/(this
Paragraph at a distance from a paragraph thereon+paragraph is at a distance from its next paragraph).
Further, the core feature value of paragraph is calculated according to the characteristic value for step 3), according to each paragraph core
The distribution situation of heart characteristic value, selection obtain core feature value in certain threshold range inner core paragraph, the combination of these paragraphs
For core text.
Further, the foundation that the threshold range is chosen are as follows: paragraph core feature value represents core content in webpage
Feature, the characteristic value of core paragraph is similar in same webpage, rather than core content such as advertisement, relief statement, push chain
The part for connecing etc. and not having such centrality feature, therefore paragraph core feature value is selected to concentrate the most alternatively core segment
The threshold range fallen.
The invention is characterized in that above-mentioned technical proposal, has the advantage that compared with prior art
1. not depending on html label merely, the feature between text fragment feature itself, paragraph layout is fully taken into account,
Thus accuracy rate is high.
2. implementation is not rely on certain types of webpage, there is versatility, all kinds of common nets on internet can be handled
Page.
3. implementing simply, calculation amount is small, and treatment effeciency is high.
Detailed description of the invention
Fig. 1 is that web page core content of the present invention obtains flow diagram.
Fig. 2 a is the first part that web page core content obtains schematic diagram in the embodiment of the present invention 2.
Fig. 2 b is the second part that web page core content obtains schematic diagram in the embodiment of the present invention 2.
Specific embodiment
To enable features described above and advantage of the invention to be clearer and more comprehensible, special embodiment below, and institute's attached drawing is cooperated to make
Detailed description are as follows.
Firstly, being illustrated to core of the invention design:
1. web page code is carried out paragraph division using html label.
Html label (Hyper Text Markup Language tag;Hypertext markup language label) it is html language
In most basic unit, html label is that one under HTML standard generalized markup language applies most important component part.
Html label usually has the characteristics that,
1) keyword surrounded by angle brackets, such as<html>.
2) usually occur in pairs, such as<div>with</div>.
3) first label of label centering is to start label, and second label is end-tag.
4) beginning and end label is also referred to as open label and closure label.
5) also there is the label individually presented, such as<img src=".GIF"/>deng.
6) label generally occurred in pairs, content is among two labels.The label individually presented, then in tag attributes
Middle assignment.Such as<h1>title</h1>with<input type="text"value="button "/>.
7) content of webpage need to be in<html>label, title, character format, language, compatibility, keyword, description etc.
Information is shown in<head>in label, and the content that webpage need to be shown need to be nested in<body>in label.Sometimes uneasy mark
Although quasi- code of writing can normally be shown, as professional quality, it still should form and regular write habit.
According to such as These characteristics, web page code is divided using html label, resulting paragraph accordingly has following
Feature:
It is surrounded by following label:
<p></p>
<div></div>
<span></span>
<br>(or<br/>)
<h1></h1>(<h2></h3>……<hn></hn>)
Select these paragraphs, according between these paragraphs in vision and character distance calculate paragraph between text away from
From.
Then, the text concentration of each paragraph itself is calculated in the tightness degree to paragraph inside character
Value:
2. according between text distance, paragraph and the latter paragraph between paragraph text size, paragraph and previous paragraph
Text distance, concentration value this four characteristic values inside paragraph are calculated, according to result judgement whether be webpage core
Heart content of text.
Whether it is core that the present invention is different from the prior art according only to each row/paragraph of the character density to web page code content
Content or html label judged, but text distance between comprehensive paragraph text size, paragraph, intensive inside paragraph
These characteristic values of degree are calculated, and compared with prior art, have not only fully considered the characteristic of html document itself, net
Visual display feature on page, while the feature in Chinese text structure is had also contemplated, it can handle all kinds of texts on internet
This (including but not limited to composite web page, news web page, blog web page, encyclopaedia class webpage, commodity class website etc.), is compared
Preferable effect.For verification the verifying results, we are sampled Global Internet Chinese website, obtain 100,000 Chinese networks at random
Page, and the method according to the invention is handled.Experiment shows that the present invention extracts the order of accuarcy height of all kinds of web page core contents
Up to 90%.In terms for the treatment of effeciency, only increase by 25% time loss with html tag processes merely under same operational capability, than
50% time loss will be lacked by carrying out processing using DOM Document Object Model.
It is illustrated below in conjunction with process flow of the Fig. 1 to web page core content extracting method of the invention:
Firstly, web page contents are divided into multiple paragraphs according to the html label in web page code.It is right in this during
Html label analysis in web page code, the part for including to lower column label are divided into paragraph:
<hn></hn>the part for including
<p></p>the part for including
<div></div>the part for including
<span></span>the part for including
A upper label occurred in pairs terminate to<br>(or<br/>) part between label
Above-mentioned each part is as an independent paragraph.
Then, the characteristic values such as text distance, the paragraph concentration between the length of paragraph, paragraph are obtained, wherein paragraph
Text range formula between paragraph is as follows:
Number of characters+M between the distance N=paragraph of paragraph
The value of M depends on a upper paragraph end-tag and this paragraph starts label, and different tag combinations obtains not
With M value, combination there are several types of:
</hn>with<hn>
</hn>with<p>
</hn>with<span>
</hn>with<div>
</hn>with<br>the first character of paragraph
</p>with<p>
</p>with<hn>
</p>with<span>
</p>with<div>
</p>with<br>the first character of paragraph
</span>with<span>
</span>with<hn>
</span>with<p>
</span>with<div>
</span>with<br>the first character of paragraph
</div>with<div>
</div>with<hn>
</div>with<p>
</div>with<span>
</div>with<br>the first character of paragraph
Then, we calculate the text concentration value of each paragraph itself:
The Chinese and English character summation/Q occurred in text concentration value=paragraph of paragraph itself
The punctuation mark number * Q1+html in Chinese and English character summation+paragraph occurred in calculating=paragraph of Q value
1 length * Q1+html label of label, 2 length * Q2 ...+html label P length * QP
(Q1, Q2 ... QP are different according to the difference of html label)
Finally, concentration determines net between paragraph at a distance from front and back and inside paragraph according to the length of paragraph, paragraph
The core paragraph of page, so that it is determined that the core content of webpage.Specific calculating process is as follows:
Concentration value inside the characteristic value=paragraph length * paragraph of paragraph core/(at a distance from paragraph and the preceding paragraph fall
+ paragraph is at a distance from next paragraph)
Finally, selecting to be in core text in certain threshold range according to paragraph core feature Distribution value in webpage
Hold.
Search engine can efficiently handle magnanimity webpage using method of the invention, extract web page core content, without
It needs to store the original contents of webpage, mass memory and a large amount of operations consumption can be saved, and can be in terms of search result
Accurately return to web page core content.
Information Collection System can not be influenced using method of the invention by advertisement, page dynamic element in webpage, convenient
Efficiently collect web page core content.
System obtain web page code, according to html label by text division of teaching contents in webpage be paragraph P1 to Pn, by upper
Length Lp1 to the Lpn, each paragraph and the preceding paragraph that each paragraph is calculated in the method for stating fall between text distance Dp before 1 arrive
N, paragraph concentration Mp1 to Mpn after 1 to Dp after the text distance Dp of n before Dp, each paragraph and next paragraph, by above-mentioned
Four characteristic values calculate paragraph core feature value Hp1 to Hp2, are selected according to threshold value, obtain core paragraph Px, Px+1 ..., Py,
That is the core content of the webpage.Calculating process refers to Fig. 2 a and Fig. 2 b.
Claims (6)
1. a kind of extracting method of web page core content, comprising the following steps:
1) according to the html label in web page code, web page contents are divided into multiple paragraphs;
2) character length of each paragraph, the spacing distance of adjacent paragraph and paragraph inside concentration are counted as characteristic value;
The type that the spacing distance of the adjacent paragraph includes have a paragraph at a distance from a paragraph thereon and the paragraph with its next section
The distance fallen;The spacing distance of the adjacent paragraph is defined as the number of characters+M between paragraph, and wherein the value of M is according to a paragraph
The end-tag of previous paragraph and the beginning label of the paragraph determine;
3) the core feature value of each paragraph is calculated according to the characteristic value;According to the core feature value of paragraph each in webpage point
Cloth situation obtains the range that core feature value is concentrated the most, and paragraph of the core feature value in this threshold range is webpage
Core paragraph, to obtain the core content of webpage.
2. the extracting method of web page core content as described in claim 1, which is characterized in that html label described in step 1)
Including<p>,</p>,<div>,</div>,<span>,</span>,<div>,</div>,<br>,<br/>.
3. the extracting method of web page core content as described in claim 1, which is characterized in that concentration inside the paragraph
It is defined as the Chinese occurred in paragraph and English character summation/Q, wherein the value of Q is defined as the Chinese and English occurred in paragraph
Punctuation mark number × 1 length of Q1+html label × 2 length of Q1+html label × Q2 ...+html mark in character summation+paragraph
Sign P length × QP;Q1, Q2 ... QP are to be determined according to the type of html label.
4. the extracting method of web page core content as claimed in claim 3, which is characterized in that the core feature value of a paragraph is fixed
Justice be concentration inside character length × paragraph of paragraph/(paragraph at a distance from a paragraph thereon+paragraph is next with it
The distance of paragraph).
5. the extracting method of web page core content as described in claim 1, which is characterized in that according to the feature in step 3)
The core feature value that paragraph is calculated in value includes according to the distribution situation of each paragraph core feature value, and selection obtains core spy
Value indicative is combined into core text in certain threshold range inner core paragraph, the group of these paragraphs.
6. the extracting method of web page core content as claimed in claim 5, which is characterized in that the threshold range choose according to
According to the threshold range of are as follows: the part concentrated the most of selection paragraph core feature value alternatively core paragraph.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510413180.0A CN105320734B (en) | 2015-07-14 | 2015-07-14 | A kind of web page core content extracting method |
PCT/CN2015/098464 WO2017008448A1 (en) | 2015-07-14 | 2015-12-23 | Method for extracting core content of web page |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510413180.0A CN105320734B (en) | 2015-07-14 | 2015-07-14 | A kind of web page core content extracting method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105320734A CN105320734A (en) | 2016-02-10 |
CN105320734B true CN105320734B (en) | 2019-02-22 |
Family
ID=55248123
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510413180.0A Active CN105320734B (en) | 2015-07-14 | 2015-07-14 | A kind of web page core content extracting method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN105320734B (en) |
WO (1) | WO2017008448A1 (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107357496B (en) * | 2017-07-19 | 2019-03-26 | 掌阅科技股份有限公司 | Annotation process method, electronic equipment and computer storage medium |
CN109543126B (en) * | 2018-11-19 | 2022-04-29 | 四川长虹电器股份有限公司 | Webpage text information extraction method based on block character ratio |
CN109684642B (en) * | 2018-12-26 | 2023-01-13 | 重庆电信系统集成有限公司 | Abstract extraction method combining page parsing rule and NLP text vectorization |
CN111435405A (en) * | 2019-01-15 | 2020-07-21 | 北京行数通科技有限公司 | Method and device for automatically labeling key sentences of article |
CN110443814B (en) * | 2019-07-30 | 2022-12-27 | 北京百度网讯科技有限公司 | Loss assessment method, device, equipment and storage medium for vehicle |
CN111046302A (en) * | 2019-12-30 | 2020-04-21 | 珠海趣印科技有限公司 | Method and device for extracting webpage content |
CN113537091B (en) * | 2021-07-20 | 2024-05-03 | 东莞盟大集团有限公司 | Webpage text recognition method and device, electronic equipment and storage medium |
CN115098804B (en) * | 2022-06-24 | 2023-11-03 | 上海上班族数字科技有限公司 | Webpage search history record intelligent management system based on big data analysis |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101093487A (en) * | 2006-06-22 | 2007-12-26 | 上海新纳广告传媒有限公司 | Method for extracting content of text based on HTML characteristics |
CN101408898A (en) * | 2008-11-07 | 2009-04-15 | 北大方正集团有限公司 | Method and device for extracting web page text |
CN103020129A (en) * | 2012-11-20 | 2013-04-03 | 中兴通讯股份有限公司 | Text content extraction method and text content extraction device |
CN104598577A (en) * | 2015-01-14 | 2015-05-06 | 晶赞广告(上海)有限公司 | Extraction method for webpage text |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130283148A1 (en) * | 2010-10-26 | 2013-10-24 | Suk Hwan Lim | Extraction of Content from a Web Page |
CN102737017B (en) * | 2011-03-31 | 2015-03-11 | 北京百度网讯科技有限公司 | Method and apparatus for extracting page theme |
CN103365935A (en) * | 2012-04-11 | 2013-10-23 | 腾讯科技(深圳)有限公司 | Method and server for confirming page readability |
CN103810251B (en) * | 2014-01-21 | 2017-05-10 | 南京财经大学 | Method and device for extracting text |
-
2015
- 2015-07-14 CN CN201510413180.0A patent/CN105320734B/en active Active
- 2015-12-23 WO PCT/CN2015/098464 patent/WO2017008448A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101093487A (en) * | 2006-06-22 | 2007-12-26 | 上海新纳广告传媒有限公司 | Method for extracting content of text based on HTML characteristics |
CN101408898A (en) * | 2008-11-07 | 2009-04-15 | 北大方正集团有限公司 | Method and device for extracting web page text |
CN103020129A (en) * | 2012-11-20 | 2013-04-03 | 中兴通讯股份有限公司 | Text content extraction method and text content extraction device |
CN104598577A (en) * | 2015-01-14 | 2015-05-06 | 晶赞广告(上海)有限公司 | Extraction method for webpage text |
Also Published As
Publication number | Publication date |
---|---|
CN105320734A (en) | 2016-02-10 |
WO2017008448A1 (en) | 2017-01-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105320734B (en) | A kind of web page core content extracting method | |
US8819028B2 (en) | System and method for web content extraction | |
CN102541874B (en) | Webpage text content extracting method and device | |
CN106055667B (en) | It is a kind of based on text-label densities web page core content extracting method | |
CN102270206A (en) | Method and device for capturing valid web page contents | |
CN101681251A (en) | Semantic analysis of documents to rank terms | |
CN103336766A (en) | Short text garbage identification and modeling method and device | |
CN107346433A (en) | A kind of text data sorting technique and server | |
JP2016042349A (en) | Automatic method for division into chapters and sections | |
CN104199845B (en) | Line Evaluation based on agent model discusses sensibility classification method | |
CN104050158B (en) | Automatic quotation extraction method and device with semantic integrity kept | |
CN108874870A (en) | A kind of data pick-up method, equipment and computer can storage mediums | |
CN103491116A (en) | Method and device for processing text-related structural data | |
CN102207974A (en) | Method for combining context web pages | |
CN102346748A (en) | Automatic identification method for network literature directory type web pages | |
CN106485525A (en) | Information processing method and device | |
CN104572874B (en) | A kind of abstracting method and device of webpage information | |
CN105335382B (en) | The extracting method and device of Web page text | |
CN105183730B (en) | The treating method and apparatus of webpage information | |
CN110795933B (en) | Webpage text recognition processing method and device | |
CN103729354B (en) | web information processing method and device | |
CN104850609B (en) | A kind of filter method for rising space class keywords | |
JP5317638B2 (en) | Web document main content extraction apparatus and program | |
CN104462151A (en) | Method for evaluating web page publishing time and related device | |
Lin et al. | Combining a segmentation-like approach and a density-based approach in content extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |