CN105320734B - A kind of web page core content extracting method - Google Patents

A kind of web page core content extracting method Download PDF

Info

Publication number
CN105320734B
CN105320734B CN201510413180.0A CN201510413180A CN105320734B CN 105320734 B CN105320734 B CN 105320734B CN 201510413180 A CN201510413180 A CN 201510413180A CN 105320734 B CN105320734 B CN 105320734B
Authority
CN
China
Prior art keywords
paragraph
core
web page
value
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510413180.0A
Other languages
Chinese (zh)
Other versions
CN105320734A (en
Inventor
陈勇
耿光刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Internet Network Information Center
Original Assignee
China Internet Network Information Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Internet Network Information Center filed Critical China Internet Network Information Center
Priority to CN201510413180.0A priority Critical patent/CN105320734B/en
Priority to PCT/CN2015/098464 priority patent/WO2017008448A1/en
Publication of CN105320734A publication Critical patent/CN105320734A/en
Application granted granted Critical
Publication of CN105320734B publication Critical patent/CN105320734B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a kind of extracting method of web page core content, comprising the following steps: 1) according to the html label in web page code, web page contents are divided into multiple paragraphs;2) character length of each paragraph, the spacing distance of adjacent paragraph and paragraph inside concentration are counted as characteristic value.3) the core feature value of each paragraph is calculated according to the characteristic value.According to the core feature Distribution value situation of paragraph each in webpage, the range that core feature value is concentrated the most is obtained, paragraph of the core feature value in this threshold range is the core paragraph of webpage, to obtain the core content of webpage.It is had the advantage that compared with prior art not merely dependent on html label, fully takes into account the feature between text fragment feature itself, paragraph layout, thus accuracy rate is high.Implementation is not rely on certain types of webpage, has versatility, can handle all kinds of common webpages on internet.Implement simply, calculation amount is small, and treatment effeciency is high.

Description

A kind of web page core content extracting method
Technical field
The present invention relates to information technology field more particularly to internet information processing technology fields, and in particular to a kind of net Page core content extracting method.
Background technique
With the development of internet, internet site's webpage number, Internet user constantly increase, internet web page contents Become the indispensable channel that people obtain information.And under the factor of commercial operation, original letter is provided for user The website of breath can provide some additional information, such as ad data and to other in the webpage it includes worth of data The link of website related content (these advertisements, link data may be text, it is also possible to picture, in some instances it may even be possible to be plug-in unit); The data such as these advertisements, link are continuously added but also the Page Views that should be simplified become cumbersome very much;All kinds of nets The addition of page tools and various dynamic elements is but also the immanent structure of the page becomes complicated.
The increasingly sophisticated influence user reading experience of web page contents and structure, expends a large amount of Internet bandwidth resources, these Data not only affect the efficiency of webpage information browsing, and if being applied to retrieval, the accuracy for also resulting in retrieval is reduced. How accurate quickly analysis obtain web page core content become numerous web contents processing application (such as search engine, network filing, Information Collection System etc.) problem in the urgent need to address.
In addition, mobile Internet flourishes so that becoming trend of the times, and mobile terminal institute in mobile terminal browsing webpage The features such as screen having is small, flow is limited, can not show all the elements in conventional web page, this is but also web page core content It is effective extraction become more urgent.
Generally there are several types of methods for the method for extraction web page core content in the prior art:
1. being determined according to going in webpage with the number of characters of row
1) it is directed to webpage, determines the character sum and Chinese character number of the i-th row and (i+1) row content;
2) text density of the i-th row and (i+1) row content is calculated, such as can be with Chinese character number divided by character sum Calculate text density;
3) text density being calculated is compared with preset threshold values;
4) if comparison result is that text density is not less than preset threshold values, it is determined that in the i-th row and (i+1) behavior core Hold, if comparison result is that text density is less than preset threshold values, it is determined that the i-th row and (i+1) row content are non-core content;
5) if it is determined that the i-th row and (i+1) row content are core content, then the i-th row, (i are determined according to the method described above + 1) whether capable and (i+2) row content is core content;
6) if it is determined that the i-th row and (i+1) row content are non-core content, then (i+2) is determined according to the method described above Whether capable and (i+3) row content is core content;
7) above-mentioned steps are executed, until traversing all rows of the webpage.
The above method of the prior art, when extracting web page core content, if the text density of continuous multiple line content is not small In preset threshold, being considered as the continuous multiple line content is body matter, but now in many webpages, there are more degree of disturbance compared with High non-core content, such as personal information, short essay abstract, relief statement etc., these non-core contents equally have text close Spend larger feature, it is likely that be greater than preset threshold values, thus with core content is mistakenly considered;And if adjusting threshold values, it is possible to Core content is mistaken for non-core content, so that the extraction accuracy of core content reduces.
In addition, when encountering the case where webpage is loaded with a large amount of contents since above method algorithm comparison is cumbersome, it may be necessary to Longer handling duration could complete the extraction of web page core content, influence the experience sense of user by being also unable to satisfy at this stage Increasingly tend to the requirement of the information processing of high-speed and high-efficiency to information technology.
2. carrying out region segmentation to webpage using structure of web page layout information, the content of core web page blocks is extracted
By carrying out piecemeal using the layout of Webpage, a webpage is divided into multiple portions, further according to these portions The feature divided is classified.But this method based on page layout is not particularly suited for all webpages, needs to be set in advance Processing template.The peak Xin Rui Information technology Co., Ltd in Jiangsu improves the above method, proposes based on html label to net Page carries out area dividing and then extracts content of text (number of patent application 201210213554.0).This method only depends on merely Html label can only carry out news web page in actual effect there is no the correlation in view of content of text itself in webpage (being 80% to 85% according to processing success rate of its description to news web page) is effectively treated
3. extracting the core content of webpage based on DOM Document Object Model (DOM, Document Object Model)
By extracting the DOM Document Object Model in web document, extracted in webpage according to specific object model nodes Hold.In fact content node is all Web page maker's self-defining in the DOM Document Object Model of each webpage, and this method can not Suitable for all webpages.
Summary of the invention
To solve the above-mentioned problems, the object of the present invention is to provide a kind of extracting method of web page core content, this method It is intensive by the text distance between the length of paragraph, paragraph, the text inside paragraph by the way that web page contents are divided into paragraph The core content of degree locating web-pages.
To achieve the goals above, the scheme that the present invention takes is:
A kind of extracting method of web page core content, comprising the following steps:
1) according to the html label in web page code, web page contents are divided into multiple paragraphs;
2) character length of each paragraph, the spacing distance of adjacent paragraph and paragraph inside concentration are counted as feature Value.
3) the core feature value of each paragraph is calculated according to the characteristic value.According to the core feature of paragraph each in webpage Distribution value situation obtains the range that core feature value is concentrated the most, and paragraph of the core feature value in this threshold range is net The core paragraph of page, to obtain the core content of webpage.
Further, step 1) according to html label (including<p></p><div></div><span></span><div> </div><br><br/>deng) paragraph division is carried out to webpage.
The type that the spacing distance of the adjacent paragraph includes have a paragraph at a distance from a paragraph thereon and the paragraph with The distance of its next paragraph.
Further, the spacing distance of the adjacent paragraph is defined as the number of characters+M between paragraph, wherein the value of M according to The end-tag of the previous paragraph of one paragraph and the beginning label of the paragraph determine.
Further, concentration is defined as the Chinese and English character summation/Q occurred in paragraph inside the paragraph, Wherein the value of Q is defined as punctuation mark number × Q1+html label in the Chinese and English character summation+paragraph that occur in paragraph 1 length × 2 length of Q1+html label × Q2 ...+html label P length × QP;Q1, Q2 ... QP are the type according to html label It determines.
Further, the core feature value of a paragraph is defined as the character length of paragraph × concentration inside paragraph/(this Paragraph at a distance from a paragraph thereon+paragraph is at a distance from its next paragraph).
Further, the core feature value of paragraph is calculated according to the characteristic value for step 3), according to each paragraph core The distribution situation of heart characteristic value, selection obtain core feature value in certain threshold range inner core paragraph, the combination of these paragraphs For core text.
Further, the foundation that the threshold range is chosen are as follows: paragraph core feature value represents core content in webpage Feature, the characteristic value of core paragraph is similar in same webpage, rather than core content such as advertisement, relief statement, push chain The part for connecing etc. and not having such centrality feature, therefore paragraph core feature value is selected to concentrate the most alternatively core segment The threshold range fallen.
The invention is characterized in that above-mentioned technical proposal, has the advantage that compared with prior art
1. not depending on html label merely, the feature between text fragment feature itself, paragraph layout is fully taken into account, Thus accuracy rate is high.
2. implementation is not rely on certain types of webpage, there is versatility, all kinds of common nets on internet can be handled Page.
3. implementing simply, calculation amount is small, and treatment effeciency is high.
Detailed description of the invention
Fig. 1 is that web page core content of the present invention obtains flow diagram.
Fig. 2 a is the first part that web page core content obtains schematic diagram in the embodiment of the present invention 2.
Fig. 2 b is the second part that web page core content obtains schematic diagram in the embodiment of the present invention 2.
Specific embodiment
To enable features described above and advantage of the invention to be clearer and more comprehensible, special embodiment below, and institute's attached drawing is cooperated to make Detailed description are as follows.
Firstly, being illustrated to core of the invention design:
1. web page code is carried out paragraph division using html label.
Html label (Hyper Text Markup Language tag;Hypertext markup language label) it is html language In most basic unit, html label is that one under HTML standard generalized markup language applies most important component part.
Html label usually has the characteristics that,
1) keyword surrounded by angle brackets, such as<html>.
2) usually occur in pairs, such as<div>with</div>.
3) first label of label centering is to start label, and second label is end-tag.
4) beginning and end label is also referred to as open label and closure label.
5) also there is the label individually presented, such as<img src=".GIF"/>deng.
6) label generally occurred in pairs, content is among two labels.The label individually presented, then in tag attributes Middle assignment.Such as<h1>title</h1>with<input type="text"value="button "/>.
7) content of webpage need to be in<html>label, title, character format, language, compatibility, keyword, description etc. Information is shown in<head>in label, and the content that webpage need to be shown need to be nested in<body>in label.Sometimes uneasy mark Although quasi- code of writing can normally be shown, as professional quality, it still should form and regular write habit.
According to such as These characteristics, web page code is divided using html label, resulting paragraph accordingly has following Feature:
It is surrounded by following label:
<p></p>
<div></div>
<span></span>
<br>(or<br/>)
<h1></h1>(<h2></h3>……<hn></hn>)
Select these paragraphs, according between these paragraphs in vision and character distance calculate paragraph between text away from From.
Then, the text concentration of each paragraph itself is calculated in the tightness degree to paragraph inside character Value:
2. according between text distance, paragraph and the latter paragraph between paragraph text size, paragraph and previous paragraph Text distance, concentration value this four characteristic values inside paragraph are calculated, according to result judgement whether be webpage core Heart content of text.
Whether it is core that the present invention is different from the prior art according only to each row/paragraph of the character density to web page code content Content or html label judged, but text distance between comprehensive paragraph text size, paragraph, intensive inside paragraph These characteristic values of degree are calculated, and compared with prior art, have not only fully considered the characteristic of html document itself, net Visual display feature on page, while the feature in Chinese text structure is had also contemplated, it can handle all kinds of texts on internet This (including but not limited to composite web page, news web page, blog web page, encyclopaedia class webpage, commodity class website etc.), is compared Preferable effect.For verification the verifying results, we are sampled Global Internet Chinese website, obtain 100,000 Chinese networks at random Page, and the method according to the invention is handled.Experiment shows that the present invention extracts the order of accuarcy height of all kinds of web page core contents Up to 90%.In terms for the treatment of effeciency, only increase by 25% time loss with html tag processes merely under same operational capability, than 50% time loss will be lacked by carrying out processing using DOM Document Object Model.
It is illustrated below in conjunction with process flow of the Fig. 1 to web page core content extracting method of the invention:
Firstly, web page contents are divided into multiple paragraphs according to the html label in web page code.It is right in this during Html label analysis in web page code, the part for including to lower column label are divided into paragraph:
<hn></hn>the part for including
<p></p>the part for including
<div></div>the part for including
<span></span>the part for including
A upper label occurred in pairs terminate to<br>(or<br/>) part between label
Above-mentioned each part is as an independent paragraph.
Then, the characteristic values such as text distance, the paragraph concentration between the length of paragraph, paragraph are obtained, wherein paragraph Text range formula between paragraph is as follows:
Number of characters+M between the distance N=paragraph of paragraph
The value of M depends on a upper paragraph end-tag and this paragraph starts label, and different tag combinations obtains not With M value, combination there are several types of:
</hn>with<hn>
</hn>with<p>
</hn>with<span>
</hn>with<div>
</hn>with<br>the first character of paragraph
</p>with<p>
</p>with<hn>
</p>with<span>
</p>with<div>
</p>with<br>the first character of paragraph
</span>with<span>
</span>with<hn>
</span>with<p>
</span>with<div>
</span>with<br>the first character of paragraph
</div>with<div>
</div>with<hn>
</div>with<p>
</div>with<span>
</div>with<br>the first character of paragraph
Then, we calculate the text concentration value of each paragraph itself:
The Chinese and English character summation/Q occurred in text concentration value=paragraph of paragraph itself
The punctuation mark number * Q1+html in Chinese and English character summation+paragraph occurred in calculating=paragraph of Q value 1 length * Q1+html label of label, 2 length * Q2 ...+html label P length * QP
(Q1, Q2 ... QP are different according to the difference of html label)
Finally, concentration determines net between paragraph at a distance from front and back and inside paragraph according to the length of paragraph, paragraph The core paragraph of page, so that it is determined that the core content of webpage.Specific calculating process is as follows:
Concentration value inside the characteristic value=paragraph length * paragraph of paragraph core/(at a distance from paragraph and the preceding paragraph fall + paragraph is at a distance from next paragraph)
Finally, selecting to be in core text in certain threshold range according to paragraph core feature Distribution value in webpage Hold.
Search engine can efficiently handle magnanimity webpage using method of the invention, extract web page core content, without It needs to store the original contents of webpage, mass memory and a large amount of operations consumption can be saved, and can be in terms of search result Accurately return to web page core content.
Information Collection System can not be influenced using method of the invention by advertisement, page dynamic element in webpage, convenient Efficiently collect web page core content.
System obtain web page code, according to html label by text division of teaching contents in webpage be paragraph P1 to Pn, by upper Length Lp1 to the Lpn, each paragraph and the preceding paragraph that each paragraph is calculated in the method for stating fall between text distance Dp before 1 arrive N, paragraph concentration Mp1 to Mpn after 1 to Dp after the text distance Dp of n before Dp, each paragraph and next paragraph, by above-mentioned Four characteristic values calculate paragraph core feature value Hp1 to Hp2, are selected according to threshold value, obtain core paragraph Px, Px+1 ..., Py, That is the core content of the webpage.Calculating process refers to Fig. 2 a and Fig. 2 b.

Claims (6)

1. a kind of extracting method of web page core content, comprising the following steps:
1) according to the html label in web page code, web page contents are divided into multiple paragraphs;
2) character length of each paragraph, the spacing distance of adjacent paragraph and paragraph inside concentration are counted as characteristic value; The type that the spacing distance of the adjacent paragraph includes have a paragraph at a distance from a paragraph thereon and the paragraph with its next section The distance fallen;The spacing distance of the adjacent paragraph is defined as the number of characters+M between paragraph, and wherein the value of M is according to a paragraph The end-tag of previous paragraph and the beginning label of the paragraph determine;
3) the core feature value of each paragraph is calculated according to the characteristic value;According to the core feature value of paragraph each in webpage point Cloth situation obtains the range that core feature value is concentrated the most, and paragraph of the core feature value in this threshold range is webpage Core paragraph, to obtain the core content of webpage.
2. the extracting method of web page core content as described in claim 1, which is characterized in that html label described in step 1) Including<p>,</p>,<div>,</div>,<span>,</span>,<div>,</div>,<br>,<br/>.
3. the extracting method of web page core content as described in claim 1, which is characterized in that concentration inside the paragraph It is defined as the Chinese occurred in paragraph and English character summation/Q, wherein the value of Q is defined as the Chinese and English occurred in paragraph Punctuation mark number × 1 length of Q1+html label × 2 length of Q1+html label × Q2 ...+html mark in character summation+paragraph Sign P length × QP;Q1, Q2 ... QP are to be determined according to the type of html label.
4. the extracting method of web page core content as claimed in claim 3, which is characterized in that the core feature value of a paragraph is fixed Justice be concentration inside character length × paragraph of paragraph/(paragraph at a distance from a paragraph thereon+paragraph is next with it The distance of paragraph).
5. the extracting method of web page core content as described in claim 1, which is characterized in that according to the feature in step 3) The core feature value that paragraph is calculated in value includes according to the distribution situation of each paragraph core feature value, and selection obtains core spy Value indicative is combined into core text in certain threshold range inner core paragraph, the group of these paragraphs.
6. the extracting method of web page core content as claimed in claim 5, which is characterized in that the threshold range choose according to According to the threshold range of are as follows: the part concentrated the most of selection paragraph core feature value alternatively core paragraph.
CN201510413180.0A 2015-07-14 2015-07-14 A kind of web page core content extracting method Active CN105320734B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201510413180.0A CN105320734B (en) 2015-07-14 2015-07-14 A kind of web page core content extracting method
PCT/CN2015/098464 WO2017008448A1 (en) 2015-07-14 2015-12-23 Method for extracting core content of web page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510413180.0A CN105320734B (en) 2015-07-14 2015-07-14 A kind of web page core content extracting method

Publications (2)

Publication Number Publication Date
CN105320734A CN105320734A (en) 2016-02-10
CN105320734B true CN105320734B (en) 2019-02-22

Family

ID=55248123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510413180.0A Active CN105320734B (en) 2015-07-14 2015-07-14 A kind of web page core content extracting method

Country Status (2)

Country Link
CN (1) CN105320734B (en)
WO (1) WO2017008448A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107357496B (en) * 2017-07-19 2019-03-26 掌阅科技股份有限公司 Annotation process method, electronic equipment and computer storage medium
CN109543126B (en) * 2018-11-19 2022-04-29 四川长虹电器股份有限公司 Webpage text information extraction method based on block character ratio
CN109684642B (en) * 2018-12-26 2023-01-13 重庆电信系统集成有限公司 Abstract extraction method combining page parsing rule and NLP text vectorization
CN111435405A (en) * 2019-01-15 2020-07-21 北京行数通科技有限公司 Method and device for automatically labeling key sentences of article
CN110443814B (en) * 2019-07-30 2022-12-27 北京百度网讯科技有限公司 Loss assessment method, device, equipment and storage medium for vehicle
CN111046302A (en) * 2019-12-30 2020-04-21 珠海趣印科技有限公司 Method and device for extracting webpage content
CN113537091B (en) * 2021-07-20 2024-05-03 东莞盟大集团有限公司 Webpage text recognition method and device, electronic equipment and storage medium
CN115098804B (en) * 2022-06-24 2023-11-03 上海上班族数字科技有限公司 Webpage search history record intelligent management system based on big data analysis

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN101408898A (en) * 2008-11-07 2009-04-15 北大方正集团有限公司 Method and device for extracting web page text
CN103020129A (en) * 2012-11-20 2013-04-03 中兴通讯股份有限公司 Text content extraction method and text content extraction device
CN104598577A (en) * 2015-01-14 2015-05-06 晶赞广告(上海)有限公司 Extraction method for webpage text

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130283148A1 (en) * 2010-10-26 2013-10-24 Suk Hwan Lim Extraction of Content from a Web Page
CN102737017B (en) * 2011-03-31 2015-03-11 北京百度网讯科技有限公司 Method and apparatus for extracting page theme
CN103365935A (en) * 2012-04-11 2013-10-23 腾讯科技(深圳)有限公司 Method and server for confirming page readability
CN103810251B (en) * 2014-01-21 2017-05-10 南京财经大学 Method and device for extracting text

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN101408898A (en) * 2008-11-07 2009-04-15 北大方正集团有限公司 Method and device for extracting web page text
CN103020129A (en) * 2012-11-20 2013-04-03 中兴通讯股份有限公司 Text content extraction method and text content extraction device
CN104598577A (en) * 2015-01-14 2015-05-06 晶赞广告(上海)有限公司 Extraction method for webpage text

Also Published As

Publication number Publication date
CN105320734A (en) 2016-02-10
WO2017008448A1 (en) 2017-01-19

Similar Documents

Publication Publication Date Title
CN105320734B (en) A kind of web page core content extracting method
US8819028B2 (en) System and method for web content extraction
CN102541874B (en) Webpage text content extracting method and device
CN106055667B (en) It is a kind of based on text-label densities web page core content extracting method
CN102270206A (en) Method and device for capturing valid web page contents
CN101681251A (en) Semantic analysis of documents to rank terms
CN103336766A (en) Short text garbage identification and modeling method and device
CN107346433A (en) A kind of text data sorting technique and server
JP2016042349A (en) Automatic method for division into chapters and sections
CN104199845B (en) Line Evaluation based on agent model discusses sensibility classification method
CN104050158B (en) Automatic quotation extraction method and device with semantic integrity kept
CN108874870A (en) A kind of data pick-up method, equipment and computer can storage mediums
CN103491116A (en) Method and device for processing text-related structural data
CN102207974A (en) Method for combining context web pages
CN102346748A (en) Automatic identification method for network literature directory type web pages
CN106485525A (en) Information processing method and device
CN104572874B (en) A kind of abstracting method and device of webpage information
CN105335382B (en) The extracting method and device of Web page text
CN105183730B (en) The treating method and apparatus of webpage information
CN110795933B (en) Webpage text recognition processing method and device
CN103729354B (en) web information processing method and device
CN104850609B (en) A kind of filter method for rising space class keywords
JP5317638B2 (en) Web document main content extraction apparatus and program
CN104462151A (en) Method for evaluating web page publishing time and related device
Lin et al. Combining a segmentation-like approach and a density-based approach in content extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant