CN105320734B

CN105320734B - A kind of web page core content extracting method

Info

Publication number: CN105320734B
Application number: CN201510413180.0A
Authority: CN
Inventors: 陈勇; 耿光刚
Original assignee: China Internet Network Information Center
Current assignee: China Internet Network Information Center
Priority date: 2015-07-14
Filing date: 2015-07-14
Publication date: 2019-02-22
Anticipated expiration: 2035-07-14
Also published as: CN105320734A; WO2017008448A1

Abstract

The present invention provides a kind of extracting method of web page core content, comprising the following steps: 1) according to the html label in web page code, web page contents are divided into multiple paragraphs；2) character length of each paragraph, the spacing distance of adjacent paragraph and paragraph inside concentration are counted as characteristic value.3) the core feature value of each paragraph is calculated according to the characteristic value.According to the core feature Distribution value situation of paragraph each in webpage, the range that core feature value is concentrated the most is obtained, paragraph of the core feature value in this threshold range is the core paragraph of webpage, to obtain the core content of webpage.It is had the advantage that compared with prior art not merely dependent on html label, fully takes into account the feature between text fragment feature itself, paragraph layout, thus accuracy rate is high.Implementation is not rely on certain types of webpage, has versatility, can handle all kinds of common webpages on internet.Implement simply, calculation amount is small, and treatment effeciency is high.

Description

A kind of web page core content extracting method

Technical field

The present invention relates to information technology field more particularly to internet information processing technology fields, and in particular to a kind of net Page core content extracting method.

Background technique

With the development of internet, internet site's webpage number, Internet user constantly increase, internet web page contents Become the indispensable channel that people obtain information.And under the factor of commercial operation, original letter is provided for user The website of breath can provide some additional information, such as ad data and to other in the webpage it includes worth of data The link of website related content (these advertisements, link data may be text, it is also possible to picture, in some instances it may even be possible to be plug-in unit)； The data such as these advertisements, link are continuously added but also the Page Views that should be simplified become cumbersome very much；All kinds of nets The addition of page tools and various dynamic elements is but also the immanent structure of the page becomes complicated.

The increasingly sophisticated influence user reading experience of web page contents and structure, expends a large amount of Internet bandwidth resources, these Data not only affect the efficiency of webpage information browsing, and if being applied to retrieval, the accuracy for also resulting in retrieval is reduced. How accurate quickly analysis obtain web page core content become numerous web contents processing application (such as search engine, network filing, Information Collection System etc.) problem in the urgent need to address.

In addition, mobile Internet flourishes so that becoming trend of the times, and mobile terminal institute in mobile terminal browsing webpage The features such as screen having is small, flow is limited, can not show all the elements in conventional web page, this is but also web page core content It is effective extraction become more urgent.

Generally there are several types of methods for the method for extraction web page core content in the prior art:

1. being determined according to going in webpage with the number of characters of row

1) it is directed to webpage, determines the character sum and Chinese character number of the i-th row and (i+1) row content；

2) text density of the i-th row and (i+1) row content is calculated, such as can be with Chinese character number divided by character sum Calculate text density；

3) text density being calculated is compared with preset threshold values；

4) if comparison result is that text density is not less than preset threshold values, it is determined that in the i-th row and (i+1) behavior core Hold, if comparison result is that text density is less than preset threshold values, it is determined that the i-th row and (i+1) row content are non-core content；

5) if it is determined that the i-th row and (i+1) row content are core content, then the i-th row, (i are determined according to the method described above + 1) whether capable and (i+2) row content is core content；

6) if it is determined that the i-th row and (i+1) row content are non-core content, then (i+2) is determined according to the method described above Whether capable and (i+3) row content is core content；

7) above-mentioned steps are executed, until traversing all rows of the webpage.

The above method of the prior art, when extracting web page core content, if the text density of continuous multiple line content is not small In preset threshold, being considered as the continuous multiple line content is body matter, but now in many webpages, there are more degree of disturbance compared with High non-core content, such as personal information, short essay abstract, relief statement etc., these non-core contents equally have text close Spend larger feature, it is likely that be greater than preset threshold values, thus with core content is mistakenly considered；And if adjusting threshold values, it is possible to Core content is mistaken for non-core content, so that the extraction accuracy of core content reduces.

In addition, when encountering the case where webpage is loaded with a large amount of contents since above method algorithm comparison is cumbersome, it may be necessary to Longer handling duration could complete the extraction of web page core content, influence the experience sense of user by being also unable to satisfy at this stage Increasingly tend to the requirement of the information processing of high-speed and high-efficiency to information technology.

2. carrying out region segmentation to webpage using structure of web page layout information, the content of core web page blocks is extracted

By carrying out piecemeal using the layout of Webpage, a webpage is divided into multiple portions, further according to these portions The feature divided is classified.But this method based on page layout is not particularly suited for all webpages, needs to be set in advance Processing template.The peak Xin Rui Information technology Co., Ltd in Jiangsu improves the above method, proposes based on html label to net Page carries out area dividing and then extracts content of text (number of patent application 201210213554.0).This method only depends on merely Html label can only carry out news web page in actual effect there is no the correlation in view of content of text itself in webpage (being 80% to 85% according to processing success rate of its description to news web page) is effectively treated

3. extracting the core content of webpage based on DOM Document Object Model (DOM, Document Object Model)

By extracting the DOM Document Object Model in web document, extracted in webpage according to specific object model nodes Hold.In fact content node is all Web page maker's self-defining in the DOM Document Object Model of each webpage, and this method can not Suitable for all webpages.

Summary of the invention

To solve the above-mentioned problems, the object of the present invention is to provide a kind of extracting method of web page core content, this method It is intensive by the text distance between the length of paragraph, paragraph, the text inside paragraph by the way that web page contents are divided into paragraph The core content of degree locating web-pages.

To achieve the goals above, the scheme that the present invention takes is:

A kind of extracting method of web page core content, comprising the following steps:

1) according to the html label in web page code, web page contents are divided into multiple paragraphs；

2) character length of each paragraph, the spacing distance of adjacent paragraph and paragraph inside concentration are counted as feature Value.

3) the core feature value of each paragraph is calculated according to the characteristic value.According to the core feature of paragraph each in webpage Distribution value situation obtains the range that core feature value is concentrated the most, and paragraph of the core feature value in this threshold range is net The core paragraph of page, to obtain the core content of webpage.

Further, step 1) according to html label (including<p></p><div></div><span></span><div> </div><br><br/>deng) paragraph division is carried out to webpage.

The type that the spacing distance of the adjacent paragraph includes have a paragraph at a distance from a paragraph thereon and the paragraph with The distance of its next paragraph.

Further, the spacing distance of the adjacent paragraph is defined as the number of characters+M between paragraph, wherein the value of M according to The end-tag of the previous paragraph of one paragraph and the beginning label of the paragraph determine.

Further, concentration is defined as the Chinese and English character summation/Q occurred in paragraph inside the paragraph, Wherein the value of Q is defined as punctuation mark number × Q1+html label in the Chinese and English character summation+paragraph that occur in paragraph 1 length × 2 length of Q1+html label × Q2 ...+html label P length × QP；Q1, Q2 ... QP are the type according to html label It determines.

Further, the core feature value of a paragraph is defined as the character length of paragraph × concentration inside paragraph/(this Paragraph at a distance from a paragraph thereon+paragraph is at a distance from its next paragraph).

Further, the core feature value of paragraph is calculated according to the characteristic value for step 3), according to each paragraph core The distribution situation of heart characteristic value, selection obtain core feature value in certain threshold range inner core paragraph, the combination of these paragraphs For core text.

Further, the foundation that the threshold range is chosen are as follows: paragraph core feature value represents core content in webpage Feature, the characteristic value of core paragraph is similar in same webpage, rather than core content such as advertisement, relief statement, push chain The part for connecing etc. and not having such centrality feature, therefore paragraph core feature value is selected to concentrate the most alternatively core segment The threshold range fallen.

The invention is characterized in that above-mentioned technical proposal, has the advantage that compared with prior art

1. not depending on html label merely, the feature between text fragment feature itself, paragraph layout is fully taken into account, Thus accuracy rate is high.

2. implementation is not rely on certain types of webpage, there is versatility, all kinds of common nets on internet can be handled Page.

3. implementing simply, calculation amount is small, and treatment effeciency is high.

Detailed description of the invention

Fig. 1 is that web page core content of the present invention obtains flow diagram.

Fig. 2 a is the first part that web page core content obtains schematic diagram in the embodiment of the present invention 2.

Fig. 2 b is the second part that web page core content obtains schematic diagram in the embodiment of the present invention 2.

Specific embodiment

To enable features described above and advantage of the invention to be clearer and more comprehensible, special embodiment below, and institute's attached drawing is cooperated to make Detailed description are as follows.

Firstly, being illustrated to core of the invention design:

1. web page code is carried out paragraph division using html label.

Html label (Hyper Text Markup Language tag；Hypertext markup language label) it is html language In most basic unit, html label is that one under HTML standard generalized markup language applies most important component part.

Html label usually has the characteristics that,

1) keyword surrounded by angle brackets, such as<html>.

2) usually occur in pairs, such as<div>with</div>.

3) first label of label centering is to start label, and second label is end-tag.

4) beginning and end label is also referred to as open label and closure label.

5) also there is the label individually presented, such as<img src=".GIF"/>deng.

6) label generally occurred in pairs, content is among two labels.The label individually presented, then in tag attributes Middle assignment.Such as<h1>title</h1>with<input type="text"value="button "/>.

7) content of webpage need to be in<html>label, title, character format, language, compatibility, keyword, description etc. Information is shown in<head>in label, and the content that webpage need to be shown need to be nested in<body>in label.Sometimes uneasy mark Although quasi- code of writing can normally be shown, as professional quality, it still should form and regular write habit.

According to such as These characteristics, web page code is divided using html label, resulting paragraph accordingly has following Feature:

It is surrounded by following label:

<p></p>

<br>(or<br/>)

<h1></h1>(<h2></h3>……<hn></hn>)

Select these paragraphs, according between these paragraphs in vision and character distance calculate paragraph between text away from From.

Then, the text concentration of each paragraph itself is calculated in the tightness degree to paragraph inside character Value:

2. according between text distance, paragraph and the latter paragraph between paragraph text size, paragraph and previous paragraph Text distance, concentration value this four characteristic values inside paragraph are calculated, according to result judgement whether be webpage core Heart content of text.

Whether it is core that the present invention is different from the prior art according only to each row/paragraph of the character density to web page code content Content or html label judged, but text distance between comprehensive paragraph text size, paragraph, intensive inside paragraph These characteristic values of degree are calculated, and compared with prior art, have not only fully considered the characteristic of html document itself, net Visual display feature on page, while the feature in Chinese text structure is had also contemplated, it can handle all kinds of texts on internet This (including but not limited to composite web page, news web page, blog web page, encyclopaedia class webpage, commodity class website etc.), is compared Preferable effect.For verification the verifying results, we are sampled Global Internet Chinese website, obtain 100,000 Chinese networks at random Page, and the method according to the invention is handled.Experiment shows that the present invention extracts the order of accuarcy height of all kinds of web page core contents Up to 90%.In terms for the treatment of effeciency, only increase by 25% time loss with html tag processes merely under same operational capability, than 50% time loss will be lacked by carrying out processing using DOM Document Object Model.

It is illustrated below in conjunction with process flow of the Fig. 1 to web page core content extracting method of the invention:

Firstly, web page contents are divided into multiple paragraphs according to the html label in web page code.It is right in this during Html label analysis in web page code, the part for including to lower column label are divided into paragraph:

<hn></hn>the part for including

<p></p>the part for including

<div></div>the part for including

<span></span>the part for including

A upper label occurred in pairs terminate to<br>(or<br/>) part between label

Above-mentioned each part is as an independent paragraph.

Then, the characteristic values such as text distance, the paragraph concentration between the length of paragraph, paragraph are obtained, wherein paragraph Text range formula between paragraph is as follows:

Number of characters+M between the distance N=paragraph of paragraph

The value of M depends on a upper paragraph end-tag and this paragraph starts label, and different tag combinations obtains not With M value, combination there are several types of:

</hn>with<hn>

</hn>with<p>

</hn>with<span>

</hn>with<div>

</hn>with<br>the first character of paragraph

</p>with<p>

</p>with<hn>

</p>with<span>

</p>with<div>

</p>with<br>the first character of paragraph

</span>with<span>

</span>with<hn>

</span>with<p>

</span>with<div>

</span>with<br>the first character of paragraph

</div>with<div>

</div>with<hn>

</div>with<p>

</div>with<span>

</div>with<br>the first character of paragraph

Then, we calculate the text concentration value of each paragraph itself:

The Chinese and English character summation/Q occurred in text concentration value=paragraph of paragraph itself

The punctuation mark number * Q1+html in Chinese and English character summation+paragraph occurred in calculating=paragraph of Q value 1 length * Q1+html label of label, 2 length * Q2 ...+html label P length * QP

(Q1, Q2 ... QP are different according to the difference of html label)

Finally, concentration determines net between paragraph at a distance from front and back and inside paragraph according to the length of paragraph, paragraph The core paragraph of page, so that it is determined that the core content of webpage.Specific calculating process is as follows:

Concentration value inside the characteristic value=paragraph length * paragraph of paragraph core/(at a distance from paragraph and the preceding paragraph fall + paragraph is at a distance from next paragraph)

Finally, selecting to be in core text in certain threshold range according to paragraph core feature Distribution value in webpage Hold.

Search engine can efficiently handle magnanimity webpage using method of the invention, extract web page core content, without It needs to store the original contents of webpage, mass memory and a large amount of operations consumption can be saved, and can be in terms of search result Accurately return to web page core content.

Information Collection System can not be influenced using method of the invention by advertisement, page dynamic element in webpage, convenient Efficiently collect web page core content.

System obtain web page code, according to html label by text division of teaching contents in webpage be paragraph P1 to Pn, by upper Length Lp1 to the Lpn, each paragraph and the preceding paragraph that each paragraph is calculated in the method for stating fall between text distance Dp before 1 arrive N, paragraph concentration Mp1 to Mpn after 1 to Dp after the text distance Dp of n before Dp, each paragraph and next paragraph, by above-mentioned Four characteristic values calculate paragraph core feature value Hp1 to Hp2, are selected according to threshold value, obtain core paragraph Px, Px+1 ..., Py, That is the core content of the webpage.Calculating process refers to Fig. 2 a and Fig. 2 b.

Claims

1. a kind of extracting method of web page core content, comprising the following steps:

2) character length of each paragraph, the spacing distance of adjacent paragraph and paragraph inside concentration are counted as characteristic value； The type that the spacing distance of the adjacent paragraph includes have a paragraph at a distance from a paragraph thereon and the paragraph with its next section The distance fallen；The spacing distance of the adjacent paragraph is defined as the number of characters+M between paragraph, and wherein the value of M is according to a paragraph The end-tag of previous paragraph and the beginning label of the paragraph determine；

3) the core feature value of each paragraph is calculated according to the characteristic value；According to the core feature value of paragraph each in webpage point Cloth situation obtains the range that core feature value is concentrated the most, and paragraph of the core feature value in this threshold range is webpage Core paragraph, to obtain the core content of webpage.

2. the extracting method of web page core content as described in claim 1, which is characterized in that html label described in step 1) Including<p>,</p>,<div>,</div>,<span>,</span>,<div>,</div>,<br>,<br/>.

3. the extracting method of web page core content as described in claim 1, which is characterized in that concentration inside the paragraph It is defined as the Chinese occurred in paragraph and English character summation/Q, wherein the value of Q is defined as the Chinese and English occurred in paragraph Punctuation mark number × 1 length of Q1+html label × 2 length of Q1+html label × Q2 ...+html mark in character summation+paragraph Sign P length × QP；Q1, Q2 ... QP are to be determined according to the type of html label.

4. the extracting method of web page core content as claimed in claim 3, which is characterized in that the core feature value of a paragraph is fixed Justice be concentration inside character length × paragraph of paragraph/(paragraph at a distance from a paragraph thereon+paragraph is next with it The distance of paragraph).

5. the extracting method of web page core content as described in claim 1, which is characterized in that according to the feature in step 3) The core feature value that paragraph is calculated in value includes according to the distribution situation of each paragraph core feature value, and selection obtains core spy Value indicative is combined into core text in certain threshold range inner core paragraph, the group of these paragraphs.

6. the extracting method of web page core content as claimed in claim 5, which is characterized in that the threshold range choose according to According to the threshold range of are as follows: the part concentrated the most of selection paragraph core feature value alternatively core paragraph.