CN107463696A

CN107463696A - A kind of method of Webpage largest block extraction

Info

Publication number: CN107463696A
Application number: CN201710694534.2A
Authority: CN
Inventors: 苑聪虎; 程国艮; 李世奇
Original assignee: Mandarin Technology (beijing) Co Ltd
Current assignee: Mandarin Technology (beijing) Co Ltd
Priority date: 2017-08-15
Filing date: 2017-08-15
Publication date: 2017-12-12

Abstract

The invention belongs to field of webpage design, discloses a kind of method of Webpage largest block extraction, including：Webpage source code is obtained first, and all blank line is substituted for canonical for the word segment of the non-page presentation of label in webpage source code；Then the function that the number of each style of writing word is formed is calculated；The function of formation is distributed as a line；Then one using step length as 3 row sums combines block blocks, forms the function of new text distribution more crypto set；Then found out and risen sharply according to intensive function, the point of rapid drawdown out generates text as web page body Text Feature Extraction, preserves.The cost of labor and maintenance cost of web page contents extraction can be greatly lowered in the present invention, while has preferable adaptive ability, and in web site contents structural change, abstracting method is still effective.

Description

A kind of method of Webpage largest block extraction

Technical field

The invention belongs to field of webpage design, more particularly to a kind of method of Webpage largest block extraction.

Background technology

At present, the main method that page extraction uses is to carry out carrying for text for being customized of different web sites rule Take.Analysis needs to crawl text page first, according to the extracting rule of the page structure handwritten copy page.Rule is with xpath or just Then write, specific content of pages extraction is then carried out according to the page extracting rule of writing, preserved locally after obtaining content.It is complete Into the extraction of a web page contents.It is excessive so to do maintenance cost, crawls content for website, there is a socket gauge each website Then.Once webpage format changes, originally just cancelled for this website, came to content of pages extraction belt difficult.

In summary：Problems of the prior art：Technology takes particular webpage specific rule to take out at present Content is taken, writing, optimize, safeguarding for substantial amounts of decimation rule can be related to when in face of a large amount of webpage content extractions；This Sample can put into substantial amounts of human resources, and the webpage format former decimation rule that once changes will fail；In order to keep away Exempt from webpage content extraction program caused by due to the page changing to fail, reduce development cost and maintenance cost, improve web page contents The intelligence degree and adaptive ability of abstracting method, it is proposed that the present invention.

The content of the invention

The problem of existing for prior art, the invention provides a kind of method of Webpage largest block extraction.

The present invention is achieved in that a kind of method of Webpage largest block extraction, specifically includes：

Webpage html codes are obtained, the pretreatment then encoded to content, remove script, removed spcial character；Obtain To be encoded after to html source codes according to website and carry out research content, if website does not have the research content of meta labels, given tacit consent to Content is encoded with utf8；Canonical is used after the completion of coding '<script[^>]*>.*</script>' remove< script></script>Comprising script and special annotation text<！--...-->, for the spcial character ＆ of escape nbsp；&lt；&gt；&amp；&quot；The escapes such as ＆apos into corresponding space,<、>, ＆, ", ' etc.；

Remove the text that format tags obtain full page；To text canonical "<[^>]*>" label is removed, remove All labels in html；Rough text distribution makes the label script of processing become blank line；

Calculate the distribution function of row block number of characters；

The point to rise sharply with rapid drawdown is found according to distribution function change, obtains valuable text；

Text is handled, according to obtained line number, the content that every row represents is stitched together to form the master of needs according to line number Hold in vivo, null is also had in the body matter so obtained；Then with canonical, ' ^ n ' are substituted for sky, to remove null；Finally Obtain article main body.

Further, the distribution function for calculating row block number of characters includes：Text based on line number is established to rough text This length distribution function；Function is expressed as in reference axis, and x-axis represents text line number：LN；Y-axis represents the text length of the line number Degree：LL；

Then row text size is merged into above most long as 3 using jumping degree and gone as a block block, LN/3 be present Individual block, concentrate text size and make the rapid drawdown point differentiation that rises sharply.

Further, it is described that the point to rise sharply with rapid drawdown is found according to distribution function change, obtain the method for valuable text Including：

If matched curve is：Y=a₀+a₁x+...+a_kx^k, each point to this curve distance, i.e. sum of square of deviations is：

Meet actual all a values to ask, derivation is carried out to the ai of formula above, obtains following result：

Result is after simplification：

Then equation matrix is converted into obtain：

Wherein, X*A=Y, then A=(X*X) -1*X*Y, then obtain the coefficient matrix on A, be then just fitted Parameter of curve；Obtain matched curve；Matched curve obtains its graded function after obtaining to the curve derivation：

Y '=a₁+a₂x+...+a_kx^k-1；

Bring to obtain all line numbers of the page into one group of curve in the gradient of each point and then to calculate each neighboring gradient poor, And take absolute value, find the maximum line number of difference；

Maximum line number rises sharply the point of rapid drawdown for text, obtains page main body, finally extracts text.

Another object of the present invention is to provide a kind of Webpage largest block extraction system.

The present invention is largely focused on the characteristic in continuous blocks using the effective information of netpage page face displaying.Utilize maximum Block method can quickly and efficiently determine page body matter position, and be extracted.In existing rule-based web page body It is big to hold extracting method workload.If need several rules for a website, then in face of this work of thousands of website Measure very big.And to safeguard that so much rule will also put into substantial amounts of cost.

Such as extract web site contents according to existing rule and method, 1000 websites need the rule of 10 people's write responses and And a general week is taken, the maintenance that crawls for content is also required to 2 people.But the present invention does not need writing for rule tight It is tight to need 1 people to carry out reptile maintenance.Thus save the workload for coming out 11 people.

In addition, the risk of Website page structural change is also faced, once Website page changes, before for the net Standing, it is regular by whole failures to design.The cost of labor and maintenance cost of web page contents extraction can be greatly lowered in the present invention, There is preferable adaptive ability simultaneously, in web site contents structural change, abstracting method is still effective.

Brief description of the drawings

Fig. 1 is the method flow diagram that the Webpage largest block that the present invention implements to provide is extracted.

Fig. 2 is that the page that the present invention implements to provide can obtain distribution schematic diagram by maximum block analysis.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

Technology takes particular webpage specific rule to carry out extraction content at present, in face of a large amount of webpage content extractions When can be related to writing, optimize, safeguarding for substantial amounts of decimation rule；Substantial amounts of human resources, and webpage can so be put into The form former decimation rule that once changes will fail；In order to avoid webpage content extraction caused by changing due to the page Program fails, and reduces development cost and maintenance cost, improves the intelligence degree and adaptive ability of webpage content extraction method, Propose the present invention.

The application principle of the present invention is described in detail below in conjunction with the accompanying drawings.

The method that the present invention implements the Webpage largest block extraction provided, including：

Webpage source code is obtained first, and canonical is all used for the word segment of the non-page presentation such as the label in webpage source code It is substituted for blank line.Then the number for calculating each style of writing word forms similar Fig. 2 function distribution.The function so formed point Cloth is a line.Then the block blocks (this block block length is 3 row sums) using step-length as 3 as combinations are formed newly Text is distributed the function of more crypto set.Then found out and risen sharply according to density function, the point of rapid drawdown is as web page body Text Feature Extraction Text is out generated, is preserved.

As shown in figure 1, the method for Webpage largest block extraction provided in an embodiment of the present invention, is specifically included：

S101：Webpage html codes are obtained, is then pre-processed and (content is encoded, script is removed, removes special word Symbol).

S102：Remove the text that format tags obtain full page.

S103：Calculate the distribution function of row block number of characters.

S104：The point to rise sharply with rapid drawdown is found in changing according to distribution function, obtains valuable text.

S105：Text is handled, the blank line between text is removed and obtains effective body matter and preserve.

The present invention is described further with reference to specific embodiment.

The present invention is implemented in the method for the Webpage largest block extraction provided, including：

1) html source codes obtain crawls with excessively universal reptile to Target Station, and the content of return changes into character string type Data.

2) progress research content will be encoded according to website later by getting html source codes, if website does not have the interior of meta labels Hold coding, then acquiescence is encoded with utf8 to content.Removed after the completion of coding with canonical<script></script>Comprising Script and special annotation text<！--...-->.Then label is removed with canonical to text, removes in html and own Label.So rough text distribution has just come out label script of processing etc. above and has all become blank line.

3) the text size distribution function based on line number is established to rough text.Function is expressed as in reference axis, x-axis Represent text line number：LN, y-axis represent the text size of the line number：LL.Then in order to more highlight the jumping characteristic of text distribution, Row text size is merged into above most long using jumping degree as 3 (such as the 1st row length is 1, the third line of the 2nd behavior 4 be 5 that First block length is 8 after merging, by that analogy 4,5,6 rows etc..).Row is used as a block block, thus deposits In LN/3 block, can thus text size be set more to concentrate, so as to more preferably distinguish the rapid drawdown point that rises sharply.

One page can obtain distribution schematic diagram as shown in Figure 2 by maximum block analysis；

According to above-mentioned distribution function, rising sharply a little and rapid drawdown point for text is found.Above-mentioned curve concrete principle is as follows

Result is after simplification：

Then equation matrix is converted into obtain：

Y '=a₁+a₂x+...+a_kx^k-1；

Bring to obtain all line numbers of the page into one group of curve in the gradient of each point and then to calculate each neighboring gradient poor, And take absolute value, find the maximum line number of difference.

Then, this line number is exactly that text rises sharply the point of rapid drawdown, such as Fig. 2 150-225 rows.So page main body just obtains , finally extract text and just complete.

The present invention is largely focused on the characteristic in continuous blocks using the effective information of netpage page face displaying.Utilize maximum Block method can quickly and efficiently determine page body matter position, and be extracted.In existing rule-based web page body It is big to hold extracting method workload.If need several rules for a website, then in face of this work of thousands of website Measure very big.And to safeguard that so much rule will also put into substantial amounts of cost.In addition, Website page knot is also faced The risk that structure changes, once Website page changes, before for the regular by whole failures of the website design.The present invention can be big The cost of labor and maintenance cost of amplitude reduction web page contents extraction, while there is preferable adaptive ability, in web site contents During structural change, abstracting method is still effective.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should be included in the scope of the protection.

Claims

A kind of 1. method of Webpage largest block extraction, it is characterised in that the method for the Webpage largest block extraction, bag Include：

Webpage source code is obtained first, is all substituted for for the word segment of the non-page presentation of label in webpage source code with canonical Blank line；

Then the function that the number of each style of writing word is formed is calculated；The function of formation is distributed as a line；

Then one using step length as 3 row sums combines block blocks, forms the function of new text distribution more crypto set；

Then found out and risen sharply according to intensive function, the point of rapid drawdown out generates text as web page body Text Feature Extraction, preserves.
2. the method for Webpage largest block extraction as claimed in claim 1, it is characterised in that the Webpage largest block The method of extraction, is specifically included：

Webpage html codes are obtained, the pretreatment then encoded to content, remove script, removed spcial character；Get Encoded after html source codes according to website and carry out research content, canonical is used after the completion of coding '<script[^>]*>.*</ script>' remove<script></script>Comprising script and special annotation text<！--...-->, for escape Te Shuzifu ＆nbsp；&lt；&gt；&amp；&quot；＆apos escapes into corresponding space,<、>、&、”、’；

Remove the text that format tags obtain full page；To text canonical "<[^>]*>" label is removed, remove html In all label；The script of the label of processing is set to become blank line using rough text distribution；

Calculate the distribution function of row block number of characters；

The point to rise sharply with rapid drawdown is found according to distribution function change, obtains valuable text；

Text is handled, according to obtained line number, the content that every row represents is stitched together to form body matter according to line number, if To body matter in also have null；Then with canonical ' ^ n ' be substituted for sky, remove null；Finally obtain article main body.
3. the method for Webpage largest block extraction as claimed in claim 2, it is characterised in that the calculating row block number of characters Distribution function include：Text size distribution function based on line number is established to rough text；Function represents in reference axis For x-axis represents text line number：LN；Y-axis represents the text size of the line number：LL；

Then row text size is merged into as 3 using jumping degree by row above most long and is used as a block block, it is individual to there is LN/3 Block, concentrate text size and make the rapid drawdown point differentiation that rises sharply.
4. the method for Webpage largest block extraction as claimed in claim 2, it is characterised in that described to be become according to distribution function Change the point for finding and rising sharply with rapid drawdown, obtaining the method for valuable text includes：

If matched curve is：Y=a₀+a₁x+...+a_kx^k, each point to this curve distance, i.e. sum of square of deviations is：

<mrow> <msup> <mi>R</mi> <mn>2</mn> </msup> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <mrow> <mo>&lsqb;</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>-</mo> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mn>0</mn> </msub> <mo>+</mo> <msub> <mi>a</mi> <mn>1</mn> </msub> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>+</mo> <mo>...</mo> <mo>+</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <msup> <msub> <mi>x</mi> <mi>i</mi> </msub> <mi>k</mi> </msup> <mo>)</mo> </mrow> <mo>&rsqb;</mo> </mrow> <mn>2</mn> </msup> <mo>;</mo> </mrow>

Meet actual all a values to ask, derivation is carried out to the ai of formula above, obtains following result：

<mrow> <mo>-</mo> <mn>2</mn> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mo>&lsqb;</mo> <mi>y</mi> <mo>-</mo> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mn>0</mn> </msub> <mo>+</mo> <msub> <mi>a</mi> <mn>1</mn> </msub> <mi>x</mi> <mo>+</mo> <mo>...</mo> <mo>+</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <msup> <mi>x</mi> <mi>k</mi> </msup> <mo>)</mo> </mrow> <mo>&rsqb;</mo> <msup> <mi>x</mi> <mi>k</mi> </msup> <mo>=</mo> <mn>0</mn> <mo>;</mo> </mrow>

Result is after simplification：

<mrow> <msub> <mi>a</mi> <mn>0</mn> </msub> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <msub> <mi>x</mi> <mi>i</mi> </msub> <mi>k</mi> </msup> <mo>+</mo> <msub> <mi>a</mi> <mn>1</mn> </msub> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <msub> <mi>x</mi> <mi>i</mi> </msub> <mrow> <mi>k</mi> <mo>+</mo> <mn>1</mn> </mrow> </msup> <mo>+</mo> <mn>...</mn> <mo>+</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <msub> <mi>x</mi> <mi>i</mi> </msub> <mrow> <mn>2</mn> <mi>k</mi> </mrow> </msup> <mo>=</mo> <mn>0</mn> <mo>;</mo> </mrow>

Then equation matrix is converted into obtain：

Wherein, X*A=Y, then A=(X*X) -1*X*Y, then obtain the coefficient matrix on A, then just obtain matched curve Parameter；Obtain matched curve；Matched curve obtains its graded function after obtaining to the curve derivation：

Y '=a₁+a₂x+...+a_kx^k-1；

Bring to obtain all line numbers of the page into one group of curve in the gradient of each point and then to calculate each neighboring gradient poor, and Take absolute value, find the maximum line number of difference；

Maximum line number rises sharply the point of rapid drawdown for text, obtains page main body, finally extracts text.
A kind of 5. Webpage largest block extraction system of the method for Webpage largest block extraction as claimed in claim 1.