CN101237465A

CN101237465A - A Webpage Text Extraction Method Based on Fast Fourier Transform

Info

Publication number: CN101237465A
Application number: CNA2007100631827A
Authority: CN
Inventors: 王劲林; 李蕾; 李晔; 白鹤; 胡晶晶
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2007-01-30
Filing date: 2007-01-30
Publication date: 2008-08-06
Anticipated expiration: 2027-01-30
Also published as: CN101237465B

Abstract

The invention discloses a web page text extraction method based on fast Fourier transform, comprising: reading in an HTML file, converting the file into a Unicode format, and storing it into a character array; performing window segmentation on the character array; The position in the document is statistically analyzed, and the intensity encoding conversion is performed on the characters according to the result to obtain the intensity value of the text, and each window character segment corresponds to an intensity value sequence; fast Fourier transform is performed on the intensity value sequence to obtain the F vector in the frequency domain; Calculate the distance between any two window character segments; set an interval for the window character segment, the interval is a combination of several consecutive windows, represented by a number pair (b, e), according to the distance between any two window character segments Calculate the weight of each interval; sort the weights of all intervals, and select the best text interval according to the weight. The invention has a high accuracy rate of extracting the text of the webpage, and can effectively distinguish the text from other parts of the webpage.

Description

A Webpage Text Extraction Method Based on Fast Fourier Transform

技术领域 technical field

本发明涉及文字信息处理，特别涉及一种基于快速傅里叶变换的网页正文提取方法。The invention relates to word information processing, in particular to a method for extracting webpage text based on fast Fourier transform.

背景技术 Background technique

随着Internet的不断发展，Web页面数量的大幅度增加，网页已经成为巨大的、分布广泛的信息源。许多信息包含在浩如烟海的Web中，如何帮助人们迅速提取有效信息，成为一个非常重要的问题。With the continuous development of the Internet and the substantial increase in the number of Web pages, the Web page has become a huge and widely distributed information source. A lot of information is included in the vast web, how to help people quickly extract effective information has become a very important issue.

针对HTML网页特点，需要利用网页结构布局信息对网页进行区域分割，模拟IE浏览器的显示方式，对网页进行解析。系统根据人类视觉原理，把网页解析处理的结果进行分块，然后根据用户需求，提取用户需要的相关网页块的内容。因此网页分割是从网页中提取有效信息的一个常用手段，当前比较常用的网页分割方法主要有如下几种：According to the characteristics of HTML web pages, it is necessary to use the structural layout information of the web pages to segment the web pages, simulate the display mode of the IE browser, and analyze the web pages. According to the principle of human vision, the system divides the results of webpage parsing and processing into blocks, and then extracts the content of relevant webpage blocks required by users according to user needs. Therefore, webpage segmentation is a common method to extract effective information from webpages. Currently, the commonly used webpage segmentation methods mainly include the following:

1、基于位置关系的分割法：该方法利用网页页面的布局进行分块，将一个网页分成上、下、左、右和中间5个部分，再根据这5个部分的特征进行分类。但实际的网页结构要复杂的多，这种基于网页布局的方法并不能适用于所有的网页；而且这种方法切分的网页粒度比较粗，有可能破坏网页本身的内在特征，难以充分包括整个网页的语义特征。1. Segmentation method based on positional relationship: This method uses the layout of the webpage to divide into blocks, divides a webpage into five parts: upper, lower, left, right and middle, and then classifies according to the characteristics of these five parts. However, the actual web page structure is much more complicated, and this method based on web page layout cannot be applied to all web pages; moreover, the granularity of web pages segmented by this method is relatively coarse, which may destroy the inherent characteristics of the web page itself, and it is difficult to fully include the entire web page. Semantic features of web pages.

2、基于文档对象模型(DOM，Document Object Model)的分割法：该方法通过找出网页HTML文档里的特定标签，利用标签项将HTML文档表示成一个DOM树的结构；然后根据特定标签包括heading、table、paragraph和list等来提取有效的树结点数据。但在许多情况下，文档对象模型不是用来表示网页内容结构的，所以利用该方法不能够准确地对网页中各分块的语义信息进行辨别。关于此类方法的进一步说明可见参考文献1：“王琦，唐世渭，杨冬青，基于DOM的网页主题信息自动提取[J]，计算机研究与发展，2004，41(10)：1786-1791”；2. Segmentation method based on Document Object Model (DOM, Document Object Model): This method finds the specific tags in the HTML document of the webpage, and uses the tag item to represent the HTML document as a DOM tree structure; and then includes the heading according to the specific tag , table, paragraph and list etc. to extract valid tree node data. But in many cases, the document object model is not used to represent the content structure of the webpage, so the semantic information of each block in the webpage cannot be distinguished accurately by using this method. Further descriptions of such methods can be found in Reference 1: "Wang Qi, Tang Shiwei, Yang Dongqing, Automatic Extraction of Webpage Topic Information Based on DOM [J], Computer Research and Development, 2004, 41(10): 1786-1791";

参考文献2：胡飞，基于标记树的Web页面区域划分和搜索方法[J]，计算机科学，2005，32(8)：182-185.；参考文献3：常育红，姜哲，朱小燕，基于标记树表示方法的页面结构分析[J]，计算机工程与应用，2004(16)：129-132。Reference 2: Hu Fei, Web Page Region Division and Search Method Based on Tag Tree [J], Computer Science, 2005, 32(8): 182-185.; Reference 3: Chang Yuhong, Jiang Zhe, Zhu Xiaoyan, Page Structure Analysis Based on Tag Tree Representation [J], Computer Engineering and Applications, 2004(16): 129-132.

发明内容 Contents of the invention

本发明的目的是克服现有正文提取方法不能准确定义正文区域，因而无法准确提取正文的缺陷，从而提供一种基于快速傅立叶变换的正文提取方法。The purpose of the present invention is to overcome the defect that the existing text extraction method cannot accurately define the text area and thus cannot accurately extract the text, thereby providing a text extraction method based on fast Fourier transform.

为了实现上述目的，本发明提供了一种基于快速傅立叶变换的网页正文提取方法，具体包含以下步骤：In order to achieve the above object, the present invention provides a method for extracting webpage text based on fast Fourier transform, which specifically includes the following steps:

步骤10)、读入HTML文件，并将该文件转换为Unicode格式，并存入一个字符数组中；Step 10), read in the HTML file, and convert the file into Unicode format, and store it in a character array;

步骤20)、对步骤10)得到的字符数组进行窗口分段，分段后的窗口字符段包含固定长度的字符；Step 20), carry out window segmentation to the character array that step 10) obtains, the window character segment after segmentation comprises the character of fixed length;

步骤30)、对字符在文档中的位置进行统计学分析，根据统计分析的结果对字符进行强度编码转换，得到该字符的正文强度值，每一个窗口字符段对应一个强度值序列；Step 30), carry out statistical analysis to the position of character in document, according to the result of statistical analysis, character is carried out strength coding conversion, obtains the text strength value of this character, and each window character segment corresponds to a strength value sequence;

步骤40)、对步骤30)中得到的每一个窗口字符段的强度值序列进行快速傅立叶变换，得到频域的F向量；Step 40), carry out fast Fourier transform to the intensity value sequence of each window segment obtained in step 30), obtain the F vector of frequency domain;

步骤50)、根据快速傅立叶变换的结果计算任意两个窗口字符段之间的距离；Step 50), calculate the distance between any two window character segments according to the result of fast Fourier transform;

步骤60)、为窗口字符段设定区间，所述区间是若干个连续的窗口的组合，用数字对(b，e)表示，根据步骤50)中得到的任意两个窗口字符段之间的距离，计算每个区间的权值；Step 60), interval is set for the window character segment, and described interval is the combination of several continuous windows, represents with numeral pair (b, e), according to the distance between any two window character segments obtained in step 50). Distance, calculate the weight of each interval;

步骤70)、对步骤60)中计算所得到的所有区间的权值排序，根据权值选择最佳正文区间。Step 70), sort the weights of all intervals calculated in step 60), and select the best text interval according to the weights.

上述技术方案中，在所述的步骤30)中，所述的统计分析的结果包括关于字符出现位置的均值、标准方差，以及字符在文档中的出现次数。In the above technical solution, in the step 30), the results of the statistical analysis include the mean, standard deviation, and occurrence times of the characters in the document.

所述强度值序列的计算公式如下：The calculation formula of the intensity value sequence is as follows:

I_i，j＝M(W_i，j，i·l+j)＝M(S_i·l+j，i·l+j)，i＝0Λ(w-1)，j＝0Λ(l-1)；I _{i, j} = M(W _{i, j} , i·l+j)=M(S _i·l+j , i·l+j), i=0Λ(w-1), j=0Λ(l- 1);

其中，M用于计算一个字符的强度值，W表示窗口字符段的二维数组，S表示字符串数组，i表示窗口字符段的编号，j表示窗口字符段内的位置，1表示窗口字符段的长度，w表示窗口字符段的数目；Among them, M is used to calculate the intensity value of a character, W represents the two-dimensional array of the window character segment, S represents the string array, i represents the number of the window character segment, j represents the position in the window character segment, and 1 represents the window character segment The length of , w represents the number of window character segments;

在计算所述M时，对于在位置x出现的字符c，其正文强度值为：When calculating said M, for the character c appearing at position x, its text strength value is:

$M m ((c c,, x x)) = = \{\begin{matrix} {N N}_{c c} \cdot \cdot exp exp ((- - {((\frac{x x - - {μ μ}_{c c}}{{σ σ}_{c c}}))}^{22})) & c c {= =}^{' '} {a a}^{' '} {~ ~}^{' '} {z z}^{' '} {,,}^{' '} {A A}^{' '} {~ ~}^{' '} {Z Z}^{' '},, 00 x x 01000100 ~ ~ 00 xFFFF xFFFF \\ {N N}_{c c} \cdot &Center Dot; ((exp exp ((- - {((\frac{x x - - {μ μ}_{c c}}{{σ σ}_{c c}}))}^{22})) - - 11)) & otherwise otherwise \end{matrix}$

上述公式中，μ_c是字符c出现位置的均值，σ_c是字符c出现位置的标准方差，N_c是字符c出现的次数。In the above formula, μ _c is the mean value of the appearance position of the character c, σ _c is the standard deviation of the appearance position of the character c, and N _c is the number of occurrences of the character c.

上述技术方案中，在所述的步骤50)中，所述的计算任意两段之间的距离为计算各频率上的欧式距离的总和，其计算公式如下：In the above-mentioned technical solution, in the described step 50), the distance between any two sections of the calculation is to calculate the sum of the Euclidean distances on each frequency, and its calculation formula is as follows:

${D D.}_{i i,, j j} = = dis dis tan the tan ce ce (({F f}_{i i},, {F f}_{j j})) = = \underset{k k = = 00 Λ Λ ((l l - - 11))}{Σ Σ} \sqrt{{| | | | {F f}_{i i,, k k} - - {F f}_{j j,, k k} | | | |}^{22}}$

其中，F为步骤40)中做快速傅立叶变换后的结果。Wherein, F is the result after fast Fourier transform in step 40).

在所述的步骤60)中，所述的计算区间的权值是将组间差之和减去组内差之和，所述区间权值的计算公式如下：In the step 60), the weight of the calculation interval is the sum of the difference between groups minus the sum of the difference within the group, and the calculation formula of the weight of the interval is as follows:

V(b，e)＝InterGroup(b，e)-IntraGroup(b，e)V(b,e)=InterGroup(b,e)-IntraGroup(b,e)

$InterGroup InterGroup ((b b,, e e)) = = \underset{Group Group ((i i)) &NotEqual; &NotEqual; Group Group ((j j))}{Σ Σ} {D D.}_{i i,, j j}$

$IntraGroup IntraGroup ((b b,, e e)) = = \underset{Group Group ((i i)) = = Group Group ((j j))}{Σ Σ} {D D.}_{i i,, j j}$

其中，IterGroup表示组间差，IntraGroup表示组内差，D_i，j表示步骤50)中计算得到的任意两个窗口字符段之间的距离。Wherein, IterGroup represents the difference between groups, IntraGroup represents the difference within a group, D _i,j represents the distance between any two window character segments calculated in step 50).

在所述的步骤60)中，所述的计算每个区间的权值采用累计距离的加速算法，所述算法的计算公式如下：In the step 60), the calculation of the weight of each interval adopts the acceleration algorithm of the cumulative distance, and the calculation formula of the algorithm is as follows:

${\overset{&OverBar; &OverBar;}{D D.}}_{i i,, j j} = = \underset{x x = = 00 Λi Λi - - 11,, y the y = = 00 Λj Λj - - 11}{Σ Σ} {D D.}_{x x,, y the y},, i i = = 11 Λw Λw,, j j = = 11 Λw Λw$

$\underset{i i = = aΛb aΛb - - 11,, j j = = cΛd cΛd - - 11}{Σ Σ} {D D.}_{i i,, j j} = = {\overset{&OverBar; &OverBar;}{D D.}}_{b b,, d d} - - {\overset{&OverBar; &OverBar;}{D D.}}_{a a,, d d} - - {\overset{&OverBar; &OverBar;}{D D.}}_{b b,, c c} + + {\overset{&OverBar; &OverBar;}{D D.}}_{a a,, c c}$

其中，D_x，y表示x段和y段的距离，D_i，j表示第0、1、...、(i-1)个窗口字符段和第0、1、...、(j-1)个窗口字符段的距离。Among them, D _{x, y} represents the distance between the x segment and the y segment, D _{i, j} represents the 0, 1, ..., (i-1) window character segment and the 0, 1, ..., (j -1) The distance of window character segments.

上述技术方案中，在所述的步骤70)中，选择权值最大的区间为最佳正文区间。In the above technical solution, in the step 70), the section with the largest weight is selected as the best text section.

上述技术方案中，在所述的步骤70)中，从步骤60)的计算结果中按照从大到小的顺序选择权值大于0的区间，对这些区间所对应的权值做加权平均，根据加权平均的结果选择最佳正文区间。In the above-mentioned technical solution, in the step 70), the calculation results of the step 60) are selected from the calculation results of the step 60) according to the order from large to small. The results of the weighted average select the best text interval.

所述网页中的正文信息用多字节字符集表示，包括日文、韩文和中文。The text information in the webpage is represented by a multi-byte character set, including Japanese, Korean and Chinese.

本发明的优点在于：The advantages of the present invention are:

1、本发明利用网页的频域特征来分割页面，过滤噪声，进而提取有效信息。1. The present invention utilizes the frequency domain feature of the webpage to segment the webpage, filter the noise, and then extract effective information.

2、本发明的方法在正文内容较长的情况下，即使页面结构复杂，含有多种干扰信息，也能有效地提取网页正文信息，并区分开正文和页面的其他部分，提取的准确率高。2. When the text content is long, the method of the present invention can effectively extract the text information of the webpage even if the page structure is complex and contains a variety of interference information, and distinguish the text from other parts of the page, and the extraction accuracy is high .

3、本发明无须对具体网页结构进行分析即可提取网页正文内容，具有良好的通用性，可适用于不同风格、不同主题的网页。3. The present invention can extract the text content of the webpage without analyzing the specific webpage structure, has good versatility, and is applicable to webpages of different styles and themes.

附图说明 Description of drawings

图1为本发明的基于快速傅立叶变换的网页正文提取方法的流程图；Fig. 1 is the flow chart of the webpage text extracting method based on fast Fourier transform of the present invention;

图2a和图2b为本发明中进行正文强度编码时所采用的正文强度函数的示意图；Fig. 2 a and Fig. 2 b are the schematic diagrams of the text strength function adopted when carrying out text strength coding in the present invention;

图3为本发明在计算区间权值时利用累计距离快速计算连续区间距离总合的加速算法的示意图。FIG. 3 is a schematic diagram of an accelerated algorithm for quickly calculating the sum of distances between consecutive intervals by using cumulative distances when calculating interval weights in the present invention.

具体实施方式 Detailed ways

下面结合附图和具体实施方式对本发明作进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

在对本发明的基于快速傅立叶变换的网页正文提取方法进行说明之前，首先将网页根据页面结构特征作分类，具体包含以下种类：Before the method for extracting webpage text based on fast Fourier transform of the present invention is described, webpages are first classified according to the structural features of the webpage, specifically including the following categories:

首页式——网站的首页，一般含有多个栏目、图片、动画，以及若干文章标题链接。如：网易的首页。Home page style - the home page of the website, generally contains multiple columns, pictures, animations, and links to several article titles. Such as: NetEase's home page.

列表式——信息以列表的方式给出，一般以表格的形式列出若干个条目，经常含有分页功能。例如：某论坛版面的文章标题列表。Tabular style—information is given in the form of a list, generally listing several items in the form of a table, often with a pagination function. Example: A list of article titles for a forum forum.

正文式——指含有正文内容的底层网页，一般只含有不超过一篇的文章内容，无评论或评论较少。如：各类网站的含有具体某篇文章的底层网页。Text style——refers to the bottom-level web page containing text content, generally only containing no more than one article content, with no or few comments. For example: the underlying web pages of various websites that contain a specific article.

评论式——除了含有正文外，正文后面还跟有若干个评论，以论坛为代表。Commentary style - in addition to the main text, there are several comments after the main text, represented by the forum.

本发明主要是针对上述的“正文式”中文网页实现网页内容的提取。正文式中文网页通常含有大段的正文信息，在正文信息的前后是一些格式信息(例如导航信息、交互信息、JavaScript脚本等)。The present invention is mainly aimed at the above-mentioned " text type " Chinese webpage to realize the extraction of webpage content. Text-style Chinese web pages usually contain a large section of text information, and some format information (such as navigation information, interactive information, JavaScript scripts, etc.) is placed before and after the text information.

正文信息具有以下特点：Text information has the following characteristics:

1、位于HTML源文件的中部；1. Located in the middle of the HTML source file;

2、以中文字符和英文字母为主；2. Mainly Chinese characters and English letters;

3、较为连续的文字；3. Relatively continuous text;

4、正文信息的信号特性类似；4. The signal characteristics of text information are similar;

5、正文信息与格式信息的信号特性不同。5. The signal characteristics of text information and format information are different.

格式信息具有以下特点：Format information has the following characteristics:

1、位于HTML源文件的开头和结尾；1. Located at the beginning and end of the HTML source file;

2、以标点符号和英文字母为主；2. Mainly punctuation marks and English letters;

3、格式信息的信号特性类似；3. The signal characteristics of format information are similar;

4、格式信息与正文信息的信号特性不同。4. The signal characteristics of format information and text information are different.

对HTML文档模型分析可知，文档由三大类信号混合而成，包括：Analysis of the HTML document model shows that the document is a mixture of three types of signals, including:

1)HTML标记符(TAG)，形式为“<标记符><标记符属性＝值></标记符>”。1) HTML tag (TAG), in the form of "<tag><tag attribute=value></tag>".

例如：For example:

2)文本自然语言(TEXT)，即中英文字符组成的句子。例如：关于我们Aboutus。2) Text natural language (TEXT), that is, sentences composed of Chinese and English characters. For example: about us Aboutus.

3)脚本程序(SCRIPT)。例如：function MM_findObj(n，d){var p，i，x；if(！d)}3) Script program (SCRIPT). For example: function MM_findObj(n,d){var p,i,x;if(!d)}

本发明根据正文式页面的结构特征，将提取正文的问题转化为给定一个底层网页的HTML源文件，求解最佳的正文区间。下面结合一个中文网页的实例，对本发明方法的具体实现步骤做如下说明：According to the structural features of text-style pages, the present invention converts the problem of text extraction into an HTML source file of a given bottom web page, and solves the optimal text interval. Below in conjunction with the example of a Chinese webpage, the specific implementation steps of the inventive method are described as follows:

步骤10、读入HTML文件，将该文件转换为Unicode格式，并存入到一个字符数组中。转换后的英文字母在’a’～’z’，’A’～’Z’之间，中文字符在0×0100～0×FFFF之间。转换后的字符存入字符数组S°，该字符数组的长度为s°。Step 10, read in the HTML file, convert the file into Unicode format, and store it in a character array. The converted English letters are between 'a'～'z', 'A'～'Z', and the Chinese characters are between 0×0100～0×FFFF. The converted characters are stored in the character array S°, and the length of the character array is s°.

假设读入一个网易旅游频道上关于云南香格里拉的网页，将该网页转换为Unicode格式后，网页转换的结果如下(鉴于原文篇幅过长，在下面的例子中只摘取了部分内容)：Suppose you read a webpage about Shangri-La, Yunnan on the NetEase Travel Channel, and after converting the webpage to Unicode format, the result of webpage conversion is as follows (due to the length of the original text, only part of the content is extracted in the following example):

“<！DOCTYPE html PUBLIC″-//W3C//DTD XHTML 1.0 Transitional//EN″“<!DOCTYPE html PUBLIC″-//W3C//DTD XHTML 1.0 Transitional//EN″

″http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd″>"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<title>为喇嘛做广东菜_芒果网易旅游</title><title>Cook Cantonese cuisine for Lama_Mango Netease Travel</title>

……...

<！--page--><! --page-->

<！--<! --

<span><a href＝″″>上一页</a></span><span class＝″fB″><a href＝″″>1</a></span><span class＝″fB″><a href＝″″>2</a></span><span class＝″fBcDRed″>3</span><span><a href＝″″>下一页</a></span><span><a href="">Previous page</a></span><span class="fB"><a href="">1</a></span><span class=" fB″><a href="">2</a></span><span class="fBcDRed">3</span><span><a href="">Next</a>< /span>

</div></div>

-->-->

------------------------------------------------------------------------------------------------------------------ --------------

以上为格式信息The above is the format information

7.14丽江古城的人流就像广州的上下九步行街一样，全是游人，毫不夸张。只在清晨的石板街才见为数不多的纳西族老人，女族人更能坚守古老的信念，才披着七星坎肩。当然也能看见看着挺专业敬业的哥们端着长枪短炮的设备轰炸丽江的早晨渺渺美景。The flow of people in the Old Town of Lijiang on July 14 is like the Shangxiajiu Pedestrian Street in Guangzhou, full of tourists, no exaggeration. Only in the stone street in the early morning can I see a small number of old Naxi people. Women of the Naxi ethnic group are more able to stick to their ancient beliefs, so they wear seven-star vests. Of course, you can also see the beautiful scenery in the morning when the professional and dedicated buddies bombard Lijiang with equipment with long guns and short cannons.

<br>&nbsp；&nbsp；&nbsp；&nbsp；虽然小桥流水依旧，夜色大红灯笼也很诱惑，都在寻找着一种过剩的激情。我独自听完颇有韵味的纳西古乐走在街上的时候，甚至有个很奇怪的看我一个人凑上前来，热情的介绍有摩梭族的女孩走婚和跳艳舞，问是否需要看表演，？那是个惊讶，果然开放带来一切，当然我并不相信那些女孩是摩梭族的，都找的别地女人充数而已。当然我并没有去.后来在酒吧随便坐的时候，听一些驴友说甚至还有广东的走婚团就是为了体验走婚去的，足见人们的追求各异。<br>    Although the small bridge and flowing water are still there, the red lanterns at night are also very tempting, and they are all looking for a kind of excess passion. When I was walking down the street after listening to the charming Naxi ancient music alone, someone even saw me approaching me alone, enthusiastically introducing Mosuo girls to marry and dance, and asked if I Need to see a show, ? It was a surprise, as expected, openness brought everything. Of course, I don't believe that those girls are from the Mosuo ethnic group, and they just found other women to make up for it. Of course I didn't go. Later, when I was sitting casually in the bar, I heard from some donkey friends that there were even walking wedding groups in Guangdong just to experience walking marriages, which shows that people's pursuits are different.

<br>&nbsp；&nbsp；&nbsp；&nbsp；我又想起在香格里拉的7月还有青梅子，是新鲜的。喜欢酸的朋友都知道，在东部6月，青梅子就熟透了，应该是气候的原因延迟了它的季节。但香格里拉的青梅子皮已经黄了，肉却不会软，仍然结实甚至是硬得很，那个酸啊，叫喜欢酸的朋友爱死了，叫怕酸的朋友简直可以把你酸死。一点也不夸张地。买了一斤3块钱，你已经知道我是极爱酸的，竟然吃了3天。别的喇嘛吃半个就受不了，口腔膜都要酸脱落一层的。但还是怀念那种味道。<br>    I remembered that there were green plums in Shangri-La in July, which were fresh. Friends who like sour know that green plums are ripe in June in the east. It should be the climate that delays its season. But the skin of Shangri-La’s greengage has turned yellow, but the flesh is not soft, it is still firm and even very hard. The sourness makes friends who like sour love it to death, and friends who are afraid of sour can make you sour to death. Not exaggerating at all. I bought it for 3 yuan a catty. You already know that I love sour food so much that I ate it for 3 days. Other lamas can't stand it after eating half of it, and the mouth membrane will peel off a layer of acid. But still miss that taste.

------------------------------------------------------------------------------------------------------------ --------

以上为正文部分The above is the text part

---------------------------------------------------------------------------------------------------------- ------

<！--page--><! --page-->

<！--<! --

</div></div>

-->-->

……...

//-->//-->

</script></script>

</noscript></noscript>

<！--END NNR Site Census V5.1--><! --END NNR Site Census V5.1-->

</body></body>

</html></html>

-------------------------------------------------------------------------------------------------------- ----

以上为格式信息The above is the format information

------------------------------------------------------”-------------------------------------------------- ----"

将上述网页的信息转换为Unicode格式后，存储在一个字符数组中。After the information of the above web page is converted into Unicode format, it is stored in a character array.

步骤20、对步骤10得到的字符数组进行窗口分段。所述的窗口用于采样，以选择等长的一段字符在后续步骤中实现傅立叶变换。假设窗口的大小为1，把包含在字符数组S°中的文件切分为长度为1的若干连续字符段，一共w段，同时将后面不足1的剩余字符删除，得到一个新的字符串数组S，该数组的长度为s。用W表示窗口的二维数组，i表示窗口编号，j表示窗口内位置，则窗口的计算公式如下：Step 20, perform window segmentation on the character array obtained in step 10. The window is used for sampling to select a segment of characters of equal length to implement Fourier transform in subsequent steps. Assuming that the size of the window is 1, the file contained in the character array S° is divided into several consecutive character segments with a length of 1, a total of w segments, and at the same time, the remaining characters less than 1 are deleted to obtain a new string array S, the length of the array is s. Use W to represent the two-dimensional array of the window, i to represent the window number, and j to represent the position in the window, then the calculation formula of the window is as follows:

${S S}_{i i} = = {S S}_{}^{i i},, i i = = 00 Λ Λ ((s the s - - 11))$

W_i，j＝S_i·l+j，i＝0Λ(w-1)，j＝0Λ(l-1)W _i,j =S _i·l+j , i=0Λ(w-1), j=0Λ(l-1)

仍以上述的关于香格里拉的网页为例，对存储该网页的字符数组进行窗口分段。假设窗口的大小设定为32，则该字符数组做窗口分段后的结果如下：Still taking the above-mentioned webpage about Shangri-La as an example, window segmentation is performed on the character array storing the webpage. Assuming that the size of the window is set to 32, the result of the character array after window segmentation is as follows:

“Window[0]：？？<！DOCTYPE html PUBLIC″-//W3C/"Window[0]:??<!DOCTYPE html PUBLIC"-//W3C/

Window[1]：/DTD XHTML 1.0 Transitional//EN″Window[1]: /DTD XHTML 1.0 Transitional//EN″

Window[2]：？？″http://www.w3.org/TR/xhtml1/Window[2]: ? ? "http://www.w3.org/TR/xhtml1/

Window[3]：DTD/xhtml1-transitional.dtd″>？？<Window[3]: DTD/xhtml1-transitional.dtd″>??<

Window[4]：html xmlns＝″http://www.w3.org/19Window[4]: html xmlns="http://www.w3.org/19

Window[5]：99/xhtml″>？？<head>？？<title>为喇嘛做广Window[5]: 99/xhtml″>??<head>??<title>Promoting for Lamas

Window[6]：东菜_芒果网易旅游</title>？？<meta http-eqWindow[6]: Dongcai_Mango Netease Travel</title>? ? <meta http-eq

……...

Window[195]：span><span><a href＝″″>下一页</a></Window[195]: span><span><a href="">next page</a></

Window[196]：span>？？</div>？？-->？？<div class＝″Window[196]: span>? ? </div>? ? -->? ? <div class="

Window[197]：text″id＝″articlebody″>7.14丽江古城的Window[197]: text″id=″articlebody″>7.14 Lijiang Old Town

Window[198]：人流就像广州的上下九步行街一样，全是游人，毫不夸张。只在清晨的石Window[198]: The flow of people is like Shangxiajiu Pedestrian Street in Guangzhou, full of tourists, no exaggeration. stone only in the morning

Window[199]：板街才见为数不多的纳西族老人，女族人更能坚守古老的信念，才披着七Window[199]: I saw a few old Naxi people in Banjie.

Window[200]：星坎肩。当然也能看见看着挺专业敬业的哥们端着长枪短炮的设备轰炸丽Window[200]: star vest. Of course, you can also see the professional and dedicated buddies bombarding Li with equipment with long guns and short cannons.

Window[201]：江的早晨渺渺美景。？？<br>&nbsp；&nbsp；&nbspWindow[201]: The faint beauty of the river in the morning. ? ? <br>  &nbsp

Window[202]：；&nbsp；虽然小桥流水依旧，夜色大红灯笼也很诱惑，都在寻找着Window[202]:; Although the small bridge and flowing water are still there, the red lanterns at night are also very tempting, and they are all looking for

Window[203]：一种过剩的激情。我独自听完颇有韵味的纳西古乐走在街上的时候，甚至Window[203]: An excess of passion. When I was walking down the street after listening to the charming Naxi ancient music alone, I even

Window[204]：有个很奇怪的看我一个人凑上前来，热情的介绍有摩梭族的女孩走婚和跳Window[204]: There is a very strange person who sees me coming up alone, and warmly introduces Mosuo girls walking marriage and dancing

Window[205]：艳舞，问是否需要看表演，？那是个惊讶，果然开放带来一切，当然我并Window[205]: Yanwu, do you want to watch the show? That was a surprise, openness brought everything, of course I didn't

Window[206]：不相信那些女孩是摩梭族的，都找的别地女人充数而已。当然我并没有去Window[206]: I don’t believe those girls are from the Mosuo ethnic group, they just found women from other places to make up for it. of course i didn't go

Window[207]：.后来在酒吧随便坐的时候，听一些驴友说甚至还有广东的走婚团就是为Window[207]: Later, when I was sitting casually in a bar, I heard from some donkey friends that there were even walking wedding groups in Guangdong just for

Window[208]：了体验走婚去的，足见人们的追求各异。？？<br>&nbsp；&nWindow[208]: For the experience of walking marriage, it shows that people have different pursuits. ? ? <br> &n

……...

Window[599]：rite(_rsCL)；？？//-->？？</script>？？Window[599]: rite(_rsCL); ? ? //-->? ? </script>? ?

Window[600]：<noscript>？？<img src＝″//secure-cWindow[600]: <noscript>? ? <img src="//secure-c

Window[601]：n.imrworldwide.com/cgi-bin/m？ci＝Window[601]: n.imrworldwide.com/cgi-bin/m? ci=

Window[602]：cn-netease&amp；cg＝0″alt＝″″>？？</Window[602]: cn-netease & cg=0″alt=″″>??</

Window[603]：noscript>？？<！--END NNR Site CenWindow[603]: noscript>? ? <! --END NNR Site Cen

Window[604]：sus V5.1-->？？</body>？？</html>？？”Window[604]: sus V5.1-->? ? </body>? ? </html>? ? "

由上述实例可知，该网页共分为605段。It can be seen from the above example that the web page is divided into 605 sections.

步骤30、利用统计学原理，对字符进行强度编码转换。在本步骤中，对文档中出现的每一个字符，分析它在整个文档中出现的规律，通过对字符在文档中出现的位置进行统计学分析，得到关于字符出现位置的均值、标准方差，以及该字符在文档中的出现次数。利用上述的均值、标准方差以及出现次数，计算正文强度值。在计算正文强度值时，应以步骤20中所划分的字符段为计算单位。对于一个字符段，对字符作编码转换从而得到一个强度值序列I，该强度值序列的计算公式如下：Step 30, using the principle of statistics to convert the characters into intensity codes. In this step, for each character that appears in the document, analyze the rule that it appears in the entire document, and perform statistical analysis on the position of the character in the document to obtain the mean value, standard deviation, and The number of occurrences of this character in the document. Using the above mean, standard deviation, and number of occurrences, a text intensity value is calculated. When calculating the strength value of the text, the character segment divided in step 20 should be used as the calculation unit. For a character segment, code conversion is performed on the characters to obtain an intensity value sequence I, and the calculation formula of the intensity value sequence is as follows:

I_i，j＝M(W_i，j，i·l+j)＝M(S_i·l+j，i·l+j)，i＝0Λ(w-1)，j＝0Λ(l-1)I _{i, j} = M(W _{i, j} , i·l+j)=M(S _i·l+j , i·l+j), i=0Λ(w-1), j=0Λ(l- 1)

其中M用于计算一个字符的强度值，对于在位置x出现的字符c，其正文强度值为：Among them, M is used to calculate the intensity value of a character. For the character c appearing at position x, its text intensity value is:

因为正文包含更多的中英文字符，而包含较少的标点符号。所以在上述公式中如果字符是文字类型的(即’a’～’z’，’A’～’Z’，0×0100～0×FFFF)，则用正态分布公式作为其编码转换函数，结果非负；对于其他字符，则用正态分布加上偏移量的公式，作为转换函数，结果非正。上述公式对于中英文字符是正数，对于标点符号是负数。因为正文位于文档中部，而标点符号位于文档两端，所以公式对于中英文字符是把信号强度集中在文档中部，对于标点符号是把信号强度分散到文档的两端。正文的强度函数如图2所示，由该图可知，由于频繁出现的中英文字符往往是常用词的反映，因此其信号强度按比例增大；而频繁出现的标点符号，往往是排版格式的反映，例如小于号和大于号，因此其信号强度在负数方向按比例增大。Because the text contains more Chinese and English characters and less punctuation marks. Therefore, in the above formula, if the character is of text type (that is, 'a'～'z', 'A'～'Z', 0×0100～0×FFFF), the normal distribution formula is used as its encoding conversion function, The result is non-negative; for other characters, the formula of normal distribution plus offset is used as the conversion function, and the result is not positive. The above formula is a positive number for Chinese and English characters, and a negative number for punctuation marks. Because the text is located in the middle of the document, and the punctuation marks are located at both ends of the document, the formula concentrates the signal strength in the middle of the document for Chinese and English characters, and distributes the signal strength to both ends of the document for punctuation marks. The intensity function of the text is shown in Figure 2. It can be seen from the figure that since frequently occurring Chinese and English characters are often reflections of commonly used words, their signal strength increases proportionally; and frequently occurring punctuation marks are often typographical. Reflect, such as the less than and greater than signs, so their signal strength increases proportionally in the negative direction.

对于所有的字符段，经过上述类似操作后，都可以得到各自的强度值序列。For all the character fields, after the above-mentioned similar operations, respective intensity value sequences can be obtained.

步骤40、对步骤30中得到的每一个窗口字符段的强度值序列进行快速傅立叶变换，得到频域的F向量。其计算公式如下：Step 40: Perform fast Fourier transform on the intensity value sequence of each window segment obtained in step 30 to obtain an F vector in the frequency domain. Its calculation formula is as follows:

F_i＝FFT(I_i)F _i =FFT(I _i )

快速傅立叶变换的具体实现是一项成熟的现有技术，在本实施例中不再作详细的说明。The specific implementation of the fast Fourier transform is a mature prior art, and will not be described in detail in this embodiment.

步骤50、计算任意两字符段间的距离，两字符段间的距离为各频率上的欧式距离的总和。其计算公式如下：Step 50, calculate the distance between any two-character fields, the distance between the two-character fields is the sum of the Euclidean distances on each frequency. Its calculation formula is as follows:

由上述公式可见，计算任意两段的距离其实对这两段的对应频率位置上的两个值求差，然后再把所有差求和。例如A窗口和B窗口的距离，就是a0与b0的差，……a31与b31的差，对这些差的平方和再开平方，就得到了欧式距离的总和。It can be seen from the above formula that calculating the distance between any two segments actually calculates the difference between the two values at the corresponding frequency positions of the two segments, and then sums all the differences. For example, the distance between window A and window B is the difference between a0 and b0, ... the difference between a31 and b31, and the sum of the squares of these differences is then squared to obtain the sum of the Euclidean distances.

步骤60、为字符段设定区间，计算每个区间的权值。一个区间是若干个连续的窗口的组合，用数字对(b，e)来表示，该数字对表示由该数字对所表示的区间是由窗口W_b到W_e-1组成的，其中0≤b＜e≤w。设定区间后，文件中的所有窗口段被分成了两组，分别为区间内部组和区间外部组，区间内部组A包括W_b～W_e-1，区间外部组B包括W₀～W_b-1以及W_e～W_w-1。所有的窗口组由B组的前一部分{W₀，W₁，..，W_b-1}，A组{W_b，W_b+1，..，W_e-1}，B组的后一部分{W_e，W_e+1，..，W_w-1}组成。Step 60, setting intervals for the character segment, and calculating the weight of each interval. An interval is a combination of several consecutive windows, represented by a pair of numbers (b, e), which means that the interval represented by the pair of numbers is composed of windows W _b to W _e-1 , where 0≤ b<e≤w. After the interval is set, all the window segments in the file are divided into two groups, namely the interval internal group and the interval external group, the interval internal group A includes W _b ～W _e-1 , and the interval external group B includes W ₀ ～W _{b -1} and W _e ~W _w-1 . All window groups consist of the former part of group B {W ₀ , W ₁ , .., W _b-1 }, group A {W _b , W _b+1 , .., W _e-1 }, the rear part of group B A part of {W _e , W _e+1 , .., W _w-1 } is composed.

区间的权值是指组间差之和减去组内差之和，其中，组间差是指从区间内部组A中任选一段与区间外部组B中的任意一段求差，所求差的总和就是组间差；组内差是指区间内部组A和区间外部组B各自对内部的任意两段求差，所求差的总和为组内差。区间权值的计算公式如下：The weight of the interval refers to the sum of the difference between groups minus the sum of the difference within the group. The difference between groups refers to the difference between any segment in group A inside the interval and any segment in group B outside the interval. The sum of is the inter-group difference; the intra-group difference refers to the difference between the inner group A and the outer group B for any two segments within the interval, and the sum of the differences is the intra-group difference. The formula for calculating interval weights is as follows:

在本步骤中，计算区间的权值的一种优选实现方式是采用一种累计距离的加速算法，使用该算法可以快速地计算两个连续组的差值之和。如图3所示，该算法的计算公式如下：In this step, a preferred implementation manner of calculating the weight of the interval is to use an accelerated algorithm for accumulating distance, and the sum of differences between two consecutive groups can be quickly calculated by using this algorithm. As shown in Figure 3, the calculation formula of the algorithm is as follows:

其中，D_x，y表示x段和y段的距离，D_i，j表示第0、1、...、(i-1)个窗口字符段和第0、1、...、(j-1)个窗口字符段的距离。上述公式用于加快计算组间差和组内差，先计算累计值表，通过查表和简单的代数运算就可以很快地求出组间差和组内差。其中的D_i，j，i＝1Λw，j＝1Λw就是所述的累计值表。Among them, D _{x, y} represents the distance between the x segment and the y segment, D _{i, j} represents the 0, 1, ..., (i-1) window character segment and the 0, 1, ..., (j -1) The distance of window character segments. The above formula is used to speed up the calculation of inter-group difference and intra-group difference. First calculate the cumulative value table, and then quickly calculate the inter-group difference and intra-group difference by looking up the table and simple algebraic operations. Among them, D _{i, j} , i=1Λw, j=1Λw is the accumulated value table.

步骤70、对步骤60中计算所得到的所有区间的权值排序，权值最大的区间为最佳正文区间。在步骤60中，由于所设定的区间包含了连续窗口组合的所有可能情况，因此最终会得到多个区间的权值，对这些权值按照从大到小的顺序进行排序，最后选择权值最大的区间作为最佳正文区间，而最佳正文区间中所包含的内容也就是本发明最终要从网页中提取的正文。Step 70, sort the weights of all intervals calculated in step 60, and the interval with the largest weight is the best text interval. In step 60, since the set interval contains all possible situations of continuous window combinations, weights of multiple intervals will be obtained in the end, these weights are sorted in descending order, and finally the weights are selected The largest interval is used as the best text interval, and the content contained in the best text interval is the text to be finally extracted from the webpage in the present invention.

对前述的关于香格里拉的网页，选择权值最大的区间，根据步骤60中权值计算的结果，最大权值为1.8671557984059033E9，权值最大的区间的b为197，e为395，该区间就是所求的最佳正文区间。For the aforementioned webpage about Shangri-La, select the interval with the largest weight. According to the result of the weight calculation in step 60, the maximum weight is 1.8671557984059033E9. The b of the interval with the largest weight is 197, and e is 395. This interval is the Find the best text interval.

在一个实施例中，对区间权值排序，选择最佳正文区间的另一种实现方式是对权值做加权平均，然后根据加权平均的结果得到平均意义上的最佳正文区间。在实现时，通常对于权值大于0的区间进行加权平均，算出平均意义上的最佳正文区间(b^*，e^*)。求加权平均值的计算公式如下：In an embodiment, another implementation manner of sorting the interval weights and selecting the best text interval is to perform weighted average on the weights, and then obtain the best text interval in the average sense according to the result of the weighted average. During implementation, the weighted average is usually performed on the intervals whose weights are greater than 0, and the optimal text interval (b ^* , e ^* ) in the average sense is calculated. The formula for calculating the weighted average is as follows:

$(({b b}^{* *},, {e e}^{* *})) = = \frac{\underset{V V ((b b,, e e)) > > 00}{Σ Σ} V V ((b b,, e e)) \cdot &Center Dot; ((b b,, e e))}{\underset{V V ((b b,, e e)) > > 00}{Σ Σ} V V ((b b,, e e))}$

其中，V(b，e)表示区间权值。Among them, V(b, e) represents the interval weight.

仍以前述的关于香格里拉的网页为例，从步骤60的权值计算结果中，假设权值大于0的区间有100个，这些权值与对应的区间如下：Still taking the aforementioned webpage about Shangri-La as an example, from the weight calculation results in step 60, assuming that there are 100 intervals with weights greater than 0, these weights and corresponding intervals are as follows:

No.1：Area{b＝197e＝395w＝1.8671557984059033E9}No.1: Area{b=197e=395w=1.8671557984059033E9}

No.2：Area{b＝198e＝395w＝1.865928902944519E9}No.2: Area{b=198e=395w=1.865928902944519E9}

No.3：Area{b＝197e＝394w＝1.863446434026815E9}No.3: Area{b=197e=394w=1.863446434026815E9}

No.4：Area{b＝198e＝394w＝1.8620946999597936E9}No.4: Area{b=198e=394w=1.8620946999597936E9}

No.5：Area{b＝197e＝396w＝1.8534012640629482E9}No.5: Area{b=197e=396w=1.8534012640629482E9}

No.6：Area{b＝196e＝395w＝1.8533969765727189E9}No.6: Area{b=196e=395w=1.8533969765727189E9}

No.7：Area{b＝198e＝396w＝1.852261927708008E9}No.7: Area{b=198e=396w=1.852261927708008E9}

No.8：Area{b＝199e＝395w＝1.8511999688045855E9}No.8: Area{b＝199e＝395w＝1.8511999688045855E9}

No.9：Area{b＝197e＝393w＝1.8500594430878716E9}No.9: Area{b=197e=393w=1.8500594430878716E9}

No.10：Area{b＝196e＝394w＝1.849788102344682E9}No.10: Area{b=196e=394w=1.849788102344682E9}

No.11：Area{b＝198e＝393w＝1.848510799009436E9}No.11: Area{b=198e=393w=1.848510799009436E9}

No.12：Area{b＝199e＝394w＝1.8471652124879038E9}No.12: Area{b=199e=394w=1.8471652124879038E9}

No.13：Area{b＝197e＝397w＝1.8453086053177962E9}No.13: Area{b＝197e＝397w＝1.8453086053177962E9}

No.14：Area{b＝195e＝395w＝1.845281305908179E9}No.14: Area{b=195e=395w=1.845281305908179E9}

No.15：Area{b＝198e＝397w＝1.8442583536949947E9}No.15: Area{b＝198e＝397w＝1.8442583536949947E9}

No.16：Area{b＝195e＝394w＝1.8417764283302329E9}No.16: Area{b=195e=394w=1.8417764283302329E9}

No.17：Area{b＝197e＝392w＝1.8413777475416255E9}No.17: Area{b=197e=392w=1.8413777475416255E9}

No.18：Area{b＝198e＝392w＝1.8396801709467006E9}No.18: Area{b＝198e＝392w＝1.8396801709467006E9}

No.19：Area{b＝196e＝396w＝1.8396421919565065E9}No.19: Area{b＝196e＝396w＝1.8396421919565065E9}

No.20：Area{b＝200e＝395w＝1.838057893744711E9}No.20: Area{b＝200e＝395w＝1.838057893744711E9}

No.21：Area{b＝199e＝396w＝1.8377040837184753E9}No.21: Area{b＝199e＝396w＝1.8377040837184753E9}

No.22：Area{b＝196e＝393w＝1.8365645973901665E9}No.22: Area{b=196e=393w=1.8365645973901665E9}

No.23：Area{b＝200e＝394w＝1.8338399474557528E9}No.23: Area{b＝200e＝394w＝1.8338399474557528E9}

No.24：Area{b＝199e＝393w＝1.8333431983722968E9}No.24: Area{b＝199e＝393w＝1.8333431983722968E9}

No.25：Area{b＝201e＝395w＝1.832882136920093E9}No.25: Area{b＝201e＝395w＝1.832882136920093E9}

No.26：Area{b＝194e＝395w＝1.8327158264980187E9}No.26: Area{b＝194e＝395w＝1.8327158264980187E9}

No.27：Area{b＝197e＝398w＝1.8317380757017245E9}No.27: Area{b＝197e＝398w＝1.8317380757017245E9}

No.28：Area{b＝196e＝397w＝1.8315166911690896E9}No.28: Area{b=196e=397w=1.8315166911690896E9}

No.29：Area{b＝195e＝396w＝1.8314938196044166E9}No.29: Area{b＝195e＝396w＝1.8314938196044166E9}

No.30：Area{b＝198e＝398w＝1.8307755060003867E9}No.30: Area{b＝198e＝398w＝1.8307755060003867E9}

No.31：Area{b＝202e＝395w＝1.830544198380903E9}No.31: Area{b＝202e＝395w＝1.830544198380903E9}

No.32：Area{b＝199e＝397w＝1.829861304277684E9}No.32: Area{b＝199e＝397w＝1.829861304277684E9}

No.33：Area{b＝194e＝394w＝1.829311678505044E9}No.33: Area{b＝194e＝394w＝1.829311678505044E9}

No.34：Area{b＝195e＝393w＝1.828719245958915E9}No.34: Area{b＝195e＝393w＝1.828719245958915E9}

No.35：Area{b＝201e＝394w＝1.8285160965672174E9}No.35: Area{b＝201e＝394w＝1.8285160965672174E9}

No.36：Area{b＝196e＝392w＝1.828012158947821E9}No.36: Area{b＝196e＝392w＝1.828012158947821E9}

No.37：Area{b＝197e＝391w＝1.8270460014801817E9}No.37: Area{b＝197e＝391w＝1.8270460014801817E9}

No.38：Area{b＝202e＝394w＝1.8260582076294603E9}No.38: Area{b＝202e＝394w＝1.8260582076294603E9}

No.39：Area{b＝198e＝391w＝1.8251564260435276E9}No.39: Area{b＝198e＝391w＝1.8251564260435276E9}

No.40：Area{b＝200e＝396w＝1.824723891564548E9}No.40: Area{b＝200e＝396w＝1.824723891564548E9}

No.41：Area{b＝199e＝392w＝1.8243151092166026E9}No.41: Area{b=199e=392w=1.8243151092166026E9}

No.42：Area{b＝195e＝397w＝1.8233264390587733E9}No.42: Area{b=195e=397w=1.8233264390587733E9}

No.43：Area{b＝203e＝395w＝1.822325780416904E9}No.43: Area{b＝203e＝395w＝1.822325780416904E9}

No.44：Area{b＝195e＝392w＝1.8202939671958587E9}No.44: Area{b＝195e＝392w＝1.8202939671958587E9}

No.45：Area{b＝200e＝393w＝1.8198227669199252E9}No.45: Area{b＝200e＝393w＝1.8198227669199252E9}

No.46：Area{b＝201e＝396w＝1.8196575937589269E9}No.46: Area{b＝201e＝396w＝1.8196575937589269E9}

No.47：Area{b＝193e＝395w＝1.8191558800920327E9}No.47: Area{b＝193e＝395w＝1.8191558800920327E9}

No.48：Area{b＝194e＝396w＝1.8189200336928308E9}No.48: Area{b＝194e＝396w＝1.8189200336928308E9}

No.49：Area{b＝197e＝399w＝1.8179459850346885E9}No.49: Area{b＝197e＝399w＝1.8179459850346885E9}

No.50：Area{b＝196e＝398w＝1.8179439755481179E9}No.50: Area{b＝196e＝398w＝1.8179439755481179E9}

No.51：Area{b＝203e＝394w＝1.8176838100122943E9}No.51: Area{b＝203e＝394w＝1.8176838100122943E9}

No.52：Area{b＝202e＝396w＝1.8174102756958842E9}No.52: Area{b=202e=396w=1.8174102756958842E9}

No.53：Area{b＝198e＝399w＝1.817070992891399E9}No.53: Area{b＝198e＝399w＝1.817070992891399E9}

No.54：Area{b＝200e＝397w＝1.8170506741580334E9}No.54: Area{b＝200e＝397w＝1.8170506741580334E9}

No.55：Area{b＝199e＝398w＝1.8165496617398362E9}No.55: Area{b＝199e＝398w＝1.8165496617398362E9}

No.56：Area{b＝194e＝393w＝1.8164182449130914E9}No.56: Area{b＝194e＝393w＝1.8164182449130914E9}

No.57：Area{b＝193e＝394w＝1.8158518234459796E9}No.57: Area{b＝193e＝394w＝1.8158518234459796E9}

No.58：Area{b＝201e＝393w＝1.8143022038707862E9}No.58: Area{b＝201e＝393w＝1.8143022038707862E9}

No.59：Area{b＝196e＝391w＝1.8138511011079237E9}No.59: Area{b＝196e＝391w＝1.8138511011079237E9}

No.60：Area{b＝197e＝390w＝1.813416825235355E9}No.60: Area{b＝197e＝390w＝1.813416825235355E9}

No.61：Area{b＝201e＝397w＝1.812101903347275E9}No.61: Area{b=201e=397w=1.812101903347275E9}

No.62：Area{b＝202e＝393w＝1.8116598519465666E9}No.62: Area{b＝202e＝393w＝1.8116598519465666E9}

No.63：Area{b＝198e＝390w＝1.8113552225214372E9}No.63: Area{b＝198e＝390w＝1.8113552225214372E9}

No.64：Area{b＝194e＝397w＝1.810719247254324E9}No.64: Area{b＝194e＝397w＝1.810719247254324E9}

No.65：Area{b＝200e＝392w＝1.8106092331069574E9}No.65: Area{b＝200e＝392w＝1.8106092331069574E9}

No.66：Area{b＝202e＝397w＝1.8099494719208207E9}No.66: Area{b＝202e＝397w＝1.8099494719208207E9}

No.67：Area{b＝195e＝398w＝1.809720873865331E9}No.67: Area{b＝195e＝398w＝1.809720873865331E9}

No.68：Area{b＝199e＝391w＝1.8095815493579323E9}No.68: Area{b＝199e＝391w＝1.8095815493579323E9}

No.69：Area{b＝203e＝396w＝1.8093194340361586E9}No.69: Area{b=203e=396w=1.8093194340361586E9}

No.70：Area{b＝204e＝395w＝1.8091673410619712E9}No.70: Area{b＝204e＝395w＝1.8091673410619712E9}

No.71：Area{b＝194e＝392w＝1.8081203284794781E9}No.71: Area{b＝194e＝392w＝1.8081203284794781E9}

No.72：Area{b＝195e＝391w＝1.8062889464577138E9}No.72: Area{b＝195e＝391w＝1.8062889464577138E9}

No.73：Area{b＝192e＝395w＝1.8055887898178735E9}No.73: Area{b＝192e＝395w＝1.8055887898178735E9}

No.74：Area{b＝193e＝396w＝1.8053577759523911E9}No.74: Area{b＝193e＝396w＝1.8053577759523911E9}

No.75：Area{b＝201e＝392w＝1.8049212023955352E9}No.75: Area{b＝201e＝392w＝1.8049212023955352E9}

No.76：Area{b＝197e＝400w＝1.804362583413403E9}No.76: Area{b＝197e＝400w＝1.804362583413403E9}

No.77：Area{b＝204e＝394w＝1.8043406024255657E9}No.77: Area{b=204e=394w=1.8043406024255657E9}

No.78：Area{b＝196e＝399w＝1.8041515829944117E9}No.78: Area{b＝196e＝399w＝1.8041515829944117E9}

No.79：Area{b＝200e＝398w＝1.8039011525318637E9}No.79: Area{b＝200e＝398w＝1.8039011525318637E9}

No.80：Area{b＝198e＝400w＝1.8035751666398578E9}No.80: Area{b＝198e＝400w＝1.8035751666398578E9}

No.81：Area{b＝193e＝393w＝1.80312176475147E9}No.81: Area{b＝193e＝393w＝1.80312176475147E9}

No.82：Area{b＝203e＝393w＝1.8030793314788742E9}No.82: Area{b＝203e＝393w＝1.8030793314788742E9}

No.83：Area{b＝199e＝399w＝1.8030163410122762E9}No.83: Area{b＝199e＝399w＝1.8030163410122762E9}

No.84：Area{b＝192e＝394w＝1.8023851986898751E9}No.84: Area{b＝192e＝394w＝1.8023851986898751E9}

No.85：Area{b＝202e＝392w＝1.8021209078151228E9}No.85: Area{b＝202e＝392w＝1.8021209078151228E9}

No.86：Area{b＝203e＝397w＝1.8019899976293116E9}No.86: Area{b＝203e＝397w＝1.8019899976293116E9}

No.87：Area{b＝196e＝390w＝1.8003818327393115E9}No.87: Area{b＝196e＝390w＝1.8003818327393115E9}

No.88：Area{b＝201e＝398w＝1.799061835030309E9}No.88: Area{b＝201e＝398w＝1.799061835030309E9}

No.89：Area{b＝191e＝395w＝1.797390318129374E9}No.89: Area{b＝191e＝395w＝1.797390318129374E9}

No.90：Area{b＝193e＝397w＝1.7971241276820748E9}No.90: Area{b＝193e＝397w＝1.7971241276820748E9}

No.91：Area{b＝194e＝398w＝1.797104678286477E9}No.91: Area{b＝194e＝398w＝1.797104678286477E9}

No.92：Area{b＝202e＝398w＝1.797000014978798E9}No.92: Area{b＝202e＝398w＝1.797000014978798E9}

No.93：Area{b＝204e＝396w＝1.796316784871037E9}No.93: Area{b＝204e＝396w＝1.796316784871037E9}

No.94：Area{b＝195e＝399w＝1.7958957929261835E9}No.94: Area{b＝195e＝399w＝1.7958957929261835E9}

No.95：Area{b＝200e＝391w＝1.7956939769691014E9}No.95: Area{b＝200e＝391w＝1.7956939769691014E9}

No.96：Area{b＝199e＝390w＝1.7955746426529288E9}No.96: Area{b＝199e＝390w＝1.7955746426529288E9}

No.97：Area{b＝205e＝395w＝1.7951057911539783E9}No.97: Area{b＝205e＝395w＝1.7951057911539783E9}

No.98：Area{b＝193e＝392w＝1.7949530569627554E9}No.98: Area{b＝193e＝392w＝1.7949530569627554E9}

No.99：Area{b＝194e＝391w＝1.7942824448319867E9}No.99: Area{b＝194e＝391w＝1.7942824448319867E9}

No.100：Area{b＝191e＝394w＝1.79426301425113E9}No.100: Area{b=191e=394w=1.79426301425113E9}

根据前述的计算公式，对上述权值做加权平均后的结果为begin＝182.3652086633145，end＝404.76999807248177，根据该加权平均值可得到相应的最佳正文区间。According to the aforementioned calculation formula, the weighted average of the above weights results in begin=182.3652086633145, end=404.76999807248177, and the corresponding optimal text interval can be obtained according to the weighted average.

采用本发明的方法，可以获得良好的实际效果：Adopt method of the present invention, can obtain good actual effect:

在一个实例中，随机选取网易旅游(http://ok.travel.163.com/itinerar/list.jsp)，e游天下(http://www.eyooworld.com/index.html)，红袖添香(http://www.hongxiu.com/)，水木论坛(www.newsmth.net)，科苑星空论坛(www.kyxk.net)这五个网站的“正文式”网页进行实验。各选取50个页面，共计250个页面。In one instance, randomly select NetEase Travel ( http://ok.travel.163.com/itinerar/list.jsp ), e-Travel World ( http://www.eyooworld.com/index.html ), Hongxiu Tim Hongxiu ( http://www.hongxiu.com/ ), Shuimu Forum ( www.newsmth.net ), and Keyuan Xingkong Forum ( www.kyxk.net ) are used for the experiment. Select 50 pages each, for a total of 250 pages.

人工观察源代码中正文开始和结束的位置，即正确的正文区间，记作(B，E)；程序运行结果给出的权值最大的区间，即最佳正文区间，记作(b1，e1)；通过加权平均得到的区间，即平均意义上的最佳正文区间，记作(b^*，e^*)。HTML源代码经处理后的总段数记作w，则得出权值法求解最佳正文区间准确度R，加权平均法求解最佳正文区间准确度R^*。Manually observe the start and end positions of the text in the source code, that is, the correct text interval, denoted as (B, E); the interval with the largest weight given by the program running results, that is, the optimal text interval, denoted as (b1, e1 ); the interval obtained by weighted average, that is, the best text interval in the average sense, is denoted as (b ^* , e ^* ). The total number of paragraphs of the HTML source code after processing is denoted as w, then the weight method is used to obtain the optimal text interval accuracy R, and the weighted average method is used to obtain the optimal text interval accuracy R ^* .

$R R = = 11 - - \frac{| | ((b b 11 - - B B)) | | + + | | ((e e 11 - - E E.)) | |}{22 w w},,$ ${R R}^{* *} = = 11 - - \frac{| | (({b b}^{* *} - - B B)) | | + + | | (({e e}^{* *} - - E E.)) | |}{22 w w}$

下面的表1是对上述网页进行正文区间提取的准确度结果。Table 1 below shows the accuracy results of extracting text intervals from the above web pages.

网易旅游NetEase Travel e游天下e travel the world 红袖添香Red sleeves add fragrance 水木论坛Mizuki Forum 科苑星空论坛Keyuan Star Forum R均值R mean 0.9881537060.988153706 0.9138671410.913867141 0.9858273810.985827381 0.9687675840.968767584 0.9725346040.972534604 R^*均值R ^* mean 0.9440798470.944079847 0.8829852770.882985277 0.913693070.91369307 0.9580576450.958057645 0.9297488950.929748895

表1Table 1

由实验结果可知，该算法对不同结构网页的正文内容提取的准确度都较高。R均值都在90％以上，五个网站的R均值约为96.583％。有四类网站的R^*均值在90％以上，五个网站的均值约为91.957％。From the experimental results, it can be seen that the accuracy of the algorithm for text content extraction of web pages with different structures is high. The mean R values are all above 90%, and the mean R values of the five websites are about 96.583%. There are four types of websites whose R ^* averages are above 90%, and the averages of five websites are about 91.957%.

最后所应说明的是，以上实施例仅用以说明本发明的技术方案而非限制。尽管参照实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，对本发明的技术方案进行修改或者等同替换，都不脱离本发明技术方案的精神和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention rather than limit them. Although the present invention has been described in detail with reference to the embodiments, those skilled in the art should understand that modifications or equivalent replacements to the technical solutions of the present invention do not depart from the spirit and scope of the technical solutions of the present invention, and all of them should be included in the scope of the present invention. within the scope of the claims.

Claims

1. webpage context extraction method based on fast fourier transform specifically comprises following steps:

Step 10), read in html file, and this document is converted to Unicode format, and deposit in the character array;

Step 20), character array that step 10) is obtained carries out windowed segments, the window character field after the segmentation comprises the character of regular length;

Step 30), statistical analysis is carried out in the position of character in document, character is carried out the intensity coding conversion, obtain the text intensity level of this character, the corresponding intensity level sequence of each window character field according to The result of statistics;

Step 40), to step 30) in the intensity level sequence of each window character field of obtaining carry out fast fourier transform, obtain the F vector of frequency domain;

Step 50), calculate distance between any two window character fields according to the result of fast fourier transform;

Step 60), be between window character field setting district, described interval is the combination of several continuous windows, to the distance between any two window character fields of obtaining in (b, e) expression is according to step 50), calculates each interval weights with numeral;

Step 70), to step 60) in calculate resulting all interval weights orderings, select between best text area according to weights.

2. the webpage context extraction method based on fast fourier transform according to claim 1, it is characterized in that, in described step 30) in, described The result of statistics comprises average, the standard variance that occurs the position about character, and the occurrence number of character in document.

3. the webpage context extraction method based on fast fourier transform according to claim 2 is characterized in that, the computing formula of described intensity level sequence is as follows:

I _i，j＝M(W _i，j，i·l+j)＝M(S _i·l+j，i·l+j)，i＝0Λ(w-1)，j＝0Λ(l-1)；

Wherein, M is used to calculate the intensity level of a character, and W represents the two-dimensional array of window character field, S represents the character string array, and i represents the numbering of window character field, and j represents the position in the window character field, l represents the length of window character field, and w represents the number of window character field;

When calculating described M, for the character c that occurs at position x, its text intensity level is:

In the above-mentioned formula, μ _cBe the average that the position appears in character c, σ _cBe the standard variance that the position appears in character c, N _cIt is the number of times that character c occurs.

4. the webpage context extraction method based on fast fourier transform according to claim 1, it is characterized in that, in described step 50) in, the distance between any two sections of the described calculating is for calculating the summation of the Euclidean distance on each frequency, and its computing formula is as follows:

D_{i, j} = dis \tan ce (F_{i}, F_{j}) = \underset{k = 0 Λ (l - 1)}{Σ} \sqrt{{| | F_{i, k} - F_{j, k} | |}^{2}}

Wherein, F is a step 40) in be result after the fast fourier transform.

5. the webpage context extraction method based on fast fourier transform according to claim 4, it is characterized in that, in described step 60) in, the weights of described computation interval are that difference sum between group is deducted group interpolation sum, the computing formula of described interval right weight is as follows:

V(b，e)＝InterGroup(b，e)-IntraGroup(b，e)

InterGroup (b, e) = \underset{Group (i) &NotEqual; Group (j)}{Σ} D_{i, j}

IntraGroup (b, e) = \underset{Group (i) &NotEqual; Group (j)}{Σ} D_{i, j}

Wherein, it is poor that InterGroup represents between group, and IntraGroup represents to organize interpolation, D _{I, j}Expression step 50) distance between any two the window character fields that calculate in.

6. the webpage context extraction method based on fast fourier transform according to claim 5 is characterized in that, in described step 60) in, each interval weights of described calculating adopt the accelerating algorithm of cumulative distance, and the computing formula of described algorithm is as follows:

{\overset{&OverBar;}{D}}_{i, j} = \underset{x = 0 Λi - 1, y = 0 Λj - 1}{Σ} D_{x, y}, i = 1 Λw, j = 1 Λw

\underset{i = aΛb - 1, j = cΛd - 1}{Σ} D_{i, j} = {\overset{&OverBar;}{D}}_{b, d} - {\overset{&OverBar;}{D}}_{a, d} - {\overset{&OverBar;}{D}}_{b, c} + {\overset{&OverBar;}{D}}_{a, c}

Wherein, D _{X, y}The distance of expression x section and y section, D _{I, j}Represent the 0th, 1 ..., (i-1) individual window character field and the 0th, 1 ..., the distance of (j-1) individual window character field.

7. the webpage context extraction method based on fast fourier transform according to claim 1 is characterized in that, in described step 70) in, selecting the interval of weights maximum is between best text area.

8. the webpage context extraction method based on fast fourier transform according to claim 1, it is characterized in that, in described step 70) in, from step 60) result of calculation according to from big to small selective sequential weights greater than 0 interval, these interval pairing weights are done weighted average, select between best text area according to average weighted result.

9. the webpage context extraction method based on fast fourier transform according to claim 1 is characterized in that, the multibyte character set representations of the text message in the described webpage comprises Japanese, Korean and Chinese.