CN101237465A - A Webpage Text Extraction Method Based on Fast Fourier Transform - Google Patents
A Webpage Text Extraction Method Based on Fast Fourier Transform Download PDFInfo
- Publication number
- CN101237465A CN101237465A CNA2007100631827A CN200710063182A CN101237465A CN 101237465 A CN101237465 A CN 101237465A CN A2007100631827 A CNA2007100631827 A CN A2007100631827A CN 200710063182 A CN200710063182 A CN 200710063182A CN 101237465 A CN101237465 A CN 101237465A
- Authority
- CN
- China
- Prior art keywords
- character
- window
- interval
- fourier transform
- fast fourier
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 18
- 230000011218 segmentation Effects 0.000 claims abstract description 11
- 238000006243 chemical reaction Methods 0.000 claims abstract description 7
- 238000004364 calculation method Methods 0.000 claims description 20
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 238000007619 statistical method Methods 0.000 claims description 5
- 230000001186 cumulative effect Effects 0.000 claims description 4
- 238000000034 method Methods 0.000 description 21
- 239000000284 extract Substances 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000013515 script Methods 0.000 description 3
- 235000017274 Diospyros sandwicensis Nutrition 0.000 description 2
- 241000283074 Equus asinus Species 0.000 description 2
- 241000282838 Lama Species 0.000 description 2
- 241001465382 Physalis alkekengi Species 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 235000021018 plums Nutrition 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 239000004575 stone Substances 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- 235000011230 Prunus domestica subsp. italica Nutrition 0.000 description 1
- 244000249693 Reneklode Species 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 239000002253 acid Substances 0.000 description 1
- 230000003796 beauty Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 235000013305 food Nutrition 0.000 description 1
- 239000003205 fragrance Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000012528 membrane Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种基于快速傅立叶变换的网页正文提取方法,包括:读入HTML文件,并将该文件转换为Unicode格式,并存入一个字符数组;对字符数组进行窗口分段;对字符在文档中的位置进行统计学分析,根据结果对字符进行强度编码转换,得到正文强度值,每一个窗口字符段对应一个强度值序列;对强度值序列进行快速傅立叶变换,得到频域的F向量;计算任意两个窗口字符段之间的距离;为窗口字符段设定区间,所述区间是若干个连续的窗口的组合,用数字对(b,e)表示,根据任意两个窗口字符段之间的距离,计算每个区间的权值;对所有区间的权值排序,根据权值选择最佳正文区间。本发明对网页正文提取的准确率高,能有效地区分正文和网页的其他部分。
The invention discloses a web page text extraction method based on fast Fourier transform, comprising: reading in an HTML file, converting the file into a Unicode format, and storing it into a character array; performing window segmentation on the character array; The position in the document is statistically analyzed, and the intensity encoding conversion is performed on the characters according to the result to obtain the intensity value of the text, and each window character segment corresponds to an intensity value sequence; fast Fourier transform is performed on the intensity value sequence to obtain the F vector in the frequency domain; Calculate the distance between any two window character segments; set an interval for the window character segment, the interval is a combination of several consecutive windows, represented by a number pair (b, e), according to the distance between any two window character segments Calculate the weight of each interval; sort the weights of all intervals, and select the best text interval according to the weight. The invention has a high accuracy rate of extracting the text of the webpage, and can effectively distinguish the text from other parts of the webpage.
Description
技术领域 technical field
本发明涉及文字信息处理,特别涉及一种基于快速傅里叶变换的网页正文提取方法。The invention relates to word information processing, in particular to a method for extracting webpage text based on fast Fourier transform.
背景技术 Background technique
随着Internet的不断发展,Web页面数量的大幅度增加,网页已经成为巨大的、分布广泛的信息源。许多信息包含在浩如烟海的Web中,如何帮助人们迅速提取有效信息,成为一个非常重要的问题。With the continuous development of the Internet and the substantial increase in the number of Web pages, the Web page has become a huge and widely distributed information source. A lot of information is included in the vast web, how to help people quickly extract effective information has become a very important issue.
针对HTML网页特点,需要利用网页结构布局信息对网页进行区域分割,模拟IE浏览器的显示方式,对网页进行解析。系统根据人类视觉原理,把网页解析处理的结果进行分块,然后根据用户需求,提取用户需要的相关网页块的内容。因此网页分割是从网页中提取有效信息的一个常用手段,当前比较常用的网页分割方法主要有如下几种:According to the characteristics of HTML web pages, it is necessary to use the structural layout information of the web pages to segment the web pages, simulate the display mode of the IE browser, and analyze the web pages. According to the principle of human vision, the system divides the results of webpage parsing and processing into blocks, and then extracts the content of relevant webpage blocks required by users according to user needs. Therefore, webpage segmentation is a common method to extract effective information from webpages. Currently, the commonly used webpage segmentation methods mainly include the following:
1、基于位置关系的分割法:该方法利用网页页面的布局进行分块,将一个网页分成上、下、左、右和中间5个部分,再根据这5个部分的特征进行分类。但实际的网页结构要复杂的多,这种基于网页布局的方法并不能适用于所有的网页;而且这种方法切分的网页粒度比较粗,有可能破坏网页本身的内在特征,难以充分包括整个网页的语义特征。1. Segmentation method based on positional relationship: This method uses the layout of the webpage to divide into blocks, divides a webpage into five parts: upper, lower, left, right and middle, and then classifies according to the characteristics of these five parts. However, the actual web page structure is much more complicated, and this method based on web page layout cannot be applied to all web pages; moreover, the granularity of web pages segmented by this method is relatively coarse, which may destroy the inherent characteristics of the web page itself, and it is difficult to fully include the entire web page. Semantic features of web pages.
2、基于文档对象模型(DOM,Document Object Model)的分割法:该方法通过找出网页HTML文档里的特定标签,利用标签项将HTML文档表示成一个DOM树的结构;然后根据特定标签包括heading、table、paragraph和list等来提取有效的树结点数据。但在许多情况下,文档对象模型不是用来表示网页内容结构的,所以利用该方法不能够准确地对网页中各分块的语义信息进行辨别。关于此类方法的进一步说明可见参考文献1:“王琦,唐世渭,杨冬青,基于DOM的网页主题信息自动提取[J],计算机研究与发展,2004,41(10):1786-1791”;2. Segmentation method based on Document Object Model (DOM, Document Object Model): This method finds the specific tags in the HTML document of the webpage, and uses the tag item to represent the HTML document as a DOM tree structure; and then includes the heading according to the specific tag , table, paragraph and list etc. to extract valid tree node data. But in many cases, the document object model is not used to represent the content structure of the webpage, so the semantic information of each block in the webpage cannot be distinguished accurately by using this method. Further descriptions of such methods can be found in Reference 1: "Wang Qi, Tang Shiwei, Yang Dongqing, Automatic Extraction of Webpage Topic Information Based on DOM [J], Computer Research and Development, 2004, 41(10): 1786-1791";
参考文献2:胡飞,基于标记树的Web页面区域划分和搜索方法[J],计算机科学,2005,32(8):182-185.;参考文献3:常育红,姜哲,朱小燕,基于标记树表示方法的页面结构分析[J],计算机工程与应用,2004(16):129-132。Reference 2: Hu Fei, Web Page Region Division and Search Method Based on Tag Tree [J], Computer Science, 2005, 32(8): 182-185.; Reference 3: Chang Yuhong, Jiang Zhe, Zhu Xiaoyan, Page Structure Analysis Based on Tag Tree Representation [J], Computer Engineering and Applications, 2004(16): 129-132.
发明内容 Contents of the invention
本发明的目的是克服现有正文提取方法不能准确定义正文区域,因而无法准确提取正文的缺陷,从而提供一种基于快速傅立叶变换的正文提取方法。The purpose of the present invention is to overcome the defect that the existing text extraction method cannot accurately define the text area and thus cannot accurately extract the text, thereby providing a text extraction method based on fast Fourier transform.
为了实现上述目的,本发明提供了一种基于快速傅立叶变换的网页正文提取方法,具体包含以下步骤:In order to achieve the above object, the present invention provides a method for extracting webpage text based on fast Fourier transform, which specifically includes the following steps:
步骤10)、读入HTML文件,并将该文件转换为Unicode格式,并存入一个字符数组中;Step 10), read in the HTML file, and convert the file into Unicode format, and store it in a character array;
步骤20)、对步骤10)得到的字符数组进行窗口分段,分段后的窗口字符段包含固定长度的字符;Step 20), carry out window segmentation to the character array that step 10) obtains, the window character segment after segmentation comprises the character of fixed length;
步骤30)、对字符在文档中的位置进行统计学分析,根据统计分析的结果对字符进行强度编码转换,得到该字符的正文强度值,每一个窗口字符段对应一个强度值序列;Step 30), carry out statistical analysis to the position of character in document, according to the result of statistical analysis, character is carried out strength coding conversion, obtains the text strength value of this character, and each window character segment corresponds to a strength value sequence;
步骤40)、对步骤30)中得到的每一个窗口字符段的强度值序列进行快速傅立叶变换,得到频域的F向量;Step 40), carry out fast Fourier transform to the intensity value sequence of each window segment obtained in step 30), obtain the F vector of frequency domain;
步骤50)、根据快速傅立叶变换的结果计算任意两个窗口字符段之间的距离;Step 50), calculate the distance between any two window character segments according to the result of fast Fourier transform;
步骤60)、为窗口字符段设定区间,所述区间是若干个连续的窗口的组合,用数字对(b,e)表示,根据步骤50)中得到的任意两个窗口字符段之间的距离,计算每个区间的权值;Step 60), interval is set for the window character segment, and described interval is the combination of several continuous windows, represents with numeral pair (b, e), according to the distance between any two window character segments obtained in step 50). Distance, calculate the weight of each interval;
步骤70)、对步骤60)中计算所得到的所有区间的权值排序,根据权值选择最佳正文区间。Step 70), sort the weights of all intervals calculated in step 60), and select the best text interval according to the weights.
上述技术方案中,在所述的步骤30)中,所述的统计分析的结果包括关于字符出现位置的均值、标准方差,以及字符在文档中的出现次数。In the above technical solution, in the step 30), the results of the statistical analysis include the mean, standard deviation, and occurrence times of the characters in the document.
所述强度值序列的计算公式如下:The calculation formula of the intensity value sequence is as follows:
Ii,j=M(Wi,j,i·l+j)=M(Si·l+j,i·l+j),i=0Λ(w-1),j=0Λ(l-1);I i, j = M(W i, j , i·l+j)=M(S i·l+j , i·l+j), i=0Λ(w-1), j=0Λ(l- 1);
其中,M用于计算一个字符的强度值,W表示窗口字符段的二维数组,S表示字符串数组,i表示窗口字符段的编号,j表示窗口字符段内的位置,1表示窗口字符段的长度,w表示窗口字符段的数目;Among them, M is used to calculate the intensity value of a character, W represents the two-dimensional array of the window character segment, S represents the string array, i represents the number of the window character segment, j represents the position in the window character segment, and 1 represents the window character segment The length of , w represents the number of window character segments;
在计算所述M时,对于在位置x出现的字符c,其正文强度值为:When calculating said M, for the character c appearing at position x, its text strength value is:
上述公式中,μc是字符c出现位置的均值,σc是字符c出现位置的标准方差,Nc是字符c出现的次数。In the above formula, μ c is the mean value of the appearance position of the character c, σ c is the standard deviation of the appearance position of the character c, and N c is the number of occurrences of the character c.
上述技术方案中,在所述的步骤50)中,所述的计算任意两段之间的距离为计算各频率上的欧式距离的总和,其计算公式如下:In the above-mentioned technical solution, in the described step 50), the distance between any two sections of the calculation is to calculate the sum of the Euclidean distances on each frequency, and its calculation formula is as follows:
其中,F为步骤40)中做快速傅立叶变换后的结果。Wherein, F is the result after fast Fourier transform in step 40).
在所述的步骤60)中,所述的计算区间的权值是将组间差之和减去组内差之和,所述区间权值的计算公式如下:In the step 60), the weight of the calculation interval is the sum of the difference between groups minus the sum of the difference within the group, and the calculation formula of the weight of the interval is as follows:
V(b,e)=InterGroup(b,e)-IntraGroup(b,e)V(b,e)=InterGroup(b,e)-IntraGroup(b,e)
其中,IterGroup表示组间差,IntraGroup表示组内差,Di,j表示步骤50)中计算得到的任意两个窗口字符段之间的距离。Wherein, IterGroup represents the difference between groups, IntraGroup represents the difference within a group, D i,j represents the distance between any two window character segments calculated in step 50).
在所述的步骤60)中,所述的计算每个区间的权值采用累计距离的加速算法,所述算法的计算公式如下:In the step 60), the calculation of the weight of each interval adopts the acceleration algorithm of the cumulative distance, and the calculation formula of the algorithm is as follows:
其中,Dx,y表示x段和y段的距离,Di,j表示第0、1、...、(i-1)个窗口字符段和第0、1、...、(j-1)个窗口字符段的距离。Among them, D x, y represents the distance between the x segment and the y segment, D i, j represents the 0, 1, ..., (i-1) window character segment and the 0, 1, ..., (j -1) The distance of window character segments.
上述技术方案中,在所述的步骤70)中,选择权值最大的区间为最佳正文区间。In the above technical solution, in the step 70), the section with the largest weight is selected as the best text section.
上述技术方案中,在所述的步骤70)中,从步骤60)的计算结果中按照从大到小的顺序选择权值大于0的区间,对这些区间所对应的权值做加权平均,根据加权平均的结果选择最佳正文区间。In the above-mentioned technical solution, in the step 70), the calculation results of the step 60) are selected from the calculation results of the step 60) according to the order from large to small. The results of the weighted average select the best text interval.
所述网页中的正文信息用多字节字符集表示,包括日文、韩文和中文。The text information in the webpage is represented by a multi-byte character set, including Japanese, Korean and Chinese.
本发明的优点在于:The advantages of the present invention are:
1、本发明利用网页的频域特征来分割页面,过滤噪声,进而提取有效信息。1. The present invention utilizes the frequency domain feature of the webpage to segment the webpage, filter the noise, and then extract effective information.
2、本发明的方法在正文内容较长的情况下,即使页面结构复杂,含有多种干扰信息,也能有效地提取网页正文信息,并区分开正文和页面的其他部分,提取的准确率高。2. When the text content is long, the method of the present invention can effectively extract the text information of the webpage even if the page structure is complex and contains a variety of interference information, and distinguish the text from other parts of the page, and the extraction accuracy is high .
3、本发明无须对具体网页结构进行分析即可提取网页正文内容,具有良好的通用性,可适用于不同风格、不同主题的网页。3. The present invention can extract the text content of the webpage without analyzing the specific webpage structure, has good versatility, and is applicable to webpages of different styles and themes.
附图说明 Description of drawings
图1为本发明的基于快速傅立叶变换的网页正文提取方法的流程图;Fig. 1 is the flow chart of the webpage text extracting method based on fast Fourier transform of the present invention;
图2a和图2b为本发明中进行正文强度编码时所采用的正文强度函数的示意图;Fig. 2 a and Fig. 2 b are the schematic diagrams of the text strength function adopted when carrying out text strength coding in the present invention;
图3为本发明在计算区间权值时利用累计距离快速计算连续区间距离总合的加速算法的示意图。FIG. 3 is a schematic diagram of an accelerated algorithm for quickly calculating the sum of distances between consecutive intervals by using cumulative distances when calculating interval weights in the present invention.
具体实施方式 Detailed ways
下面结合附图和具体实施方式对本发明作进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.
在对本发明的基于快速傅立叶变换的网页正文提取方法进行说明之前,首先将网页根据页面结构特征作分类,具体包含以下种类:Before the method for extracting webpage text based on fast Fourier transform of the present invention is described, webpages are first classified according to the structural features of the webpage, specifically including the following categories:
首页式——网站的首页,一般含有多个栏目、图片、动画,以及若干文章标题链接。如:网易的首页。Home page style - the home page of the website, generally contains multiple columns, pictures, animations, and links to several article titles. Such as: NetEase's home page.
列表式——信息以列表的方式给出,一般以表格的形式列出若干个条目,经常含有分页功能。例如:某论坛版面的文章标题列表。Tabular style—information is given in the form of a list, generally listing several items in the form of a table, often with a pagination function. Example: A list of article titles for a forum forum.
正文式——指含有正文内容的底层网页,一般只含有不超过一篇的文章内容,无评论或评论较少。如:各类网站的含有具体某篇文章的底层网页。Text style——refers to the bottom-level web page containing text content, generally only containing no more than one article content, with no or few comments. For example: the underlying web pages of various websites that contain a specific article.
评论式——除了含有正文外,正文后面还跟有若干个评论,以论坛为代表。Commentary style - in addition to the main text, there are several comments after the main text, represented by the forum.
本发明主要是针对上述的“正文式”中文网页实现网页内容的提取。正文式中文网页通常含有大段的正文信息,在正文信息的前后是一些格式信息(例如导航信息、交互信息、JavaScript脚本等)。The present invention is mainly aimed at the above-mentioned " text type " Chinese webpage to realize the extraction of webpage content. Text-style Chinese web pages usually contain a large section of text information, and some format information (such as navigation information, interactive information, JavaScript scripts, etc.) is placed before and after the text information.
正文信息具有以下特点:Text information has the following characteristics:
1、位于HTML源文件的中部;1. Located in the middle of the HTML source file;
2、以中文字符和英文字母为主;2. Mainly Chinese characters and English letters;
3、较为连续的文字;3. Relatively continuous text;
4、正文信息的信号特性类似;4. The signal characteristics of text information are similar;
5、正文信息与格式信息的信号特性不同。5. The signal characteristics of text information and format information are different.
格式信息具有以下特点:Format information has the following characteristics:
1、位于HTML源文件的开头和结尾;1. Located at the beginning and end of the HTML source file;
2、以标点符号和英文字母为主;2. Mainly punctuation marks and English letters;
3、格式信息的信号特性类似;3. The signal characteristics of format information are similar;
4、格式信息与正文信息的信号特性不同。4. The signal characteristics of format information and text information are different.
对HTML文档模型分析可知,文档由三大类信号混合而成,包括:Analysis of the HTML document model shows that the document is a mixture of three types of signals, including:
1)HTML标记符(TAG),形式为“<标记符><标记符属性=值></标记符>”。1) HTML tag (TAG), in the form of "<tag><tag attribute=value></tag>".
例如:For example:
<table width=″756″border=″0″align=″center″></table><table width="756"border="0"align="center"></table>
2)文本自然语言(TEXT),即中英文字符组成的句子。例如:关于我们Aboutus。2) Text natural language (TEXT), that is, sentences composed of Chinese and English characters. For example: about us Aboutus.
3)脚本程序(SCRIPT)。例如:function MM_findObj(n,d){var p,i,x;if(!d)}3) Script program (SCRIPT). For example: function MM_findObj(n,d){var p,i,x;if(!d)}
本发明根据正文式页面的结构特征,将提取正文的问题转化为给定一个底层网页的HTML源文件,求解最佳的正文区间。下面结合一个中文网页的实例,对本发明方法的具体实现步骤做如下说明:According to the structural features of text-style pages, the present invention converts the problem of text extraction into an HTML source file of a given bottom web page, and solves the optimal text interval. Below in conjunction with the example of a Chinese webpage, the specific implementation steps of the inventive method are described as follows:
步骤10、读入HTML文件,将该文件转换为Unicode格式,并存入到一个字符数组中。转换后的英文字母在’a’~’z’,’A’~’Z’之间,中文字符在0×0100~0×FFFF之间。转换后的字符存入字符数组S°,该字符数组的长度为s°。Step 10, read in the HTML file, convert the file into Unicode format, and store it in a character array. The converted English letters are between 'a'~'z', 'A'~'Z', and the Chinese characters are between 0×0100~0×FFFF. The converted characters are stored in the character array S°, and the length of the character array is s°.
假设读入一个网易旅游频道上关于云南香格里拉的网页,将该网页转换为Unicode格式后,网页转换的结果如下(鉴于原文篇幅过长,在下面的例子中只摘取了部分内容):Suppose you read a webpage about Shangri-La, Yunnan on the NetEase Travel Channel, and after converting the webpage to Unicode format, the result of webpage conversion is as follows (due to the length of the original text, only part of the content is extracted in the following example):
“<!DOCTYPE html PUBLIC″-//W3C//DTD XHTML 1.0 Transitional//EN″“<!DOCTYPE html PUBLIC″-//W3C//DTD XHTML 1.0 Transitional//EN″
″http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd″>"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns=″http://www.w3.org/1999/xhtml″><html xmlns="http://www.w3.org/1999/xhtml">
<head><head>
<title>为喇嘛做广东菜_芒果网易旅游</title><title>Cook Cantonese cuisine for Lama_Mango Netease Travel</title>
……...
<!--page--><! --page-->
<!--<! --
<div class=″tpage″><div class="tpage">
<span><a href=″″>上一页</a></span><span class=″fB″><a href=″″>1</a></span><span class=″fB″><a href=″″>2</a></span><span class=″fBcDRed″>3</span><span><a href=″″>下一页</a></span><span><a href="">Previous page</a></span><span class="fB"><a href="">1</a></span><span class=" fB″><a href="">2</a></span><span class="fBcDRed">3</span><span><a href="">Next</a>< /span>
</div></div>
-->-->
<div class=″text″id=″articlebody″><div class="text" id="articlebody">
------------------------------------------------------------------------------------------------------------------ --------------
以上为格式信息The above is the format information
------------------------------------------------------------------------------------------------------------------ --------------
7.14丽江古城的人流就像广州的上下九步行街一样,全是游人,毫不夸张。只在清晨的石板街才见为数不多的纳西族老人,女族人更能坚守古老的信念,才披着七星坎肩。当然也能看见看着挺专业敬业的哥们端着长枪短炮的设备轰炸丽江的早晨渺渺美景。The flow of people in the Old Town of Lijiang on July 14 is like the Shangxiajiu Pedestrian Street in Guangzhou, full of tourists, no exaggeration. Only in the stone street in the early morning can I see a small number of old Naxi people. Women of the Naxi ethnic group are more able to stick to their ancient beliefs, so they wear seven-star vests. Of course, you can also see the beautiful scenery in the morning when the professional and dedicated buddies bombard Lijiang with equipment with long guns and short cannons.
<br> ; ; ; ;虽然小桥流水依旧,夜色大红灯笼也很诱惑,都在寻找着一种过剩的激情。我独自听完颇有韵味的纳西古乐走在街上的时候,甚至有个很奇怪的看我一个人凑上前来,热情的介绍有摩梭族的女孩走婚和跳艳舞,问是否需要看表演,?那是个惊讶,果然开放带来一切,当然我并不相信那些女孩是摩梭族的,都找的别地女人充数而已。当然我并没有去.后来在酒吧随便坐的时候,听一些驴友说甚至还有广东的走婚团就是为了体验走婚去的,足见人们的追求各异。<br> Although the small bridge and flowing water are still there, the red lanterns at night are also very tempting, and they are all looking for a kind of excess passion. When I was walking down the street after listening to the charming Naxi ancient music alone, someone even saw me approaching me alone, enthusiastically introducing Mosuo girls to marry and dance, and asked if I Need to see a show, ? It was a surprise, as expected, openness brought everything. Of course, I don't believe that those girls are from the Mosuo ethnic group, and they just found other women to make up for it. Of course I didn't go. Later, when I was sitting casually in the bar, I heard from some donkey friends that there were even walking wedding groups in Guangdong just to experience walking marriages, which shows that people's pursuits are different.
<br> ; ; ; ;我又想起在香格里拉的7月还有青梅子,是新鲜的。喜欢酸的朋友都知道,在东部6月,青梅子就熟透了,应该是气候的原因延迟了它的季节。但香格里拉的青梅子皮已经黄了,肉却不会软,仍然结实甚至是硬得很,那个酸啊,叫喜欢酸的朋友爱死了,叫怕酸的朋友简直可以把你酸死。一点也不夸张地。买了一斤3块钱,你已经知道我是极爱酸的,竟然吃了3天。别的喇嘛吃半个就受不了,口腔膜都要酸脱落一层的。但还是怀念那种味道。<br> I remembered that there were green plums in Shangri-La in July, which were fresh. Friends who like sour know that green plums are ripe in June in the east. It should be the climate that delays its season. But the skin of Shangri-La’s greengage has turned yellow, but the flesh is not soft, it is still firm and even very hard. The sourness makes friends who like sour love it to death, and friends who are afraid of sour can make you sour to death. Not exaggerating at all. I bought it for 3 yuan a catty. You already know that I love sour food so much that I ate it for 3 days. Other lamas can't stand it after eating half of it, and the mouth membrane will peel off a layer of acid. But still miss that taste.
------------------------------------------------------------------------------------------------------------ --------
以上为正文部分The above is the text part
---------------------------------------------------------------------------------------------------------- ------
<br></div><br></div>
<!--page--><! --page-->
<!--<! --
<div class=″tpage″><div class="tpage">
<span><a href=″″>上一页</a></span><span class=″fB″><a href=″″>1</a></span><span class=″fB″><a href=″″>2</a></span><span class=″fBcDRed″>3</span><span><a href=″″>下一页</a></span><span><a href="">Previous page</a></span><span class="fB"><a href="">1</a></span><span class=" fB″><a href="">2</a></span><span class="fBcDRed">3</span><span><a href="">Next</a>< /span>
</div></div>
-->-->
……...
//-->//-->
</script></script>
<noscript><noscript>
<img src=″//secure-cn.imrworldwide.com/cgi-bin/m?ci=cn-netease&;cg=0″alt=″″><img src=″//secure-cn.imrworldwide.com/cgi-bin/m?ci=cn-netease&cg=0″alt=″″>
</noscript></noscript>
<!--END NNR Site Census V5.1--><! --END NNR Site Census V5.1-->
</body></body>
</html></html>
-------------------------------------------------------------------------------------------------------- ----
以上为格式信息The above is the format information
------------------------------------------------------”-------------------------------------------------- ----"
将上述网页的信息转换为Unicode格式后,存储在一个字符数组中。After the information of the above web page is converted into Unicode format, it is stored in a character array.
步骤20、对步骤10得到的字符数组进行窗口分段。所述的窗口用于采样,以选择等长的一段字符在后续步骤中实现傅立叶变换。假设窗口的大小为1,把包含在字符数组S°中的文件切分为长度为1的若干连续字符段,一共w段,同时将后面不足1的剩余字符删除,得到一个新的字符串数组S,该数组的长度为s。用W表示窗口的二维数组,i表示窗口编号,j表示窗口内位置,则窗口的计算公式如下:Step 20, perform window segmentation on the character array obtained in step 10. The window is used for sampling to select a segment of characters of equal length to implement Fourier transform in subsequent steps. Assuming that the size of the window is 1, the file contained in the character array S° is divided into several consecutive character segments with a length of 1, a total of w segments, and at the same time, the remaining characters less than 1 are deleted to obtain a new string array S, the length of the array is s. Use W to represent the two-dimensional array of the window, i to represent the window number, and j to represent the position in the window, then the calculation formula of the window is as follows:
Wi,j=Si·l+j,i=0Λ(w-1),j=0Λ(l-1)W i,j =S i·l+j , i=0Λ(w-1), j=0Λ(l-1)
仍以上述的关于香格里拉的网页为例,对存储该网页的字符数组进行窗口分段。假设窗口的大小设定为32,则该字符数组做窗口分段后的结果如下:Still taking the above-mentioned webpage about Shangri-La as an example, window segmentation is performed on the character array storing the webpage. Assuming that the size of the window is set to 32, the result of the character array after window segmentation is as follows:
“Window[0]:??<!DOCTYPE html PUBLIC″-//W3C/"Window[0]:??<!DOCTYPE html PUBLIC"-//W3C/
Window[1]:/DTD XHTML 1.0 Transitional//EN″Window[1]: /DTD XHTML 1.0 Transitional//EN″
Window[2]:??″http://www.w3.org/TR/xhtml1/Window[2]: ? ? "http://www.w3.org/TR/xhtml1/
Window[3]:DTD/xhtml1-transitional.dtd″>??<Window[3]: DTD/xhtml1-transitional.dtd″>??<
Window[4]:html xmlns=″http://www.w3.org/19Window[4]: html xmlns="http://www.w3.org/19
Window[5]:99/xhtml″>??<head>??<title>为喇嘛做广Window[5]: 99/xhtml″>??<head>??<title>Promoting for Lamas
Window[6]:东菜_芒果网易旅游</title>??<meta http-eqWindow[6]: Dongcai_Mango Netease Travel</title>? ? <meta http-eq
……...
Window[195]:span><span><a href=″″>下一页</a></Window[195]: span><span><a href="">next page</a></
Window[196]:span>??</div>??-->??<div class=″Window[196]: span>? ? </div>? ? -->? ? <div class="
Window[197]:text″id=″articlebody″>7.14丽江古城的Window[197]: text″id=″articlebody″>7.14 Lijiang Old Town
Window[198]:人流就像广州的上下九步行街一样,全是游人,毫不夸张。只在清晨的石Window[198]: The flow of people is like Shangxiajiu Pedestrian Street in Guangzhou, full of tourists, no exaggeration. stone only in the morning
Window[199]:板街才见为数不多的纳西族老人,女族人更能坚守古老的信念,才披着七Window[199]: I saw a few old Naxi people in Banjie.
Window[200]:星坎肩。当然也能看见看着挺专业敬业的哥们端着长枪短炮的设备轰炸丽Window[200]: star vest. Of course, you can also see the professional and dedicated buddies bombarding Li with equipment with long guns and short cannons.
Window[201]:江的早晨渺渺美景。??<br> ; ; Window[201]: The faint beauty of the river in the morning. ? ? <br>  
Window[202]:; ;虽然小桥流水依旧,夜色大红灯笼也很诱惑,都在寻找着Window[202]:; Although the small bridge and flowing water are still there, the red lanterns at night are also very tempting, and they are all looking for
Window[203]:一种过剩的激情。我独自听完颇有韵味的纳西古乐走在街上的时候,甚至Window[203]: An excess of passion. When I was walking down the street after listening to the charming Naxi ancient music alone, I even
Window[204]:有个很奇怪的看我一个人凑上前来,热情的介绍有摩梭族的女孩走婚和跳Window[204]: There is a very strange person who sees me coming up alone, and warmly introduces Mosuo girls walking marriage and dancing
Window[205]:艳舞,问是否需要看表演,?那是个惊讶,果然开放带来一切,当然我并Window[205]: Yanwu, do you want to watch the show? That was a surprise, openness brought everything, of course I didn't
Window[206]:不相信那些女孩是摩梭族的,都找的别地女人充数而已。当然我并没有去Window[206]: I don’t believe those girls are from the Mosuo ethnic group, they just found women from other places to make up for it. of course i didn't go
Window[207]:.后来在酒吧随便坐的时候,听一些驴友说甚至还有广东的走婚团就是为Window[207]: Later, when I was sitting casually in a bar, I heard from some donkey friends that there were even walking wedding groups in Guangdong just for
Window[208]:了体验走婚去的,足见人们的追求各异。??<br> ;&nWindow[208]: For the experience of walking marriage, it shows that people have different pursuits. ? ? <br> &n
……...
Window[599]:rite(_rsCL);??//-->??</script>??Window[599]: rite(_rsCL); ? ? //-->? ? </script>? ?
Window[600]:<noscript>??<img src=″//secure-cWindow[600]: <noscript>? ? <img src="//secure-c
Window[601]:n.imrworldwide.com/cgi-bin/m?ci=Window[601]: n.imrworldwide.com/cgi-bin/m? ci=
Window[602]:cn-netease&;cg=0″alt=″″>??</Window[602]: cn-netease & cg=0″alt=″″>??</
Window[603]:noscript>??<!--END NNR Site CenWindow[603]: noscript>? ? <! --END NNR Site Cen
Window[604]:sus V5.1-->??</body>??</html>??”Window[604]: sus V5.1-->? ? </body>? ? </html>? ? "
由上述实例可知,该网页共分为605段。It can be seen from the above example that the web page is divided into 605 sections.
步骤30、利用统计学原理,对字符进行强度编码转换。在本步骤中,对文档中出现的每一个字符,分析它在整个文档中出现的规律,通过对字符在文档中出现的位置进行统计学分析,得到关于字符出现位置的均值、标准方差,以及该字符在文档中的出现次数。利用上述的均值、标准方差以及出现次数,计算正文强度值。在计算正文强度值时,应以步骤20中所划分的字符段为计算单位。对于一个字符段,对字符作编码转换从而得到一个强度值序列I,该强度值序列的计算公式如下:Step 30, using the principle of statistics to convert the characters into intensity codes. In this step, for each character that appears in the document, analyze the rule that it appears in the entire document, and perform statistical analysis on the position of the character in the document to obtain the mean value, standard deviation, and The number of occurrences of this character in the document. Using the above mean, standard deviation, and number of occurrences, a text intensity value is calculated. When calculating the strength value of the text, the character segment divided in step 20 should be used as the calculation unit. For a character segment, code conversion is performed on the characters to obtain an intensity value sequence I, and the calculation formula of the intensity value sequence is as follows:
Ii,j=M(Wi,j,i·l+j)=M(Si·l+j,i·l+j),i=0Λ(w-1),j=0Λ(l-1)I i, j = M(W i, j , i·l+j)=M(S i·l+j , i·l+j), i=0Λ(w-1), j=0Λ(l- 1)
其中M用于计算一个字符的强度值,对于在位置x出现的字符c,其正文强度值为:Among them, M is used to calculate the intensity value of a character. For the character c appearing at position x, its text intensity value is:
上述公式中,μc是字符c出现位置的均值,σc是字符c出现位置的标准方差,Nc是字符c出现的次数。In the above formula, μ c is the mean value of the appearance position of the character c, σ c is the standard deviation of the appearance position of the character c, and N c is the number of occurrences of the character c.
因为正文包含更多的中英文字符,而包含较少的标点符号。所以在上述公式中如果字符是文字类型的(即’a’~’z’,’A’~’Z’,0×0100~0×FFFF),则用正态分布公式作为其编码转换函数,结果非负;对于其他字符,则用正态分布加上偏移量的公式,作为转换函数,结果非正。上述公式对于中英文字符是正数,对于标点符号是负数。因为正文位于文档中部,而标点符号位于文档两端,所以公式对于中英文字符是把信号强度集中在文档中部,对于标点符号是把信号强度分散到文档的两端。正文的强度函数如图2所示,由该图可知,由于频繁出现的中英文字符往往是常用词的反映,因此其信号强度按比例增大;而频繁出现的标点符号,往往是排版格式的反映,例如小于号和大于号,因此其信号强度在负数方向按比例增大。Because the text contains more Chinese and English characters and less punctuation marks. Therefore, in the above formula, if the character is of text type (that is, 'a'~'z', 'A'~'Z', 0×0100~0×FFFF), the normal distribution formula is used as its encoding conversion function, The result is non-negative; for other characters, the formula of normal distribution plus offset is used as the conversion function, and the result is not positive. The above formula is a positive number for Chinese and English characters, and a negative number for punctuation marks. Because the text is located in the middle of the document, and the punctuation marks are located at both ends of the document, the formula concentrates the signal strength in the middle of the document for Chinese and English characters, and distributes the signal strength to both ends of the document for punctuation marks. The intensity function of the text is shown in Figure 2. It can be seen from the figure that since frequently occurring Chinese and English characters are often reflections of commonly used words, their signal strength increases proportionally; and frequently occurring punctuation marks are often typographical. Reflect, such as the less than and greater than signs, so their signal strength increases proportionally in the negative direction.
对于所有的字符段,经过上述类似操作后,都可以得到各自的强度值序列。For all the character fields, after the above-mentioned similar operations, respective intensity value sequences can be obtained.
步骤40、对步骤30中得到的每一个窗口字符段的强度值序列进行快速傅立叶变换,得到频域的F向量。其计算公式如下:Step 40: Perform fast Fourier transform on the intensity value sequence of each window segment obtained in step 30 to obtain an F vector in the frequency domain. Its calculation formula is as follows:
Fi=FFT(Ii)F i =FFT(I i )
快速傅立叶变换的具体实现是一项成熟的现有技术,在本实施例中不再作详细的说明。The specific implementation of the fast Fourier transform is a mature prior art, and will not be described in detail in this embodiment.
步骤50、计算任意两字符段间的距离,两字符段间的距离为各频率上的欧式距离的总和。其计算公式如下:Step 50, calculate the distance between any two-character fields, the distance between the two-character fields is the sum of the Euclidean distances on each frequency. Its calculation formula is as follows:
由上述公式可见,计算任意两段的距离其实对这两段的对应频率位置上的两个值求差,然后再把所有差求和。例如A窗口和B窗口的距离,就是a0与b0的差,……a31与b31的差,对这些差的平方和再开平方,就得到了欧式距离的总和。It can be seen from the above formula that calculating the distance between any two segments actually calculates the difference between the two values at the corresponding frequency positions of the two segments, and then sums all the differences. For example, the distance between window A and window B is the difference between a0 and b0, ... the difference between a31 and b31, and the sum of the squares of these differences is then squared to obtain the sum of the Euclidean distances.
步骤60、为字符段设定区间,计算每个区间的权值。一个区间是若干个连续的窗口的组合,用数字对(b,e)来表示,该数字对表示由该数字对所表示的区间是由窗口Wb到We-1组成的,其中0≤b<e≤w。设定区间后,文件中的所有窗口段被分成了两组,分别为区间内部组和区间外部组,区间内部组A包括Wb~We-1,区间外部组B包括W0~Wb-1以及We~Ww-1。所有的窗口组由B组的前一部分{W0,W1,..,Wb-1},A组{Wb,Wb+1,..,We-1},B组的后一部分{We,We+1,..,Ww-1}组成。Step 60, setting intervals for the character segment, and calculating the weight of each interval. An interval is a combination of several consecutive windows, represented by a pair of numbers (b, e), which means that the interval represented by the pair of numbers is composed of windows W b to W e-1 , where 0≤ b<e≤w. After the interval is set, all the window segments in the file are divided into two groups, namely the interval internal group and the interval external group, the interval internal group A includes W b ~W e-1 , and the interval external group B includes W 0 ~W b -1 and W e ~W w-1 . All window groups consist of the former part of group B {W 0 , W 1 , .., W b-1 }, group A {W b , W b+1 , .., W e-1 }, the rear part of group B A part of {W e , W e+1 , .., W w-1 } is composed.
区间的权值是指组间差之和减去组内差之和,其中,组间差是指从区间内部组A中任选一段与区间外部组B中的任意一段求差,所求差的总和就是组间差;组内差是指区间内部组A和区间外部组B各自对内部的任意两段求差,所求差的总和为组内差。区间权值的计算公式如下:The weight of the interval refers to the sum of the difference between groups minus the sum of the difference within the group. The difference between groups refers to the difference between any segment in group A inside the interval and any segment in group B outside the interval. The sum of is the inter-group difference; the intra-group difference refers to the difference between the inner group A and the outer group B for any two segments within the interval, and the sum of the differences is the intra-group difference. The formula for calculating interval weights is as follows:
V(b,e)=InterGroup(b,e)-IntraGroup(b,e)V(b,e)=InterGroup(b,e)-IntraGroup(b,e)
在本步骤中,计算区间的权值的一种优选实现方式是采用一种累计距离的加速算法,使用该算法可以快速地计算两个连续组的差值之和。如图3所示,该算法的计算公式如下:In this step, a preferred implementation manner of calculating the weight of the interval is to use an accelerated algorithm for accumulating distance, and the sum of differences between two consecutive groups can be quickly calculated by using this algorithm. As shown in Figure 3, the calculation formula of the algorithm is as follows:
其中,Dx,y表示x段和y段的距离,Di,j表示第0、1、...、(i-1)个窗口字符段和第0、1、...、(j-1)个窗口字符段的距离。上述公式用于加快计算组间差和组内差,先计算累计值表,通过查表和简单的代数运算就可以很快地求出组间差和组内差。其中的Di,j,i=1Λw,j=1Λw就是所述的累计值表。Among them, D x, y represents the distance between the x segment and the y segment, D i, j represents the 0, 1, ..., (i-1) window character segment and the 0, 1, ..., (j -1) The distance of window character segments. The above formula is used to speed up the calculation of inter-group difference and intra-group difference. First calculate the cumulative value table, and then quickly calculate the inter-group difference and intra-group difference by looking up the table and simple algebraic operations. Among them, D i, j , i=1Λw, j=1Λw is the accumulated value table.
步骤70、对步骤60中计算所得到的所有区间的权值排序,权值最大的区间为最佳正文区间。在步骤60中,由于所设定的区间包含了连续窗口组合的所有可能情况,因此最终会得到多个区间的权值,对这些权值按照从大到小的顺序进行排序,最后选择权值最大的区间作为最佳正文区间,而最佳正文区间中所包含的内容也就是本发明最终要从网页中提取的正文。Step 70, sort the weights of all intervals calculated in step 60, and the interval with the largest weight is the best text interval. In step 60, since the set interval contains all possible situations of continuous window combinations, weights of multiple intervals will be obtained in the end, these weights are sorted in descending order, and finally the weights are selected The largest interval is used as the best text interval, and the content contained in the best text interval is the text to be finally extracted from the webpage in the present invention.
对前述的关于香格里拉的网页,选择权值最大的区间,根据步骤60中权值计算的结果,最大权值为1.8671557984059033E9,权值最大的区间的b为197,e为395,该区间就是所求的最佳正文区间。For the aforementioned webpage about Shangri-La, select the interval with the largest weight. According to the result of the weight calculation in step 60, the maximum weight is 1.8671557984059033E9. The b of the interval with the largest weight is 197, and e is 395. This interval is the Find the best text interval.
在一个实施例中,对区间权值排序,选择最佳正文区间的另一种实现方式是对权值做加权平均,然后根据加权平均的结果得到平均意义上的最佳正文区间。在实现时,通常对于权值大于0的区间进行加权平均,算出平均意义上的最佳正文区间(b*,e*)。求加权平均值的计算公式如下:In an embodiment, another implementation manner of sorting the interval weights and selecting the best text interval is to perform weighted average on the weights, and then obtain the best text interval in the average sense according to the result of the weighted average. During implementation, the weighted average is usually performed on the intervals whose weights are greater than 0, and the optimal text interval (b * , e * ) in the average sense is calculated. The formula for calculating the weighted average is as follows:
其中,V(b,e)表示区间权值。Among them, V(b, e) represents the interval weight.
仍以前述的关于香格里拉的网页为例,从步骤60的权值计算结果中,假设权值大于0的区间有100个,这些权值与对应的区间如下:Still taking the aforementioned webpage about Shangri-La as an example, from the weight calculation results in step 60, assuming that there are 100 intervals with weights greater than 0, these weights and corresponding intervals are as follows:
No.1:Area{b=197e=395w=1.8671557984059033E9}No.1: Area{b=197e=395w=1.8671557984059033E9}
No.2:Area{b=198e=395w=1.865928902944519E9}No.2: Area{b=198e=395w=1.865928902944519E9}
No.3:Area{b=197e=394w=1.863446434026815E9}No.3: Area{b=197e=394w=1.863446434026815E9}
No.4:Area{b=198e=394w=1.8620946999597936E9}No.4: Area{b=198e=394w=1.8620946999597936E9}
No.5:Area{b=197e=396w=1.8534012640629482E9}No.5: Area{b=197e=396w=1.8534012640629482E9}
No.6:Area{b=196e=395w=1.8533969765727189E9}No.6: Area{b=196e=395w=1.8533969765727189E9}
No.7:Area{b=198e=396w=1.852261927708008E9}No.7: Area{b=198e=396w=1.852261927708008E9}
No.8:Area{b=199e=395w=1.8511999688045855E9}No.8: Area{b=199e=395w=1.8511999688045855E9}
No.9:Area{b=197e=393w=1.8500594430878716E9}No.9: Area{b=197e=393w=1.8500594430878716E9}
No.10:Area{b=196e=394w=1.849788102344682E9}No.10: Area{b=196e=394w=1.849788102344682E9}
No.11:Area{b=198e=393w=1.848510799009436E9}No.11: Area{b=198e=393w=1.848510799009436E9}
No.12:Area{b=199e=394w=1.8471652124879038E9}No.12: Area{b=199e=394w=1.8471652124879038E9}
No.13:Area{b=197e=397w=1.8453086053177962E9}No.13: Area{b=197e=397w=1.8453086053177962E9}
No.14:Area{b=195e=395w=1.845281305908179E9}No.14: Area{b=195e=395w=1.845281305908179E9}
No.15:Area{b=198e=397w=1.8442583536949947E9}No.15: Area{b=198e=397w=1.8442583536949947E9}
No.16:Area{b=195e=394w=1.8417764283302329E9}No.16: Area{b=195e=394w=1.8417764283302329E9}
No.17:Area{b=197e=392w=1.8413777475416255E9}No.17: Area{b=197e=392w=1.8413777475416255E9}
No.18:Area{b=198e=392w=1.8396801709467006E9}No.18: Area{b=198e=392w=1.8396801709467006E9}
No.19:Area{b=196e=396w=1.8396421919565065E9}No.19: Area{b=196e=396w=1.8396421919565065E9}
No.20:Area{b=200e=395w=1.838057893744711E9}No.20: Area{b=200e=395w=1.838057893744711E9}
No.21:Area{b=199e=396w=1.8377040837184753E9}No.21: Area{b=199e=396w=1.8377040837184753E9}
No.22:Area{b=196e=393w=1.8365645973901665E9}No.22: Area{b=196e=393w=1.8365645973901665E9}
No.23:Area{b=200e=394w=1.8338399474557528E9}No.23: Area{b=200e=394w=1.8338399474557528E9}
No.24:Area{b=199e=393w=1.8333431983722968E9}No.24: Area{b=199e=393w=1.8333431983722968E9}
No.25:Area{b=201e=395w=1.832882136920093E9}No.25: Area{b=201e=395w=1.832882136920093E9}
No.26:Area{b=194e=395w=1.8327158264980187E9}No.26: Area{b=194e=395w=1.8327158264980187E9}
No.27:Area{b=197e=398w=1.8317380757017245E9}No.27: Area{b=197e=398w=1.8317380757017245E9}
No.28:Area{b=196e=397w=1.8315166911690896E9}No.28: Area{b=196e=397w=1.8315166911690896E9}
No.29:Area{b=195e=396w=1.8314938196044166E9}No.29: Area{b=195e=396w=1.8314938196044166E9}
No.30:Area{b=198e=398w=1.8307755060003867E9}No.30: Area{b=198e=398w=1.8307755060003867E9}
No.31:Area{b=202e=395w=1.830544198380903E9}No.31: Area{b=202e=395w=1.830544198380903E9}
No.32:Area{b=199e=397w=1.829861304277684E9}No.32: Area{b=199e=397w=1.829861304277684E9}
No.33:Area{b=194e=394w=1.829311678505044E9}No.33: Area{b=194e=394w=1.829311678505044E9}
No.34:Area{b=195e=393w=1.828719245958915E9}No.34: Area{b=195e=393w=1.828719245958915E9}
No.35:Area{b=201e=394w=1.8285160965672174E9}No.35: Area{b=201e=394w=1.8285160965672174E9}
No.36:Area{b=196e=392w=1.828012158947821E9}No.36: Area{b=196e=392w=1.828012158947821E9}
No.37:Area{b=197e=391w=1.8270460014801817E9}No.37: Area{b=197e=391w=1.8270460014801817E9}
No.38:Area{b=202e=394w=1.8260582076294603E9}No.38: Area{b=202e=394w=1.8260582076294603E9}
No.39:Area{b=198e=391w=1.8251564260435276E9}No.39: Area{b=198e=391w=1.8251564260435276E9}
No.40:Area{b=200e=396w=1.824723891564548E9}No.40: Area{b=200e=396w=1.824723891564548E9}
No.41:Area{b=199e=392w=1.8243151092166026E9}No.41: Area{b=199e=392w=1.8243151092166026E9}
No.42:Area{b=195e=397w=1.8233264390587733E9}No.42: Area{b=195e=397w=1.8233264390587733E9}
No.43:Area{b=203e=395w=1.822325780416904E9}No.43: Area{b=203e=395w=1.822325780416904E9}
No.44:Area{b=195e=392w=1.8202939671958587E9}No.44: Area{b=195e=392w=1.8202939671958587E9}
No.45:Area{b=200e=393w=1.8198227669199252E9}No.45: Area{b=200e=393w=1.8198227669199252E9}
No.46:Area{b=201e=396w=1.8196575937589269E9}No.46: Area{b=201e=396w=1.8196575937589269E9}
No.47:Area{b=193e=395w=1.8191558800920327E9}No.47: Area{b=193e=395w=1.8191558800920327E9}
No.48:Area{b=194e=396w=1.8189200336928308E9}No.48: Area{b=194e=396w=1.8189200336928308E9}
No.49:Area{b=197e=399w=1.8179459850346885E9}No.49: Area{b=197e=399w=1.8179459850346885E9}
No.50:Area{b=196e=398w=1.8179439755481179E9}No.50: Area{b=196e=398w=1.8179439755481179E9}
No.51:Area{b=203e=394w=1.8176838100122943E9}No.51: Area{b=203e=394w=1.8176838100122943E9}
No.52:Area{b=202e=396w=1.8174102756958842E9}No.52: Area{b=202e=396w=1.8174102756958842E9}
No.53:Area{b=198e=399w=1.817070992891399E9}No.53: Area{b=198e=399w=1.817070992891399E9}
No.54:Area{b=200e=397w=1.8170506741580334E9}No.54: Area{b=200e=397w=1.8170506741580334E9}
No.55:Area{b=199e=398w=1.8165496617398362E9}No.55: Area{b=199e=398w=1.8165496617398362E9}
No.56:Area{b=194e=393w=1.8164182449130914E9}No.56: Area{b=194e=393w=1.8164182449130914E9}
No.57:Area{b=193e=394w=1.8158518234459796E9}No.57: Area{b=193e=394w=1.8158518234459796E9}
No.58:Area{b=201e=393w=1.8143022038707862E9}No.58: Area{b=201e=393w=1.8143022038707862E9}
No.59:Area{b=196e=391w=1.8138511011079237E9}No.59: Area{b=196e=391w=1.8138511011079237E9}
No.60:Area{b=197e=390w=1.813416825235355E9}No.60: Area{b=197e=390w=1.813416825235355E9}
No.61:Area{b=201e=397w=1.812101903347275E9}No.61: Area{b=201e=397w=1.812101903347275E9}
No.62:Area{b=202e=393w=1.8116598519465666E9}No.62: Area{b=202e=393w=1.8116598519465666E9}
No.63:Area{b=198e=390w=1.8113552225214372E9}No.63: Area{b=198e=390w=1.8113552225214372E9}
No.64:Area{b=194e=397w=1.810719247254324E9}No.64: Area{b=194e=397w=1.810719247254324E9}
No.65:Area{b=200e=392w=1.8106092331069574E9}No.65: Area{b=200e=392w=1.8106092331069574E9}
No.66:Area{b=202e=397w=1.8099494719208207E9}No.66: Area{b=202e=397w=1.8099494719208207E9}
No.67:Area{b=195e=398w=1.809720873865331E9}No.67: Area{b=195e=398w=1.809720873865331E9}
No.68:Area{b=199e=391w=1.8095815493579323E9}No.68: Area{b=199e=391w=1.8095815493579323E9}
No.69:Area{b=203e=396w=1.8093194340361586E9}No.69: Area{b=203e=396w=1.8093194340361586E9}
No.70:Area{b=204e=395w=1.8091673410619712E9}No.70: Area{b=204e=395w=1.8091673410619712E9}
No.71:Area{b=194e=392w=1.8081203284794781E9}No.71: Area{b=194e=392w=1.8081203284794781E9}
No.72:Area{b=195e=391w=1.8062889464577138E9}No.72: Area{b=195e=391w=1.8062889464577138E9}
No.73:Area{b=192e=395w=1.8055887898178735E9}No.73: Area{b=192e=395w=1.8055887898178735E9}
No.74:Area{b=193e=396w=1.8053577759523911E9}No.74: Area{b=193e=396w=1.8053577759523911E9}
No.75:Area{b=201e=392w=1.8049212023955352E9}No.75: Area{b=201e=392w=1.8049212023955352E9}
No.76:Area{b=197e=400w=1.804362583413403E9}No.76: Area{b=197e=400w=1.804362583413403E9}
No.77:Area{b=204e=394w=1.8043406024255657E9}No.77: Area{b=204e=394w=1.8043406024255657E9}
No.78:Area{b=196e=399w=1.8041515829944117E9}No.78: Area{b=196e=399w=1.8041515829944117E9}
No.79:Area{b=200e=398w=1.8039011525318637E9}No.79: Area{b=200e=398w=1.8039011525318637E9}
No.80:Area{b=198e=400w=1.8035751666398578E9}No.80: Area{b=198e=400w=1.8035751666398578E9}
No.81:Area{b=193e=393w=1.80312176475147E9}No.81: Area{b=193e=393w=1.80312176475147E9}
No.82:Area{b=203e=393w=1.8030793314788742E9}No.82: Area{b=203e=393w=1.8030793314788742E9}
No.83:Area{b=199e=399w=1.8030163410122762E9}No.83: Area{b=199e=399w=1.8030163410122762E9}
No.84:Area{b=192e=394w=1.8023851986898751E9}No.84: Area{b=192e=394w=1.8023851986898751E9}
No.85:Area{b=202e=392w=1.8021209078151228E9}No.85: Area{b=202e=392w=1.8021209078151228E9}
No.86:Area{b=203e=397w=1.8019899976293116E9}No.86: Area{b=203e=397w=1.8019899976293116E9}
No.87:Area{b=196e=390w=1.8003818327393115E9}No.87: Area{b=196e=390w=1.8003818327393115E9}
No.88:Area{b=201e=398w=1.799061835030309E9}No.88: Area{b=201e=398w=1.799061835030309E9}
No.89:Area{b=191e=395w=1.797390318129374E9}No.89: Area{b=191e=395w=1.797390318129374E9}
No.90:Area{b=193e=397w=1.7971241276820748E9}No.90: Area{b=193e=397w=1.7971241276820748E9}
No.91:Area{b=194e=398w=1.797104678286477E9}No.91: Area{b=194e=398w=1.797104678286477E9}
No.92:Area{b=202e=398w=1.797000014978798E9}No.92: Area{b=202e=398w=1.797000014978798E9}
No.93:Area{b=204e=396w=1.796316784871037E9}No.93: Area{b=204e=396w=1.796316784871037E9}
No.94:Area{b=195e=399w=1.7958957929261835E9}No.94: Area{b=195e=399w=1.7958957929261835E9}
No.95:Area{b=200e=391w=1.7956939769691014E9}No.95: Area{b=200e=391w=1.7956939769691014E9}
No.96:Area{b=199e=390w=1.7955746426529288E9}No.96: Area{b=199e=390w=1.7955746426529288E9}
No.97:Area{b=205e=395w=1.7951057911539783E9}No.97: Area{b=205e=395w=1.7951057911539783E9}
No.98:Area{b=193e=392w=1.7949530569627554E9}No.98: Area{b=193e=392w=1.7949530569627554E9}
No.99:Area{b=194e=391w=1.7942824448319867E9}No.99: Area{b=194e=391w=1.7942824448319867E9}
No.100:Area{b=191e=394w=1.79426301425113E9}No.100: Area{b=191e=394w=1.79426301425113E9}
根据前述的计算公式,对上述权值做加权平均后的结果为begin=182.3652086633145,end=404.76999807248177,根据该加权平均值可得到相应的最佳正文区间。According to the aforementioned calculation formula, the weighted average of the above weights results in begin=182.3652086633145, end=404.76999807248177, and the corresponding optimal text interval can be obtained according to the weighted average.
采用本发明的方法,可以获得良好的实际效果:Adopt method of the present invention, can obtain good actual effect:
在一个实例中,随机选取网易旅游(http://ok.travel.163.com/itinerar/list.jsp),e游天下(http://www.eyooworld.com/index.html),红袖添香(http://www.hongxiu.com/),水木论坛(www.newsmth.net),科苑星空论坛(www.kyxk.net)这五个网站的“正文式”网页进行实验。各选取50个页面,共计250个页面。In one instance, randomly select NetEase Travel ( http://ok.travel.163.com/itinerar/list.jsp ), e-Travel World ( http://www.eyooworld.com/index.html ), Hongxiu Tim Hongxiu ( http://www.hongxiu.com/ ), Shuimu Forum ( www.newsmth.net ), and Keyuan Xingkong Forum ( www.kyxk.net ) are used for the experiment. Select 50 pages each, for a total of 250 pages.
人工观察源代码中正文开始和结束的位置,即正确的正文区间,记作(B,E);程序运行结果给出的权值最大的区间,即最佳正文区间,记作(b1,e1);通过加权平均得到的区间,即平均意义上的最佳正文区间,记作(b*,e*)。HTML源代码经处理后的总段数记作w,则得出权值法求解最佳正文区间准确度R,加权平均法求解最佳正文区间准确度R*。Manually observe the start and end positions of the text in the source code, that is, the correct text interval, denoted as (B, E); the interval with the largest weight given by the program running results, that is, the optimal text interval, denoted as (b1, e1 ); the interval obtained by weighted average, that is, the best text interval in the average sense, is denoted as (b * , e * ). The total number of paragraphs of the HTML source code after processing is denoted as w, then the weight method is used to obtain the optimal text interval accuracy R, and the weighted average method is used to obtain the optimal text interval accuracy R * .
下面的表1是对上述网页进行正文区间提取的准确度结果。Table 1 below shows the accuracy results of extracting text intervals from the above web pages.
表1Table 1
由实验结果可知,该算法对不同结构网页的正文内容提取的准确度都较高。R均值都在90%以上,五个网站的R均值约为96.583%。有四类网站的R*均值在90%以上,五个网站的均值约为91.957%。From the experimental results, it can be seen that the accuracy of the algorithm for text content extraction of web pages with different structures is high. The mean R values are all above 90%, and the mean R values of the five websites are about 96.583%. There are four types of websites whose R * averages are above 90%, and the averages of five websites are about 91.957%.
最后所应说明的是,以上实施例仅用以说明本发明的技术方案而非限制。尽管参照实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,对本发明的技术方案进行修改或者等同替换,都不脱离本发明技术方案的精神和范围,其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention rather than limit them. Although the present invention has been described in detail with reference to the embodiments, those skilled in the art should understand that modifications or equivalent replacements to the technical solutions of the present invention do not depart from the spirit and scope of the technical solutions of the present invention, and all of them should be included in the scope of the present invention. within the scope of the claims.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2007100631827A CN101237465B (en) | 2007-01-30 | 2007-01-30 | A webpage context extraction method based on quick Fourier conversion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2007100631827A CN101237465B (en) | 2007-01-30 | 2007-01-30 | A webpage context extraction method based on quick Fourier conversion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101237465A true CN101237465A (en) | 2008-08-06 |
CN101237465B CN101237465B (en) | 2010-11-03 |
Family
ID=39920823
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2007100631827A Expired - Fee Related CN101237465B (en) | 2007-01-30 | 2007-01-30 | A webpage context extraction method based on quick Fourier conversion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101237465B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101436309B (en) * | 2008-12-15 | 2011-03-30 | 北大方正集团有限公司 | Method and device for modifying formula operator |
CN102591612A (en) * | 2011-12-27 | 2012-07-18 | 厦门市美亚柏科信息股份有限公司 | General webpage text extraction method based on punctuation continuity and system thereof |
CN105117500A (en) * | 2015-10-10 | 2015-12-02 | 成都携恩科技有限公司 | Data query and acquisition method under big data background |
CN106951505A (en) * | 2017-03-16 | 2017-07-14 | 北京搜狐新媒体信息技术有限公司 | Info web preparation method and system |
US10255253B2 (en) | 2013-08-07 | 2019-04-09 | Microsoft Technology Licensing, Llc | Augmenting and presenting captured data |
US10776501B2 (en) | 2013-08-07 | 2020-09-15 | Microsoft Technology Licensing, Llc | Automatic augmentation of content through augmentation services |
CN114817639A (en) * | 2022-05-18 | 2022-07-29 | 山东大学 | A method and system for sorting web graph convolution documents based on contrastive learning |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20010098265A (en) * | 2000-04-29 | 2001-11-08 | 윤종용 | Home page moving service method and device thereof |
CN100442278C (en) * | 2003-09-18 | 2008-12-10 | 富士通株式会社 | Method and device for extracting webpage information block |
CN100432996C (en) * | 2004-12-07 | 2008-11-12 | 国际商业机器公司 | System, method and program for extracting web page core content based on web page layout |
-
2007
- 2007-01-30 CN CN2007100631827A patent/CN101237465B/en not_active Expired - Fee Related
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101436309B (en) * | 2008-12-15 | 2011-03-30 | 北大方正集团有限公司 | Method and device for modifying formula operator |
CN102591612A (en) * | 2011-12-27 | 2012-07-18 | 厦门市美亚柏科信息股份有限公司 | General webpage text extraction method based on punctuation continuity and system thereof |
CN102591612B (en) * | 2011-12-27 | 2014-12-03 | 厦门市美亚柏科信息股份有限公司 | General webpage text extraction method based on punctuation continuity and system thereof |
US10255253B2 (en) | 2013-08-07 | 2019-04-09 | Microsoft Technology Licensing, Llc | Augmenting and presenting captured data |
US10776501B2 (en) | 2013-08-07 | 2020-09-15 | Microsoft Technology Licensing, Llc | Automatic augmentation of content through augmentation services |
US10817613B2 (en) | 2013-08-07 | 2020-10-27 | Microsoft Technology Licensing, Llc | Access and management of entity-augmented content |
CN105117500A (en) * | 2015-10-10 | 2015-12-02 | 成都携恩科技有限公司 | Data query and acquisition method under big data background |
CN105117500B (en) * | 2015-10-10 | 2018-07-06 | 成都携恩科技有限公司 | A kind of data query acquisition methods under big data background |
CN106951505A (en) * | 2017-03-16 | 2017-07-14 | 北京搜狐新媒体信息技术有限公司 | Info web preparation method and system |
CN114817639A (en) * | 2022-05-18 | 2022-07-29 | 山东大学 | A method and system for sorting web graph convolution documents based on contrastive learning |
CN114817639B (en) * | 2022-05-18 | 2024-05-10 | 山东大学 | Webpage diagram convolution document ordering method and system based on contrast learning |
Also Published As
Publication number | Publication date |
---|---|
CN101237465B (en) | 2010-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101237465B (en) | A webpage context extraction method based on quick Fourier conversion | |
CN107608949B (en) | A kind of Text Information Extraction method and device based on semantic model | |
CN104484343B (en) | It is a kind of that method of the motif discovery with following the trail of is carried out to microblogging | |
CN106407484B (en) | A video tag extraction method based on bullet chat semantic association | |
TWI695277B (en) | Automatic website data collection method | |
CN102591612B (en) | General webpage text extraction method based on punctuation continuity and system thereof | |
CN106407235B (en) | A kind of semantic dictionary construction method based on comment data | |
CN105243129A (en) | Commodity property characteristic word clustering method | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN105488077A (en) | Content tag generation method and apparatus | |
CN103544176A (en) | Method and device for generating page structure template corresponding to multiple pages | |
CN103761284A (en) | Video retrieval method and video retrieval system | |
CN103136359A (en) | Generation method of single document summaries | |
CN103853834A (en) | Text structure analysis-based Web document abstract generation method | |
CN105893354A (en) | Word segmentation method based on bidirectional recurrent neural network | |
CN102915361B (en) | Webpage text extracting method based on character distribution characteristic | |
CN112749265A (en) | Intelligent question-answering system based on multiple information sources | |
CN106649819A (en) | Method and device for extracting entity words and hypernyms | |
CN111967267B (en) | XLNET-based news text region extraction method and system | |
CN109815383A (en) | Microblog rumor detection based on LSTM and its resource library construction method | |
CN105718584B (en) | The method and device that Web page text extracts | |
CN108491512A (en) | The method of abstracting and device of headline | |
CN105677638A (en) | Web information extraction method | |
CN107391678A (en) | Web page content information extracting method based on cluster | |
CN107436955A (en) | A kind of English word relatedness computation method and apparatus based on Wikipedia Concept Vectors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20101103 Termination date: 20130130 |
|
CF01 | Termination of patent right due to non-payment of annual fee |