CN103853834B - Text structure analysis-based Web document abstract generation method - Google Patents

Text structure analysis-based Web document abstract generation method Download PDF

Info

Publication number
CN103853834B
CN103853834B CN201410090200.0A CN201410090200A CN103853834B CN 103853834 B CN103853834 B CN 103853834B CN 201410090200 A CN201410090200 A CN 201410090200A CN 103853834 B CN103853834 B CN 103853834B
Authority
CN
China
Prior art keywords
text
semantic
sentence
segmentation
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410090200.0A
Other languages
Chinese (zh)
Other versions
CN103853834A (en
Inventor
沈怡涛
顾君忠
林晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201410090200.0A priority Critical patent/CN103853834B/en
Publication of CN103853834A publication Critical patent/CN103853834A/en
Application granted granted Critical
Publication of CN103853834B publication Critical patent/CN103853834B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a text structure analysis-based Web document abstract generation method. The method comprises the steps of using a URL (uniform resource locator) as input, integrating the webpage main bodies of visual features and text features for extraction, partitioning the main bodies into a plurality of semantic paragraphs, and abstracting each semantic paragraph, so the generated abstract has higher coverage rate. The text structure analysis-based Web document summary generation method realizes the generation of the text abstract with better quality from a Webpage aiming at the conditions that the Webpage structure is complex, the main body is hard to identify and the Chinese automatic abstract is still positioned in the probe stage.

Description

基于文本结构分析的Web文档摘要的生成方法A Web Document Summary Generation Method Based on Text Structure Analysis

技术领域technical field

本发明涉及网页正文提取、自然语言处理、中文自动文摘技术领域,具体地说是一种基于文本结构分析的Web文档摘要的生成方法。The invention relates to the technical fields of web page text extraction, natural language processing, and Chinese automatic summarization, in particular to a generation method of web document summaries based on text structure analysis.

背景技术Background technique

目前,Internet已经成为了人们获取信息的主要来源。特别是近年来用户生成内容(UGC)的飞速发展,Internet上的信息正在爆发式增长。搜索引擎虽然能够根据用户要求返回搜索结果。但用户仍然需要从搜索列表中寻找最适合自己需要的网页,特别是由于互联网上大量存在的搜索引擎优化和转载现象,给用户快速准确的寻找信息带来了很大困难。At present, the Internet has become the main source for people to obtain information. Especially with the rapid development of User Generated Content (UGC) in recent years, the information on the Internet is growing explosively. Although search engines can return search results according to user requirements. But users still need to find the most suitable webpage for their needs from the search list, especially because there are a large number of search engine optimization and reposting phenomena on the Internet, which brings great difficulties to users to find information quickly and accurately.

自动文摘系统是利用计算机快速处理Web文档,从中按一定压缩比抓取出Web文档的核心内容,用户可以从中获取主题信息并判断该Web文档的价值,提高了用户搜索信息的效率。The automatic summarization system uses computers to quickly process web documents, and captures the core content of web documents according to a certain compression ratio, from which users can obtain subject information and judge the value of the web documents, which improves the efficiency of users searching for information.

Web文档中大量存在着噪声信息,如广告、导航栏、用户功能条、相关推荐、版权信息等与主题无关的信息。Web文档是一种半结构化信息,虽然具有一定结构,但语义无法确定。内容在HTML源代码中的表示和最终渲染得到的页面会有很大区别。近年来JS和AJAX技术的大量应用,使得网页数据不再是静态的HTML代码,而是动态生成的,甚至针对用户的操作行为还会产生相应改变。所以如何从Web文档中抽取出和主题相关的且结构正确的内容,存在着一定的难度。There is a large amount of noise information in Web documents, such as advertisements, navigation bars, user function bars, related recommendations, copyright information and other information that has nothing to do with the topic. Web document is a kind of semi-structured information. Although it has a certain structure, its semantics cannot be determined. The representation of the content in the HTML source code and the final rendered page can be very different. In recent years, the extensive application of JS and AJAX technology has made web page data no longer static HTML codes, but dynamically generated, and even corresponding changes will be made to the user's operation behavior. So how to extract the content related to the theme and with the correct structure from the Web document, there is a certain difficulty.

中文自动文摘系统的研究大约有二十余年的历史,但目前还处于探索阶段,自动摘要的结果还远远不能令人满意。自动摘要的方法主要分为两大类,基于理解的自动文摘和基于抽取的自动文摘。由于自然语言处理技术仍没有重大突破,所以基于理解的方法并不能真正的实现自动文摘。The research on Chinese automatic summarization system has a history of more than 20 years, but it is still in the exploratory stage, and the results of automatic summarization are far from satisfactory. The methods of automatic summarization are mainly divided into two categories, automatic summarization based on understanding and automatic summarization based on extraction. Since there is still no major breakthrough in natural language processing technology, the method based on comprehension cannot truly realize automatic summarization.

而面向Web文档的自动摘要技术的研究历史更短,“与传统文本相比,网页的文本结构松散,标题命名相对不那么严谨,一个句子结束也可能没有结束符,并且存在大量的与正文不相关的内容,这给摘要的生成带来一定的困难。”The research history of automatic summarization technology for Web documents is even shorter. "Compared with traditional texts, the text structure of web pages is loose, the title naming is relatively less rigorous, and there may be no terminator at the end of a sentence, and there are a large number of inconsistencies with the text." Relevant content, which brings certain difficulties to the generation of summaries.”

发明内容Contents of the invention

本发明的目的是提供一种基于文本结构分析的Web文档摘要的生成方法,该方法综合运用了视觉特征分析、自然语言分析、文本结构分析等技术,为搜索结果中的每个网页生成基于语义的,质量较好的网页摘要,为用户提供参考。The purpose of the present invention is to provide a method for generating Web document summaries based on text structure analysis. This method comprehensively uses technologies such as visual feature analysis, natural language analysis, and text structure analysis to generate semantically based summaries for each web page in the search results. A good, high-quality webpage summary to provide users with reference.

本发明的目的是这样实现的:The purpose of the present invention is achieved like this:

一种基于文本结构分析的Web文档摘要的生成方法,它包括以下步骤:A method for generating web document summaries based on text structure analysis, which includes the following steps:

1)输入待摘要网页的URL;1) Input the URL of the webpage to be summarized;

2)从待摘要网页基于视觉分析提取网页正文,具体包括;2) Extracting the webpage text from the webpage to be summarized based on visual analysis, specifically including;

2.1)采用浏览器核心对Web文档进行解析和渲染;2.1) The browser core is used to parse and render the Web document;

2.2)采用视觉树(VIPS)算法对网页进行分块,得到各区块的位置、面积;2.2) Use the visual tree (VIPS) algorithm to divide the webpage into blocks to obtain the position and area of each block;

2.3)对各区块进行分词;2.3) Carry out word segmentation to each block;

2.4)对各区块分析文本特征;2.4) Analyze text features for each block;

2.5)对各区块是否包含正文进行打分;2.5) Scoring whether each block contains text;

2.6)将得分高于某一阈值的文本按顺序连接起来;2.6) Concatenate texts with scores higher than a certain threshold in order;

2.7)输出Web文档正文;2.7) output the text of the Web document;

3)对提取的正文进行基于文本结构分析的自动摘要,具体包括:3) Carry out automatic summarization based on text structure analysis to the extracted text, specifically including:

3.1)由步骤2)得到网页正文;3.1) obtain the webpage text by step 2);

3.2)对正文进行分词和词性标注;3.2) Carry out word segmentation and part-of-speech tagging on the text;

3.3)进行文本预处理:识别正文中的基本结构,即识别文章标题,完成句子、段落切分;3.3) Perform text preprocessing: identify the basic structure in the text, that is, identify the title of the article, and complete sentence and paragraph segmentation;

3.4)对正文进行语义段切分,通过文本结构分析识别语义发生转换的位置,作为语义段切分的标志;3.4) Segment the text into semantic segments, and identify the position where semantic conversion occurs through text structure analysis, as a symbol of semantic segment segmentation;

3.5)对每个语义段,利用TFIDF的推广方法,对每个句子在所在语义段中的重要性进行度量,然后根据文摘字数要求,提取出若干句最能代表该语义段主题的句子;3.5) For each semantic segment, use the generalization method of TFIDF to measure the importance of each sentence in the semantic segment, and then extract several sentences that can best represent the theme of the semantic segment according to the word count requirements of the abstract;

3.6)将各句子按顺序连接起来,输出文摘。3.6) Connect the sentences in order to output the abstract.

所述步骤2.4)中的文本特征为字数、字号、陈述句数量、非陈述句数量及文本片断数量。The text features in the step 2.4) are word count, font size, declarative sentence quantity, non-declarative sentence quantity and text fragment quantity.

所述步骤2.5)中所述判断各区块是否包含正文进行打分,使用以下公式计算打分的分值:Described in the step 2.5) to judge whether each block contains text to score, use the following formula to calculate the score of scoring:

VV (( SS )) == SS 22 ** PP (( xx 11 ,, ythe y 11 ,, xx 22 ,, ythe y 22 )) NN ++ 11

其中S表示陈述句数量,N表示非陈述句数量,P是根据区块大小和位置计算得到的一个值,x1,y1表示区块左上角的坐标,x2,y2表示区块右下角的坐标。Among them, S represents the number of declarative sentences, N represents the number of non-declarative sentences, P is a value calculated according to the size and position of the block, x 1 , y 1 represent the coordinates of the upper left corner of the block, x 2 , y 2 represent the coordinates of the lower right corner of the block coordinate.

所述步骤3.4)中语义发生转换的位置的分析识别是:The analysis and identification of the position where semantic conversion occurs in the step 3.4) is:

3.4-1)对文档D进行分句,每两个相邻的句子之间均为待定分割点;3.4-1) segment the document D into sentences, and each two adjacent sentences are undetermined segmentation points;

3.4-2)对每个待定分割点进行打分,其公式为:3.4-2) Score each undetermined split point, the formula is:

QQ (( pp ii )) == &Sigma;&Sigma; ii ++ 11 << jj &le;&le; ii ++ aa RR (( sthe s ii ,, sthe s jj )) -- &Sigma;&Sigma; ii -- aa &le;&le; jj << ii RR (( sthe s ii ,, sthe s jj ))

其中,R(si,sj)表示句子si和句子sj的句间语义相关度;pi表示分割点在句子si和si-1之间,如果Q(pi)>Q(pi-1)且Q(pi)>Q(pi+1),说明pi是分割点权值的极大值点,所以pi是该文本中语义段之间的分割点。a为一个可调节的经验参数,表示在识别分割点时的语义分析的范围,即表示考虑分割点前后各a个句子。Among them, R(s i , s j ) represents the inter-sentence semantic correlation between sentence s i and sentence s j ; p i represents that the segmentation point is between sentence s i and s i-1 , if Q(p i )>Q (p i-1 ) and Q(p i )>Q(p i+1 ), it means that p i is the maximum value point of the segmentation point weight, so p i is the segmentation point between semantic segments in the text. a is an adjustable empirical parameter, indicating the range of semantic analysis when identifying the segmentation point, that is, considering the a sentences before and after the segmentation point.

3.4-3)若分割点的分值大于某一阈值,且为局部最大值,即分值高于前后两个分割点的分值,该分割点就是语义段的切分点,即步骤3.4)中所述语义发生转换的位置。3.4-3) If the score of the segmentation point is greater than a certain threshold and is a local maximum, that is, the score is higher than the scores of the two previous and subsequent segmentation points, the segmentation point is the segmentation point of the semantic segment, that is, step 3.4) Where the semantic translation occurs as described in .

所述语义发生转换的位置的分析识别步骤2)中句间语义相关度的计算包括以下步骤:The analysis and recognition step 2) of the position where the semantic conversion occurs The calculation of the inter-sentence semantic correlation comprises the following steps:

3.4-2-1)将句子切分成词的集合;3.4-2-1) Segment the sentence into a collection of words;

3.4-2-2)使用以下公式计算句间语义相关度3.4-2-2) Use the following formula to calculate the semantic correlation between sentences

RR (( sthe s 11 ,, sthe s 22 )) == &Sigma;&Sigma; ww ii &Element;&Element; sthe s 11 mm aa xx (( RR (( ww ii ,, ww jj )) )) (( ww jj &Element;&Element; sthe s 22 ))

其中R(wi,wj)表示词wi和词wj的词间语义相关度。Among them, R(w i , w j ) represents the inter-word semantic correlation between word w i and word w j .

所述步骤3.5)中对每个句子在所在语义段中的重要性进行度量使用以下公式计算:In the step 3.5), the importance of each sentence in the semantic segment is measured using the following formula:

V(S1)=sum(w∈S1)*TFIDF(w)V(S 1 )=sum(w∈S 1 )*TFIDF(w)

其中,计算TFIDF(w)时,将每个段落视为独立的文件,将整篇文章包含的若干个段落视为文件集。Among them, when calculating TFIDF(w), each paragraph is regarded as an independent file, and several paragraphs included in the whole article are regarded as a file set.

本发明能够过滤掉网页中和主题无关的文字、链接等,识别出网页中所包含的文章正文,准确率较高,且拥有较高的鲁棒性。自动摘要流程采用了基于文本结构分析的自动文摘技术,生成的摘要覆盖率高而且摘要较为流畅。The invention can filter out the words and links irrelevant to the theme in the webpage, and identify the main text of the article contained in the webpage, with high accuracy and high robustness. The automatic summarization process adopts the automatic summarization technology based on text structure analysis, and the generated summaries have high coverage and smooth summaries.

本发明能针对Web文档,按用户指定的压缩比要求,仅需要输入待摘要网页的URL地址,就可在数秒的时间内,形成能覆盖原文意思,较为准确、流畅的摘要,帮助用户快速准确的在互联网中寻找信息。According to the compression ratio specified by the user, the present invention can form a relatively accurate and smooth summary that can cover the meaning of the original text within a few seconds only by inputting the URL address of the webpage to be summarized, helping users to quickly and accurately looking for information on the Internet.

附图说明Description of drawings

图1为本发明流程图;Fig. 1 is a flowchart of the present invention;

图2为本发明网页预处理流程图;Fig. 2 is the flow chart of webpage preprocessing of the present invention;

图3为本发明自动摘要流程图。Fig. 3 is a flow chart of the automatic summarization of the present invention.

具体实施方式detailed description

本发明公开了一种面向搜索引擎的Web文档摘要生成方法,可以自动分析一个Web网页,并生成反应网页主题的文本摘要。The invention discloses a search engine-oriented method for generating web document abstracts, which can automatically analyze a web page and generate text abstracts reflecting the theme of the web page.

本发明包含一个综合了视觉特征和文本特征的网页正文提取和一个基于通过文本结构分析进行子主题划分的自动文本摘要。The invention includes a webpage body text extraction that combines visual features and text features and an automatic text summary based on subtopic division through text structure analysis.

本发明以一个URL作为输入,经过网页正文提取、自动摘要两个阶段,最终生成文本摘要。The invention takes a URL as input, and finally generates a text summary through two stages of web page text extraction and automatic summary.

下面对所述两个阶段的具体算法,结合对一个新闻网页进行摘要为例作进一步说明:The following is a further description of the specific algorithms of the two stages, combined with an example of summarizing a news web page:

图1描述了从待摘要URL到生成摘要的总体流程,其中包括了网页预处理流程和自动摘要流程。Figure 1 describes the overall process from the URL to be summarized to generating the summary, which includes the webpage preprocessing process and the automatic summarization process.

具体地,在实施例中,本发明在网页预处理流程(见图2)URL输入步骤中获取待摘要新闻网页的URL。网页预处理流程通过分析视觉特征,可以更准确的找到网页中的正文部分,比其他方法拥有更高鲁棒性。同时综合考虑文本特征、文本相关度分析、HTML标签特征、语义特征等其他特征,进一步提高Web网页正文提取的准确性。Specifically, in the embodiment, the present invention acquires the URL of the news webpage to be summarized in the URL input step of the webpage preprocessing flow (see FIG. 2 ). The web page preprocessing process can more accurately find the body part of the web page by analyzing the visual features, which is more robust than other methods. At the same time, text features, text correlation analysis, HTML tag features, semantic features and other features are considered comprehensively to further improve the accuracy of Web page text extraction.

网页渲染步骤负责读取输入URL对应的网页,在该实施例中,采用IE11浏览器核心对HTML标签进行处理,并渲染该网页。在网页渲染的基础上,视觉树分析步骤采用VIPS算法,对网页进行视觉树分析,得到各区块的位置、面积。在该实施例中,该步骤将待摘要的新闻网页分割成6个区块:一个顶部区块、一个底部区块、一个导航区块、一个广告区块和两个包含正文的区块。分词步骤负责对各区块进行分词。然后,文本特征分析步骤对分词结果进行文本特征分析。最后综合分析步骤对视觉树分析得到的各区块的特征和文本特征进行综合分析,输出新闻正文。The webpage rendering step is responsible for reading the webpage corresponding to the input URL. In this embodiment, the IE11 browser core is used to process the HTML tags and render the webpage. On the basis of webpage rendering, the visual tree analysis step adopts the VIPS algorithm to analyze the visual tree of the webpage to obtain the position and area of each block. In this embodiment, this step divides the news web page to be summarized into 6 blocks: a top block, a bottom block, a navigation block, an advertisement block and two blocks containing text. The word segmentation step is responsible for word segmentation of each block. Then, the text feature analysis step performs text feature analysis on the word segmentation result. Finally, in the comprehensive analysis step, the features and text features of each block obtained by the visual tree analysis are comprehensively analyzed, and the news text is output.

在该实施例中,采用下列公式计算P(x1,y1,x2,y2)。In this example, P(x 1 , y 1 , x 2 , y 2 ) was calculated using the following formula.

P(x1,y1,x2,y2)=(x2-x1)*(y2-y1)-x1*y1 P(x 1 , y 1 , x 2 , y 2 )=(x 2 -x 1 )*(y 2 -y 1 )-x 1 *y 1

其中x1,y1表示区块左上角的坐标,x2,y2表示区块右下角的坐标。然后计算出每个区块的V(s)值:Among them, x 1 and y 1 represent the coordinates of the upper left corner of the block, and x 2 and y 2 represent the coordinates of the lower right corner of the block. Then calculate the V(s) value of each block:

VV (( SS )) == SS 22 ** PP (( xx 11 ,, ythe y 11 ,, xx 22 ,, ythe y 22 )) NN ++ 11

上述6个区块的V(s)值从大到小分别为3.7×106,2.3×106,7.5×105,5.4×106,3.7×105,1.6×105,1.2×104The V(s) values of the above six blocks from large to small are 3.7×10 6 , 2.3×10 6 , 7.5×10 5 , 5.4×10 6 , 3.7×10 5 , 1.6×10 5 , 1.2×10 4 .

在该实施例中,采用的阈值为106,所以选取V(s)大于106的区块,即V(s)值最大的两个区块。在该实施例中,V(s)值最大的两个区块就是两个包含正文的区块,所以正确提取到了新闻正文。In this embodiment, the threshold value used is 10 6 , so the blocks with V(s) greater than 10 6 are selected, that is, the two blocks with the largest V(s) values. In this embodiment, the two blocks with the largest V(s) values are the two blocks containing the text, so the news text is correctly extracted.

在提取出新闻正文后,接着进行自动摘要流程(见图3),包含文本预处理、词间相关度计算、句间相关度计算、语义段分割、摘要生成这些步骤。After the news text is extracted, the automatic summarization process (see Figure 3) is followed, including the steps of text preprocessing, inter-word correlation calculation, inter-sentence correlation calculation, semantic segment segmentation, and summary generation.

一个文本预处理步骤,识别正文中的基本结构,即识别文章标题,完成句子、段落切分。在该实施例中,新闻正文共包含8个段落,23个句子。A text preprocessing step to identify the basic structure in the text, that is, identify the title of the article, and complete sentence and paragraph segmentation. In this embodiment, the news text contains 8 paragraphs and 23 sentences.

词间相关度计算步骤基于知网提供的计算语义学知识,通过计算两个词的义原相似度来得到两个词语的相关度。采用的公式如下:The inter-word correlation calculation step is based on the computational semantics knowledge provided by HowNet, and the correlation between two words is obtained by calculating the sememe similarity of the two words. The formula used is as follows:

R(w1,w2)=max(Rele(Ci,Cj))(Ci∈w1,Cj∈w2)R(w 1 , w 2 )=max(Rele(C i , C j ))(C i ∈ w 1 , C j ∈ w 2 )

其中R(w1,w2)表示了两个词之间语义相关度,Rele(Ci,Cj)表示了两个义原的相关度,取其最大值表示两个词的语义相关度。Among them, R(w 1 , w 2 ) represents the semantic correlation between two words, Rele(C i , C j ) represents the correlation between two sememes, and the maximum value represents the semantic correlation between two words .

句间相关度步骤通过分析两个句子中词语间的相关度得到两个句子的相关度。The inter-sentence correlation step obtains the correlation between the two sentences by analyzing the correlation between words in the two sentences.

RR (( sthe s 11 ,, sthe s 22 )) == &Sigma;&Sigma; ww ii &Element;&Element; sthe s 11 mm aa xx (( RR (( ww ii ,, ww jj )) )) (( ww jj &Element;&Element; sthe s 22 ))

其中R(s1,s2)表示了两个句子之间的相关度,为每个句子1中的词,找句子2中与之相关度最大的词,计算这两个词之间的相关度。最后将这些最大值求和,得到这两个句子之间的相关度。Among them, R(s 1 , s 2 ) represents the correlation between two sentences, for each word in sentence 1, find the word with the highest correlation in sentence 2, and calculate the correlation between these two words Spend. Finally, these maximum values are summed to obtain the correlation between the two sentences.

一个语义段分割步骤,参考了文献《基于内容相关度计算的文本结构分析方法研究》来进行文本结构分析。语义段之间分割点的特征是分割点后的第一个句子和之前若干句子的相关度很小,而跟之后若干个句子的相关度较大。采用以下公式对该实施例中的23个句子间的22个分割点计算分割点的分值,并寻找函数Q(pi)的极大值点:A semantic segment segmentation step refers to the document "Research on Text Structure Analysis Method Based on Content Correlation Calculation" to analyze text structure. The feature of the segmentation point between semantic segments is that the first sentence after the segmentation point has a small correlation with the previous sentences, but has a large correlation with the subsequent sentences. 22 segmentation points between 23 sentences in this embodiment are adopted to calculate the score of segmentation points, and find the maximum point of function Q (pi):

QQ (( pp ii )) == &Sigma;&Sigma; ii ++ 11 << jj &le;&le; ii ++ aa RR (( sthe s ii ,, sthe s jj )) -- &Sigma;&Sigma; ii -- aa &le;&le; jj << ii RR (( sthe s ii ,, sthe s jj ))

在该实施例中,Q(pi)包含2个极大值点,依据这两个极大值点,将该新闻分割成3个语义段。每个语义段包含了新闻的一个子主题,在该实施例中,第一个语义段是对新闻事件的概述,后两个语义段是两方对该新闻事件分别的评论。In this embodiment, Q(p i ) contains two maximum points, and the news is divided into three semantic segments according to these two maximum points. Each semantic segment contains a subtopic of the news. In this embodiment, the first semantic segment is an overview of the news event, and the last two semantic segments are comments on the news event by two parties.

一个摘要生成步骤,根据用户要求,从文本格式的正文中按一定比例提取出摘要。A summary generation step, according to user requirements, extracts a summary from the main body in text format in a certain proportion.

在该实施例中,该摘要生成步骤通过句间相关度计算步骤,计算各个子主题中的句子和文章标题词汇序列的相关度之和,从而确定各子主题的价值。从子主题中抽取句子的数量和该子主题和文章标题的相关度成正比。In this embodiment, the summary generation step calculates the sum of the correlations between sentences in each subtopic and article title vocabulary sequences through the inter-sentence correlation calculation step, so as to determine the value of each subtopic. The number of sentences extracted from a subtopic is proportional to the relevance of the subtopic to the article title.

在该实施例中,用户指定的比例为0.2,即提取23句中的5句话形成摘要。通过对3个子主题的价值进行计算,确定从3个语义段中分别提取2、1、1个句子。最后,所述摘要生成步骤将选取的5个摘要句按顺序连接,形成摘要并输出。In this embodiment, the ratio specified by the user is 0.2, that is, 5 sentences out of 23 sentences are extracted to form a summary. By calculating the value of the 3 subtopics, it is determined to extract 2, 1, and 1 sentences from the 3 semantic segments respectively. Finally, the summary generation step connects the selected 5 summary sentences in order to form a summary and output it.

Claims (5)

1.一种基于文本结构分析的Web文档摘要的生成方法,其特征在于:该方法包括以下步骤:1. A method for generating web document summaries based on text structure analysis, characterized in that: the method may further comprise the steps: 1)输入待摘要网页的URL;1) Input the URL of the webpage to be summarized; 2)从待摘要网页基于视觉分析提取网页正文,具体包括;2) Extracting the webpage text from the webpage to be summarized based on visual analysis, specifically including; 2.1)采用浏览器核心对Web文档进行解析和渲染;2.1) The browser core is used to parse and render the Web document; 2.2)采用视觉树算法对网页进行分块,得到各区块的位置、面积;2.2) Use the visual tree algorithm to divide the webpage into blocks to obtain the position and area of each block; 2.3)对各区块进行分词;2.3) Carry out word segmentation to each block; 2.4)对各区块分析文本特征;2.4) Analyze text features for each block; 2.5)对各区块是否包含正文进行打分,使用以下公式计算打分的分值:2.5) Score whether each block contains text, and use the following formula to calculate the scoring value: VV (( SS )) == SS 22 ** PP (( xx 11 ,, ythe y 11 ,, xx 22 ,, ythe y 22 )) NN ++ 11 其中S表示陈述句数量,N表示非陈述句数量,P是根据区块大小和位置计算得到的一个值,x1,y1表示区块左上角的坐标,x2,y2表示区块右下角的坐标;Among them, S represents the number of declarative sentences, N represents the number of non-declarative sentences, P is a value calculated according to the size and position of the block, x 1 and y 1 represent the coordinates of the upper left corner of the block, x 2 and y 2 represent the coordinates of the lower right corner of the block coordinate; 2.6)将得分高于某一阈值的文本按顺序连接起来;2.6) Concatenate texts with scores higher than a certain threshold in sequence; 2.7)输出Web文档正文;2.7) output the text of the Web document; 3)对提取的正文进行基于文本结构分析的自动摘要,具体包括:3) Carry out automatic summarization based on text structure analysis to the extracted text, specifically including: 3.1)由步骤2)得到网页正文;3.1) obtain the webpage text by step 2); 3.2)对正文进行分词和词性标注;3.2) Carry out word segmentation and part-of-speech tagging on the text; 3.3)进行文本预处理:识别正文中的基本结构,即识别文章标题,完成句子、段落切分;3.3) Perform text preprocessing: identify the basic structure in the text, that is, identify the title of the article, and complete sentence and paragraph segmentation; 3.4)对正文进行语义段切分,通过文本结构分析识别语义发生转换的位置,作为语义段切分的标志;3.4) Segment the text into semantic segments, and identify the position where semantic conversion occurs through text structure analysis, as a symbol of semantic segment segmentation; 3.5)对每个语义段,利用TFIDF的推广方法,对每个句子在所在语义段中的重要性进行度量,然后根据文摘字数要求,提取出若干句最能代表该语义段主题的句子;3.5) For each semantic segment, use the generalization method of TFIDF to measure the importance of each sentence in the semantic segment, and then extract several sentences that can best represent the theme of the semantic segment according to the number of words in the abstract; 3.6)将各句子按顺序连接起来,输出文摘。3.6) Connect the sentences in order to output the abstract. 2.根据权利要求1所述的方法,其特征在于:步骤2.4)中所述的文本特征为字数、字号、陈述句数量、非陈述句数量及文本片断数量。2. The method according to claim 1, characterized in that: the text features described in step 2.4) are word count, font size, declarative sentence quantity, non-declarative sentence quantity and text fragment quantity. 3.根据权利要求1所述的方法,其特征在于:步骤3.4)中所述语义发生转换的位置的分析识别是:3. The method according to claim 1, characterized in that: in step 3.4), the analysis and recognition of the position where semantic conversion occurs is: 3.4-1)对文档D进行分句,每两个相邻的句子之间均为待定分割点;3.4-1) segment the document D into sentences, and each two adjacent sentences are undetermined segmentation points; 3.4-2)对每个待定分割点进行打分,其公式为:3.4-2) Score each undetermined split point, the formula is: QQ (( pp ii )) == &Sigma;&Sigma; ii ++ 11 << jj &le;&le; ii ++ aa RR (( sthe s ii ,, sthe s jj )) -- &Sigma;&Sigma; ii -- aa &le;&le; jj << ii RR (( sthe s ii ,, sthe s jj )) 其中,R(si,sj)表示句子si和句子sj的句间语义相关度;pi表示分割点在句子si和si-1之间,如果Q(pi)>Q(pi-1)且Q(pi)>Q(pi+1),说明pi是分割点权值的极大值点,所以pi是该文本中语义段之间的分割点;a为一个可调节的经验参数,表示在识别分割点时的语义分析的范围,即表示考虑分割点前后各a个句子;Among them, R(s i , s j ) represents the inter-sentence semantic correlation between sentence s i and sentence s j ; p i represents that the segmentation point is between sentence s i and s i-1 , if Q(p i )>Q (p i-1 ) and Q(p i )>Q(p i+1 ), indicating that p i is the maximum value point of the segmentation point weight, so p i is the segmentation point between semantic segments in the text; a is an adjustable empirical parameter, representing the scope of semantic analysis when identifying the segmentation point, that is, considering each a sentence before and after the segmentation point; 3.4-3)若分割点的分值大于某一阈值,且为局部最大值,即分值高于前后两个分割点的分值,该分割点就是语义段的切分点,即步骤3.4)中所述语义发生转换的位置。3.4-3) If the score of the segmentation point is greater than a certain threshold and is a local maximum, that is, the score is higher than the scores of the two previous and subsequent segmentation points, the segmentation point is the segmentation point of the semantic segment, that is, step 3.4) Where the semantic translation occurs as described in . 4.根据权利要求3所述的方法,其特征在于:步骤3.4-2)中所述句间语义相关度的计算包括以下步骤:4. method according to claim 3, it is characterized in that: step 3.4-2) the calculation of the inter-sentence semantic correlation degree comprises the following steps: 3.4-2-1)将句子切分成词的集合;3.4-2-1) Segment the sentence into a collection of words; 3.4-2-2)使用以下公式计算句间语义相关度3.4-2-2) Use the following formula to calculate the semantic correlation between sentences RR (( sthe s 11 ,, sthe s 22 )) == &Sigma;&Sigma; ww ii &Element;&Element; sthe s 11 mm aa xx (( RR (( ww ii ,, ww jj )) )) (( ww jj &Element;&Element; sthe s 22 )) 其中R(wi,wj)表示词wi和词wj的词间语义相关度。Among them, R(w i , w j ) represents the inter-word semantic correlation between word w i and word w j . 5.根据权利要求1所述的方法,其特征在于:步骤3.5)中所述对每个句子在所在语义段中的重要性进行度量使用以下公式计算:5. method according to claim 1, it is characterized in that: described in step 3.5) measure the importance of each sentence in place semantic segment and use following formula to calculate: V(S1)=sum(w∈S1)*TFIDF(w)V(S 1 )=sum(w∈S 1 )*TFIDF(w) 其中,计算TFIDF(w)时,将每个段落视为独立的文件,将整篇文章包含的若干个段落视为文件集。Among them, when calculating TFIDF(w), each paragraph is regarded as an independent file, and several paragraphs included in the whole article are regarded as a file set.
CN201410090200.0A 2014-03-12 2014-03-12 Text structure analysis-based Web document abstract generation method Expired - Fee Related CN103853834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410090200.0A CN103853834B (en) 2014-03-12 2014-03-12 Text structure analysis-based Web document abstract generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410090200.0A CN103853834B (en) 2014-03-12 2014-03-12 Text structure analysis-based Web document abstract generation method

Publications (2)

Publication Number Publication Date
CN103853834A CN103853834A (en) 2014-06-11
CN103853834B true CN103853834B (en) 2017-02-08

Family

ID=50861489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410090200.0A Expired - Fee Related CN103853834B (en) 2014-03-12 2014-03-12 Text structure analysis-based Web document abstract generation method

Country Status (1)

Country Link
CN (1) CN103853834B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677764B (en) * 2015-12-30 2020-05-08 百度在线网络技术(北京)有限公司 Information extraction method and device
CN106484768B (en) * 2016-09-09 2019-12-31 天津海量信息技术股份有限公司 Local feature extraction method and system for text content saliency region
CN106844340B (en) * 2017-01-10 2020-04-07 北京百度网讯科技有限公司 News abstract generating and displaying method, device and system based on artificial intelligence
CN108959312B (en) 2017-05-23 2021-01-29 华为技术有限公司 Method, device and terminal for generating multi-document abstract
CN107346335B (en) * 2017-06-28 2020-04-14 浙江大学 A method for web page topic block recognition based on combined features
CN107622046A (en) * 2017-09-01 2018-01-23 广州慧睿思通信息科技有限公司 A kind of algorithm according to keyword abstraction text snippet
CN107766325B (en) * 2017-09-27 2021-05-28 百度在线网络技术(北京)有限公司 Text splicing method and device
CN108427761B (en) * 2018-03-21 2022-01-14 腾讯科技(深圳)有限公司 News event processing method, terminal, server and storage medium
CN110889280B (en) * 2018-09-06 2023-09-26 上海智臻智能网络科技股份有限公司 Knowledge base construction method and device based on document splitting
CN110968752A (en) * 2018-09-28 2020-04-07 珠海格力电器股份有限公司 Data acquisition method and device, storage medium and electronic equipment
CN113515627B (en) * 2021-05-19 2023-07-25 北京世纪好未来教育科技有限公司 Document detection method, device, equipment and storage medium
CN114330315A (en) * 2021-12-28 2022-04-12 浙江大华技术股份有限公司 Method and device for processing secure text, storage medium and electronic device
CN114417808B (en) * 2022-02-25 2023-04-07 北京百度网讯科技有限公司 Article generation method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method and system for extracting and processing network information
CN102446191A (en) * 2010-10-13 2012-05-09 北京创新方舟科技有限公司 Method for generating webpage content abstracts and equipment and system adopting same

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8930376B2 (en) * 2008-02-15 2015-01-06 Yahoo! Inc. Search result abstract quality using community metadata

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method and system for extracting and processing network information
CN102446191A (en) * 2010-10-13 2012-05-09 北京创新方舟科技有限公司 Method for generating webpage content abstracts and equipment and system adopting same

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"基于内容相关度计算的文本结构分析方法研究";钟茂生;《中国博士学位论文全文数据库信息科技辑》;20101015(第10期);I138-81 *
"基于分块的网页正文信息提取算法研究";黄文蓓 等;《计算机应用》;20070601;第6卷(第S1期);24-26 *
"基于潜在语义分析的多网页自动文摘研究";何媛媛;《中国优秀硕士学位论文全文数据库信息科技辑》;20080115(第01期);I138-1310 *

Also Published As

Publication number Publication date
CN103853834A (en) 2014-06-11

Similar Documents

Publication Publication Date Title
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN106649818B (en) Application search intent identification method, device, application search method and server
TWI695277B (en) Automatic website data collection method
CN102200975B (en) Vertical search engine system using semantic analysis
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN106126619A (en) A kind of video retrieval method based on video content and system
CN103810251B (en) Method and device for extracting text
WO2020074017A1 (en) Deep learning-based method and device for screening for keywords in medical document
CN109472022B (en) New word recognition method based on machine learning and terminal equipment
CN105975639B (en) Search result ordering method and device
CN103544266A (en) Method and device for generating search suggestion words
CN109446313B (en) Sequencing system and method based on natural language analysis
CN105608075A (en) Related knowledge point acquisition method and system
CN106156143A (en) Page processor and web page processing method
CN107871002A (en) A Cross-lingual Plagiarism Detection Method Based on Fingerprint Fusion
CN111199151A (en) Data processing method and data processing device
CN111191413A (en) Method, device and system for automatically marking event core content based on graph sequencing model
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN105426379A (en) Keyword weight calculation method based on position of word
CN106372232B (en) Information mining method and device based on artificial intelligence
CN113361252B (en) Text depression tendency detection system based on multi-modal features and emotion dictionary
CN118170933B (en) A method and device for constructing multimodal corpus data in scientific fields
CN103514194B (en) Determine method and apparatus and the classifier training method of the dependency of language material and entity
Shah et al. An automatic text summarization on Naive Bayes classifier using latent semantic analysis
CN110019814B (en) A news information aggregation method based on data mining and deep learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170208

Termination date: 20200312