CN103793491B

CN103793491B - Chinese news story segmentation method based on flexible semantic similarity measurement

Info

Publication number: CN103793491B
Application number: CN201410027012.3A
Authority: CN
Inventors: 冯伟; 万亮; 聂学成; 高晓妮; 党建武
Original assignee: Tianjin University
Current assignee: Beijing Hongbo Zhiwei Science & Technology Co ltd
Priority date: 2014-01-20
Filing date: 2014-01-20
Publication date: 2017-01-25
Anticipated expiration: 2034-01-20
Also published as: CN103793491A

Abstract

The invention discloses a method for segmenting Chinese news stories based on flexible semantic similarity measurement. The method comprises the following steps: inputting a target corpus, performing word segmentation on each news story script T _i in the corpus; establishing a context graph; The context graph and the quick sorting algorithm iteratively propagate the context correlation between words to obtain a flexible semantic correlation matrix; the flexible semantic similarity between sentences is defined by the flexible semantic correlation matrix; using the flexible Semantic Similarity Segmentation of Chinese News Stories. The flexibility measurement method proposed by the present invention can more reasonably represent the semantic similarity between words and word sets. The experimental results show that in the Chinese news story segmentation technology, based on the same segmentation criteria, compared with the traditional similarity measurement method, using this flexible semantic similarity measurement method can improve the segmentation accuracy to 3%‑10%.

Description

A Chinese News Story Segmentation Method Based on Flexible Semantic Similarity Measure

技术领域technical field

本发明涉及中文新闻故事分割领域，特别涉及一种基于柔性语义相似性度量的中文新闻故事分割方法。The invention relates to the field of Chinese news story segmentation, in particular to a Chinese news story segmentation method based on flexible semantic similarity measurement.

背景技术Background technique

随着网络的普及和发展，例如：广播新闻、会议记录、网上公开课之类的多媒体内容正在急速增加，现在急需一种有效的方法将这类多媒体数据进行自动的组织，以用于基于主题的文本检索和分析。一个多媒体的文档，例如一小时的广播新闻节目，通常由多个故事（Story）组成，为了进行高效率的语义检索，指导使用者去找到他们感兴趣主题的开始和结束是很重要的，同时，一个分割好的多媒体文档是进行主题跟踪^[1]、分类和总结^[2]等高层次的语义浏览的重要前提条件。新闻故事分割技术的目的就在于将新闻故事脚本分割成主题一致的故事。从技术上讲，新闻故事分割技术的效率与两个因素相关：一是词语之间的相似性以及此语句集合之间的相似性的度量方法；二是分割新闻故事脚本的准则。With the popularization and development of the network, for example: multimedia content such as broadcast news, meeting minutes, and online open courses is increasing rapidly, and now there is an urgent need for an effective method to automatically organize such multimedia data for topic-based text retrieval and analysis. A multimedia document, such as a one-hour broadcast news program, usually consists of multiple stories. In order to perform efficient semantic retrieval, it is very important to guide users to find the beginning and end of the topics they are interested in. At the same time, , a well-segmented multimedia document is an important prerequisite for high-level semantic browsing such as topic tracking ^[1] , classification and summarization ^[2] . The purpose of news story segmentation technology is to segment news story scripts into stories with consistent themes. Technically speaking, the efficiency of news story segmentation technology is related to two factors: one is the similarity between words and the measurement method of the similarity between this sentence set; the other is the criteria for segmenting news story scripts.

之前的许多工作都着眼于设计合理的分割准则，例如：TextTiling^[3][4]最小归一化分割准则（Minimum NCuts）^[5][6]、最大词汇连接准则^[7]等。与广泛研究的分割准则相比，现阶段的大多数工作都使用简单的基于重复的硬性相似性度量方式，即相同词语之间的相似性为1，不同词语之间的相似性为0。很明显这种基于重复的硬性相似性度量方法忽略了不同词语之间潜在的语义相关性，使得语义关系度量不准确，得到的中文新闻故事分割结果不准确。因此需要提出一种更加合理的语义相似性度量方式以助于提高分割的效率和精度。Many previous works have focused on designing reasonable segmentation criteria, such as: TextTiling ^{[3] [4]} Minimum Normalized Segmentation Criterion (Minimum NCuts) ^{[5] [6]} , Maximum Lexical Connection Criterion ^[7] , etc. Compared with widely studied segmentation criteria, most works at this stage use a simple repetition-based hard similarity measure, that is, the similarity between the same words is 1, and the similarity between different words is 0. Obviously, this repetition-based rigid similarity measurement method ignores the potential semantic correlation between different words, which makes the semantic relationship measurement inaccurate, and the obtained Chinese news story segmentation results are inaccurate. Therefore, it is necessary to propose a more reasonable semantic similarity measurement method to help improve the efficiency and accuracy of segmentation.

发明内容Contents of the invention

本发明提供了一种基于柔性语义相似性度量的中文新闻故事分割方法，本发明能够合理的表示词语之间的语义相似性，并且可以显著提高中文新闻故事分割技术的精度，详见下文描述：The present invention provides a Chinese news story segmentation method based on flexible semantic similarity measurement. The present invention can reasonably represent the semantic similarity between words, and can significantly improve the accuracy of Chinese news story segmentation technology. See the following description for details:

一种基于柔性语义相似性度量的中文新闻故事分割方法，所述方法包括以下步骤：A method for segmenting Chinese news stories based on flexible semantic similarity measurement, said method comprising the following steps:

（1）输入目标文集对文集中的每个新闻故事脚本T_i进行分词；(1) Input the target corpus Carry out word segmentation for each news story script T _i in the corpus;

（2）建立上下文关系图；(2) Establish a context diagram;

（3）通过所述上下文关系图和快速排序算法对词语之间的上下文相关性进行迭代传播获取柔性语义相关性矩阵；(3) Iteratively propagating the context correlation between words through the context graph and the quick sorting algorithm to obtain a flexible semantic correlation matrix;

（4）通过所述柔性语义相关性矩阵对句子间的柔性语义相似性进行定义；(4) Define the flexible semantic similarity between sentences through the flexible semantic correlation matrix;

（5）使用所述柔性语义相似性对中文新闻故事进行分割。(5) Segment Chinese news stories using the flexible semantic similarity.

所述建立上下文关系图的步骤具体为：The steps of establishing the context graph are specifically:

1）依次读入每个新闻故事脚本，对所包含的词语进行词频统计；1) Read in each news story script in turn, and perform word frequency statistics on the included words;

2）根据定义好的词频阈值，将高频词语和低频词语删除；2) Delete high-frequency words and low-frequency words according to the defined word frequency threshold;

3）将保留下的词语作为上下文关系图中的结点，其集合即为V；3) Take the reserved words as the nodes in the context graph, and its set is V;

4）判断集合中的任意两个词语是否同时出现在某一新闻故事脚本中，且这两个词语之间的距离小于或等于距离阈值，如果是则在这两个词语之间建立边，边的集合即为E；如果否重新判断其他任意两个词语，直至整个集合中的词语都被遍历；4) Determine whether any two words in the set appear in a news story script at the same time, and the distance between the two words is less than or equal to the distance threshold, and if so, establish an edge between the two words, and the edge The set of is E; if no, re-judgment any other two words, until the words in the entire set are traversed;

5）边的权值S_C由词语之间的权值sim_C(a,b)、词语本身的权值sim_C(a,a)表示；5) The weight S _C of the edge is represented by the weight sim _C (a,b) between words and the weight sim _C (a,a) of the word itself;

6）所述上下文关系图表示为G=V,E,S_C。6) The context graph is expressed as G=V,E,S _C .

所述词语之间的权值sim_C(a,b)具体为：The weight sim _C (a, b) between the words is specifically:

${sim sim}_{C C} ((a a,, b b)) = = \frac{freq freq ((a a,, b b))}{{freq freq}_{max max} + + ϵ ϵ}$

其中，freq(a,b)表示词语a和词语b同时出现的次数，freq_max=max(_i,j){freq(i,j)}表示词对(i,j)的频率最大值，ε是一个常数用以确保0≤sim_C(a,b)≤1。Among them, freq(a,b) indicates the number of times that word a and word b appear at the same time, freq _max =max( _i,j ){freq(i,j)} indicates the maximum frequency of word pair (i,j), ε is a constant to ensure that 0≤sim _C (a,b)≤1.

所述词语本身的权值sim_C(a,a)=1。The weight sim _C (a, a)=1 of the word itself.

所述通过所述上下文关系图和快速排序算法对词语之间的上下文相关性进行迭代传播获取柔性语义相关性矩阵的步骤具体为：The step of obtaining the flexible semantic correlation matrix by iteratively propagating the context correlation between words through the context graph and the quick sorting algorithm is specifically:

1）定义上下文关系图中词语之间的语义相似性为sim_S(a,b)，满足以下三条准则：1) Define the semantic similarity between words in the context graph as sim _S (a,b), which satisfies the following three criteria:

词语与它本身的相似性为1，即sim_S(a,a)=1；sim_S(a,b)与sim_C(a,b)正相关；sim_S(a,b)与他们邻居之间的相似性成正比；The similarity between a word and itself is 1, that is, sim _S (a, a)=1; sim _S (a, b) is positively correlated with sim _C (a, b); the relationship between sim _S (a, b) and their neighbors proportional to the similarity between

2）定义语义相似性的迭代传播过程：2) Define the iterative propagation process of semantic similarity:

${sim sim}_{S S}^{((00))} ((a a,, b b)) = = {sim sim}_{C C} ((a a,, b b));;$ ${sim sim}_{S S}^{((t t))} ((a a,, b b)) = = \frac{c c}{Z Z} \underset{u u ~ ~ a a,, v v ~ ~ b b}{Σ Σ} si the si {m m}_{S S}^{((t t - - 11))} ((a a,, b b));;$ ${sim sim}_{S S} ((a a,, b b)) = = \underset{t t &RightArrow; &Right Arrow; \infty \infty}{lim lim} {sim sim}_{S S}^{((t t))} ((a a,, b b));;$

其中，u～a,v～b表示u和v在上下文关系图中分别是词语a和词语b的邻居节点，Z是归一化因子，c是控制因子，表示第t次迭代时词语a和词语b的语义相似性，表示第t-1次迭代时词语a和词语b的语义相似性，表示初始化；Among them, u～a, v～b indicate that u and v are the neighbor nodes of word a and word b respectively in the context diagram, Z is the normalization factor, c is the control factor, Indicates the semantic similarity between word a and word b at the tth iteration, Indicates the semantic similarity between word a and word b at the t-1th iteration, Indicates initialization;

3）使用快速排序算法求解2）中定义的关系式，获取语义相关性，对每两个词语都求取语义相关性，若干个语义相关性组成了柔性语义相关性矩阵，该相关性矩阵记为S_S。3) Use the quick sorting algorithm to solve the relational expression defined in 2), obtain the semantic correlation, and obtain the semantic correlation for each two words, and several semantic correlations form a flexible semantic correlation matrix, which records for S _S .

所述通过所述柔性语义相关性矩阵对句子间的柔性语义相似性进行定义的步骤具体为：The step of defining the flexible semantic similarity between sentences through the flexible semantic correlation matrix is specifically:

$Sim Sim (({s the s}_{i i},, {s the s}_{j j} | | {S S}_{S S})) = = \frac{{{f f}_{i i}^{T T} S S}_{S S} {f f}_{i i}}{| | | | {f f}_{i i} | | | | | | | | {f f}_{j j} | | | |}$

其中s_i和s_j分别表示句子，||f_i||和||f_j||分别表示两个句子词频向量的二范数，T为转置。Among them, s _i and s _j represent sentences respectively, ||f _i || and ||f _j || represent the two norms of word frequency vectors of two sentences respectively, and T is the transposition.

本发明提供的技术方案的有益效果是：本发明通过快速排序算法提出一种非监督式的语义相似性度量方法，对传统的余弦相似性进行改进以使之能够融入词语之间的潜在语义关系，并利用该柔性语义相似性改进中文新闻故事分割技术。本发明提出的柔性度量方法能够更加合理的表示词语之间以及词语集合之间的语义相似性。实验结果表明，在中文新闻故事分割技术中，基于相同的分割准则，与传统的相似性度量方法相比，使用该柔性语义相似性度量方法能够将分割精度提高到3%-10%。The beneficial effect of the technical solution provided by the present invention is: the present invention proposes a non-supervised semantic similarity measurement method through the quick sorting algorithm, and improves the traditional cosine similarity so that it can be incorporated into the potential semantic relationship between words , and use this flexible semantic similarity to improve Chinese news story segmentation technology. The flexibility measurement method proposed by the present invention can more reasonably represent the semantic similarity between words and word sets. The experimental results show that in the segmentation technology of Chinese news stories, based on the same segmentation criterion, compared with the traditional similarity measurement method, using this flexible semantic similarity measurement method can improve the segmentation accuracy to 3%-10%.

附图说明Description of drawings

图1为基于柔性语义相似性的中文新闻故事分割技术的流程图；Figure 1 is a flowchart of Chinese news story segmentation technology based on flexible semantic similarity;

图2为上下文关系图的示意图；Fig. 2 is a schematic diagram of a context diagram;

图3为在标准数据集CCTV和TDT2上故事之间和故事内部句子相似性比值的对比图；Figure 3 is a comparison chart of the sentence similarity ratio between stories and within stories on the standard dataset CCTV and TDT2;

图4为在标准数据集CCTV-75-s上中文新闻故事分割算法在100组随机参数上使用三种不同相似性度量方式的结果对比图。Figure 4 is a comparison of the results of the Chinese news story segmentation algorithm using three different similarity measures on 100 sets of random parameters on the standard data set CCTV-75-s.

具体实施方式detailed description

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明实施方式作进一步地详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the implementation manner of the present invention will be further described in detail below in conjunction with the accompanying drawings.

语义相似性的度量是自然语言处理中一个极具挑战性的研究课题。现有的方法主要分为两类：监督式和非监督式。监督式的方法主要包括WordNet^[8][9]和DISCO。WordNet用于度量任意两个英文词语之间的相似性。WordNet依靠标识好的文集，将名次、动词、形容词和副词进行层次划分，划分的依据是语言专家对这些词的语义定义。由于WordNet的简洁性和有效性，WordNet已经被广泛应用到自然语言处理任务中。与WordNet类似，DISCO作为另一种常用的监督式方法，用于检索任意给定的两个词之间的相似性。与WordNet相比，DISCO支持更丰富的语种，例如：英语、德语、法语、西班牙语等。监督式的方法能够直接被用于提前定义好的语言空间，不需要任何额外的计算，同时，监督式的方法也几乎覆盖了全部的常用词。不过，监督式的方法依赖于语言学家的知识，词语之间相似性的度量通常由主观意识所定义，同时，监督式的方法不适用于基于特定文集的应用。非监督式的方法主要包括PMI、LSA和pLSA。PMI通过查询网站搜索引擎获取到的统计数据，统计两个词同时出现在同一个网页中的次数，次数越多，那么这两个词的PMI得分就越高。LSA也是一种非监督式的语义相似性度量方法，它融入了人类学习知识的机理去获取词语或者文本段落之间的相似性。LSA的关键步骤是通过奇异值分解来进行降维。同时LSA也可以解决自然语言处理中同义性的问题。pLSA是对LSA的一个改进算法。与源于线性代数的LSA算法不同，在继承LSA优点的基础上，pLSA使用概率论的方法对成对的词语之间的相关性进行分析，并且能够很好的处理同义性和多义性的问题。与LSA相比，pLSA更具有普遍性。The measurement of semantic similarity is a very challenging research topic in natural language processing. Existing methods mainly fall into two categories: supervised and unsupervised. Supervised methods mainly include WordNet ^[8][9] and DISCO. WordNet is used to measure the similarity between any two English words. WordNet relies on the well-marked corpus to divide the rankings, verbs, adjectives and adverbs into levels, and the division is based on the semantic definitions of these words by language experts. Due to its simplicity and effectiveness, WordNet has been widely used in natural language processing tasks. Similar to WordNet, DISCO is another commonly used supervised method for retrieving the similarity between any given two words. Compared with WordNet, DISCO supports more languages, such as: English, German, French, Spanish, etc. The supervised method can be directly used in the language space defined in advance without any additional calculations. At the same time, the supervised method also covers almost all common words. However, supervised methods rely on the knowledge of linguists, and the measure of similarity between words is usually defined by subjective consciousness. At the same time, supervised methods are not suitable for applications based on specific corpora. Unsupervised methods mainly include PMI, LSA and pLSA. PMI counts the number of times two words appear on the same webpage at the same time by querying the statistical data obtained by the website search engine. The more times, the higher the PMI score of these two words. LSA is also an unsupervised semantic similarity measurement method, which incorporates the mechanism of human learning knowledge to obtain the similarity between words or text paragraphs. The key step of LSA is dimensionality reduction through singular value decomposition. At the same time, LSA can also solve the problem of synonymity in natural language processing. pLSA is an improved algorithm to LSA. Different from the LSA algorithm derived from linear algebra, on the basis of inheriting the advantages of LSA, pLSA uses the method of probability theory to analyze the correlation between pairs of words, and can handle synonymy and polysemy very well The problem. Compared with LSA, pLSA is more common.

近年来，图论的发展引起了自然语言处理学家们的重视，Widdows等人提出了一种基于图模型的非监督式方法用于获取语义相似性。在该图模型中，结点表示词语，边便是词语之间的关系。此外，该图模型基于特定的文集，可以处理词语的歧义性。Ambwani等人提出另一种度量词语语义相似性的图模型，每个词语被表示为一系列的结点，每个结点对应于一个在该词语影响范围内的句子，边上的权值表示词语之间的相关性。该模型将词语之间的相互影响融入其中，同时依据上下文来决定词语之间的相关性。以上讨论的语义相似性计算的非监督式方法都基于特定文集，与监督式方法相比更适用于具体的应用。在这些非监督式的方法里面，由于图模型的简洁性和高效性，使得基于图模型的语义相似性计算方法引起了越来越多自然语言处理研究专家的关注。In recent years, the development of graph theory has attracted the attention of natural language processing scientists. Widdows et al. proposed an unsupervised method based on graph models to obtain semantic similarity. In this graph model, nodes represent words, and edges are the relationships between words. In addition, the graph model is based on a specific corpus, which can handle word ambiguity. Ambwani et al. proposed another graph model to measure the semantic similarity of words. Each word is represented as a series of nodes, each node corresponds to a sentence within the scope of influence of the word, and the weight on the edge represents correlation between words. The model incorporates the mutual influence between words, and determines the relevance of words according to the context. The unsupervised methods for computing semantic similarity discussed above are all based on specific corpora and are more suitable for specific applications than supervised methods. Among these unsupervised methods, due to the simplicity and efficiency of the graphical model, the semantic similarity calculation method based on the graphical model has attracted the attention of more and more natural language processing research experts.

词语集合（例如段落、文本等）之间的语义相似性度量也是一个亟待解决的问题。词语集合语义相似性度量的传统方法为余弦相似性。基于词袋假设，每个词语集合被表示为词频向量，余弦相似度用于度量词频向量之间的夹角，夹角越大，相似性越小；反之，越大。由于余弦相似性的使用十分简单和有效，因此余弦相似性被广泛应用与词语集合语义相似性的度量，但是余弦相似性只考虑了相同词语之间的关系，忽略了词语集合中词语之间的相互关系，这使得词语集合相似性的度量不准确。为了使词语集合语义相似性的度量更加准确和有意义，在度量词语集合之间相似度的时候，将词语之间的相关性考虑进去更为有意义。因此，亟需提出一种将词语之间的相互关系融入到词语集合之间相似性度量的方法。Semantic similarity measurement between word collections (such as paragraphs, texts, etc.) is also an urgent problem to be solved. The traditional method of measuring semantic similarity of word sets is cosine similarity. Based on the bag-of-words assumption, each word set is represented as a word frequency vector, and the cosine similarity is used to measure the angle between the word frequency vectors. The larger the angle, the smaller the similarity; otherwise, the larger it is. Because the use of cosine similarity is very simple and effective, cosine similarity is widely used to measure the semantic similarity of word sets, but cosine similarity only considers the relationship between the same words, ignoring the relationship between words in the word set Correlations, which make the measure of word set similarity inaccurate. In order to make the measurement of the semantic similarity of word sets more accurate and meaningful, it is more meaningful to take the correlation between words into account when measuring the similarity between word sets. Therefore, it is urgent to propose a method that integrates the mutual relationship between words into the similarity measurement between word sets.

为了能够合理的表示词语之间的语义相似性，并且可以显著提高中文新闻故事分割技术的精度，本发明实施例提供了一种基于柔性语义相似性度量的中文新闻故事分割方法，参见图1，本方法在进行语义相似性计算以及新闻故事分割过程中，均是针对某一特定数据集。同时，为了体现柔性语义相似性度量方法的合理性，重新设计了验证准则对该合理性进行验证，详见下文描述：In order to reasonably represent the semantic similarity between words and significantly improve the accuracy of the Chinese news story segmentation technology, an embodiment of the present invention provides a Chinese news story segmentation method based on a flexible semantic similarity measure, see Figure 1, This method is aimed at a specific data set in the process of semantic similarity calculation and news story segmentation. At the same time, in order to reflect the rationality of the flexible semantic similarity measurement method, the verification criteria are redesigned to verify the rationality, see the following description for details:

101：输入目标文集对文集中的每个新闻故事脚本Ti进行分词；101: Input target corpus Segment each news story script Ti in the corpus;

即通过该步骤将新闻故事脚本中的每句话分割若干个词语，该步骤为本领域技术人员所公知，本发明实施例对此不做赘述。That is, through this step, each sentence in the news story script is divided into several words. This step is well known to those skilled in the art, and will not be described in detail in this embodiment of the present invention.

102：建立上下文关系图；102: Establish a context diagram;

1）依次读入新闻故事脚本，对其中所包含的词语进行词频统计；1) Read in the news story scripts one by one, and count the word frequency of the words contained in it;

2）根据定义好的词频阈值，将高频词语和低频词语删除。2) According to the defined word frequency threshold, delete high-frequency words and low-frequency words.

3）将保留下的词语作为上下文关系图中的结点，其集合即为V。3) Take the reserved words as the nodes in the context graph, and its set is V.

4）判断集合中的任意两个词语是否同时出现在某一新闻故事脚本中，且这两个词语之间的距离小于或等于距离阈值，如果是则在这两个词语之间建立边，边的集合即为E；如果否重新判断其他任意两个词语，直至整个集合中的词语都被遍历。4) Determine whether any two words in the set appear in a news story script at the same time, and the distance between the two words is less than or equal to the distance threshold, and if so, establish an edge between the two words, and the edge The set of is E; if no, re-evaluate any other two words until the words in the entire set are traversed.

词语之间的权值sim_C(a,b)由以下公式定义：The weight sim _C (a,b) between words is defined by the following formula:

其中freq(a,b)表示词语a和词语b同时出现的次数，freq_max=max(_i,j){freq(i,j)}表示词对(i,j)的频率最大值，ε是一个常数用以确保0≤sim_C(a,b)≤1(a≠b)。Among them, freq(a,b) indicates the number of times that word a and word b appear at the same time, freq _max =max( _i,j ){freq(i,j)} indicates the maximum frequency of word pair (i,j), ε is A constant to ensure that 0≤sim _C (a,b)≤1 (a≠b).

同时，词语本身的权值sim_C(a,a)=1。At the same time, the weight sim _C (a,a)=1 of the word itself.

6）因此上下文关系图可以表示为G=V,E,S_C，图2为构建数据集中某一文档的上下文关系图的示意图，其中，w_u表示T_i中的第u个词语，线条表示词语之间的关系。6) Therefore, the context graph can be expressed as G=V,E,S _C . Figure 2 is a schematic diagram of constructing a context graph of a document in a data set, where w _u represents the uth word in T _i , and the line represents relationship between words.

103：通过上下文关系图和快速排序算法^[10][11]对词语之间的上下文相关性进行迭代传播获取柔性语义相关性矩阵；103: Obtain a flexible semantic correlation matrix by iteratively propagating the contextual correlation between words through the context graph and the quick sorting algorithm ^[10][11] ;

1）定义上下文关系图中词语之间的语义相似性为sim_S(a,b)，其满足以下三条准则：1) Define the semantic similarity between words in the context graph as sim _S (a,b), which satisfies the following three criteria:

词语与它本身的相似性为1，即sim_S(a,a)=1；The similarity between the word and itself is 1, that is, sim _S (a,a)=1;

sim_S(a,b)与sim_C(a,b)正相关；sim _S (a,b) is positively correlated with sim _C (a,b);

sim_S(a,b)与他们邻居之间的相似性成正比。sim _S (a,b) is proportional to the similarity between their neighbors.

2）定义语义相似性的迭代传播过程由以下关系式定义如下：2) The iterative propagation process that defines semantic similarity is defined by the following relation as follows:

${sim sim}_{S S}^{((00))} ((a a,, b b)) = = {sim sim}_{C C} ((a a,, b b));;$

${sim sim}_{S S}^{((t t))} ((a a,, b b)) = = \frac{c c}{Z Z} \underset{u u ~ ~ a a,, v v ~ ~ b b}{Σ Σ} si the si {m m}_{S S}^{((t t - - 11))} ((a a,, b b));;$

${sim sim}_{S S} ((a a,, b b)) = = \underset{t t &RightArrow; &Right Arrow; \infty \infty}{lim lim} {sim sim}_{S S}^{((t t))} ((a a,, b b))$

其中，u～a,v～b表示u和v在上下文关系图中分别是词语a和词语b的邻居节点，Z是归一化因子，c是控制因子，表示第t次迭代时词语a和词语b的语义相似性，表示第t-1次迭代时词语a和词语b的语义相似性，表示初始化。Among them, u～a, v～b indicate that u and v are the neighbor nodes of word a and word b respectively in the context diagram, Z is the normalization factor, c is the control factor, Indicates the semantic similarity between word a and word b at the tth iteration, Indicates the semantic similarity between word a and word b at the t-1th iteration, Indicates initialization.

3）使用快速排序算法求解2）中定义的关系式，获取语义相关性，对每两个词语都求取语义相关性，若干个语义相关性组成了柔性语义相关性矩阵，该相关性矩阵记为S_S。与之类似，将传统的硬性语义相似性定义为S_H=I，其中I表示单位矩阵。3) Use the quick sorting algorithm to solve the relational expression defined in 2), obtain the semantic correlation, and obtain the semantic correlation for each two words, and several semantic correlations form a flexible semantic correlation matrix, which records for S _S . Similarly, the traditional hard semantic similarity is defined as _SH = I, where I represents the identity matrix.

快速排序算法所基于的假设：如果两个词语的邻居词语比较相似（相关性大于或等于0.5），那么这两个词语也比较相似；The assumption based on the quick sort algorithm: if the neighbor words of two words are similar (correlation greater than or equal to 0.5), then the two words are also similar;

快速排序算法以上下文关系图作为输入；算法复杂度为O(k|V|²)，其中，k为G中的平均度数，|V|表示上下文关系图中的节点数，O为算法复杂度。The quick sort algorithm takes the context graph as input; the algorithm complexity is O(k|V| ² ), where k is the average degree in G, |V| represents the number of nodes in the context graph, and O is the algorithm complexity .

该发明中实现了快速排序算法基于GPU的全点对并行算法。通过实验观察发现，以同一上下文关系图作为输入，基于GPU实现的快速排序算法的速度比传统基于CPU实现的快速排序算法的速度提升了大约1000倍。In this invention, a GPU-based full-point pair parallel algorithm of the quick sorting algorithm is implemented. Through experimental observation, it is found that with the same context graph as input, the speed of the quick sort algorithm based on GPU is about 1000 times faster than that of the traditional quick sort algorithm based on CPU.

快速排序算法的输出为进行迭代传播之后的柔性语义相关性矩阵S_S={sim_S(a,b)}_a,b∈C；The output of the quick sort algorithm is the flexible semantic correlation matrix S _S ={sim _S (a,b)} _a,b∈C after iterative propagation;

104：通过柔性语义相关性矩阵对句子（即为新闻故事脚本中连续出现的一段词语的集合）间的柔性语义相似性进行定义；104: Define the flexible semantic similarity between sentences (that is, a collection of words that appear consecutively in a news story script) through a flexible semantic correlation matrix;

在新闻故事分割技术中，除词语语义相似性之外，也需要度量词语集合也就是句子之间的相似性。故事中每个句子可以被标识为词频向量，用于记录每个词语在句子中出现的次数。对于给定的柔性语义相关性矩阵，句子间的柔性语义相似性定义如下：In the news story segmentation technology, in addition to the semantic similarity of words, it is also necessary to measure the similarity between word sets, that is, sentences. Each sentence in the story can be identified as a word frequency vector, which is used to record the number of times each word appears in the sentence. For a given flexible semantic relevance matrix, the flexible semantic similarity between sentences is defined as follows:

$Sim Sim (({s the s}_{i i},, {s the s}_{j j} | | {S S}_{S S})) = = \frac{{f f}_{i i}^{T T} {S S}_{S S} {f f}_{i i}}{| | | | {f f}_{i i} | | | | | | | | {f f}_{j j} | | | |}$

其中s_i和s_j分别表示句子，||f_i||和||f_j||分别表示两个句子词频向量的二范数，T为转置。该定义是对传统的余弦相似性的改进，它将不同词语之间潜在的语义相关性考虑进去，因此能够更为合理的表示句子之间的语义相似性。Among them, s _i and s _j represent sentences respectively, ||f _i || and ||f _j || represent the two norms of word frequency vectors of two sentences respectively, and T is the transposition. This definition is an improvement on the traditional cosine similarity, which takes into account the potential semantic correlation between different words, so it can more reasonably represent the semantic similarity between sentences.

105：使用柔性语义相似性对中文新闻故事进行分割。105: Segmenting Chinese News Stories Using Flexible Semantic Similarity.

1）中文新闻故事分割技术基于的准则为归一化准则^[5][6]；1) The criterion based on Chinese news story segmentation technology is the normalization criterion ^[5][6] ;

其中，归一化准则具体为：该准则基于图模型；句子被标识为图模型中的结点，句子之间的关系表示为图模型中的边；句子之间的相似性表示为边上的权值；新闻故事分割问题转化为图模型分割问题。Among them, the normalization criterion is specifically: the criterion is based on a graph model; sentences are identified as nodes in the graph model, and the relationship between sentences is represented as an edge in the graph model; the similarity between sentences is represented as weights; the news story segmentation problem is transformed into a graphical model segmentation problem.

2）使用句子间柔性语义相似性对输入数据集中所包含的新闻故事脚本进行中文新闻故事分割。2) Segment Chinese news stories on news story scripts contained in the input dataset using flexible semantic similarity between sentences.

下面以具体的试验对本发明提供的一种基于柔性语义相似性度量的中文新闻故事分割方法的可行性进行验证：The feasibility of a kind of Chinese news story segmentation method based on flexible semantic similarity measure provided by the present invention is verified below with specific test:

在标准数据集上进行实验：Experiment on a standard dataset:

为验证本方法的有效性，本方法在两个标准数据集CCTV和TDT2上进行了实验。CCTV数据集共包含了71个汉语新闻故事脚本，根据新闻故事长度和识别错误率可以讲CCTV数据集分为8个子集，分别记为CCTV_59_f/s,CCTV_66_f/s,CCTV_75_f/s和CCTV_ref_f/s,其中f表示长故事集，s表示短故事集，ref表示参考数据集。TDT2数据集包含177个汉语新闻故事脚本，根据识别错误率可以讲这177个脚本分为两个子集，分别记为TDT2_ref和TDT2_rcg。分别使用硬性语义相似性S_H、上下文语义相似性S_C和柔性语义相似性S_S对CCTV和TDT2数据集中的新闻故事进行分割，并比较其分割精度的好坏。其中分割精度由F1评分来表示。表1列出了使用不同相似性度量方式在CCTV和TDT2数据集上的分割精度。In order to verify the effectiveness of this method, this method is tested on two standard datasets CCTV and TDT2. The CCTV data set contains a total of 71 Chinese news story scripts. According to the length of news stories and the recognition error rate, the CCTV data set can be divided into 8 subsets, which are recorded as CCTV_59_f/s, CCTV_66_f/s, CCTV_75_f/s and CCTV_ref_f/s , where f represents the long story set, s represents the short story set, and ref represents the reference dataset. The TDT2 data set contains 177 scripts of Chinese news stories. According to the recognition error rate, these 177 scripts can be divided into two subsets, which are respectively recorded as TDT2_ref and TDT2_rcg. News stories in CCTV and TDT2 datasets are segmented using hard semantic similarity _SH , contextual semantic similarity SC and soft semantic similarity _{S S} _respectively , and their segmentation accuracy is compared. The segmentation accuracy is represented by the F1 score. Table 1 lists the segmentation accuracy on CCTV and TDT2 datasets using different similarity measures.

表1Table 1

从表1中可以观察到，与传统的硬性语义相似性相比，使用柔性语义相似性能够使分割精度有明显的提高，提高幅度约在3%到10%。同时还可以发现上下文语义相似性要比硬性的语义相似性好，并且通过快速排序算法可以使上下文语义相似性得到提升。为了表明本方法的健壮性，本方法在CCTV_75_s数据集上实施了另一个更为严格的实验，通过比较不同方法使用100组随机参数的分割精度。图4显示了该实验结果。在新闻故分割技术中，句子之间相似性的好坏可以用故事之间的相似性和故事内部的相似性的比值来度量，该比值对应句子的可区分性，其中，比值越小则证明相似性越好，该比值由以下公式来定义：It can be observed from Table 1 that compared with the traditional hard semantic similarity, the use of soft semantic similarity can significantly improve the segmentation accuracy, and the improvement range is about 3% to 10%. At the same time, it can also be found that the contextual semantic similarity is better than the rigid semantic similarity, and the contextual semantic similarity can be improved through the quick sort algorithm. In order to show the robustness of this method, this method implements another more rigorous experiment on the CCTV_75_s dataset, by comparing the segmentation accuracy of different methods using 100 sets of random parameters. Figure 4 shows the results of this experiment. In the news story segmentation technology, the similarity between sentences can be measured by the ratio between the similarity between stories and the similarity inside the story. This ratio corresponds to the distinguishability of sentences. The smaller the ratio, the better the The better the similarity, the ratio is defined by the following formula:

$R R ((C C | | {S S}_{S S})) = = \frac{exp exp (({mean mean}_{lab lab (({s the s}_{i i})) &NotEqual; &NotEqual; lab lab (({s the s}_{j j}))} Sim Sim (({s the s}_{i i},, {s the s}_{j j} | | {S S}_{S S}))))}{exp exp (({mean mean}_{lab lab (({s the s}_{i i})) = = lan lan (({s the s}_{j j}))} Sim Sim (({s the s}_{i i},, {s the s}_{j j} | | {S S}_{S S}))))}$

其中lab(s_i)表示句子s_i所属故事的标号，lab(s_j)表示句子s_j所属故事的标号，mean表示取平均值。Among them, lab(s _i ) represents the label of the story to which sentence s _i belongs, lab(s _j ) represents the label of the story to which sentence s _j belongs, and mean represents the average value.

本方法将硬性语义相似性S_H、上下文语义相似性S_C和柔性语义相似性S_S在标准数据集上进行了对比，对比结果表示在图4中。通过实验发现，使用柔性语义相似性S_S所得到的R比值要比另外两种相似性要低，而上下文语义相似性S_C要比硬性语义相似性要低S_H。该实验表明柔性语义相似性S_S要比传统的硬性语义相似性S_H更加合理，并且通过快速排序算法求解的柔性语义相似性（即经过迭代传播后的语义相似性）更加合理。将本方法应用到中文新闻故事分割技术中可以使分割精度有显著的提高。This method compares the hard semantic similarity _SH , the contextual semantic similarity S _C and the soft semantic similarity S _S on the standard data set, and the comparison results are shown in Figure 4. It is found through experiments that the R ratio obtained by using the soft semantic similarity S _S is lower than the other two similarities, and the context semantic similarity S _C is lower than the hard semantic similarity _SH . The experiment shows that the flexible semantic similarity S _S is more reasonable than the traditional hard semantic similarity _SH , and the flexible semantic similarity (that is, the semantic similarity after iterative propagation) solved by the quick sort algorithm is more reasonable. Applying this method to Chinese news story segmentation technology can significantly improve the segmentation accuracy.

参考文献：references:

[1].J.Allan,Ed.,Topic Detection and Tracking:Event-based InformationOrganization,Kluwer Academic Publishers,2002.[1]. J. Allan, Ed., Topic Detection and Tracking: Event-based Information Organization, Kluwer Academic Publishers, 2002.

[2].L.-S.Lee and B.Chen,“Spoken document understanding andorganization,”vol.22,no.5,pp.42–60,2005.[2].L.-S.Lee and B.Chen, "Spoken document understanding and organization," vol.22, no.5, pp.42–60, 2005.

[3].S.Banerjee and I.A.Rudnicky,“A TextTiling based approach to topicboundary detection in meetings,”in INTERSPEECH,2006.[3].S.Banerjee and I.A.Rudnicky, "A TextTiling based approach to topicboundary detection in meetings," in INTERSPEECH, 2006.

[4].L.Xie,J.Zeng,and W.Feng,“Multi-scale TextTiling for automaticstory segmentation in Chinese broadcast news,”in AIRS,2008.[4].L.Xie, J.Zeng, and W.Feng, “Multi-scale TextTiling for automatic story segmentation in Chinese broadcast news,” in AIRS, 2008.

[5].I.Malioutov and R.Barzilay,“Minimum cut model for spoken lecturesegmentation,”in ACL,2006.[5].I.Malioutov and R.Barzilay,"Minimum cut model for spoken lecturesegmentation,"in ACL,2006.

[6].J.Zhang,L.Xie,W.Feng,and Y.Zhang,“A subword normalized cutapproach to automatic story segmentation of Chinese broadcast news,”in AIRS,2009.[6]. J. Zhang, L. Xie, W. Feng, and Y. Zhang, “A subword normalized cut approach to automatic story segmentation of Chinese broadcast news,” in AIRS, 2009.

[7].Z.Liu,L.Xie,and W.Feng,“Maximum lexical cohesion for fine-grainednews story segmentation,”in INTERSPEECH,2010.[7]. Z.Liu, L.Xie, and W.Feng, “Maximum lexical cohesion for fine-grained news story segmentation,” in INTERSPEECH, 2010.

[8].T.Pedersen,S.Patwardhan,and J.Michelizzi,“Wordnet::similarity-measuring the relatedness of concepts,”in AAAI(Intelligent SystemsDemonstration),2004.[8].T.Pedersen, S.Patwardhan, and J.Michelizzi, "Wordnet::similarity-measuring the relatedness of concepts," in AAAI (Intelligent Systems Demonstration), 2004.

[9].Christiane Fellbaum,Ed.,WordNet:An Electronic Lexical Database,MIT Press,1998.[9]. Christiane Fellbaum, Ed., WordNet: An Electronic Lexical Database, MIT Press, 1998.

[10].G.Jeh and J.Widom,“SimRank:A measure of structural-contextsimilarity,”in ACM SIGKDD,2002.[10].G.Jeh and J.Widom, "SimRank: A measure of structural-context similarity," in ACM SIGKDD, 2002.

[11].G.He,H.Feng,C.Li,and H.Chen,“Parallel SimRank computation onlarge graphs with iterative aggregation,”in ACM SIGKDD,2010.[11].G.He,H.Feng,C.Li,and H.Chen,"Parallel SimRank computation on large graphs with iterative aggregation,"in ACM SIGKDD,2010.

本领域技术人员可以理解附图只是一个优选实施例的示意图，上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred embodiment, and the serial numbers of the above-mentioned embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.

Claims

1. a kind of Chinese News Stories dividing method based on flexible Semantic Similarity tolerance is it is characterised in that methods described bag Include following steps:

(1) target collected works are inputtedTo each News Stories script t in collected works_iCarry out participle；

(2) set up context relation figure；

(3) context dependence between word is iterated propagate by described context relation figure and quick sorting algorithm Obtain flexible semantic dependency matrix；

(4) it is defined by described flexibility flexible Semantic Similarity between sentence for the semantic dependency matrix；

(5) using described flexibility Semantic Similarity, Chinese News Stories are split；

The described step setting up context relation figure particularly as follows:

1) read in each News Stories script successively, word frequency statisticses are carried out to the word being comprised；

2) according to the word frequency threshold value defining, frequent words and low frequency word are deleted；

3) using the word retaining as context relation in figure node, its set be v；

4) judge gather in any two word whether simultaneously appear in a certain News Stories script, and this two words it Between distance be less than or equal to distance threshold, if it is between this two words, set up side, the set on side is e；If No rejudge other any two words, until whole gather in word be all traversed；

5) the weights s on side_cBy the weights sim between word_cThe weights sim of (a, b), word itself_c(a a) represents；

6) described context diagram is shown as g=< v, e, s_c>.

2. method according to claim 1 is it is characterised in that weights sim between described word_c(a, b) particularly as follows:

{sim}_{c} (a, b) = \frac{f r e q (a, b)}{{freq}_{m a x} + ϵ}

Wherein, freq (a, b) represents the number of times that word a and word b occurs simultaneously, freq_max=max_(i,j){ freq (i, j) } table Show the frequency maxima to (i, j) for the word, ε is a constant in order to guarantee 0≤sim_c(a,b)≤1.

3. method according to claim 1 is it is characterised in that the weights sim of described word itself_c(a, a)=1.

4. method according to claim 1 is it is characterised in that described calculated by described context relation figure and quicksort Method the context dependence between word is iterated propagate the step obtaining flexible semantic dependency matrix particularly as follows:

1) defining the Semantic Similarity between context relation in figure word is sim_s(a, b), following three criterions of satisfaction:

Word is 1 with the similitude of itself, that is,

sim_s(a, a)=1；sim_s(a, b) and sim_c(a, b) positive correlation；sim_sSimilitude between (a, b) and their neighbours just becomes Than；

2) the iterative diffusion process of definition Semantic Similarity:

{sim}_{s}^{(0)} (a, b) = {sim}_{c} (a, b); {sim}_{s}^{(t)} (a, b) = \frac{c}{z} \underset{u ~ a, v ~ b}{σ} {sim}_{s}^{(t - 1)} (a, b); {sim}_{s} (a, b) = \lim_{t &rightarrow; \infty} {sim}_{s}^{(t)} (a, b);

Wherein, u～a, v～b represent that u and v is word a and the neighbor node of word b in context relation in figure respectively, and z is to return The one change factor, c is controlling elements,Represent the Semantic Similarity of word a and word b during the t time iteration,Represent the Semantic Similarity of word a and word b during the t-1 time iteration,Represent initialization；

3) use quick sorting algorithm solve 2) defined in relational expression, obtain semantic dependency, each two word is all asked for Semantic dependency, several semantic dependencies constitute flexible semantic dependency matrix, and this correlation matrix is designated as s_s.

5. method according to claim 1 it is characterised in that described by described flexibility semantic dependency matrix to sentence Between the step that is defined of flexible Semantic Similarity particularly as follows:

s i m (s_{i}, s_{j} | s_{s}) = \frac{f_{i}^{t} s_{s} f_{j}}{| | f_{i} | | | | f_{j} | |}

Wherein s_iAnd s_jRepresent sentence respectively, | | f_i| | and | | f_j| | represent two norms of two sentence word frequency vectors respectively, t is to turn Put.