CN102789479A

CN102789479A - Vocabulary relevance calculating method on basis of semantic analysis of search result

Info

Publication number: CN102789479A
Application number: CN2012101884759A
Authority: CN
Inventors: 曾剑平; 段江娇
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2012-06-08
Filing date: 2012-06-08
Publication date: 2012-11-21

Abstract

The invention belongs to the technical field of computational linguistics, and specifically relates to a calculation method for vocabulary relevancy based on semantic analysis of search results. The method of the invention automatically submits a search command to an Internet search engine according to a certain search strategy to obtain a search result, and uses Web page information extraction technology and text semantic analysis technology to calculate the co-occurrence degree of words, thereby obtaining the correlation degree of words. The invention avoids building and maintaining the knowledge base system locally, and the calculation result of the correlation degree can reflect the impact on the time dimension, and can also perform the calculation of the correlation degree of the mixed vocabulary containing numbers well. This method is suitable for various application occasions that require semantic relevance of words.

Description

A Lexical Correlation Calculation Method Based on Semantic Analysis of Search Results

技术领域 technical field

本发明属于计算语言学技术领域，具体涉及一种词汇相关度的量化计算方法。 The invention belongs to the technical field of computational linguistics, and in particular relates to a quantitative calculation method for vocabulary correlation. the

背景技术 Background technique

随着Web2.0应用技术发展，各种博客、网络论坛和社会化交互媒体的不断出现，每天都有大量的文本信息内容产生，如各种新闻报道、产品介绍、产品评论等等。而在这其中，不管是从新闻报道中发现热点，还是从产品评论中进行观点的自动分析都需要涉及到一个更关键的技术问题，就是如何有效计算两个词汇的相关性。因此，计算词汇相关度成为网络文本信息处理的一个关键基础问题。目前有多种计算词汇相关度的方法。一种是基于大规模语料知识库的统计方法，先确定一组特征词，将词汇表示为这组特征词所定义的向量空间中的点，然后通过类似于夹角余弦之类的相似性来计算[6, 7]。第二种方法是利用语义词典，根据语义词典的组织结构将词汇表达为一组语义向量从而计算给定词汇的相关度，常用的词典有WordNet、HowNet等[3, 5]。另外一种方法是基于LSA（潜在语义分析）将词汇映射到一个维度较小的语义空间，在语义空间中计算相关性，这种方法是基于矩阵的SVD（奇异值分解）分解。近来，随着互联网上各种百科全书（如维基百科）的不断完善，使用这类知识库进行词汇相关性计算也得到了关注，它们将文本或词汇显式地表示成维基概念空间中的带权向量[1, 4]。 With the development of Web2.0 application technology, various blogs, online forums, and socialized interactive media continue to appear, and a large amount of text information is generated every day, such as various news reports, product introductions, product reviews, and so on. Among them, whether it is finding hot spots from news reports or automatically analyzing opinions from product reviews, a more critical technical issue needs to be involved, that is, how to effectively calculate the correlation between two words. Therefore, calculating lexical correlation has become a key basic problem in network text information processing. There are currently several methods for computing lexical relevance. One is a statistical method based on a large-scale corpus knowledge base. First, a set of feature words is determined, and the vocabulary is represented as a point in the vector space defined by this set of feature words, and then the similarity of the cosine of the included angle is used to Calculate [6, 7]. The second method is to use a semantic dictionary to express the vocabulary as a set of semantic vectors according to the organizational structure of the semantic dictionary to calculate the relevance of a given vocabulary. Commonly used dictionaries include WordNet, HowNet, etc. [3, 5]. Another method is to map vocabulary to a semantic space with a smaller dimension based on LSA (latent semantic analysis), and calculate the correlation in the semantic space. This method is based on matrix SVD (singular value decomposition) decomposition. Recently, with the continuous improvement of various encyclopedias (such as Wikipedia) on the Internet, the use of such knowledge bases for lexical correlation computation has also received attention, which explicitly represent text or vocabulary as bands in the concept space of Wikipedia. Weight vector [1, 4]. the

虽然这些方法在某些场合下能够较好地计算词汇相关性，但是在具体应用中需要进行大量的特征空间构建计算，需要维护和更新知识库，同时其相关度计算方法对于知识表示及库结构的依赖性较大，导致现实中的应用并不能让人满意。具体而言，所存在的问题列举如下： Although these methods can calculate lexical correlation better in some occasions, in specific applications, a large number of feature space construction calculations are required, and knowledge bases need to be maintained and updated. The dependence of the system is relatively large, which leads to unsatisfactory applications in reality. Specifically, the existing problems are listed as follows:

1．构造词汇语义向量的问题。语义向量中的元素是从语义词典或语料库中选择的词汇，代表了与所要计算的词汇比较相关的词汇集。需要在较大的文本信息中进行特征分析和计算，特别是对于中文应用来说，还需要进行分词处理，对于“113米栏”之类的混合型词汇容易造成向量计算的偏差。 1. The problem of constructing lexical semantic vectors. Elements in a semantic vector are words selected from a semantic dictionary or corpus, representing the set of words relevant to the lexical comparison to be computed. It is necessary to perform feature analysis and calculation in larger text information, especially for Chinese applications, and word segmentation processing is also required. For mixed words such as "113 meter column", it is easy to cause deviations in vector calculation.

2．需要构建一个巨大的知识库系统。统计计算相关度的方法需要构建和维护一个巨大的知识库系统，在数据存储及检索上需要较多的额外处理。 2. Need to build a huge knowledge base system. The method of statistical calculation of correlation needs to build and maintain a huge knowledge base system, and requires more extra processing in data storage and retrieval. the

3．对不同知识库系统的适应能力。知识库系统是进行相关度计算的一个基础，但是目前的方法主要依赖于英文维基百科系统，其概念特征提取方法对于中文词汇而言并不适应。因此，在更换为其它系统后，需要重新定义语义向量的计算方法，在一定程度上也限制了该方法的实用价值。 3. Adaptability to different knowledge base systems. The knowledge base system is a basis for correlation calculation, but the current method mainly relies on the English Wikipedia system, and its concept feature extraction method is not suitable for Chinese vocabulary. Therefore, after switching to other systems, it is necessary to redefine the calculation method of semantic vectors, which also limits the practical value of this method to a certain extent. the

由此可见，在进行词汇的语义相关度时，考虑知识库构建和维护的实际问题，提升对不同类型词汇相关度的计算能力，对于相关度计算方法的应用是非常必要的。 It can be seen that when performing semantic correlation of vocabulary, it is very necessary for the application of correlation calculation methods to consider the practical problems of knowledge base construction and maintenance, and to improve the calculation ability of different types of vocabulary correlation. the

发明内容 Contents of the invention

本发明的目的主要是针对现有各种词汇相关性计算方法中存在的不足，提出一种基于搜索引擎技术及文本语义分析技术相结合的词汇相关度计算方法。 The purpose of the present invention is mainly aimed at the deficiencies in existing various lexical correlation calculation methods, and proposes a lexical correlation calculation method based on the combination of search engine technology and text semantic analysis technology. the

本发明提出的词汇相关度计算方法，按照一定的检索策略，自动向互联网搜索引擎提交检索命令，获得检索结果，并运用Web页面信息提取技术、文本语义分析技术进行词汇共现程度的计算，从而获得词汇的相关度。 The method for calculating the correlation degree of words proposed by the present invention automatically submits search commands to Internet search engines according to a certain search strategy, obtains search results, and uses Web page information extraction technology and text semantic analysis technology to calculate the degree of word co-occurrence, thereby Get word relevance. the

本发明提出的词汇相关度计算方法，具体步骤如下： The method for calculating the degree of vocabulary relevance proposed by the present invention, the concrete steps are as follows:

（1）设定需要进行相关度计算的两个词汇w1, w2，及记录数阈值ξ； (1) Set the two words w1, w2 that need to be calculated for correlation, and the threshold ξ of the number of records;

（2）根据词汇是中文或英文，生成符合www.bing.com的检索命令，并指定为限定网站范围的检索，范围设定为en.wikipedia.org或baike.baidu.com； (2) According to whether the vocabulary is Chinese or English, generate a search command that matches www.bing.com, and specify it as a search that limits the scope of the website, and the scope is set to en.wikipedia.org or baike.baidu.com;

（3）自动建立HTTP（超文本传送协议）网络连接，通过该连接发送检索命令到bing搜索系统； (3) Automatically establish an HTTP (Hypertext Transfer Protocol) network connection, and send retrieval commands to the bing search system through this connection;

（4）接收并处理所返回的检索结果，即HTML（超文本标记语言）文本信息，当一个页面上的记录处理完毕后，自动执行下一页的检索记录处理，直到所有检索记录处理完毕或达到一定的记录数为止；采用Web信息提取技术[2]自动获取页面上的检索记录，基于每个检索记录中的摘要文本统计词汇频次信息； (4) Receive and process the returned search results, that is, HTML (Hypertext Markup Language) text information. When the records on one page are processed, the search records on the next page are automatically processed until all search records are processed or Until a certain number of records is reached; Web information extraction technology [2] is used to automatically obtain the search records on the page, and the vocabulary frequency information is counted based on the abstract text in each search record;

（5）基于统计得到的词汇频次信息，计算得到两个词汇的相关度，并提示相关信息。 (5) Based on the word frequency information obtained by statistics, calculate the correlation between two words, and prompt relevant information.

本发明流程见图1所示。 The flow chart of the present invention is shown in Figure 1. the

本发明中，步骤（2）以Bing搜索引擎作为词汇上下文信息获取手段，以en.wikipedia.org或baike.baidu.com作为中英文知识库。 In the present invention, in step (2), the Bing search engine is used as the vocabulary context information acquisition means, and en.wikipedia.org or baike.baidu.com is used as the Chinese-English knowledge base. the

本发明中，步骤（3）建立HTTP网络连接，并建立符合要求的检索命令，通过该连接发送检索命令到bing搜索系统。 In the present invention, step (3) establishes an HTTP network connection, and establishes a search command that meets the requirements, and sends the search command to the bing search system through the connection. the

本发明中，步骤（4）提取检索结果页面中的每个记录，提取其中的摘要文本信息，并根据分割符“…”进行文本切分，得到若干个分段。对于每个分段进行词汇频次信息统计。 In the present invention, step (4) extracts each record in the search result page, extracts the summary text information therein, and performs text segmentation according to the separator "..." to obtain several segments. Perform vocabulary frequency information statistics for each segment. the

本发明中，步骤（4）根据条件endRec<TotalRec，及条件Trec小于设定的记录数阈值ξ是否成立，来决定是否要获取更多的记录。其中，TotalRec表示检索结果的总记录数，endRec表示当前页面的记录数，Trec已经处理过的记录数。 In the present invention, step (4) determines whether to obtain more records according to the condition endRec<TotalRec and whether the condition Trec is less than the set record number threshold ξ holds true. Among them, TotalRec indicates the total number of records in the search results, endRec indicates the number of records on the current page, and the number of records that Trec has processed. the

本发明中，步骤（5）通过下面的式子计算两个词汇w1, w2的相关度： In the present invention, step (5) calculates the correlation degree of two words w1, w2 by the following formula:

R（w1, w2）= TC*2 / (T1+T2) R(w1, w2) = TC*2 / (T1+T2)

其中，T1是词汇w1出现的次数，T2是词汇w2出现的次数，TC是两者同时出现的次数。 Among them, T1 is the number of occurrences of word w1, T2 is the number of occurrences of word w2, and TC is the number of occurrences of both.

本发明中，通过构造训练集，将相似性计算结果与标注结果进行Pearson相关系数的计算，从而确定计算过程中需要的记录数阈值。 In the present invention, by constructing the training set, the similarity calculation result and the labeling result are calculated by the Pearson correlation coefficient, so as to determine the threshold value of the number of records required in the calculation process. the

本发明具有实质性特点和显著进步：（1）使用搜索引擎系统，避免在本地建立大的知识库系统。现有基于维基百科的相关度计算方法需要将网站的内容全部下载下来，存储空间及系统维护更新都会产生新的问题。本发明基于搜索引擎技术在词汇匹配上的处理能力，既可以获得相关度计算的相关信息，又不需要建立和维护类似的本地信息库。同时，通过改变检索范围，可以很方便地实现基于不同知识库系统的相关度计算，从而可适应于英文词汇，也适应于中文词汇的计算；（2）不需要进行复杂的语义分析，基于相对简单的词汇共现程度计算来获取词汇相关度，构建语义向量，能够适用于包含数字之类的混合型词汇的相关度计算。并且对于中文词汇的计算，不需要进行分词等处理。（3）由于搜索引擎系统会对最近修改的网站内容进行自动更新，因此基于搜索引擎结果的相关度计算方法能够更加有效地反映两个词汇在时间维度上的相关性，具有时间感知能力。 The present invention has substantive features and significant progress: (1) Using a search engine system avoids building a large knowledge base system locally. The existing Wikipedia-based correlation calculation method needs to download all the content of the website, and new problems will arise in terms of storage space and system maintenance and updates. Based on the processing ability of the search engine technology in vocabulary matching, the present invention can not only obtain relevant information for correlation degree calculation, but also does not need to establish and maintain similar local information databases. At the same time, by changing the retrieval scope, the correlation calculation based on different knowledge base systems can be easily realized, which can be adapted to the calculation of English vocabulary and Chinese vocabulary; (2) No need for complex semantic analysis, based on relative Simple vocabulary co-occurrence calculations are used to obtain vocabulary correlations and construct semantic vectors, which can be applied to the correlation calculations of mixed vocabulary such as numbers. And for the calculation of Chinese vocabulary, word segmentation and other processing are not required. (3) Since the search engine system will automatically update the recently modified website content, the correlation calculation method based on search engine results can more effectively reflect the correlation of two words in the time dimension, and has the ability of time perception. the

本发明利用搜索引擎技术及简单语义分析技术建立词汇相关度计算方法，避免在本地构建和维护知识库系统，相关度计算结果能够反映时间维度上的影响，对于含有数字的混合型词汇相关度计算也能很好地进行。本方法适合于各种需要词汇语义相关度的应用场合。 The present invention uses search engine technology and simple semantic analysis technology to establish a vocabulary correlation calculation method, which avoids building and maintaining a knowledge base system locally, and the correlation calculation result can reflect the influence of the time dimension. For mixed vocabulary correlation calculations containing numbers Also works well. This method is suitable for various application occasions that require semantic relevance of words. the

附图说明 Description of drawings

图1是本发明的流程图。 Fig. 1 is a flow chart of the present invention. the

具体实施方式 Detailed ways

（1）设定需要进行相关度计算的两个词汇w1, w2，及相关度阈值ξ。 (1) Set the two words w1, w2 that need to be calculated for correlation, and the correlation threshold ξ. the

（2）如果词汇是中文，先对词汇进行UTF8编码。并指定检索范围：en.wikipedia.org、baike.baidu.com。以这些信息为基础，生成针对bing搜索引擎（www.bing.com）的检索命令。 (2) If the vocabulary is Chinese, UTF8 encoding is performed on the vocabulary first. And specify the search scope: en.wikipedia.org, baike.baidu.com. Based on this information, a search command for the bing search engine (www.bing.com) is generated. the

（3）建立HTTP网络连接，通过该连接发送检索命令到bing搜索系统。 (3) Establish an HTTP network connection, and send retrieval commands to the bing search system through this connection. the

（4）初始化变量值：T1=0, T2=0, TC=0, Trec=0。 (4) Initialize variable values: T1=0, T2=0, TC=0, Trec=0. the

（5）接收并处理搜索引擎返回的结果。对页面内容HTML文本信息进行内容提取。采用正则表达式"[0-9,]+ - [0-9,]+ 条结果\\(共 [0-9,]+ 条\\)"提取页面上所提示的检索结果总记录及当前页的记录号范围，并记录到变量TotalRec和beginRec, endRec三个变量中。 (5) Receive and process the results returned by the search engine. Content extraction is performed on the HTML text information of the page content. Use the regular expression "[0-9,]+ - [0-9,]+ results\\(total [0-9,]+ results\\)" to extract the total records of the search results prompted on the page and the current The record number range of the page, and recorded into the variables TotalRec and beginRec, endRec three variables. the

（6）根据检索记录之间的分隔符号“<li class=\"sa_wr\"><div class=\"sa_cc\">”定位并提取当前页的每个记录，并提取其中的摘要文本信息。根据文本的分隔符“…”进行文本切分，得到若干个分段。对于每个分段进行词汇频次信息统计。 (6) Locate and extract each record of the current page according to the delimiter "<li class=\"sa_wr\"><div class=\"sa_cc\">" between the retrieved records, and extract the summary text information in it . Segment the text according to the text delimiter "..." to get several segments. Perform vocabulary frequency information statistics for each segment. the

如果在分段中出现w1，则w1的次数T1增加1；如果在分段中出现w2，则w2的次数T2增加1；如果w1, w2同时在分段中出现，则共现频数TC增加1。 If w1 appears in the segment, the number T1 of w1 increases by 1; if w2 appears in the segment, the number T2 of w2 increases by 1; if w1 and w2 appear in the segment at the same time, the co-occurrence frequency TC increases by 1 . the

将已经处理的记录数Trec增加1。 Increment the number of records Trec already processed by 1. the

（7）判断条件条件Trec<ξ是否成立。如果不成立转步骤（9）执行。 (7) Determine whether the conditional condition Trec<ξ holds true. If not established, go to step (9) to execute. the

（8）判断条件endRec<TotalRec是否成立，即是否达到最后一页。如果不成立，则生成获取下一个页面内容的检索命令，发送到bing搜索系统，并重复执行步骤（5）、（6）、（7）、（8）。否则提示“信息不足，无法进行相关度计算。”，并结束处理流程。 (8) Determine whether the condition endRec<TotalRec is true, that is, whether the last page is reached. If not, generate a retrieval command to obtain the content of the next page, send it to the bing search system, and repeat steps (5), (6), (7), and (8). Otherwise, it will prompt "Insufficient information, correlation calculation cannot be performed.", and end the processing flow. the

（9）根据下面的式子计算这两个词汇的相关度： (9) Calculate the correlation between the two words according to the following formula:

R（w1, w2）= TC*2 / (T1+T2) R(w1, w2) = TC*2 / (T1+T2)

设定阈值ξ方法：需要先确定一个词汇集，包含了若干个词汇对以及相关性标注结果X。基于这些词汇对，选择不同的检索范围，调整阈值ξ的值，得到相关性计算结果Y，并计算X与Y 两个集合的Pearson相关系数r： Method of setting threshold ξ: It is necessary to determine a vocabulary set, including several vocabulary pairs and correlation labeling results X. Based on these word pairs, select different retrieval ranges, adjust the value of the threshold ξ, obtain the correlation calculation result Y , and calculate the Pearson correlation coefficient r of the two sets of X and Y :

其中，n是集合中元素个数。r的取值范围为[-1，+1]，当相关系数r达到比较合理的范围（一般当r大于0.4）时，表示所选择的计算参数ξ是可接受的。 where n is the number of elements in the set. The value range of r is [-1, +1]. When the correlation coefficient r reaches a reasonable range (generally when r is greater than 0.4), it means that the selected calculation parameter ξ is acceptable.

从上述实施过程可以看出，本发明引入了基于搜索引擎检索结果的简单语义处理，既可以获得相关度计算的相关信息，又不需要建立和维护类似的本地信息库。基于搜索引擎结果的计算方法能够更加有效地反映两个词汇在时间维度上的相关性，并且能够适用于包含数字的混合型词汇计算。采用基于Pearson相关度计算的方法确定需要检索的最大记录数，可以更合理地根据设定的阈值进行词汇的相关度计算。 It can be seen from the above implementation process that the present invention introduces simple semantic processing based on search engine retrieval results, which can obtain relevant information for correlation calculation and does not need to establish and maintain similar local information databases. The calculation method based on search engine results can more effectively reflect the correlation of two words in the time dimension, and can be applied to the calculation of mixed words containing numbers. Using the method based on Pearson correlation calculation to determine the maximum number of records that need to be retrieved can more reasonably calculate the vocabulary correlation according to the set threshold. the

[0032 具体例子：Concrete example:

假设要计算的两个词汇是“doctor”和“nurse”，采用英文维基en.wikipedia.org为检索范围，则自动生成如下的初始检索命令，并通过HTTP连接进行检索。 Assuming that the two words to be calculated are "doctor" and "nurse", and the English Wiki en.wikipedia.org is used as the search scope, the following initial search commands are automatically generated and searched through the HTTP connection.

http://cn.bing.com/search?q=site%3aen.wikipedia.org+doctor+%26+nurse&qs=n&pq=site%3aen.wikipedia.org+doctor+%26+nurse&sc=0-0&sp=-1&sk=&first=1 http://cn.bing.com/search?q=site%3aen.wikipedia.org+doctor+%26+nurse&qs=n&pq=site%3aen.wikipedia.org+doctor+%26+nurse&sc=0-0&sp=-1&sk =&first=1

再通过对命令中最后的first=1自动增加，在60个记录时获得相关度为0.6613。 Then by automatically increasing the last first=1 in the command, the correlation degree is 0.6613 when there are 60 records.

又如，要计算的两个词汇是“计算机”和“电脑”，采用百度百科baike.baidu.com作为检索范围，则自动生成如下的初始检索命令，并通过HTTP连接进行检索。 As another example, if the two words to be calculated are "computer" and "computer", if Baidu Encyclopedia baike.baidu.com is used as the search range, the following initial search command will be automatically generated and searched through the HTTP connection. the

http://cn.bing.com/search?q=site%3abaike.baidu.com+%22%e8%ae%a1%e7%ae%97%e6%9c%ba%22+%26+%22%e7%94%b5%e8%84%91%22&qs=n&pq=site%3abaike.baidu.com+%22%e8%ae%a1%e7%ae%97%e6%9c%ba%22+%26+%22%e7%94%b5%e8%84%91%22&sc=0-0&sp=-1&sk=&first=1 http://cn.bing.com/search?q=site%3abaike.baidu.com+%22%e8%ae%a1%e7%ae%97%e6%9c%ba%22+%26+%22% e7%94%b5%e8%84%91%22&qs=n&pq=site%3abaike.baidu.com+%22%e8%ae%a1%e7%ae%97%e6%9c%ba%22+%26+% 22%e7%94%b5%e8%84%91%22&sc=0-0&sp=-1&sk=&first=1

再通过对命令中最后的first=1自动增加，在检索140个记录后，得到的词汇相关度为0.9111。 Then by automatically adding first=1 at the end of the command, after retrieving 140 records, the obtained vocabulary correlation is 0.9111.

参考文献：references:

[1] E. Gabrilovich, S. Markovitch. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007. [1] E. Gabrilovich, S. Markovitch. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007.

[2] X. W. Ji, J. P. Zeng, S. Y. Zhang, C. R. Wu. Tag Tree Template for Web Information and Schema Extraction. Expert Systems With Applications, 2010,37(12): 8492-8498. [2] X. W. Ji, J. P. Zeng, S. Y. Zhang, C. R. Wu. Tag Tree Template for Web Information and Schema Extraction. Expert Systems With Applications, 2010,37(12): 8492 -8498.

[3] A. Budanitsky, G. Hirst. Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics, 2006, 32(1):13-47. [3] A. Budanitsky, G. Hirst. Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics, 2006, 32(1):13-47.

[4] M. Strube and S. P. Ponzetto. WikiRelate! Computing Semantic Relatedness Using Wikipedia. In AAAI’06, 2006. [4] M. Strube and S. P. Ponzetto. WikiRelate! Computing Semantic Relatedness Using Wikipedia. In AAAI’06, 2006.

[5] 江敏, 肖诗斌, 王弘蔚, 施水才.一种改进的基于《知网》的词语语义相似度计算.中文信息学报,2008年5期. [5] Jiang Min, Xiao Shibin, Wang Hongwei, Shi Shuicai. An improved word semantic similarity calculation based on "HowNet". Chinese Journal of Information, 2008, No.5.

[6] 鲁松. 自然语言中词相关性知识无导获取和均衡分类器的构建，中国科学院计算技术研究所博士论文, 2001. [6] Lu Song. Unguided Acquisition of Word Relevance Knowledge in Natural Language and Construction of Balanced Classifier, Doctoral Thesis of Institute of Computing Technology, Chinese Academy of Sciences, 2001.

[7] 刘群, 李素建.基于《知网》的词汇语义相似度计算[J].中文计算语言学,2002,7(2):59-76. [7] Liu Qun, Li Sujian. Lexical Semantic Similarity Calculation Based on "HowNet" [J]. Chinese Computational Linguistics, 2002,7(2):59-76.

Claims

1. vocabulary relatedness computation method based on the Search Results semantic analysis is characterized in that concrete steps are following:

(1) setting need be carried out two vocabulary w1 of relatedness computation, w2, and record number threshold xi;

(2) be Chinese or English according to vocabulary, generation meets the retrieval command of www.bing.com, and is appointed as the retrieval that limits the website scope, and scope is set at en.wikipedia.org or baike.baidu.com;

(3) set up the HTTP network automatically and connect, connect through this and send retrieval command to the bing search system;

(4) receive and handle the result for retrieval that is returned; It is the HTML text message; After the recording processing on the page finished, the search records that automatically performs down one page was handled, till all search records dispose or reach certain record number; Adopt the Web information extraction technology to obtain the search records on the page automatically, based on the statistics of the summary texts in each search records vocabulary frequency information;

(5) the vocabulary frequency information that obtains based on statistics calculates the degree of correlation of two vocabulary, and points out relevant information.

2. the vocabulary relatedness computation method based on the Search Results semantic analysis as claimed in claim 1; It is characterized in that: the method for the said statistics vocabulary of step (4) frequency information is: extract each record in the result for retrieval page; Extract summary texts information wherein; And according to decollator " ... " Carry out text dividing, obtain several segmentations; Carry out the Information Statistics of the vocabulary frequency for each segmentation.

3. the vocabulary relatedness computation method based on the Search Results semantic analysis as claimed in claim 1; It is characterized in that: in the step (4); According to condition endRec < TotalRec, and whether condition Trec count threshold xi less than the record of setting and set up, and determines whether obtaining more record; Wherein, TotalRec representes the total number of records of result for retrieval, and endRec representes the record number of current page, the record number that Trec had handled.

4. the vocabulary relatedness computation method based on the Search Results semantic analysis described in claim 1 is characterized in that: calculate two vocabulary w1 through following formula in the step (5), the degree of correlation of w2:

R（w1,?w2）=?TC*2?/?(T1+T2)

Wherein, T1 is the number of times that w1 occurs, and T2 is the number of times that w2 occurs, and TC is the number of times that both occur simultaneously.

5. the vocabulary relatedness computation method described in claim 1 based on the Search Results semantic analysis; It is characterized in that: through the structure training set; Similarity result of calculation and annotation results are carried out the calculating of Pearson correlation coefficient, thereby threshold xi counted in the record that needs in definite computation process.