CN102789479A - Vocabulary relevance calculating method on basis of semantic analysis of search result - Google Patents
Vocabulary relevance calculating method on basis of semantic analysis of search result Download PDFInfo
- Publication number
- CN102789479A CN102789479A CN2012101884759A CN201210188475A CN102789479A CN 102789479 A CN102789479 A CN 102789479A CN 2012101884759 A CN2012101884759 A CN 2012101884759A CN 201210188475 A CN201210188475 A CN 201210188475A CN 102789479 A CN102789479 A CN 102789479A
- Authority
- CN
- China
- Prior art keywords
- vocabulary
- search
- semantic analysis
- correlation
- record
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000004458 analytical method Methods 0.000 title claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims abstract description 35
- 238000000605 extraction Methods 0.000 claims abstract description 6
- 238000012545 processing Methods 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 8
- 239000013598 vector Substances 0.000 description 8
- 238000000354 decomposition reaction Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 238000009411 base construction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
本发明属于计算语言学技术领域,具体为一种基于搜索结果语义分析的词汇相关度计算方法。本发明方法,按照一定的检索策略,自动向互联网搜索引擎提交检索命令,获得检索结果,并运用Web页面信息提取技术、文本语义分析技术进行词汇共现程度的计算,从而获得词汇的相关度。本发明避免在本地构建和维护知识库系统,相关度计算结果能够反映时间维度上的影响,对于含有数字的混合型词汇相关度计算也能很好地进行。本方法适合于各种需要词汇语义相关度的应用场合。
The invention belongs to the technical field of computational linguistics, and specifically relates to a calculation method for vocabulary relevancy based on semantic analysis of search results. The method of the invention automatically submits a search command to an Internet search engine according to a certain search strategy to obtain a search result, and uses Web page information extraction technology and text semantic analysis technology to calculate the co-occurrence degree of words, thereby obtaining the correlation degree of words. The invention avoids building and maintaining the knowledge base system locally, and the calculation result of the correlation degree can reflect the impact on the time dimension, and can also perform the calculation of the correlation degree of the mixed vocabulary containing numbers well. This method is suitable for various application occasions that require semantic relevance of words.
Description
技术领域 technical field
本发明属于计算语言学技术领域,具体涉及一种词汇相关度的量化计算方法。 The invention belongs to the technical field of computational linguistics, and in particular relates to a quantitative calculation method for vocabulary correlation. the
背景技术 Background technique
随着Web2.0应用技术发展,各种博客、网络论坛和社会化交互媒体的不断出现,每天都有大量的文本信息内容产生,如各种新闻报道、产品介绍、产品评论等等。而在这其中,不管是从新闻报道中发现热点,还是从产品评论中进行观点的自动分析都需要涉及到一个更关键的技术问题,就是如何有效计算两个词汇的相关性。因此,计算词汇相关度成为网络文本信息处理的一个关键基础问题。目前有多种计算词汇相关度的方法。一种是基于大规模语料知识库的统计方法,先确定一组特征词,将词汇表示为这组特征词所定义的向量空间中的点,然后通过类似于夹角余弦之类的相似性来计算[6, 7]。第二种方法是利用语义词典,根据语义词典的组织结构将词汇表达为一组语义向量从而计算给定词汇的相关度,常用的词典有WordNet、HowNet等[3, 5]。另外一种方法是基于LSA(潜在语义分析)将词汇映射到一个维度较小的语义空间,在语义空间中计算相关性,这种方法是基于矩阵的SVD(奇异值分解)分解。近来,随着互联网上各种百科全书(如维基百科)的不断完善,使用这类知识库进行词汇相关性计算也得到了关注,它们将文本或词汇显式地表示成维基概念空间中的带权向量[1, 4]。 With the development of Web2.0 application technology, various blogs, online forums, and socialized interactive media continue to appear, and a large amount of text information is generated every day, such as various news reports, product introductions, product reviews, and so on. Among them, whether it is finding hot spots from news reports or automatically analyzing opinions from product reviews, a more critical technical issue needs to be involved, that is, how to effectively calculate the correlation between two words. Therefore, calculating lexical correlation has become a key basic problem in network text information processing. There are currently several methods for computing lexical relevance. One is a statistical method based on a large-scale corpus knowledge base. First, a set of feature words is determined, and the vocabulary is represented as a point in the vector space defined by this set of feature words, and then the similarity of the cosine of the included angle is used to Calculate [6, 7]. The second method is to use a semantic dictionary to express the vocabulary as a set of semantic vectors according to the organizational structure of the semantic dictionary to calculate the relevance of a given vocabulary. Commonly used dictionaries include WordNet, HowNet, etc. [3, 5]. Another method is to map vocabulary to a semantic space with a smaller dimension based on LSA (latent semantic analysis), and calculate the correlation in the semantic space. This method is based on matrix SVD (singular value decomposition) decomposition. Recently, with the continuous improvement of various encyclopedias (such as Wikipedia) on the Internet, the use of such knowledge bases for lexical correlation computation has also received attention, which explicitly represent text or vocabulary as bands in the concept space of Wikipedia. Weight vector [1, 4]. the
虽然这些方法在某些场合下能够较好地计算词汇相关性,但是在具体应用中需要进行大量的特征空间构建计算,需要维护和更新知识库,同时其相关度计算方法对于知识表示及库结构的依赖性较大,导致现实中的应用并不能让人满意。具体而言,所存在的问题列举如下: Although these methods can calculate lexical correlation better in some occasions, in specific applications, a large number of feature space construction calculations are required, and knowledge bases need to be maintained and updated. The dependence of the system is relatively large, which leads to unsatisfactory applications in reality. Specifically, the existing problems are listed as follows:
1.构造词汇语义向量的问题。语义向量中的元素是从语义词典或语料库中选择的词汇,代表了与所要计算的词汇比较相关的词汇集。需要在较大的文本信息中进行特征分析和计算,特别是对于中文应用来说,还需要进行分词处理,对于“113米栏”之类的混合型词汇容易造成向量计算的偏差。 1. The problem of constructing lexical semantic vectors. Elements in a semantic vector are words selected from a semantic dictionary or corpus, representing the set of words relevant to the lexical comparison to be computed. It is necessary to perform feature analysis and calculation in larger text information, especially for Chinese applications, and word segmentation processing is also required. For mixed words such as "113 meter column", it is easy to cause deviations in vector calculation.
2.需要构建一个巨大的知识库系统。统计计算相关度的方法需要构建和维护一个巨大的知识库系统,在数据存储及检索上需要较多的额外处理。 2. Need to build a huge knowledge base system. The method of statistical calculation of correlation needs to build and maintain a huge knowledge base system, and requires more extra processing in data storage and retrieval. the
3.对不同知识库系统的适应能力。知识库系统是进行相关度计算的一个基础,但是目前的方法主要依赖于英文维基百科系统,其概念特征提取方法对于中文词汇而言并不适应。因此,在更换为其它系统后,需要重新定义语义向量的计算方法,在一定程度上也限制了该方法的实用价值。 3. Adaptability to different knowledge base systems. The knowledge base system is a basis for correlation calculation, but the current method mainly relies on the English Wikipedia system, and its concept feature extraction method is not suitable for Chinese vocabulary. Therefore, after switching to other systems, it is necessary to redefine the calculation method of semantic vectors, which also limits the practical value of this method to a certain extent. the
由此可见,在进行词汇的语义相关度时,考虑知识库构建和维护的实际问题,提升对不同类型词汇相关度的计算能力,对于相关度计算方法的应用是非常必要的。 It can be seen that when performing semantic correlation of vocabulary, it is very necessary for the application of correlation calculation methods to consider the practical problems of knowledge base construction and maintenance, and to improve the calculation ability of different types of vocabulary correlation. the
发明内容 Contents of the invention
本发明的目的主要是针对现有各种词汇相关性计算方法中存在的不足,提出一种基于搜索引擎技术及文本语义分析技术相结合的词汇相关度计算方法。 The purpose of the present invention is mainly aimed at the deficiencies in existing various lexical correlation calculation methods, and proposes a lexical correlation calculation method based on the combination of search engine technology and text semantic analysis technology. the
本发明提出的词汇相关度计算方法,按照一定的检索策略,自动向互联网搜索引擎提交检索命令,获得检索结果,并运用Web页面信息提取技术、文本语义分析技术进行词汇共现程度的计算,从而获得词汇的相关度。 The method for calculating the correlation degree of words proposed by the present invention automatically submits search commands to Internet search engines according to a certain search strategy, obtains search results, and uses Web page information extraction technology and text semantic analysis technology to calculate the degree of word co-occurrence, thereby Get word relevance. the
本发明提出的词汇相关度计算方法,具体步骤如下: The method for calculating the degree of vocabulary relevance proposed by the present invention, the concrete steps are as follows:
(1)设定需要进行相关度计算的两个词汇w1, w2,及记录数阈值ξ; (1) Set the two words w1, w2 that need to be calculated for correlation, and the threshold ξ of the number of records;
(2)根据词汇是中文或英文,生成符合www.bing.com的检索命令,并指定为限定网站范围的检索,范围设定为en.wikipedia.org或baike.baidu.com; (2) According to whether the vocabulary is Chinese or English, generate a search command that matches www.bing.com, and specify it as a search that limits the scope of the website, and the scope is set to en.wikipedia.org or baike.baidu.com;
(3)自动建立HTTP(超文本传送协议)网络连接,通过该连接发送检索命令到bing搜索系统; (3) Automatically establish an HTTP (Hypertext Transfer Protocol) network connection, and send retrieval commands to the bing search system through this connection;
(4)接收并处理所返回的检索结果,即HTML(超文本标记语言)文本信息,当一个页面上的记录处理完毕后,自动执行下一页的检索记录处理,直到所有检索记录处理完毕或达到一定的记录数为止;采用Web信息提取技术[2]自动获取页面上的检索记录,基于每个检索记录中的摘要文本统计词汇频次信息; (4) Receive and process the returned search results, that is, HTML (Hypertext Markup Language) text information. When the records on one page are processed, the search records on the next page are automatically processed until all search records are processed or Until a certain number of records is reached; Web information extraction technology [2] is used to automatically obtain the search records on the page, and the vocabulary frequency information is counted based on the abstract text in each search record;
(5)基于统计得到的词汇频次信息,计算得到两个词汇的相关度,并提示相关信息。 (5) Based on the word frequency information obtained by statistics, calculate the correlation between two words, and prompt relevant information.
本发明流程见图1所示。 The flow chart of the present invention is shown in Figure 1. the
本发明中,步骤(2)以Bing搜索引擎作为词汇上下文信息获取手段,以en.wikipedia.org或baike.baidu.com作为中英文知识库。 In the present invention, in step (2), the Bing search engine is used as the vocabulary context information acquisition means, and en.wikipedia.org or baike.baidu.com is used as the Chinese-English knowledge base. the
本发明中,步骤(3)建立HTTP网络连接,并建立符合要求的检索命令,通过该连接发送检索命令到bing搜索系统。 In the present invention, step (3) establishes an HTTP network connection, and establishes a search command that meets the requirements, and sends the search command to the bing search system through the connection. the
本发明中,步骤(4)提取检索结果页面中的每个记录,提取其中的摘要文本信息,并根据分割符“…”进行文本切分,得到若干个分段。对于每个分段进行词汇频次信息统计。 In the present invention, step (4) extracts each record in the search result page, extracts the summary text information therein, and performs text segmentation according to the separator "..." to obtain several segments. Perform vocabulary frequency information statistics for each segment. the
本发明中,步骤(4)根据条件endRec<TotalRec,及条件Trec小于设定的记录数阈值ξ是否成立,来决定是否要获取更多的记录。其中,TotalRec表示检索结果的总记录数,endRec表示当前页面的记录数,Trec已经处理过的记录数。 In the present invention, step (4) determines whether to obtain more records according to the condition endRec<TotalRec and whether the condition Trec is less than the set record number threshold ξ holds true. Among them, TotalRec indicates the total number of records in the search results, endRec indicates the number of records on the current page, and the number of records that Trec has processed. the
本发明中,步骤(5)通过下面的式子计算两个词汇w1, w2的相关度: In the present invention, step (5) calculates the correlation degree of two words w1, w2 by the following formula:
R(w1, w2)= TC*2 / (T1+T2) R(w1, w2) = TC*2 / (T1+T2)
其中,T1是词汇w1出现的次数,T2是词汇w2出现的次数,TC是两者同时出现的次数。 Among them, T1 is the number of occurrences of word w1, T2 is the number of occurrences of word w2, and TC is the number of occurrences of both.
本发明中,通过构造训练集,将相似性计算结果与标注结果进行Pearson相关系数的计算,从而确定计算过程中需要的记录数阈值。 In the present invention, by constructing the training set, the similarity calculation result and the labeling result are calculated by the Pearson correlation coefficient, so as to determine the threshold value of the number of records required in the calculation process. the
本发明具有实质性特点和显著进步:(1)使用搜索引擎系统,避免在本地建立大的知识库系统。现有基于维基百科的相关度计算方法需要将网站的内容全部下载下来,存储空间及系统维护更新都会产生新的问题。本发明基于搜索引擎技术在词汇匹配上的处理能力,既可以获得相关度计算的相关信息,又不需要建立和维护类似的本地信息库。同时,通过改变检索范围,可以很方便地实现基于不同知识库系统的相关度计算,从而可适应于英文词汇,也适应于中文词汇的计算;(2)不需要进行复杂的语义分析,基于相对简单的词汇共现程度计算来获取词汇相关度,构建语义向量,能够适用于包含数字之类的混合型词汇的相关度计算。并且对于中文词汇的计算,不需要进行分词等处理。(3)由于搜索引擎系统会对最近修改的网站内容进行自动更新,因此基于搜索引擎结果的相关度计算方法能够更加有效地反映两个词汇在时间维度上的相关性,具有时间感知能力。 The present invention has substantive features and significant progress: (1) Using a search engine system avoids building a large knowledge base system locally. The existing Wikipedia-based correlation calculation method needs to download all the content of the website, and new problems will arise in terms of storage space and system maintenance and updates. Based on the processing ability of the search engine technology in vocabulary matching, the present invention can not only obtain relevant information for correlation degree calculation, but also does not need to establish and maintain similar local information databases. At the same time, by changing the retrieval scope, the correlation calculation based on different knowledge base systems can be easily realized, which can be adapted to the calculation of English vocabulary and Chinese vocabulary; (2) No need for complex semantic analysis, based on relative Simple vocabulary co-occurrence calculations are used to obtain vocabulary correlations and construct semantic vectors, which can be applied to the correlation calculations of mixed vocabulary such as numbers. And for the calculation of Chinese vocabulary, word segmentation and other processing are not required. (3) Since the search engine system will automatically update the recently modified website content, the correlation calculation method based on search engine results can more effectively reflect the correlation of two words in the time dimension, and has the ability of time perception. the
本发明利用搜索引擎技术及简单语义分析技术建立词汇相关度计算方法,避免在本地构建和维护知识库系统,相关度计算结果能够反映时间维度上的影响,对于含有数字的混合型词汇相关度计算也能很好地进行。本方法适合于各种需要词汇语义相关度的应用场合。 The present invention uses search engine technology and simple semantic analysis technology to establish a vocabulary correlation calculation method, which avoids building and maintaining a knowledge base system locally, and the correlation calculation result can reflect the influence of the time dimension. For mixed vocabulary correlation calculations containing numbers Also works well. This method is suitable for various application occasions that require semantic relevance of words. the
附图说明 Description of drawings
图1是本发明的流程图。 Fig. 1 is a flow chart of the present invention. the
具体实施方式 Detailed ways
(1)设定需要进行相关度计算的两个词汇w1, w2,及相关度阈值ξ。 (1) Set the two words w1, w2 that need to be calculated for correlation, and the correlation threshold ξ. the
(2)如果词汇是中文,先对词汇进行UTF8编码。并指定检索范围:en.wikipedia.org、baike.baidu.com。以这些信息为基础,生成针对bing搜索引擎(www.bing.com)的检索命令。 (2) If the vocabulary is Chinese, UTF8 encoding is performed on the vocabulary first. And specify the search scope: en.wikipedia.org, baike.baidu.com. Based on this information, a search command for the bing search engine (www.bing.com) is generated. the
(3)建立HTTP网络连接,通过该连接发送检索命令到bing搜索系统。 (3) Establish an HTTP network connection, and send retrieval commands to the bing search system through this connection. the
(4)初始化变量值:T1=0, T2=0, TC=0, Trec=0。 (4) Initialize variable values: T1=0, T2=0, TC=0, Trec=0. the
(5)接收并处理搜索引擎返回的结果。对页面内容HTML文本信息进行内容提取。采用正则表达式"[0-9,]+ - [0-9,]+ 条结果\\(共 [0-9,]+ 条\\)"提取页面上所提示的检索结果总记录及当前页的记录号范围,并记录到变量TotalRec和beginRec, endRec三个变量中。 (5) Receive and process the results returned by the search engine. Content extraction is performed on the HTML text information of the page content. Use the regular expression "[0-9,]+ - [0-9,]+ results\\(total [0-9,]+ results\\)" to extract the total records of the search results prompted on the page and the current The record number range of the page, and recorded into the variables TotalRec and beginRec, endRec three variables. the
(6)根据检索记录之间的分隔符号“<li class=\"sa_wr\"><div class=\"sa_cc\">”定位并提取当前页的每个记录,并提取其中的摘要文本信息。根据文本的分隔符“…”进行文本切分,得到若干个分段。对于每个分段进行词汇频次信息统计。 (6) Locate and extract each record of the current page according to the delimiter "<li class=\"sa_wr\"><div class=\"sa_cc\">" between the retrieved records, and extract the summary text information in it . Segment the text according to the text delimiter "..." to get several segments. Perform vocabulary frequency information statistics for each segment. the
如果在分段中出现w1,则w1的次数T1增加1;如果在分段中出现w2,则w2的次数T2增加1;如果w1, w2同时在分段中出现,则共现频数TC增加1。 If w1 appears in the segment, the number T1 of w1 increases by 1; if w2 appears in the segment, the number T2 of w2 increases by 1; if w1 and w2 appear in the segment at the same time, the co-occurrence frequency TC increases by 1 . the
将已经处理的记录数Trec增加1。 Increment the number of records Trec already processed by 1. the
(7)判断条件条件Trec<ξ是否成立。如果不成立转步骤(9)执行。 (7) Determine whether the conditional condition Trec<ξ holds true. If not established, go to step (9) to execute. the
(8)判断条件endRec<TotalRec是否成立,即是否达到最后一页。如果不成立,则生成获取下一个页面内容的检索命令,发送到bing搜索系统,并重复执行步骤(5)、(6)、(7)、(8)。否则提示“信息不足,无法进行相关度计算。”,并结束处理流程。 (8) Determine whether the condition endRec<TotalRec is true, that is, whether the last page is reached. If not, generate a retrieval command to obtain the content of the next page, send it to the bing search system, and repeat steps (5), (6), (7), and (8). Otherwise, it will prompt "Insufficient information, correlation calculation cannot be performed.", and end the processing flow. the
(9)根据下面的式子计算这两个词汇的相关度: (9) Calculate the correlation between the two words according to the following formula:
R(w1, w2)= TC*2 / (T1+T2) R(w1, w2) = TC*2 / (T1+T2)
设定阈值ξ方法:需要先确定一个词汇集,包含了若干个词汇对以及相关性标注结果X。基于这些词汇对,选择不同的检索范围,调整阈值ξ的值,得到相关性计算结果Y,并计算X与Y 两个集合的Pearson相关系数r: Method of setting threshold ξ: It is necessary to determine a vocabulary set, including several vocabulary pairs and correlation labeling results X. Based on these word pairs, select different retrieval ranges, adjust the value of the threshold ξ, obtain the correlation calculation result Y , and calculate the Pearson correlation coefficient r of the two sets of X and Y :
其中,n是集合中元素个数。r的取值范围为[-1,+1],当相关系数r达到比较合理的范围(一般当r大于0.4)时,表示所选择的计算参数ξ是可接受的。 where n is the number of elements in the set. The value range of r is [-1, +1]. When the correlation coefficient r reaches a reasonable range (generally when r is greater than 0.4), it means that the selected calculation parameter ξ is acceptable.
从上述实施过程可以看出,本发明引入了基于搜索引擎检索结果的简单语义处理,既可以获得相关度计算的相关信息,又不需要建立和维护类似的本地信息库。基于搜索引擎结果的计算方法能够更加有效地反映两个词汇在时间维度上的相关性,并且能够适用于包含数字的混合型词汇计算。采用基于Pearson相关度计算的方法确定需要检索的最大记录数,可以更合理地根据设定的阈值进行词汇的相关度计算。 It can be seen from the above implementation process that the present invention introduces simple semantic processing based on search engine retrieval results, which can obtain relevant information for correlation calculation and does not need to establish and maintain similar local information databases. The calculation method based on search engine results can more effectively reflect the correlation of two words in the time dimension, and can be applied to the calculation of mixed words containing numbers. Using the method based on Pearson correlation calculation to determine the maximum number of records that need to be retrieved can more reasonably calculate the vocabulary correlation according to the set threshold. the
[0032 具体例子:Concrete example:
假设要计算的两个词汇是“doctor”和“nurse”,采用英文维基en.wikipedia.org为检索范围,则自动生成如下的初始检索命令,并通过HTTP连接进行检索。 Assuming that the two words to be calculated are "doctor" and "nurse", and the English Wiki en.wikipedia.org is used as the search scope, the following initial search commands are automatically generated and searched through the HTTP connection.
http://cn.bing.com/search?q=site%3aen.wikipedia.org+doctor+%26+nurse&qs=n&pq=site%3aen.wikipedia.org+doctor+%26+nurse&sc=0-0&sp=-1&sk=&first=1 http://cn.bing.com/search?q=site%3aen.wikipedia.org+doctor+%26+nurse&qs=n&pq=site%3aen.wikipedia.org+doctor+%26+nurse&sc=0-0&sp=-1&sk =&first=1
再通过对命令中最后的first=1自动增加,在60个记录时获得相关度为0.6613。 Then by automatically increasing the last first=1 in the command, the correlation degree is 0.6613 when there are 60 records.
又如,要计算的两个词汇是“计算机”和“电脑”,采用百度百科baike.baidu.com作为检索范围,则自动生成如下的初始检索命令,并通过HTTP连接进行检索。 As another example, if the two words to be calculated are "computer" and "computer", if Baidu Encyclopedia baike.baidu.com is used as the search range, the following initial search command will be automatically generated and searched through the HTTP connection. the
http://cn.bing.com/search?q=site%3abaike.baidu.com+%22%e8%ae%a1%e7%ae%97%e6%9c%ba%22+%26+%22%e7%94%b5%e8%84%91%22&qs=n&pq=site%3abaike.baidu.com+%22%e8%ae%a1%e7%ae%97%e6%9c%ba%22+%26+%22%e7%94%b5%e8%84%91%22&sc=0-0&sp=-1&sk=&first=1 http://cn.bing.com/search?q=site%3abaike.baidu.com+%22%e8%ae%a1%e7%ae%97%e6%9c%ba%22+%26+%22% e7%94%b5%e8%84%91%22&qs=n&pq=site%3abaike.baidu.com+%22%e8%ae%a1%e7%ae%97%e6%9c%ba%22+%26+% 22%e7%94%b5%e8%84%91%22&sc=0-0&sp=-1&sk=&first=1
再通过对命令中最后的first=1自动增加,在检索140个记录后,得到的词汇相关度为0.9111。 Then by automatically adding first=1 at the end of the command, after retrieving 140 records, the obtained vocabulary correlation is 0.9111.
参考文献:references:
[1] E. Gabrilovich, S. Markovitch. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007. [1] E. Gabrilovich, S. Markovitch. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007.
[2] X. W. Ji, J. P. Zeng, S. Y. Zhang, C. R. Wu. Tag Tree Template for Web Information and Schema Extraction. Expert Systems With Applications, 2010,37(12): 8492-8498. [2] X. W. Ji, J. P. Zeng, S. Y. Zhang, C. R. Wu. Tag Tree Template for Web Information and Schema Extraction. Expert Systems With Applications, 2010,37(12): 8492 -8498.
[3] A. Budanitsky, G. Hirst. Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics, 2006, 32(1):13-47. [3] A. Budanitsky, G. Hirst. Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics, 2006, 32(1):13-47.
[4] M. Strube and S. P. Ponzetto. WikiRelate! Computing Semantic Relatedness Using Wikipedia. In AAAI’06, 2006. [4] M. Strube and S. P. Ponzetto. WikiRelate! Computing Semantic Relatedness Using Wikipedia. In AAAI’06, 2006.
[5] 江敏, 肖诗斌, 王弘蔚, 施水才.一种改进的基于《知网》的词语语义相似度计算.中文信息学报,2008年5期. [5] Jiang Min, Xiao Shibin, Wang Hongwei, Shi Shuicai. An improved word semantic similarity calculation based on "HowNet". Chinese Journal of Information, 2008, No.5.
[6] 鲁松. 自然语言中词相关性知识无导获取和均衡分类器的构建,中国科学院计算技术研究所博士论文, 2001. [6] Lu Song. Unguided Acquisition of Word Relevance Knowledge in Natural Language and Construction of Balanced Classifier, Doctoral Thesis of Institute of Computing Technology, Chinese Academy of Sciences, 2001.
[7] 刘群, 李素建.基于《知网》的词汇语义相似度计算[J].中文计算语言学,2002,7(2):59-76. [7] Liu Qun, Li Sujian. Lexical Semantic Similarity Calculation Based on "HowNet" [J]. Chinese Computational Linguistics, 2002,7(2):59-76.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012101884759A CN102789479A (en) | 2012-06-08 | 2012-06-08 | Vocabulary relevance calculating method on basis of semantic analysis of search result |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012101884759A CN102789479A (en) | 2012-06-08 | 2012-06-08 | Vocabulary relevance calculating method on basis of semantic analysis of search result |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102789479A true CN102789479A (en) | 2012-11-21 |
Family
ID=47154882
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012101884759A Pending CN102789479A (en) | 2012-06-08 | 2012-06-08 | Vocabulary relevance calculating method on basis of semantic analysis of search result |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102789479A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104317846A (en) * | 2014-10-13 | 2015-01-28 | 安徽华贞信息科技有限公司 | Semantic analysis and marking method and system |
CN105335505A (en) * | 2015-10-29 | 2016-02-17 | 成都博睿德科技有限公司 | Information searching method based on natural language |
CN105335504A (en) * | 2015-10-29 | 2016-02-17 | 成都博睿德科技有限公司 | Information retrieval method based on natural language |
CN109299292A (en) * | 2018-11-26 | 2019-02-01 | 广西财经学院 | A Text Retrieval Method Based on Matrix-Weighted Association Rules Mixed Expansion of Context and Context |
CN109299278A (en) * | 2018-11-26 | 2019-02-01 | 广西财经学院 | A Text Retrieval Method for Mining Rule Antecedents Based on Confidence-Correlation Coefficient Framework |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184233A (en) * | 2011-05-12 | 2011-09-14 | 西北工业大学 | Query-result-based semantic correlation degree computing method |
-
2012
- 2012-06-08 CN CN2012101884759A patent/CN102789479A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184233A (en) * | 2011-05-12 | 2011-09-14 | 西北工业大学 | Query-result-based semantic correlation degree computing method |
Non-Patent Citations (2)
Title |
---|
史天艺: ""基于维基百科的搜索引擎检索结果聚类"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》, 15 December 2011 (2011-12-15), pages 138 - 2061 * |
沙芸等: ""基于词间语义相关度的搜索结果聚类算法"", 《郑州大学学报(理学版)》, vol. 41, no. 1, 31 March 2009 (2009-03-31), pages 73 - 76 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104317846A (en) * | 2014-10-13 | 2015-01-28 | 安徽华贞信息科技有限公司 | Semantic analysis and marking method and system |
CN105335505A (en) * | 2015-10-29 | 2016-02-17 | 成都博睿德科技有限公司 | Information searching method based on natural language |
CN105335504A (en) * | 2015-10-29 | 2016-02-17 | 成都博睿德科技有限公司 | Information retrieval method based on natural language |
CN109299292A (en) * | 2018-11-26 | 2019-02-01 | 广西财经学院 | A Text Retrieval Method Based on Matrix-Weighted Association Rules Mixed Expansion of Context and Context |
CN109299278A (en) * | 2018-11-26 | 2019-02-01 | 广西财经学院 | A Text Retrieval Method for Mining Rule Antecedents Based on Confidence-Correlation Coefficient Framework |
CN109299292B (en) * | 2018-11-26 | 2022-02-15 | 广西财经学院 | A Text Retrieval Method Based on Matrix-Weighted Association Rules Mixed Expansion of Context and Context |
CN109299278B (en) * | 2018-11-26 | 2022-02-15 | 广西财经学院 | Text retrieval method based on confidence coefficient-correlation coefficient framework mining rule antecedent |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shen et al. | A probabilistic model for linking named entities in web text with heterogeneous information networks | |
CN107180045A (en) | A kind of internet text contains the abstracting method of geographical entity relation | |
Lu et al. | An improved focused crawler: using web page classification and link priority evaluation | |
Nasution et al. | Social network extraction based on Web: 3. The integrated superficial method | |
Jiang et al. | Affiliation disambiguation for constructing semantic digital libraries | |
CN102789479A (en) | Vocabulary relevance calculating method on basis of semantic analysis of search result | |
CN118210875A (en) | Knowledge retrieval method and knowledge base management platform based on large language model | |
Xu et al. | Sinkhorn distance minimization for adaptive semi-supervised social network alignment | |
Wu et al. | Ltrrs: a learning to rank based algorithm for resource selection in distributed information retrieval | |
Wei et al. | MLP-IA: Multi-label user profile based on implicit association labels | |
Wu et al. | Clustering results of image searches by annotations and visual features | |
Guo et al. | Query expansion based on semantic related network | |
Yang et al. | Relation linking for wikidata using bag of distribution representation | |
Shen et al. | A tag-based personalized news recommendation method | |
Kannan et al. | Text document clustering using statistical integrated graph based sentence sensitivity ranking algorithm | |
Zhou et al. | Discovering bursty events based on enhanced bursty term detection | |
Jayabharathy et al. | Multi-document update summarisation using co-related terms for scientific articles and news group | |
Ouyang et al. | Representation learning with entity topics for knowledge graphs | |
Parida et al. | Ranking of Odia text document relevant to user query using vector space model | |
Tsapatsoulis | Web image indexing using WICE and a learning-free language model | |
She et al. | Deep neural semantic network for keywords extraction on short text | |
Ma et al. | Entity linking based on graph model and semantic representation | |
Wang et al. | A graph-based approach for semantic similar word retrieval | |
Adhiya et al. | AN EFFICIENT AND NOVEL APPROACH FOR WEB SEARCH PERSONALIZATION USING WEB USAGE MINING. | |
Zaw et al. | Web document clustering using Gauss distribution based cuckoo search clustering algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20121121 |