WO2016127458A1 - 改进的基于语义词典的词语相似度计算方法和装置 - Google Patents

改进的基于语义词典的词语相似度计算方法和装置 Download PDF

Info

Publication number
WO2016127458A1
WO2016127458A1 PCT/CN2015/073841 CN2015073841W WO2016127458A1 WO 2016127458 A1 WO2016127458 A1 WO 2016127458A1 CN 2015073841 W CN2015073841 W CN 2015073841W WO 2016127458 A1 WO2016127458 A1 WO 2016127458A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
phrase
similarity value
words
extended
Prior art date
Application number
PCT/CN2015/073841
Other languages
English (en)
French (fr)
Inventor
张贯京
陈兴明
葛新科
张少鹏
方静芳
高伟明
梁艳妮
周荣
梁昊原
周亮
Original Assignee
深圳市前海安测信息技术有限公司
深圳市易特科信息技术有限公司
深圳市贝沃德克生物技术研究院有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市前海安测信息技术有限公司, 深圳市易特科信息技术有限公司, 深圳市贝沃德克生物技术研究院有限公司 filed Critical 深圳市前海安测信息技术有限公司
Publication of WO2016127458A1 publication Critical patent/WO2016127458A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools

Definitions

  • the invention relates to the field of natural language processing technology in computer science, in particular to an improved semantic similarity calculation method based on a semantic dictionary.
  • Word similarity calculation has a wide range of applications in the fields of natural speech processing, intelligent retrieval, text clustering, text classification, automatic response, word sense disambiguation and machine translation.
  • one of the methods for calculating the similarity of words at home and abroad is based on the semantic similarity calculation of the semantic dictionary.
  • the commonly used semantic dictionary is in Chinese, and there are Hownet, synonym word forest, Chinese concept dictionary and so on.
  • the main object of the present invention is to provide an improved semantic similarity calculation method for word similarity, which improves the accuracy of similarity calculation between words, and further improves the intelligent level of the intelligent interactive system.
  • the present invention provides an improved semantic dictionary based word similarity calculation method, the improved semantic dictionary based word similarity calculation method comprising the following steps:
  • step S20 when the word A and the word B are both present in the semantic dictionary, calculate the similarity value of the word A and the word B, otherwise perform step S30;
  • S30 establishing an extended phrase a[M] of the word A and/or an extended phrase b[N] of the word B by using a preset synonym dictionary; calculating the word A and the extended phrase b[N] a similarity value for each word, or a similarity value for each word in the expanded phrase a[M] and the word B, or each word and extended phrase b[N in the expanded phrase a[M] a similarity value for each of the words; a maximum similarity value is taken as the similarity value of the word A and the word B.
  • the step S30 is specifically:
  • the extended phrase b[N] of the word B is established, and the word A and the extended phrase b[N] are sequentially calculated. a similarity value of each word in the middle, and taking a maximum similarity value as a similarity value between the word A and the word B;
  • the extended phrase a[M] of the word A is established, and all words in the extended phrase a[M] are sequentially calculated. a similarity value with the word B, and taking a maximum similarity value as a similarity value between the word A and the word B;
  • the extended phrase a[M] of the word A and the extended phrase b[N] of the word B are established, and the extended phrase a is sequentially calculated.
  • the similarity value of all words in [M] and all words in the extended phrase b[N] takes the maximum similarity value as the similarity value of the word A and the word B.
  • the improved semantic dictionary-based word similarity calculation method further comprises the following steps:
  • the step S30 is specifically:
  • the extended phrase b[N] of the word B is established, and the word A and the extended phrase b[N] are sequentially calculated. a similarity value of each word in the middle, and taking a maximum similarity value as a similarity value between the word A and the word B;
  • the extended phrase a[M] of the word A is established, and all words in the extended phrase a[M] are sequentially calculated. a similarity value with the word B, and taking a maximum similarity value as a similarity value between the word A and the word B;
  • the extended phrase a[M] of the word A and the extended phrase b[N] of the word B are established, and the extended phrase a is sequentially calculated.
  • the similarity value of all words in [M] and all words in the extended phrase b[N] takes the maximum similarity value as the similarity value of the word A and the word B.
  • the step S40 is specifically:
  • the word A is divided into a single word, establishing a single-word phrase aa[P] of the word A; calculating a similarity value of the single-word phrase aa[P] and the word B, taking the similarity value as the word A similarity value between A and the word B;
  • the word B is divided into a single word, establishing a single word phrase bb[Q] of the word B; calculating a similarity value of the word A and the single word phrase bb[Q], taking the similarity value as the word A similarity value between A and the word B;
  • the words in the phrase a[M] and the words in the phrase b[N] are respectively classified into single words. And establishing a single-word phrase aa[P] of the word A and a single-word phrase bb[Q] of the word B; calculating the single-word phrase aa[P] and the single-word phrase bb[ The similarity value of Q] is taken as the similarity value of the word A and the word B.
  • the step S30 is specifically:
  • the extended phrase b[N] of the word B is established, and the word A and the extended phrase b[N] are sequentially calculated. a similarity value of each word in the middle, and taking a maximum similarity value as a similarity value between the word A and the word B;
  • the extended phrase a[M] of the word A is established, and all words in the extended phrase a[M] are sequentially calculated. a similarity value with the word B, and taking a maximum similarity value as a similarity value between the word A and the word B;
  • the extended phrase a[M] of the word A and the extended phrase b[N] of the word B are established, and the extended phrase a is sequentially calculated.
  • the similarity value of all words in [M] and all words in the extended phrase b[N] takes the maximum similarity value as the similarity value of the word A and the word B.
  • the present invention also provides an improved semantic lexicon-based word similarity calculation apparatus, and the improved semantic dictionary-based word similarity calculation apparatus includes:
  • a word acquisition module configured to obtain the words A and B to be compared
  • a first word similarity calculation module configured to calculate a similarity value of the word A and the word B when both the word A and the word B are present in the semantic dictionary
  • a second word similarity calculation module configured to establish an extended phrase a[M] of the word A by using a preset synonym dictionary when at least one of the word A and the word B does not exist in the semantic dictionary And/or the extended phrase b[N] of the word B, calculating a similarity value of each of the words A and the expanded phrase b[N], or each of the extended phrases a[M] a similarity value of the word and the word B, or a similarity value of each word in the expanded phrase a[M] and each word in the extended phrase b[N], taking the maximum similarity value as the word A A similarity value to the word B.
  • the second word similarity calculation module is specifically configured to:
  • the extended phrase b[N] of the word B is established, and the word A and the extended phrase b[N] are sequentially calculated. a similarity value of each word in the middle, and taking a maximum similarity value as a similarity value between the word A and the word B;
  • the extended phrase a[M] of the word A is established, and all words in the extended phrase a[M] are sequentially calculated. a similarity value with the word B, and taking a maximum similarity value as a similarity value between the word A and the word B;
  • the extended phrase a[M] of the word A and the extended phrase b[N] of the word B are established, and the extended phrase a is sequentially calculated.
  • the similarity value of all words in [M] and all words in the extended phrase b[N] takes the maximum similarity value as the similarity value of the word A and the word B.
  • the improved semantic dictionary-based word similarity calculation device further comprises:
  • a third word similarity calculation module configured to: when the words in the phrase a[M] and/or the words in the phrase b[N] are not present in the semantic dictionary, the words A and / or the word B is divided into single words, the single word phrase aa [P] of the word A and / or the single word phrase bb [Q] of the word B is established; the word A is calculated a similarity value of the single-word phrase bb[Q], or a similarity value of the single-word phrase aa[P] and the word B, or the single-word phrase aa[P] and the The similarity value of the single-word phrase bb[Q] is taken as the similarity value of the word A and the word B.
  • the second word similarity calculation module is specifically configured to:
  • the extended phrase b[N] of the word B is established, and the word A and the extended phrase b[N] are sequentially calculated. a similarity value of each word in the middle, and taking a maximum similarity value as a similarity value between the word A and the word B;
  • the extended phrase a[M] of the word A is established, and all words in the extended phrase a[M] are sequentially calculated. a similarity value with the word B, and taking a maximum similarity value as a similarity value between the word A and the word B;
  • the extended phrase a[M] of the word A and the extended phrase b[N] of the word B are established, and the extended phrase a is sequentially calculated.
  • the similarity value of all words in [M] and all words in the extended phrase b[N] takes the maximum similarity value as the similarity value of the word A and the word B.
  • the third word similarity calculation module is specifically configured to:
  • the word A is divided into a single word, establishing a single-word phrase aa[P] of the word A; calculating a similarity value of the single-word phrase aa[P] and the word B, taking the similarity value as the word A similarity value between A and the word B;
  • the word B is divided into a single word, establishing a single word phrase bb[Q] of the word B; calculating a similarity value of the word A and the single word phrase bb[Q], taking the similarity value as the word A similarity value between A and the word B;
  • the words in the phrase a[M] and the words in the phrase b[N] are respectively classified into single words. And establishing a single-word phrase aa[P] of the word A and a single-word phrase bb[Q] of the word B; calculating the single-word phrase aa[P] and the single-word phrase bb[ The similarity value of Q] is taken as the similarity value of the word A and the word B.
  • the second word similarity calculation module is specifically configured to:
  • the extended phrase b[N] of the word B is established, and the word A and the extended phrase b[N] are sequentially calculated. a similarity value of each word in the middle, and taking a maximum similarity value as a similarity value between the word A and the word B;
  • the extended phrase a[M] of the word A is established, and all words in the extended phrase a[M] are sequentially calculated. a similarity value with the word B, and taking a maximum similarity value as a similarity value between the word A and the word B;
  • the extended phrase a[M] of the word A and the extended phrase b[N] of the word B are established, and the extended phrase a is sequentially calculated.
  • the similarity value of all words in [M] and all words in the extended phrase b[N] takes the maximum similarity value as the similarity value of the word A and the word B.
  • the technical solution of the present invention adopts the above technical solution, and the technical effect is that when one of the word A and the word B to be compared does not exist in the semantic dictionary, the word is established by using a preset synonym dictionary.
  • An extended phrase a[M] of A and/or an extended phrase b[N] of the word B and then calculating a similarity value of each of the words A and the extended phrase b[N], or Expanding the similarity value of each word in the phrase a[M] with the word B, or the similarity value of each word in the extended phrase a[M] and each word in the extended phrase b[N],
  • the maximum similarity value is used as the similarity value of the word A and the word B.
  • the embodiment of the present invention performs synonym expansion by using the word A and/or the word B to be compared, thereby improving the accuracy of the similarity calculation between words, thereby improving the intelligence level of the intelligent interactive system.
  • FIG. 1 is a schematic flow chart of a first embodiment of a method for calculating a similarity of a word based on a semantic dictionary according to the present invention
  • FIG. 2 is a schematic structural diagram of a first embodiment of a semantic lexicon-based word similarity calculation apparatus according to the present invention.
  • the main object of the present invention is to provide an improved semantic similarity calculation method for word similarity, which improves the accuracy of similarity calculation between words, and further improves the intelligent level of the intelligent interactive system.
  • the present invention provides an improved semantic dictionary based word similarity calculation method.
  • FIG. 1 is a schematic flow chart of a first embodiment of a method for calculating a similarity of a word based on a semantic dictionary according to the present invention.
  • the improved semantic dictionary-based word similarity calculation method includes the following steps:
  • the words A and B to be compared can be obtained in various ways.
  • the word A is obtained from the client, and the word B is from the server-side database.
  • step S20 when the word A and the word B are both present in the semantic dictionary, calculate the similarity value of the word A and the word B, otherwise perform step S30;
  • the preset semantic dictionary refers to a HowNet semantic dictionary, including a glossary.dat file. Determining whether the word A and the word B exist in a preset semantic dictionary, that is, respectively searching for the word A and the word B in a glossary.dat file, if the word A and the word B are In the semantic dictionary, the similarity of the word A and the word B is calculated according to a conventional method for calculating word similarity.
  • the conventional method of calculating word similarity referred to herein refers to a method based on a semantic dictionary disclosed in the prior art for calculating word similarity.
  • S30 establishing an extended phrase a[M] of the word A and/or an extended phrase b[N] of the word B by using a preset synonym dictionary; calculating the word A and the extended phrase b[N] a similarity value for each word, or a similarity value for each word in the expanded phrase a[M] and the word B, or each word and extended phrase b[N in the expanded phrase a[M] a similarity value for each of the words; a maximum similarity value is taken as the similarity value of the word A and the word B.
  • a[M] is a synonym expansion phrase of the word A
  • M is a natural number
  • b[N] is a synonym expansion phrase of the word B
  • N is The natural number represents the number of words in the extended phrase b[N].
  • the preset synonym dictionary described in the embodiment of the present invention may be based on the existing synonym word forest or other version of the synonym dictionary.
  • Step S30 requires different processing in the following three cases.
  • the conventional method for calculating word similarity as described below refers to a method based on semantic dictionary for calculating word similarity disclosed in the prior art.
  • the extended phrase a[M] of the word A is established by using a preset synonym dictionary and / or the expanded phrase b[N] of the word B, and then calculate the similarity value of each of the words A and the extended phrase b[N], or each of the extended phrases a[M] a similarity value of the word and the word B, or a similarity value of each word in the expanded phrase a[M] and each word in the extended phrase b[N], taking the maximum similarity value as the word A A similarity value to the word B.
  • the embodiment of the present invention performs synonym expansion by using the word A and/or the word B to be compared, thereby improving the accuracy of the similarity calculation between words, thereby improving the intelligence level of the intelligent interactive system.
  • the improved semantic dictionary based word similarity calculation also includes the following steps:
  • the step S40 is specifically:
  • the word A is divided into a single word, establishing a single-word phrase aa[P] of the word A; calculating a similarity value of the single-word phrase aa[P] and the word B, taking the similarity value as the word A similarity value between A and the word B;
  • the word B is divided into a single word, establishing a single word phrase bb[Q] of the word B; calculating a similarity value of the word A and the single word phrase bb[Q], taking the similarity value as the word A similarity value between A and the word B;
  • the words in the phrase a[M] and the words in the phrase b[N] are respectively classified into single words. And establishing a single-word phrase aa[P] of the word A and a single-word phrase bb[Q] of the word B; calculating the single-word phrase aa[P] and the single-word phrase bb[ The similarity value of Q] is taken as the similarity value of the word A and the word B.
  • the words in the phrase a[M] and the words in the phrase b[N] are not present in the semantic dictionary, the words A and the words are B is divided into single words, and the single-word phrase aa [P] of the word A and the single-word phrase bb [Q] of the word B are established.
  • the word phrase of the word A is aa[P](aa[0], aa[1], aa[2], ..., aa[P-1]), the word of the word B is described.
  • the extension word a[M] of the word A and/or the word B is further calculated.
  • the words in the word and/or the words in the phrase b[N] are analyzed, the words A and/or the words B are divided into single words, and the word phrase aa of the word A is established [ P] and/or the single-word phrase bb[Q] of the word B, calculating the similarity between the word A and the word B according to the above algorithm, further improving the accuracy of the similarity calculation between words, and further Improve the intelligence level of intelligent interactive systems.
  • the pseudo code of the preferred embodiment of the improved semantic lexicon-based word similarity calculation method of the present invention is as follows, wherein the sim function is a traditional algorithm for calculating word similarity, and the sim2 function is based on equation (1), and its two
  • the parameter type is an array of strings.
  • one of the arguments is a string, it can be thought of as an array of strings of length one:
  • the present invention provides an improved semantic dictionary based word similarity calculation apparatus.
  • FIG. 2 is a schematic structural diagram of a first embodiment of a semantic lexicon-based word similarity calculation apparatus according to the present invention.
  • the improved semantic dictionary-based word similarity calculation device includes:
  • a word acquisition module 10 configured to obtain the words A and B to be compared
  • the words A and B to be compared can be obtained in various ways.
  • the word A is obtained from the client, and the word B is from the server-side database.
  • a first word similarity calculation module 20 configured to calculate a similarity value of the word A and the word B when both the word A and the word B are present in the semantic dictionary;
  • the preset semantic dictionary refers to a HowNet semantic dictionary, including a glossary.dat file. Determining whether the word A and the word B exist in a preset semantic dictionary, that is, respectively searching for the word A and the word B in a glossary.dat file, if the word A and the word B are In the semantic dictionary, the similarity of the word A and the word B is calculated according to a conventional method for calculating word similarity.
  • the conventional method of calculating word similarity referred to herein refers to a method based on a semantic dictionary disclosed in the prior art for calculating word similarity.
  • a second word similarity calculation module 30 configured to establish an extended phrase a[M of the word A by using a preset synonym dictionary when at least one of the word A and the word B does not exist in the semantic dictionary And/or the extended phrase b[N] of the word B, calculating a similarity value of each of the words A and the expanded phrase b[N], or each of the expanded phrases a[M] The similarity value of the word and the word B, or the similarity value of each word in the extended phrase a[M] and the expanded phrase b[N], taking the maximum similarity value as the word A similarity value of A with the word B.
  • a[M] is a synonym expansion phrase of the word A
  • M is a natural number
  • b[N] is a synonym expansion phrase of the word B
  • N is The natural number represents the number of words in the extended phrase b[N].
  • the second word similarity calculation module is specifically configured to: perform different processing in the following three cases, and the conventional method for calculating word similarity as described below refers to a semantic dictionary based method disclosed in the prior art. A method of calculating the similarity of words.
  • the extended phrase a[M] of the word A is established by using a preset synonym dictionary and / or the expanded phrase b[N] of the word B, and then calculate the similarity value of each of the words A and the extended phrase b[N], or each of the extended phrases a[M] a similarity value of the word and the word B, or a similarity value of each word in the expanded phrase a[M] and each word in the extended phrase b[N], taking the maximum similarity value as the word A A similarity value to the word B.
  • the embodiment of the present invention performs synonym expansion by using the word A and/or the word B to be compared, thereby improving the accuracy of the similarity calculation between words, thereby improving the intelligence level of the intelligent interactive system.
  • the improved semantic dictionary-based word similarity calculation device further includes:
  • a third word similarity calculation module configured to further determine a word and a phrase in the phrase a[M] when the similarity of the word A and the word B is still 0 by the method of the first embodiment / or whether the words in the phrase b[N] are not present in the semantic dictionary, the words A and / or the words B are divided into single words, the single word of the word A is established a word group aa[P] and/or a single word phrase bb[Q] of the word B; calculating a similarity value of the word A and the single word phrase bb[Q], or the word a similarity value of the phrase aa[P] and the word B, or a similarity value of the single-word phrase aa[P] and the single-word phrase bb[Q], taking the similarity value as The similarity value of the word A and the word B.
  • the third word similarity calculation module is specifically configured to:
  • the word A is divided into a single word, establishing a single-word phrase aa[P] of the word A; calculating a similarity value of the single-word phrase aa[P] and the word B, taking the similarity value as the word A similarity value between A and the word B;
  • the word B is divided into a single word, establishing a single word phrase bb[Q] of the word B; calculating a similarity value of the word A and the single word phrase bb[Q], taking the similarity value as the word A similarity value between A and the word B;
  • the words in the phrase a[M] and the words in the phrase b[N] are respectively classified into single words. And establishing a single-word phrase aa[P] of the word A and a single-word phrase bb[Q] of the word B; calculating the single-word phrase aa[P] and the single-word phrase bb[ The similarity value of Q] is taken as the similarity value of the word A and the word B.
  • the words in the phrase a[M] and the words in the phrase b[N] are not present in the semantic dictionary, the words A and the words are B is divided into single words, and the single-word phrase aa [P] of the word A and the single-word phrase bb [Q] of the word B are established.
  • the word phrase of the word A is aa[P](aa[0], aa[1], aa[2], ..., aa[P-1]), the word of the word B is described.
  • the extension word a[M] of the word A and/or the word B is further calculated.
  • the words in the word and/or the words in the phrase b[N] are analyzed, the words A and/or the words B are divided into single words, and the word phrase aa of the word A is established [ P] and/or the single-word phrase bb[Q] of the word B, calculating the similarity between the word A and the word B according to the above algorithm, further improving the accuracy of the similarity calculation between words, and further Improve the intelligence level of intelligent interactive systems.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种改进的词语相似度计算方法。当待比较的所述词语A和所述词语B有其中一个不存在于所述语义词典中时,通过预设的同义词词典建立所述词语A的扩展词组a[M]和/或所述词语B的扩展词组b[N];再计算所述词语A与所述扩展词组b[N]中每个词语的相似度值,或所述扩展词组a[M]中每个词语与所述词语B的相似度值,或所述扩展词组a[M]中每个词语与扩展词组b[N]中每个词语的相似度值;取最大相似度值作为所述词语A与所述词语B的相似度值。本发明实施例通过对待比较的所述词语A和/或所述词语B进行同义词扩展,提高词语之间相似度计算的准确性,进而提高智能交互系统的智能化水平。

Description

改进的基于语义词典的词语相似度计算方法和装置
技术领域
本发明涉及计算机科学中自然语言处理技术领域,尤其涉及一种改进的基于语义词典的词语相似度计算方法。
背景技术
词语相似度计算在自然语音处理、智能检索、文本聚类、文本分类、自动应答、词义排歧和机器翻译等领域都有广泛的应用。目前,国内外词语相似度计算方法之一是基于语义词典的词语相似度计算,常用的语义词典在汉语方面,有知网(Hownet)、同义词词林、中文概念词典等。
然而比较词语的相似性,首先要到语义词典对应的词库中查找该词语,若该词语不存在,如:“送到”与“送达”,其中若“送达”不在词库中,则其相似度无法计算,则会默认这两个词语之间的相似度为零。
基于此有必要提供一种改进的基于语义词典的词语相似度计算方法,以提高词语之间相似度计算的准确性,进而提高智能交互系统的智能化水平。
发明内容
本发明的主要目的在于提供一种改进的基于语义词典的词语相似度计算方法,提高词语之间相似度计算的准确性,进而提高智能交互系统的智能化水平。
为实现上述目的,本发明提供了一种改进的基于语义词典的词语相似度计算方法,所述改进的基于语义词典的词语相似度计算方法包括如下步骤:
S10:获取待比较的词语A和词语B;
S20:当所述词语A和所述词语B都存在于所述语义词典中时,计算所述词语A与所述词语B的相似度值,否则执行步骤S30;
S30:通过预设的同义词词典建立所述词语A的扩展词组a[M]和/或所述词语B的扩展词组b[N];计算所述词语A与所述扩展词组b[N]中每个词语的相似度值,或所述扩展词组a[M]中每个词语与所述词语B的相似度值,或所述扩展词组a[M]中每个词语与扩展词组b[N]中每个词语的相似度值;取最大相似度值作为所述词语A与所述词语B的相似度值。
优选地,所述步骤S30具体为:
当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,建立所述词语B的扩展词组b[N],依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,建立所述词语A的扩展词组a[M],依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
当所述词语A和所述词语B都不存在于语义词典中,建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
优选地,所述改进的基于语义词典的词语相似度计算方法还包括如下步骤:
S40:当所述词组a[M]中的词语和/或所述词组b[N]中的词语不存在于所述语义词典中时,将所述词语A和/或所述词语B切分为单字词,建立所述词语A的单字词组aa[P]和/或所述词语B的单字词组bb[Q];计算所述词语A与所述单字词组bb[Q]的相似度值,或所述单字词组aa[P]与所述词语B的相似度值,或所述单字词组aa[P]与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值。
优选地,所述步骤S30具体为:
当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,建立所述词语B的扩展词组b[N],依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,建立所述词语A的扩展词组a[M],依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
当所述词语A和所述词语B都不存在于语义词典中,建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
优选地,所述步骤S40具体为:
当所述词组a[M]中的词语都不存在于所述语义词典中,且所述词组b[N]中的词语有存在于所述语义词典中时,将所述词语A切分为单字词,建立所述词语A的单字词组aa[P];计算所述单字词组aa[P]与所述词语B的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值;
当所述词组b[N]中的词语都不存在于所述语义词典中,且所述词组a[M]中的词语有存在于所述语义词典中时,将所述词语B切分为单字词,建立所述词语B的单字词组bb[Q];计算所述词语A与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值;
当所述词组a[M]中的词语和所述词组b[N]中的词语都不存在于所述语义词典中时,将所述词语A和所述词语B分别切分为单字词,建立所述词语A的单字词组aa[P]和所述词语B的单字词组bb[Q];计算所述单字词组aa[P]与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值。
优选地,所述步骤S30具体为:
当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,建立所述词语B的扩展词组b[N],依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,建立所述词语A的扩展词组a[M],依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
当所述词语A和所述词语B都不存在于语义词典中,建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
此外,为实现上述目的,本发明还提供一种改进的基于语义词典的词语相似度计算装置,所述改进的基于语义词典的词语相似度计算装置包括:
词语获取模块,用于获取待比较的词语A和词语B;
第一词语相似度计算模块,用于当所述词语A和所述词语B都存在于所述语义词典中时,计算所述词语A与所述词语B的相似度值;
第二词语相似度计算模块,用于当所述词语A和所述词语B至少一个不存在于所述语义词典中时,通过预设的同义词词典建立所述词语A的扩展词组a[M]和/或所述词语B的扩展词组b[N],计算所述词语A与所述扩展词组b[N]中每个词语的相似度值,或所述扩展词组a[M]中每个词语与所述词语B的相似度值,或所述扩展词组a[M]中每个词语与扩展词组b[N]中每个词语的相似度值,取最大相似度值作为所述词语A与所述词语B的相似度值。
优选地,所述第二词语相似度计算模块具体用于:
当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,建立所述词语B的扩展词组b[N],依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,建立所述词语A的扩展词组a[M],依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
当所述词语A和所述词语B都不存在于语义词典中,建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
优选地,所述改进的基于语义词典的词语相似度计算装置还包括:
第三词语相似度计算模块,用于当所述词组a[M]中的词语和/或所述词组b[N]中的词语不存在于所述语义词典中时,将所述词语A和/或所述词语B切分为单字词,建立所述词语A的单字词组aa[P]和/或所述词语B的单字词组bb[Q];计算所述词语A与所述单字词组bb[Q]的相似度值,或所述单字词组aa[P]与所述词语B的相似度值,或所述单字词组aa[P]与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值。
优选地,所述第二词语相似度计算模块具体用于:
当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,建立所述词语B的扩展词组b[N],依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,建立所述词语A的扩展词组a[M],依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
当所述词语A和所述词语B都不存在于语义词典中,建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
优选地,所述第三词语相似度计算模块具体用于:
当所述词组a[M]中的词语都不存在于所述语义词典中,且所述词组b[N]中的词语有存在于所述语义词典中时,将所述词语A切分为单字词,建立所述词语A的单字词组aa[P];计算所述单字词组aa[P]与所述词语B的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值;
当所述词组b[N]中的词语都不存在于所述语义词典中,且所述词组a[M]中的词语有存在于所述语义词典中时,将所述词语B切分为单字词,建立所述词语B的单字词组bb[Q];计算所述词语A与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值;
当所述词组a[M]中的词语和所述词组b[N]中的词语都不存在于所述语义词典中时,将所述词语A和所述词语B分别切分为单字词,建立所述词语A的单字词组aa[P]和所述词语B的单字词组bb[Q];计算所述单字词组aa[P]与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值。
优选地,所述第二词语相似度计算模块具体用于:
当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,建立所述词语B的扩展词组b[N],依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,建立所述词语A的扩展词组a[M],依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
当所述词语A和所述词语B都不存在于语义词典中,建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
本发明采用上述技术方案,带来的技术效果为:当待比较的所述词语A和所述词语B有其中一个不存在于所述语义词典中时,通过预设的同义词词典建立所述词语A的扩展词组a[M]和/或所述词语B的扩展词组b[N],再计算所述词语A与所述扩展词组b[N]中每个词语的相似度值,或所述扩展词组a[M]中每个词语与所述词语B的相似度值,或所述扩展词组a[M]中每个词语与扩展词组b[N]中每个词语的相似度值,取最大相似度值作为所述词语A与所述词语B的相似度值。本发明实施例通过对待比较的所述词语A和/或所述词语B进行同义词扩展,提高词语之间相似度计算的准确性,进而提高智能交互系统的智能化水平。
附图说明
图1为本发明改进的基于语义词典的词语相似度计算方法第一实施例流程示意图;
图2为本发明改进的基于语义词典的词语相似度计算装置第一实施例结构示意图。
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
本发明的主要目的在于提供一种改进的基于语义词典的词语相似度计算方法,提高词语之间相似度计算的准确性,进而提高智能交互系统的智能化水平。
为实现上述目的,本发明提供了一种改进的基于语义词典的词语相似度计算方法。
参照图1,图1为本发明改进的基于语义词典的词语相似度计算方法第一实施例流程示意图。
在一实施例中,如图1所示,所述改进的基于语义词典的词语相似度计算方法包括如下步骤:
S10:获取待比较的词语A和词语B;
具体地,可以通过多种方式获取待比较的词语A和词语B,例如,在智能交互系统中,要进行问题匹配时,所述词语A为从客户端获取,所述词语B从服务器端的数据库中年获取;或者在进行语句相似度计算是,所述词语A从语句1中获取,所述词语B从语句2中获取。
S20:当所述词语A和所述词语B都存在于所述语义词典中时,计算所述词语A与所述词语B的相似度值,否则执行步骤S30;
具体地,所述预设的语义词典是指HowNet语义词典,包括glossary.dat文件。分别判断所述词语A和所述词语B是否存在于预设的语义词典中,即在glossary.dat文件中分别查找所述词语A和所述词语B,若所述词语A和所述词语B都在所述语义词典中,则按照传统的计算词语相似度的方法计算所述词语A和所述词语B的相似度。此处所说的传统的计算词语相似度的方法是指现有技术中公开的基于语义词典的计算词语相似度的方法。
S30:通过预设的同义词词典建立所述词语A的扩展词组a[M]和/或所述词语B的扩展词组b[N];计算所述词语A与所述扩展词组b[N]中每个词语的相似度值,或所述扩展词组a[M]中每个词语与所述词语B的相似度值,或所述扩展词组a[M]中每个词语与扩展词组b[N]中每个词语的相似度值;取最大相似度值作为所述词语A与所述词语B的相似度值。
具体地,当所述词语A和/或所述词语B不在所述语义词典中,则需要基于预设的同义词词典建立所述词语A的扩展词组a[M]和/或所述词语B的扩展词组b[N]。其中,a[M]为所述词语A的同义词扩展词组,M为自然数,代表了扩展词组a[M]中词语的个数;b[N]为所述词语B的同义词扩展词组,N为自然数,代表了扩展词组b[N]中词语的个数。本发明实施例中所述的预设的同义词词典可以基于现有的《同义词词林》或其他版本的同义词词典。
步骤S30在以下三种情况下需要做不同的处理,下述所说的传统的计算词语相似度的方法是指现有技术中公开的基于语义词典的计算词语相似度的方法。
(1)当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,只需建立所述词语B的扩展词组b[N],按照传统的计算词语相似度的方法依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
(2)当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,只需建立所述词语A的扩展词组a[M],按照传统的计算词语相似度的方法依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
(3)当所述词语A和所述词语B都不存在于语义词典中,则需要同时建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],按照传统的计算词语相似度的方法依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
本发明实施例中当待比较的所述词语A和所述词语B有其中一个不存在于所述语义词典中时,通过预设的同义词词典建立所述词语A的扩展词组a[M]和/或所述词语B的扩展词组b[N],再计算所述词语A与所述扩展词组b[N]中每个词语的相似度值,或所述扩展词组a[M]中每个词语与所述词语B的相似度值,或所述扩展词组a[M]中每个词语与扩展词组b[N]中每个词语的相似度值,取最大相似度值作为所述词语A与所述词语B的相似度值。本发明实施例通过对待比较的所述词语A和/或所述词语B进行同义词扩展,提高词语之间相似度计算的准确性,进而提高智能交互系统的智能化水平。
在一个优选的实施例中,进一步地,当通过第一实施例的方法计算出所述词语A和所述词语B的相似度仍然为0时,所述改进的基于语义词典的词语相似度计算方法还包括如下步骤:
S40:当所述词组a[M]中的词语和/或所述词组b[N]中的词语不存在于所述语义词典中时,将所述词语A和/或所述词语B切分为单字词,建立所述词语A的单字词组aa[P]和/或所述词语B的单字词组bb[Q];计算所述词语A与所述单字词组bb[Q]的相似度值,或所述单字词组aa[P]与所述词语B的相似度值,或所述单字词组aa[P]与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值。
优选地,所述步骤S40具体为:
当所述词组a[M]中的词语都不存在于所述语义词典中,且所述词组b[N]中的词语有存在于所述语义词典中时,将所述词语A切分为单字词,建立所述词语A的单字词组aa[P];计算所述单字词组aa[P]与所述词语B的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值;
当所述词组b[N]中的词语都不存在于所述语义词典中,且所述词组a[M]中的词语有存在于所述语义词典中时,将所述词语B切分为单字词,建立所述词语B的单字词组bb[Q];计算所述词语A与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值;
当所述词组a[M]中的词语和所述词组b[N]中的词语都不存在于所述语义词典中时,将所述词语A和所述词语B分别切分为单字词,建立所述词语A的单字词组aa[P]和所述词语B的单字词组bb[Q];计算所述单字词组aa[P]与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值。
具体地,在一个实施例中,若所述词组a[M]中的词语和所述词组b[N]中的词语都不存在于所述语义词典中,将所述词语A和所述词语B分别切分为单字词,建立所述词语A的单字词组aa[P]和所述词语B的单字词组bb[Q]。假设所述词语A的单字词组为aa[P](aa[0],aa[1],aa[2],……,aa[P-1]),所述述词语B的单字词组为bb[Q](bb[0],bb[1],bb[2],……,bb[Q-1]),则aa[i](0≤i≤P-1)和bb[j]( 0≤j≤Q-1)之间的相似度可以用sim(aa[i],bb[j])表示,则词语A和词语B之间相似度sim2(A,B)公式如式(1)所示:
式(1)
式(1)中 ;
本发明实施例通过第一实施例的方法计算出所述词语A和所述词语B的相似度仍然为0时,进一步对所述词语A和/或所述词语B的扩展词a[M]中的词语和/或所述词组b[N]中的词语进行分析,将所述词语A和/或所述词语B切分为单字词,建立所述词语A的单字词组aa[P]和/或所述词语B的单字词组bb[Q],按照上述算法计算所述词语A和所述词语B的相似度,进一步提高了词语之间相似度计算的准确性,进而提高智能交互系统的智能化水平。
本发明改进的基于语义词典的词语相似度计算方法最佳实施例的伪代码如下,其中,sim函数是传统的计算词语相似度的算法,sim2函数以式(1)为原型,它的两个参数类型是字符串数组,当其中一个参数为字符串时,可以看作是长度为一的字符串数组:
Dim f AS FILE //定义预设的语义词典
Dim A,B AS STRING //定义待比较的词
Dim r AS STRING //定义比较结果
Dim a[M],b[N],aa[P],bb[Q] AS STRING[] //定义词组分别作为词语A和词语B的同义词扩展词组和单字词词组
IF (A IN f) AND (B IN f) THEN
r = sim(A,B) //sim为传统词语相似度计算算法
ELSE IF(A NOT IN f)AND (B IN f) THEN
A->a[M] //对A进行同义词扩展
IF (a[i] IN f) THEN //0≤i≤M-1
r = max(sim(a[i],B))
ELSE
A->aa[P] //将A切分成单字词建立单字词词组
r = sim2(B,aa[p] ) //单字词词组与词语的相似度算法
ENDIF
ELSE IF(A IN f)AND (B NOT IN f) THEN
B->b[N] //对A进行同义词扩展
IF (b[j] IN f)) THEN //0≤j≤N-1
r = max(sim(A,b[j]))
ELSE
B->bb[Q] //将B切分成单字词建立单字词词组
r = sim2(A, bb[Q]) //单字词词组与词语的相似度算法
ENDIF
ESLE
A->a[M] //对A进行同义词扩展
B->b[N] //对B进行同义词扩展
IF (a[i] IN f)AND(b[j] IN f) //0≤i≤M-1, 0≤j≤N-1
r = max(sim(a[i],b[j])) //取最大值作为A和B的相似度值
ELSE IF (a[i] NOT IN f)AND(b[j] IN f)
A->aa[P] //将A切分成单字词建立单字词词组
r = max(sim2(b[j],aa[P])) //单字词词组与词语的相似度算法
ELSE IF (a[i] IN f)AND(b[j] NOT IN f)
B->bb[Q] //将B切分成单字词建立单字词词组
r = max(sim2(a[i],bb[Q])) //单字词词组与词语的相似度算法
ELSE
A->aa[P]
B->bb[Q]
r=sim2(aa[P], bb[Q])
ENDIF
ENDIF
为实现上述目的,本发明提供了一种改进的基于语义词典的词语相似度计算装置。
参照图2,图2为本发明改进的基于语义词典的词语相似度计算装置第一实施例结构示意图。
在一实施例中,如图2所示,所述改进的基于语义词典的词语相似度计算装置包括:
词语获取模块10,用于获取待比较的词语A和词语B;
具体地,可以通过多种方式获取待比较的词语A和词语B,例如,在智能交互系统中,要进行问题匹配时,所述词语A为从客户端获取,所述词语B从服务器端的数据库中年获取;或者在进行语句相似度计算是,所述词语A从语句1中获取,所述词语B从语句2中获取。
第一词语相似度计算模块20,用于当所述词语A和所述词语B都存在于所述语义词典中时,计算所述词语A与所述词语B的相似度值;
具体地,所述预设的语义词典是指HowNet语义词典,包括glossary.dat文件。分别判断所述词语A和所述词语B是否存在于预设的语义词典中,即在glossary.dat文件中分别查找所述词语A和所述词语B,若所述词语A和所述词语B都在所述语义词典中,则按照传统的计算词语相似度的方法计算所述词语A和所述词语B的相似度。此处所说的传统的计算词语相似度的方法是指现有技术中公开的基于语义词典的计算词语相似度的方法。
第二词语相似度计算模块30,用于当所述词语A和所述词语B至少一个不存在于所述语义词典中时,通过预设的同义词词典建立所述词语A的扩展词组a[M]和/或所述词语B的扩展词组b[N],计算所述词语A与所述扩展词组b[N]中每个词语的相似度值,或所述扩展词组a[M]中每个词语与所述词语B的相似度值,或所述扩展词组a[M]中每个词语与扩展词组b[N]中每个词语的相似度值,取最大相似度值作为所述词语A与所述词语B的相似度值。
具体地,当所述词语A和/或所述词语B不在所述语义词典中,则需要基于预设的同义词词典建立所述词语A的扩展词组a[M]和/或所述词语B的扩展词组b[N]。其中,a[M]为所述词语A的同义词扩展词组,M为自然数,代表了扩展词组a[M]中词语的个数;b[N]为所述词语B的同义词扩展词组,N为自然数,代表了扩展词组b[N]中词语的个数。
所述第二词语相似度计算模块具体用于:在以下三种情况下需要做不同的处理,下述所说的传统的计算词语相似度的方法是指现有技术中公开的基于语义词典的计算词语相似度的方法。
(1)当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,只需建立所述词语B的扩展词组b[N],按照传统的计算词语相似度的方法依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
(2)当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,只需建立所述词语A的扩展词组a[M],按照传统的计算词语相似度的方法依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
(3)当所述词语A和所述词语B都不存在于语义词典中,则需要同时建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],按照传统的计算词语相似度的方法依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
本发明实施例中当待比较的所述词语A和所述词语B有其中一个不存在于所述语义词典中时,通过预设的同义词词典建立所述词语A的扩展词组a[M]和/或所述词语B的扩展词组b[N],再计算所述词语A与所述扩展词组b[N]中每个词语的相似度值,或所述扩展词组a[M]中每个词语与所述词语B的相似度值,或所述扩展词组a[M]中每个词语与扩展词组b[N]中每个词语的相似度值,取最大相似度值作为所述词语A与所述词语B的相似度值。本发明实施例通过对待比较的所述词语A和/或所述词语B进行同义词扩展,提高词语之间相似度计算的准确性,进而提高智能交互系统的智能化水平。
在一个优选的实施例中,进一步地,所述改进的基于语义词典的词语相似度计算装置还包括:
第三词语相似度计算模块,用于当通过第一实施例的方法计算出所述词语A和所述词语B的相似度仍然为0时,进一步判断所述词组a[M]中的词语和/或所述词组b[N]中的词语是否都不存在于所述语义词典中时,将所述词语A和/或所述词语B切分为单字词,建立所述词语A的单字词组aa[P]和/或所述词语B的单字词组bb[Q];计算所述词语A与所述单字词组bb[Q]的相似度值,或所述单字词组aa[P]与所述词语B的相似度值,或所述单字词组aa[P]与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值。
优选地,所述第三词语相似度计算模块具体用于:
当所述词组a[M]中的词语都不存在于所述语义词典中,且所述词组b[N]中的词语有存在于所述语义词典中时,将所述词语A切分为单字词,建立所述词语A的单字词组aa[P];计算所述单字词组aa[P]与所述词语B的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值;
当所述词组b[N]中的词语都不存在于所述语义词典中,且所述词组a[M]中的词语有存在于所述语义词典中时,将所述词语B切分为单字词,建立所述词语B的单字词组bb[Q];计算所述词语A与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值;
当所述词组a[M]中的词语和所述词组b[N]中的词语都不存在于所述语义词典中时,将所述词语A和所述词语B分别切分为单字词,建立所述词语A的单字词组aa[P]和所述词语B的单字词组bb[Q];计算所述单字词组aa[P]与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值。
具体地,在一个实施例中,若所述词组a[M]中的词语和所述词组b[N]中的词语都不存在于所述语义词典中,将所述词语A和所述词语B分别切分为单字词,建立所述词语A的单字词组aa[P]和所述词语B的单字词组bb[Q]。假设所述词语A的单字词组为aa[P](aa[0],aa[1],aa[2],……,aa[P-1]),所述述词语B的单字词组为bb[Q](bb[0],bb[1],bb[2],……,bb[Q-1]),则aa[i](0≤i≤P-1)和bb[j]( 0≤j≤Q-1)之间的相似度可以用sim(aa[i],bb[j])表示,则词语A和词语B之间相似度sim2(A,B)公式如式(1)所示:
式(1)
式(1)中 ;
本发明实施例通过第一实施例的方法计算出所述词语A和所述词语B的相似度仍然为0时,进一步对所述词语A和/或所述词语B的扩展词a[M]中的词语和/或所述词组b[N]中的词语进行分析,将所述词语A和/或所述词语B切分为单字词,建立所述词语A的单字词组aa[P]和/或所述词语B的单字词组bb[Q],按照上述算法计算所述词语A和所述词语B的相似度,进一步提高了词语之间相似度计算的准确性,进而提高智能交互系统的智能化水平。
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。

Claims (12)

  1. 一种改进的基于语义词典的词语相似度计算方法,其特征在于,所述改进的基于语义词典的词语相似度计算方法包括如下步骤:
    S10:获取待比较的词语A和词语B;
    S20:当所述词语A和所述词语B都存在于所述语义词典中时,计算所述词语A与所述词语B的相似度值,否则执行步骤S30;
    S30:通过预设的同义词词典建立所述词语A的扩展词组a[M]和/或所述词语B的扩展词组b[N];计算所述词语A与所述扩展词组b[N]中每个词语的相似度值,或所述扩展词组a[M]中每个词语与所述词语B的相似度值,或所述扩展词组a[M]中每个词语与扩展词组b[N]中每个词语的相似度值;取最大相似度值作为所述词语A与所述词语B的相似度值。
  2. 如权利要求1所述的改进的基于语义词典的词语相似度计算方法,其特征在于,所述步骤S30具体为:
    当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,建立所述词语B的扩展词组b[N],依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
    当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,建立所述词语A的扩展词组a[M],依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
    当所述词语A和所述词语B都不存在于语义词典中,建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
  3. 如权利要求1所述的改进的基于语义词典的词语相似度计算方法,其特征在于,所述改进的基于语义词典的词语相似度计算方法还包括如下步骤:
    S40:当所述词组a[M]中的词语和/或所述词组b[N]中的词语不存在于所述语义词典中时,将所述词语A和/或所述词语B切分为单字词,建立所述词语A的单字词组aa[P]和/或所述词语B的单字词组bb[Q];计算所述词语A与所述单字词组bb[Q]的相似度值,或所述单字词组aa[P]与所述词语B的相似度值,或所述单字词组aa[P]与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值。
  4. 如权利要求3所述的改进的基于语义词典的词语相似度计算方法,其特征在于,所述步骤S30具体为:
    当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,建立所述词语B的扩展词组b[N],依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
    当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,建立所述词语A的扩展词组a[M],依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
    当所述词语A和所述词语B都不存在于语义词典中,建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
  5. 如权利要求3所述的改进的基于语义词典的词语相似度计算方法,其特征在于,所述步骤S40具体为:
    当所述词组a[M]中的词语都不存在于所述语义词典中,且所述词组b[N]中的词语有存在于所述语义词典中时,将所述词语A切分为单字词,建立所述词语A的单字词组aa[P];计算所述单字词组aa[P]与所述词语B的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值;
    当所述词组b[N]中的词语都不存在于所述语义词典中,且所述词组a[M]中的词语有存在于所述语义词典中时,将所述词语B切分为单字词,建立所述词语B的单字词组bb[Q];计算所述词语A与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值;
    当所述词组a[M]中的词语和所述词组b[N]中的词语都不存在于所述语义词典中时,将所述词语A和所述词语B分别切分为单字词,建立所述词语A的单字词组aa[P]和所述词语B的单字词组bb[Q];计算所述单字词组aa[P]与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值。
  6. 如权利要求5所述的改进的基于语义词典的词语相似度计算方法,其特征在于,所述步骤S30具体为:
    当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,建立所述词语B的扩展词组b[N],依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
    当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,建立所述词语A的扩展词组a[M],依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
    当所述词语A和所述词语B都不存在于语义词典中,建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
  7. 一种改进的基于语义词典的词语相似度计算装置,其特征在于,所述改进的基于语义词典的词语相似度计算装置包括:
    词语获取模块,用于获取待比较的词语A和词语B;
    第一词语相似度计算模块,用于当所述词语A和所述词语B都存在于所述语义词典中时,计算所述词语A与所述词语B的相似度值;
    第二词语相似度计算模块,用于当所述词语A和所述词语B至少一个不存在于所述语义词典中时,通过预设的同义词词典建立所述词语A的扩展词组a[M]和/或所述词语B的扩展词组b[N],计算所述词语A与所述扩展词组b[N]中每个词语的相似度值,或所述扩展词组a[M]中每个词语与所述词语B的相似度值,或所述扩展词组a[M]中每个词语与扩展词组b[N]中每个词语的相似度值,取最大相似度值作为所述词语A与所述词语B的相似度值。
  8. 如权利要求7所述的改进的基于语义词典的词语相似度计算装置,其特征在于,所述第二词语相似度计算模块具体用于:
    当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,建立所述词语B的扩展词组b[N],依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
    当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,建立所述词语A的扩展词组a[M],依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
    当所述词语A和所述词语B都不存在于语义词典中,建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
  9. 如权利要求7所述的改进的基于语义词典的词语相似度计算装置,其特征在于,所述改进的基于语义词典的词语相似度计算装置还包括:
    第三词语相似度计算模块,用于当所述词组a[M]中的词语和/或所述词组b[N]中的词语不存在于所述语义词典中时,将所述词语A和/或所述词语B切分为单字词,建立所述词语A的单字词组aa[P]和/或所述词语B的单字词组bb[Q];计算所述词语A与所述单字词组bb[Q]的相似度值,或所述单字词组aa[P]与所述词语B的相似度值,或所述单字词组aa[P]与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值。
  10. 如权利要求9所述的改进的基于语义词典的词语相似度计算装置,其特征在于,所述第二词语相似度计算模块具体用于:
    当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,建立所述词语B的扩展词组b[N],依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
    当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,建立所述词语A的扩展词组a[M],依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
    当所述词语A和所述词语B都不存在于语义词典中,建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
  11. 如权利要求9所述的改进的基于语义词典的词语相似度计算装置,其特征在于,所述第三词语相似度计算模块具体用于:
    当所述词组a[M]中的词语都不存在于所述语义词典中,且所述词组b[N]中的词语有存在于所述语义词典中时,将所述词语A切分为单字词,建立所述词语A的单字词组aa[P];计算所述单字词组aa[P]与所述词语B的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值;
    当所述词组b[N]中的词语都不存在于所述语义词典中,且所述词组a[M]中的词语有存在于所述语义词典中时,将所述词语B切分为单字词,建立所述词语B的单字词组bb[Q];计算所述词语A与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值;
    当所述词组a[M]中的词语和所述词组b[N]中的词语都不存在于所述语义词典中时,将所述词语A和所述词语B分别切分为单字词,建立所述词语A的单字词组aa[P]和所述词语B的单字词组bb[Q];计算所述单字词组aa[P]与所述单字词组bb[Q]的相似度值,取所述相似度值作为所述词语A与所述词语B的相似度值。
  12. 如权利要求11所述的改进的基于语义词典的词语相似度计算装置,其特征在于,所述第二词语相似度计算模块具体用于:
    当所述词语A存在于语义词典中,且所述词语B不存在于语义词典中时,建立所述词语B的扩展词组b[N],依次计算所述词语A与扩展词组b[N]中每个词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
    当所述词语A不存在于语义词典中,且所述词语B存在于语义词典中时,建立所述词语A的扩展词组a[M],依次计算所述扩展词组a[M]中所有词语与所述词语B的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值;
    当所述词语A和所述词语B都不存在于语义词典中,建立所述词语A的扩展词组a[M]和所述词语B的扩展词组b[N],依次计算所述扩展词组a[M]中所有词语和所述扩展词组b[N]中所有词语的相似度值,并取最大相似度值作为所述词语A与所述词语B的相似度值。
PCT/CN2015/073841 2015-02-15 2015-03-07 改进的基于语义词典的词语相似度计算方法和装置 WO2016127458A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510083760.8 2015-02-15
CN201510083760.8A CN104699667A (zh) 2015-02-15 2015-02-15 改进的基于语义词典的词语相似度计算方法和装置

Publications (1)

Publication Number Publication Date
WO2016127458A1 true WO2016127458A1 (zh) 2016-08-18

Family

ID=53346806

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/073841 WO2016127458A1 (zh) 2015-02-15 2015-03-07 改进的基于语义词典的词语相似度计算方法和装置

Country Status (2)

Country Link
CN (1) CN104699667A (zh)
WO (1) WO2016127458A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815484A (zh) * 2018-12-21 2019-05-28 平安科技(深圳)有限公司 基于交叉注意力机制的语义相似度匹配方法及其匹配装置

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802918A (zh) * 2016-12-13 2017-06-06 成都数联铭品科技有限公司 用于自然语言处理的领域词典生成系统
CN108664464B (zh) * 2017-03-27 2021-07-16 中国移动通信有限公司研究院 一种语义相关度的确定方法及确定装置
CN108932222B (zh) * 2017-05-22 2021-11-19 中国移动通信有限公司研究院 一种获取词语相关度的方法及装置
CN108153735B (zh) * 2017-12-28 2021-05-18 北京奇艺世纪科技有限公司 一种近义词的获取方法及系统
CN109472019B (zh) * 2018-10-11 2023-02-10 厦门快商通信息技术有限公司 一种基于同义词典的短文本相似度匹配方法及系统
CN112528666A (zh) * 2019-08-30 2021-03-19 北京猎户星空科技有限公司 一种语义识别方法、装置及电子设备
CN110737469B (zh) * 2019-09-29 2021-09-03 南京大学 一种功能粒度上基于语义信息的源代码相似度评估方法
CN111339262B (zh) * 2020-05-21 2020-08-18 北京金山数字娱乐科技有限公司 一种语句选词方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101288071A (zh) * 2005-02-25 2008-10-15 西门子企业通讯有限责任两合公司 用于确定计算机服务名称的方法和计算机单元
CN102622338A (zh) * 2012-02-24 2012-08-01 北京工业大学 一种短文本间语义距离的计算机辅助计算方法
CN102880600A (zh) * 2012-08-30 2013-01-16 北京航空航天大学 基于通用知识网络的词语语义倾向性预测方法
CN102968409A (zh) * 2012-11-23 2013-03-13 海信集团有限公司 智能人机交互语义分析方法及交互系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8682898B2 (en) * 2010-04-30 2014-03-25 International Business Machines Corporation Systems and methods for discovering synonymous elements using context over multiple similar addresses
CN103377239B (zh) * 2012-04-26 2020-08-07 深圳市世纪光速信息技术有限公司 计算文本间相似度的方法和装置
CN103678272B (zh) * 2012-09-17 2016-04-06 北京信息科技大学 汉语依存树库中未登录词的处理方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101288071A (zh) * 2005-02-25 2008-10-15 西门子企业通讯有限责任两合公司 用于确定计算机服务名称的方法和计算机单元
CN102622338A (zh) * 2012-02-24 2012-08-01 北京工业大学 一种短文本间语义距离的计算机辅助计算方法
CN102880600A (zh) * 2012-08-30 2013-01-16 北京航空航天大学 基于通用知识网络的词语语义倾向性预测方法
CN102968409A (zh) * 2012-11-23 2013-03-13 海信集团有限公司 智能人机交互语义分析方法及交互系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815484A (zh) * 2018-12-21 2019-05-28 平安科技(深圳)有限公司 基于交叉注意力机制的语义相似度匹配方法及其匹配装置
CN109815484B (zh) * 2018-12-21 2022-03-15 平安科技(深圳)有限公司 基于交叉注意力机制的语义相似度匹配方法及其匹配装置

Also Published As

Publication number Publication date
CN104699667A (zh) 2015-06-10

Similar Documents

Publication Publication Date Title
WO2016127458A1 (zh) 改进的基于语义词典的词语相似度计算方法和装置
WO2017143692A1 (zh) 智能电视及其语音控制方法
WO2018188196A1 (zh) 一种数据版本控制方法、数据版本控制器、设备及计算机可读存储介质
WO2020009297A1 (ko) 도메인 추출기반의 언어 이해 성능 향상장치및 성능 향상방법
WO2016082267A1 (zh) 语音识别方法和系统
WO2015046753A1 (en) Impedance matching method and impedance matching system
WO2012122718A1 (zh) 一种浏览器预读方法及其系统
WO2018120429A1 (zh) 一种资源更新的方法、终端、计算机可读存储介质及资源更新设备
WO2017054592A1 (zh) 一种界面显示的方法及终端
WO2019051905A1 (zh) 空调器控制方法、空调器及计算机可读存储介质
WO2017148112A1 (zh) 一种指纹录入方法及终端
WO2019062112A1 (zh) 空调器控制方法、装置、空调器及计算机可读存储介质
WO2017084302A1 (zh) 显示终端开机播放视频的方法及显示终端
WO2019041851A1 (zh) 家电售后咨询方法、电子设备和计算机可读存储介质
WO2019114262A1 (zh) 加载用户界面的方法、智能电视及计算机可读存储介质
WO2018233221A1 (zh) 多窗口声音输出方法、电视机以及计算机可读存储介质
WO2019000801A1 (zh) 数据同步方法、装置、设备及计算机可读存储介质
WO2014048231A1 (zh) 触摸屏智能设备文字处理方法和装置
WO2018032680A1 (zh) 音视频播放方法及系统
WO2017080195A1 (zh) 音频识别方法及装置
WO2019169717A1 (zh) 空调器及其的控制方法和计算机可读存储介质
WO2019000466A1 (zh) 人脸识别方法、装置、存储介质及电子设备
WO2018032679A1 (zh) 电视定时开关机的设置方法和装置
WO2017113587A1 (zh) 创建wep密码的方法和装置
WO2018090461A1 (zh) 多声道无线音箱之间数据同步的方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15881620

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15881620

Country of ref document: EP

Kind code of ref document: A1