CN101847141A - Method for measuring semantic similarity of Chinese words - Google Patents

Method for measuring semantic similarity of Chinese words Download PDF

Info

Publication number
CN101847141A
CN101847141A CN 201010191677 CN201010191677A CN101847141A CN 101847141 A CN101847141 A CN 101847141A CN 201010191677 CN201010191677 CN 201010191677 CN 201010191677 A CN201010191677 A CN 201010191677A CN 101847141 A CN101847141 A CN 101847141A
Authority
CN
China
Prior art keywords
similarity
former
justice
depth
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201010191677
Other languages
Chinese (zh)
Inventor
张玥杰
彭琳
金城
薛向阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN 201010191677 priority Critical patent/CN101847141A/en
Publication of CN101847141A publication Critical patent/CN101847141A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention belongs to the technical field of natural language processing and particularly discloses a method for measuring semantic similarity of Chinese words. The method comprises the following steps of: firstly, extracting rich semantic information of a Hownet by using a KDML language of the hownet; secondly, calculating primary similarity by using an optimized primary similarity calculation formula; and finally, calculating the similarity among concepts by using a maximum matching algorithm to obtain the semantic similarity of the Chinese words. Compared with other traditional methods, the method for measuring the semantic similarity of the Chinese words has a better discrimination of the semantic similarity and calculation results meet the subject feeling of people.

Description

Chinese word semantic similarity measure
Technical field
The invention belongs to the natural language processing technique field, be specifically related to the phrase semantic method for measuring similarity.
Background technology
There is very complicated relation between the word of natural language, as synonym, to justice, antisense, whole-part and hyponymy etc.In actual applications, an active demand is accurately measured this complex relationship with a kind of simple quantity exactly, i.e. phrase semantic similarity (Word Lexical Semantic Similarity).The phrase semantic similarity all has a wide range of applications in a lot of fields, as image retrieval, text classification, word sense disambiguation and mechanical translation etc.The phrase semantic similarity adopts a numerical value to measure semantic similarity degree between two words, and also the complex relationship between word is that the accurate tolerance of similarity is brought very big problem just.Manual creation is passed through in semantic knowledge source based on the dictionary form, has " correctness, without prejudice and completeness "; Have field unbalancedness and the sparse property of data and be used for the semantic corpus that calculates, do not have enough big semantic marker language material simultaneously.Therefore, have many limitation, and demonstrate its advantage all the more along with constantly improving of extensive semantic dictionary, intuitively, simply and be not subjected to the restriction in word field based on the mode of machine readable dictionary based on the phrase semantic measuring similarity mode of adding up.
Research strategy to the phrase semantic measuring similarity is divided into two classes substantially both at home and abroad, promptly based on the mode of machine readable dictionary with based on the mode of adding up.
Phrase semantic measuring similarity based on machine readable dictionary is a kind of rationalist approach based on linguistics and artificial intelligence.Semantic dictionary is organized according to layer of structure relation between notion, according to the similarity of learning calculating such as hyponymy between the notion and apposition word in the resource at this speech like sound.
Phrase semantic measuring similarity based on statistics is a kind of empirical method, and it depends on a kind of like this hypothesis, and the speech that every semanteme is close, their context also should be similar.The large-scale corpus of this method utilization is added up, it mainly with the probability distribution of contextual information as the phrase semantic measuring similarity with reference to foundation, so some documents also are referred to as distributed similarity (Distributional Similarity).Vector space model be based on use in the phrase semantic method for measuring similarity of statistics a kind of comparatively widely.This model is selected a stack features speech in advance, calculate the correlativity (generally the frequency that appears in this speech context with this stack features speech is measured) of this stack features speech and each speech then in the extensive language material of reality, so, can obtain the feature term vector of a correlativity for each speech, the similarity between the compute vector (relatively Chang Yong method is to calculate the cosine value) is as the similarity of word then.But present computing method mostly speed are slower.And because the corpus or the restriction of machine readable dictionary, the accuracy of semantic similarity tolerance has much room for improvement.
Summary of the invention
It is fast to the objective of the invention is to propose a kind of computing velocity, the high Chinese word semantic similarity measure of tolerance accuracy.
The present invention proposes Chinese word semantic similarity measure, is a kind of new for the phrase semantic method for measuring similarity of knowing net.This method similar algorithm more in the past, more utilized KDML (the Knowledge DatabaseMark-up Language) language of knowing net to extract the semantic information of enriching of knowing net, adopt layering to calculate and add the method for maximum match, optimized adopted former similarity algorithm simultaneously, made result calculated have more the subjective sensation that discrimination also meets the people more.
Summarize about the KDML language
A word may have a plurality of notions in " knowing net ", and each notion represents with a record, shape as:
NO.=021739
W_C=beats
G_C=V
E_C=~ball ,~tennis ,~basketball ,~shuttlecock ,~board ,~playing card ,~mahjong ,~swing ,~taijiquan, ball~get very well
W_E=play
G_E=V
E_E=
DEF={exercise| exercise: domain={sport| physical culture } }
In the above-mentioned record, NO. be recording mechanism, W_C, G_C, E_C are respectively word, part of speech and the examples of Chinese, and W_E, G_E, E_E are respectively word, part of speech and the examples of English, DEF is the semantic formula of this notion, comes its expression of standard with knowledge data descriptive language (KDML).KDML has following four important composition forms:
1) justice is former: used word is called as justice former (sememes) in the KDML descriptive language, and " exercise| exercise " wherein and " sport| physical culture " are exactly that two justice are former, and organize according to the KDML syntax rule.The adopted former ambiguousness that do not have extracts " meaning least unit the most basic and that be not easy to cut apart again ", just the least unit of Miao Shuing from Chinese character (comprising single-morpheme word).
2) main classes justice is former: first justice in the semantic formula is former, and to be also referred to as main classes justice simultaneously former, and it is former that " exercise| exercise " is main classes justice in this example.The former meaning that must be pointed out that this notion is the most basic of main classes justice can think that it has the strongest descriptive power to notion.
3) semantic formula: " DEF={...} " is the core of whole record, is definition and description for this notion, is referred to as semantic formula.Be complexity, consistance and the accuracy of guaranteeing conceptual description, utilize KDML to come the description of standard semantic formula.
4) the former framework of main classes justice: briefly, know that net has also carried out the semantic formula definition as word for most of justice is former.As shown below.Wherein, for justice former " thing| all things on earth ", the former framework of its main classes justice is " { entity| entity: { ExistAppear| deposits cash: existent={~} } } ", describes the grammer strictness and follows the KDML language.
In based on the notion semantic description that KDML set up, be in the adopted former descriptive power difference in the different bracket levels for the phrase semantic definition, the adopted former descriptive power to general neck that is in the outer bracket is strong more; Otherwise being in adopted former in the internal layer bracket is to the former specific explanations of last layer justice, so it is the intermediate description to notion, descriptive power relatively a little less than.So when tolerance phrase semantic similarity, be necessary it is treated with a certain discrimination.
About the former measuring similarity of justice
As the important foundation of word measuring similarity, adopted former calculation of similarity degree is carried out according to the former hierarchical system of justice (being hyponymy).Based on tree-shaped hierarchical structure, consider path between the node, introduce the level degree of depth of node simultaneously, and set up adopted former calculation of similarity degree formula:
Sim ( S 1 , S 2 ) = α × min ( Depth ( S 1 ) , Depth ( S 2 ) ) α × min ( Depth ( S 1 ) , Depth ( S 2 ) ) + Dist ( S 1 , S 2 ) - - - ( 1 )
Wherein, S 1With S 2Represent that respectively two justice are former; Dist (S 1, S 2) path of two justice of expression between former; α is for regulating parameter, and the expression similarity is 0.5 o'clock a path; Depth (S 1) and Depth (S 2) represent adopted former S respectively 1With S 2The level degree of depth; Min (Depth (S 1), Depth (S 2)) expression gets smaller in two former level degree of depth of justice.The former semantic information of carrying of justice has the branch of size, and the node semantic information that is in bottom is abundant more, and the node semanteme that is in high level is abstract more, so should treat adopted former on the different levels with a certain discrimination.
Tolerance about semantic similarity
Have the polysemy phenomenon in the Chinese, the phrase semantic similarity should be calculated the similarity between the notion, and the semantic similarity of two alone words (not being in certain context) is the maximal value of similarity between its all notions.
Sim(W 1,W 2)=max?Sim(C 1i,C 2j) i=1…n,j=1…m (2)
Wherein, W 1Represent word 1 and have n notion, W 2Represent word 2 and have m notion, C 1iBe W 1I item notion, C 2jBe W 2J item notion.According to the architectural characteristic of KDML, the notion semantic similarity is divided into three parts calculates:
Sim(C 1,C 2)=w 1*P 1+w 2*P 2+w 3*P 3 (3)
Wherein, P 1Be the similarity of two notion main classes justice between former; P 2Similarity for whole semantic formula; P 3Be at two former framework calculation of similarity degree of DEF main classes justice; w 1, w 2With w 3Be respectively three pairing weights of part, should satisfy constraint condition w 1+ w 2+ w 3=1 and w 2>w 1, w 2>w 3
With " infant " and " pediatrician " is the calculating that example specifies concept similarity.Wherein the semantic formula of " infant " and " pediatrician " is respectively:
DEF={human| people: domain={medical| doctor }, modifier={child| juvenile }, SufferFrom| suffers from: experience={~, doctor| cures: patient={~
DEF={human| people: HostOf={Occupation| position }, the domain={medical| doctor }, doctor| cures: agent={~, patient={human| people: modifier={child| juvenile } } } }
In formula (3), P 1Be the similarity of two main classes justice between former, i.e. first justice former " human| people " and " human| people's " similarity in the semantic formula, by formula calculate (1).Aforementionedly illustrated that main classes justice is former and had the most direct semantic description ability, therefore it has single-rowly been considered highly significant for a part for notion.
For P 2,, so do it as a whole and to calculate its similarity with reference to the KDML rule necessary because semantic formula is a complete individuality, and has oneself syntax rule.This part is the part of the most complicated and weights proportion maximum in the whole semantic similarity tolerance, because need to consider whole semantic formula.Its computation process can be divided into two stages, at first, according to the KDML syntax rule adopted former in the semantic formula divided (" { } " being distinguished level under it with braces) by level, and before not having dynamic character adopted former interpolation ZeroRole, as shown in table 1; Every layer is adopted the method for maximum match to carry out similarity calculating then.
Figure BSA00000147358200041
Table 1: the adopted former hierarchical structure table of " infant " and " pediatrician "
In the maximum match method, be example with the second layer of table 1.At first, calculate every group of similarity that justice is former, therefrom one of the selective value maximum group, in this example " the domain={medical| doctor }; domain={medical| doctor } " and the similarity of " ZeroRole=doctor| cures, and ZeroRole=doctor| cures " be 1, then choose one group wantonly and get final product, as selecting " domain={medical| doctor }, domain={medical| cures } "; Secondly, still select semantic similarity value the maximum in remaining adopted former group, " ZeroRole=doctor| cures, and ZeroRole=doctor| cures " is selected; The rest may be inferred, and third round is selected " modifier={child| juvenile }, HostOf={Occupation| position } ", and four-wheel is selected " SufferFrom| suffers from, NULL ".When two notions when not waiting, can occur the situation that the former and empty element of justice matches with the adopted former number of layer, can unify to get at this moment smaller value r (parameter that sets).At last, adopted former group of selected semantic similarity addition averaged, can obtain P 2The value of part.
For P 3, its computing method and P 2Identical.Measuring similarity at the former framework of main classes justice is actually the another kind of method of calculating the former similarity of main classes justice, has emphasized the former direct descriptive power for notion of main classes justice again.
Finally, based on above-mentioned three part calculation of similarity degree, can calculate semantic similarity between every pair of notion according to formula (3), then by formula (2) get the semantic similarity of maximal value as word.
Description of drawings
Fig. 1 is semantic formula definition diagram.
Fig. 2 is an algorithm flow chart.
Embodiment
Method flow of the present invention as shown in Figure 2.Its operation steps is:
A) two words of input;
B) from know net, obtain all records of this two words;
C) the DEF expression formula in the taking-up record;
D) according to the former similarity of justice in formula (1) calculation expression;
E) calculate the similarity of two DEF expression formulas according to formula (3);
F) calculate the similarity of two words according to formula (2).
For b), the C/C++ development interface that can use official to provide.Know that the net system externally provides bilingual Chinese-English knowledge dictionary and the chained library relevant with exploitation.
In HowNet, search the word record and need following steps:
(1) at first need to call HowNet_Initial (), this function can carry out initialization to the data of knowing the net knowledge system, must call this initialization function before calling other functions.The index file hownet.idx of net knowledge system can be need known in the function,, the initialization failure can be directly caused if this file does not exist.
(2) (char* ApSt, S_SEARCH_MODEsHowNet_SearchMode) search key return the number of the record that finds, the key word of ApStr variable for being searched to call HowNet_Search_Keyword then.The result that can obtain searching according to the search pattern and the key word of appointment by this function, the i.e. number of the record that in knowing the net knowledge base, finds.In use, for the concrete outcome that obtains searching, this function usually and function HowNet_Get_SearchResult, HowNet_Get_Unit_Item uses jointly.
(3) then call the result that HowNet_Get_SearchResult () obtains searching, rreturn value is the recording mechanism array of the record that finds.This function will be placed on the recording mechanism of all records that find by function HowNet_Search_Keyword in the array and return.
Use the recording mechanism that obtains in the step 3 at last, (const DWORDAdwUnitID, const BYTE AItemID char*ApRlt) obtain the particular content of the specified portions of a designated recorder to call HowNet_Get_Unit_Item.The AdwUnitI variable is a recording mechanism, and AitemID specifies concrete which content that obtains this record, when being worth for HOWNET_ITEM_ID_ALL, and the complete documentation content of expression designated recorder.
For d), to take out in the DEF expression formula, shape gets final product as adopted former calculating of " ".

Claims (1)

1. a Chinese word words and phrases justice calculation of similarity degree method is characterized in that concrete steps are: at first utilize and know that the KDML language of net extracts the semantic information of enriching of knowing net; Adopt adopted former calculating formula of similarity to calculate adopted former similarity then; Adopt the similarity between the maximum matching algorithm formula calculating notion at last, promptly obtain Chinese word words and phrases justice similarity; Wherein:
The former calculation of similarity degree formula of described justice is:
Sim ( S 1 , S 2 ) = α × min ( Depth ( S 1 ) , Depth ( S 2 ) ) α × min ( Depth ( S 1 ) , Depth ( S 2 ) ) + Dist ( S 1 , S 2 ) - - - ( 1 )
Wherein, S 1With S 2Represent that respectively two justice are former; Dist (S 1, S 2) path of two justice of expression between former; α is for regulating parameter, and the expression similarity is 0.5 o'clock a path; Depth (S 1) and Depth (S 2) represent adopted former S respectively 1With S 2The level degree of depth; Min (Depth (S 1), Depth (S 2)) expression gets smaller in two former level degree of depth of justice;
The formula of described maximum matching algorithm is:
Sim(W 1,W 2)=maxSim(C 1i,C 2j) i=1…n,j=1…m (2)
Wherein, W 1Represent word 1 and have n notion, W 2Represent word 2 and have m notion, C 1iBe W 1I item notion, C 2jBe W 2J item notion; According to the architectural characteristic of KDML, the notion semantic similarity is divided into three parts and calculates:
Sim(C 1,C 2)=w 1*P 1+w 2*P 2+w 3*P 3 (3)
Wherein, P 1Be the similarity of two notion main classes justice between former; P 2Similarity for whole semantic formula; P 3Be at two former framework calculation of similarity degree of DEF main classes justice; w 1, w 2With w 3Be respectively three pairing weights of part, satisfy constraint condition w 1+ w 2+ w 3=1 and w 2>w 1, w 2>w 3
CN 201010191677 2010-06-03 2010-06-03 Method for measuring semantic similarity of Chinese words Pending CN101847141A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010191677 CN101847141A (en) 2010-06-03 2010-06-03 Method for measuring semantic similarity of Chinese words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010191677 CN101847141A (en) 2010-06-03 2010-06-03 Method for measuring semantic similarity of Chinese words

Publications (1)

Publication Number Publication Date
CN101847141A true CN101847141A (en) 2010-09-29

Family

ID=42771764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010191677 Pending CN101847141A (en) 2010-06-03 2010-06-03 Method for measuring semantic similarity of Chinese words

Country Status (1)

Country Link
CN (1) CN101847141A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955774A (en) * 2012-05-30 2013-03-06 华东师范大学 Control method and device for calculating Chinese word semantic similarity
CN104133812A (en) * 2014-07-17 2014-11-05 北京信息科技大学 User-query-intention-oriented Chinese sentence similarity hierarchical calculation method and user-query-intention-oriented Chinese sentence similarity hierarchical calculation device
CN105183716A (en) * 2015-09-21 2015-12-23 上海智臻智能网络科技股份有限公司 Intelligent interaction method based on abstract semantics
CN105808522A (en) * 2016-03-08 2016-07-27 浪潮软件股份有限公司 Method and apparatus for semantic association
CN106021286A (en) * 2016-04-29 2016-10-12 东北电力大学 Method for language understanding based on language structure
CN106339371A (en) * 2016-08-30 2017-01-18 齐鲁工业大学 English and Chinese word meaning mapping method and device based on word vectors
CN106610934A (en) * 2016-07-08 2017-05-03 四川用联信息技术有限公司 Novel semantic similarity solving method in intelligent manufacturing industry
CN106815197A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 The determination method and apparatus of text similarity
CN109446518A (en) * 2018-10-09 2019-03-08 清华大学 The coding/decoding method and decoder of language model
CN111597826A (en) * 2020-05-15 2020-08-28 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《信息技术》 20100331 赵应秋等 基于知网的词语语义相关度计算 第92-93页 1 , 2 *
《微计算机信息》 20100131 徐猛等 一种基于知网语义相似度计算的应用研究 第201页 1 第26卷, 第1-3期 2 *
《电子技术》 20100531 曹立勇等 基于知网的语义相似度的改进算法 第1-2页 1 , 2 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955774A (en) * 2012-05-30 2013-03-06 华东师范大学 Control method and device for calculating Chinese word semantic similarity
CN104133812B (en) * 2014-07-17 2017-03-08 北京信息科技大学 A kind of Chinese sentence similarity layered calculation method of user oriented query intention and device
CN104133812A (en) * 2014-07-17 2014-11-05 北京信息科技大学 User-query-intention-oriented Chinese sentence similarity hierarchical calculation method and user-query-intention-oriented Chinese sentence similarity hierarchical calculation device
CN105183716B (en) * 2015-09-21 2017-12-15 上海智臻智能网络科技股份有限公司 A kind of intelligent interactive method based on abstract semantics
CN105183716A (en) * 2015-09-21 2015-12-23 上海智臻智能网络科技股份有限公司 Intelligent interaction method based on abstract semantics
CN106815197A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 The determination method and apparatus of text similarity
CN106815197B (en) * 2015-11-27 2020-07-31 北京国双科技有限公司 Text similarity determination method and device
CN105808522A (en) * 2016-03-08 2016-07-27 浪潮软件股份有限公司 Method and apparatus for semantic association
CN106021286A (en) * 2016-04-29 2016-10-12 东北电力大学 Method for language understanding based on language structure
CN106610934A (en) * 2016-07-08 2017-05-03 四川用联信息技术有限公司 Novel semantic similarity solving method in intelligent manufacturing industry
CN106339371A (en) * 2016-08-30 2017-01-18 齐鲁工业大学 English and Chinese word meaning mapping method and device based on word vectors
CN106339371B (en) * 2016-08-30 2019-04-30 齐鲁工业大学 A kind of English-Chinese meaning of a word mapping method and device based on term vector
CN109446518A (en) * 2018-10-09 2019-03-08 清华大学 The coding/decoding method and decoder of language model
CN111597826A (en) * 2020-05-15 2020-08-28 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation

Similar Documents

Publication Publication Date Title
CN101847141A (en) Method for measuring semantic similarity of Chinese words
Grefenstette Explorations in automatic thesaurus discovery
CN106844658A (en) A kind of Chinese text knowledge mapping method for auto constructing and system
Yao et al. Question generation with minimal recursion semantics
Lee et al. A grammar-based semantic similarity algorithm for natural language sentences
Chakrabarti et al. Optimizing scoring functions and indexes for proximity search in type-annotated corpora
CN112487202B (en) Chinese medical named entity recognition method and device fusing knowledge map and BERT
CN110427478B (en) Knowledge graph-based question and answer searching method and system
CN109918676A (en) It is a kind of to detect the method and device for being intended to regular expression, terminal device
Guha et al. Removing the training wheels: A coreference dataset that entertains humans and challenges computers
CN112949293B (en) Similar text generation method, similar text generation device and intelligent equipment
CN112632250A (en) Question and answer method and system under multi-document scene
Hong Verb Sense Discovery in Mandarin Chinese--A Corpus Based Knowledge-Intensive Approach
Vandevoorde On semantic differences: a multivariate corpus-based study of the semantic field of inchoativity in translated and non-translated Dutch
Mollaei et al. Question classification in Persian language based on conditional random fields
KR102363131B1 (en) Multi-dimensional knowledge searching method and system for expert systems
Kennedy et al. Evaluation of automatic updates of Roget’s Thesaurus
CN108959269B (en) A kind of sentence auto ordering method and device
Van Tu A Deep Learning Model of Multiple Knowledge Sources Integration for Community Question Answering
Gunasiri Automated cricket news generation in Sri Lankan style using natural language generation
Wijaya VerbKB: a knowledge base of verbs for natural language understanding
Becerra-Bonache et al. A gold standard to measure relative linguistic complexity with a grounded language learning model
Bindu et al. Design and development of a named entity based question answering system for Malayalam language
Lapshin Question-answering systems: Development and prospects
Zhou et al. Medical text classification system based on deep learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20100929