CN106610948A - Improved lexical semantic similarity solution algorithm - Google Patents

Improved lexical semantic similarity solution algorithm Download PDF

Info

Publication number
CN106610948A
CN106610948A CN201610838940.7A CN201610838940A CN106610948A CN 106610948 A CN106610948 A CN 106610948A CN 201610838940 A CN201610838940 A CN 201610838940A CN 106610948 A CN106610948 A CN 106610948A
Authority
CN
China
Prior art keywords
word
compared
maximum
similarity
cliction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610838940.7A
Other languages
Chinese (zh)
Inventor
金平艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Yonglian Information Technology Co Ltd
Original Assignee
Sichuan Yonglian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Yonglian Information Technology Co Ltd filed Critical Sichuan Yonglian Information Technology Co Ltd
Publication of CN106610948A publication Critical patent/CN106610948A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention provides an improved lexical semantic similarity solution algorithm. On the basis of a successfully initialized statistical method module, a maximum keyword is extracted by comparing context words with the maximum weights of words to be compared and the similarity of the words to be compared, and finally the similarity between the extracted maximum keyword and the words to be compared is calculated to obtain a result. The calculation result of the semantic similarity is basically consistent with actually manually judged semantic similarity; the objective reality is better reflected; and the user demands are better satisfied.

Description

A kind of improved Similarity of Words derivation algorithm
Technical field
The present invention relates to Semantic Web technology field, and in particular to a kind of improved Similarity of Words derivation algorithm.
Background technology
Since 21 century, the internet industry in the whole world enters the new period of a high speed development, and various new technologies are continuous Emerge in large numbers.Natural language processing as important technology between contact computer and people also achieves quickly development.Traditional language Adopted relatedness computation method is roughly divided into two classes:Semantic similarity calculation method based on semantic dictionary and based on corpus Semantic similarity calculation method, they belong to the semantic similarity computational methods based on statistics, are also a kind of experience master Right way of conduct method;Set up on observable linguistic fact based on the research of the semantic similarity of statistics, and do not depended solely on The intuition of linguist.It is built upon two phrase semantics it is similar and if only if they in similar context environmental this On the basis of one assumes, using Large Scale Corpus, using the contextual information of word as Semantic Similarity Measurement reference according to According to.In addition, this kind of method can be compared accurate effectively tolerance to the Semantic Similarity between word, but the method is relied on very much It is computationally intensive and method is more complicated in training used corpus, it is larger by Sparse and data noise jamming, sometimes Apparent error occurs, based on above demand is met, the invention provides a kind of improved Similarity of Words derivation algorithm.
The content of the invention
The similarity problem being directed in word, the invention provides a kind of improved Similarity of Words is solved and calculated Method.
In order to solve the above problems, the present invention is achieved by the following technical solutions:
Step 1:Initialization statistical method module, can be here《Word dictionary》、《Word woods》, Hownet,《Baidupedia》 Etc. corpus.
Step 2:By word (c to be compared1, c2) in input initialization statistical method module.
Step 3:Word (c to be compared is found in statistical module1, c2) the maximum cliction up and down of weight in adjacent context (csx1,csx2)。
Step 4:According to word (c to be compared1, c2) the maximum cliction (c up and down of the corresponding weight of differencesx1,csx2) between it is similar Degree, extracts similarity maximum keyword csx
Step 5:Similarity maximum keyword c is calculated respectivelysxWith word (c to be compared1, c2) the degree of correlation.
Step 6:The degree of correlation that recycle step 5 is tried to achieve, draws word (c to be compared1, c2) similarity sim (c1, c2) value.
Present invention has the advantages that:
1st, the result of calculation of semantic similarity is basically identical with the semantic similitude degree of actual artificial judgment.
2nd, objective reality is preferably reflected.
3rd, user's request is more met.
Description of the drawings
Fig. 1 is a kind of structure flow chart of improved Similarity of Words derivation algorithm.
Specific embodiment
To solve word (c1, c2) between semantic similarity problem, the present invention has been described in detail with reference to Fig. 1, its tool Body implementation steps are as follows:
Step 1:Initialization statistical method module, can be here《Word dictionary》、《Word woods》、《Hownet》、《Baidu hundred Section》Etc. corpus.
Step 2:By word (c to be compared1, c2) in input initialization statistical method module.
Step 3:Word (c to be compared is found in statistical module1, c2) the maximum cliction up and down of weight in adjacent context (csx1,csx2)。
Find out word (c to be compared1, c2) corresponding upper and lower cliction maximum value (c of weight in corpussx1,csx2), specifically Calculating process is as follows:
Up and down cliction is searched according to constraints, for example, in Chinese, with stronger context restriction relation Part of speech is to having:Adjective-noun, verb-noun, noun-verb, adjective-verb etc..
weightSx1,2=p (c1,2/cSx1,2)log2[p(c1,2/cSx1,2)+1]
Above formula cSx1,2Respectively with certain relation and word (c to be compared1, c2) adjacent cliction up and down,
p(c1,2/cSx1,2) it is c1,2With upper and lower cliction cSx1,2The conditional probability of certain relation is presented, equation 1 above is smooth system Number.
Above formula
n(c1,2/cSx1,2) it is word (c to be compared in corpus1, c2) respectively with upper and lower cliction cSx1,2Certain relation is presented Co-occurrence quantity, n (c1,2, cSx1,2) it is word (c to be compared in corpus1, c2) with upper and lower cliction cSx1,2Co-occurrence quantity.
In sum, following formula is obtained final product:
MAXweightSx1,2=MAX { p (c1,2/cSx1,2)log2[p(c1,2/cSx1,2)+1]}
Word (c to be compared is found according to above formula1, c2) optimal context Collocation (csx1,csx2),csx1It is and c1.It is in The existing optimal context Collocation corresponding to certain relation, in the same manner csx2It is and c2It is presented optimal upper and lower corresponding to certain relation Literary Collocation.
Step 4:According to word (c to be compared1, c2) the maximum cliction (c up and down of the corresponding weight of differencesx1,csx2) between it is similar Degree, extracts similarity maximum keyword csx
According to the context vocabulary (c that two weights are maximumsx1,csx2) included in total word frequency number, can find its pass Key word information csx, i.e.,
Above formula nf (csx1, csx2) it is the maximum context vocabulary (c of two weightssx1,csx2) included in total word quantity,Maximum context vocabulary (the c of respectively two weightssx1,csx2) in entry length.
Obtained by above formula,
f(csx)=max [Sim (csx1, csx2)]
According to f (csx) maximum, you can find best match keyword csx
Step 5:Similarity maximum keyword c is calculated respectivelysxWith word (c to be compared1, c2) the degree of correlation.
According to csxWith word (c to be compared1, c2) probability of co-occurrence, c in corpussxWith word (c to be compared1, c2) semanteme Structural relation can respectively draw maximum keyword csxWith word (c to be compared1, c2) the degree of correlation, its concrete calculating process is as follows:
Step 5.1) csxWith word (c to be compared1, c2) the co-occurrence probabilities p (c in corpus1,2/csx), above-mentioned steps 4 in the same manner Principle can draw.
Step 5.2) csxWith word (c to be compared1, c2) semantic structure relation, this can according to its《Hownet》In Hierarchical relationship draws its path vector
Step 5.3) in sum, can respectively draw maximum keyword csxWith word (c to be compared1, c2) the degree of correlation.
relativity(csx, c1)、relativity(csx, c2) it is respectively maximum keyword csxWith word (c to be compared1, c2) the degree of correlation, above formulaFor csxWith word (c to be compared1, c2) semantic structure relation regulatory factor,For co-occurrence Probability p (c1,2/csx) regulatory factor.
Step 6:The degree of correlation that recycle step 4 is tried to achieve, draws word (c to be compared1, c2) similarity sim (c1, c2) value.
A kind of improved Similarity of Words derivation algorithm, its false code calculating process is as follows:
Input:The selected statistical module of initialization, word (c to be compared1, c2)
Output:Word (c to be compared1, c2) between semantic similarity.

Claims (5)

1. a kind of improved Similarity of Words derivation algorithm, the present invention relates to Semantic Web technology field, and in particular to Improved Similarity of Words derivation algorithm is planted, be it is characterized in that, comprised the steps:
Step 1:Initialization statistical method module, can be here《Word dictionary》、《Word woods》, Hownet,《Baidupedia》Etc. Corpus
Step 2:By word to be comparedIn input initialization statistical method module
Step 3:Word to be compared is found in statistical moduleThe maximum cliction up and down of weight in adjacent context
Step 4:According to word to be comparedThe maximum cliction up and down of the corresponding weight of differenceBetween it is similar Degree, extracts similarity maximum keyword
Step 5:Similarity maximum keyword is calculated respectivelyWith word to be comparedThe degree of correlation
Step 6:The degree of correlation that recycle step 5 is tried to achieve, draws word to be comparedSimilarity simValue.
2., according to a kind of improved Similarity of Words derivation algorithm described in claim 1, it is characterized in that, the above Concrete calculating process in step 3 is as follows:
Step 3:Word to be compared is found in statistical moduleThe maximum cliction up and down of weight in adjacent context
Find out word to be comparedCorresponding upper and lower cliction maximum value of weight in corpus, specifically Calculating process is as follows:
Up and down cliction is searched according to constraints, for example, in Chinese, the part of speech with stronger context restriction relation To having:Adjective-noun, verb-noun, noun-verb, adjective-verb etc.
Above formulaRespectively with certain relation and word to be comparedAdjacent cliction up and down, ForWith upper and lower clictionThe conditional probability of certain relation is presented, equation 1 above is smoothing factor
Above formula
For word to be compared in corpusRespectively with upper and lower clictionCertain relation is presented Co-occurrence quantity,For word to be compared in corpusWith upper and lower clictionCo-occurrence quantity
In sum, following formula is obtained final product:
Word to be compared is found according to above formulaOptimal context CollocationBe with, it is in The existing optimal context Collocation corresponding to certain relation, in the same mannerBe withPresent corresponding to certain relation it is optimal on Hereafter Collocation.
3., according to a kind of improved Similarity of Words derivation algorithm described in claim 1, it is characterized in that, the above Concrete calculating process in step 4 is as follows:
Step 4:According to word to be comparedThe maximum cliction up and down of the corresponding weight of differenceBetween it is similar Degree, extracts similarity maximum keyword
According to the context vocabulary that two weights are maximumIncluded in total word frequency number, its can be found crucial Word information, i.e.,
Sim
Above formulaFor the context vocabulary that two weights are maximumIncluded in total word number Amount,The maximum context vocabulary of respectively two weightsIn entry length
Obtained by above formula,
According toMaximum, you can find best match keyword
4., according to a kind of improved Similarity of Words derivation algorithm described in claim 1, it is characterized in that, the above Concrete calculating process in step 5 is as follows:
Step 5:Similarity maximum keyword is calculated respectivelyWith word to be comparedThe degree of correlation
According toWith word to be comparedThe probability of co-occurrence in corpus,With word to be comparedSemanteme Structural relation can respectively draw maximum keywordWith word to be comparedThe degree of correlation, its concrete calculating process is as follows:
Step 5.1)With word to be comparedThe co-occurrence probabilities in corpusAbove-mentioned steps 4 in the same manner Principle can draw
Step 5.2)With word to be comparedSemantic structure relation, this can according to its《Hownet》In level Structural relation draws its path vector
Step 5.3)In sum, maximum keyword can respectively be drawnWith word to be comparedThe degree of correlation
Respectively maximum keywordWith word to be compared The degree of correlation, above formulaForWith word to be comparedSemantic structure relation regulatory factor,It is common Existing probabilityRegulatory factor.
5., according to a kind of improved Similarity of Words derivation algorithm described in claim 1, it is characterized in that, the above Concrete calculating process in step 6 is as follows:
Step 6:The degree of correlation that recycle step 4 is tried to achieve, draws word to be comparedSimilarity simValue
sim
CN201610838940.7A 2016-07-20 2016-09-21 Improved lexical semantic similarity solution algorithm Pending CN106610948A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2016105759690 2016-07-20
CN201610575969 2016-07-20

Publications (1)

Publication Number Publication Date
CN106610948A true CN106610948A (en) 2017-05-03

Family

ID=58615306

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610838940.7A Pending CN106610948A (en) 2016-07-20 2016-09-21 Improved lexical semantic similarity solution algorithm

Country Status (1)

Country Link
CN (1) CN106610948A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052509A (en) * 2018-01-31 2018-05-18 北京神州泰岳软件股份有限公司 A kind of Text similarity computing method, apparatus and server

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
崔春华 等: ""基于本体的概念相似度计算的改进"", 《世界科技研究与发展》 *
鲁松 等: ""自然语言处理中词语上下文有效范围的定量描述"", 《计算机学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052509A (en) * 2018-01-31 2018-05-18 北京神州泰岳软件股份有限公司 A kind of Text similarity computing method, apparatus and server
CN108052509B (en) * 2018-01-31 2019-06-28 北京神州泰岳软件股份有限公司 A kind of Text similarity computing method, apparatus and server

Similar Documents

Publication Publication Date Title
WO2023065544A1 (en) Intention classification method and apparatus, electronic device, and computer-readable storage medium
CN107085581B (en) Short text classification method and device
Dos Santos et al. Deep convolutional neural networks for sentiment analysis of short texts
CN106610951A (en) Improved text similarity solving algorithm based on semantic analysis
JP5936698B2 (en) Word semantic relation extraction device
CN108984526A (en) A kind of document subject matter vector abstracting method based on deep learning
CN106598940A (en) Text similarity solution algorithm based on global optimization of keyword quality
CN106611041A (en) New text similarity solution method
JP2005505869A (en) Identifying character strings
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN106598941A (en) Algorithm for globally optimizing quality of text keywords
CN110750646B (en) Attribute description extracting method for hotel comment text
CN111046660B (en) Method and device for identifying text professional terms
CN112599128A (en) Voice recognition method, device, equipment and storage medium
CN110991180A (en) Command identification method based on keywords and Word2Vec
CN107102985A (en) Multi-threaded keyword extraction techniques in improved document
CN106610954A (en) Text feature word extraction method based on statistics
CN106610952A (en) Mixed text feature word extraction method
CN106610949A (en) Text feature extraction method based on semantic analysis
CN107038155A (en) The extracting method of text feature is realized based on improved small-world network model
CN114742069A (en) Code similarity detection method and device
Lee et al. Character-level feature extraction with densely connected networks
CN114491062A (en) Short text classification method fusing knowledge graph and topic model
CN106610953A (en) Method for solving text similarity based on Gini index
CN112632272B (en) Microblog emotion classification method and system based on syntactic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170503

WD01 Invention patent application deemed withdrawn after publication