CN106610948A - Improved lexical semantic similarity solution algorithm - Google Patents
Improved lexical semantic similarity solution algorithm Download PDFInfo
- Publication number
- CN106610948A CN106610948A CN201610838940.7A CN201610838940A CN106610948A CN 106610948 A CN106610948 A CN 106610948A CN 201610838940 A CN201610838940 A CN 201610838940A CN 106610948 A CN106610948 A CN 106610948A
- Authority
- CN
- China
- Prior art keywords
- word
- compared
- maximum
- similarity
- cliction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
The invention provides an improved lexical semantic similarity solution algorithm. On the basis of a successfully initialized statistical method module, a maximum keyword is extracted by comparing context words with the maximum weights of words to be compared and the similarity of the words to be compared, and finally the similarity between the extracted maximum keyword and the words to be compared is calculated to obtain a result. The calculation result of the semantic similarity is basically consistent with actually manually judged semantic similarity; the objective reality is better reflected; and the user demands are better satisfied.
Description
Technical field
The present invention relates to Semantic Web technology field, and in particular to a kind of improved Similarity of Words derivation algorithm.
Background technology
Since 21 century, the internet industry in the whole world enters the new period of a high speed development, and various new technologies are continuous
Emerge in large numbers.Natural language processing as important technology between contact computer and people also achieves quickly development.Traditional language
Adopted relatedness computation method is roughly divided into two classes:Semantic similarity calculation method based on semantic dictionary and based on corpus
Semantic similarity calculation method, they belong to the semantic similarity computational methods based on statistics, are also a kind of experience master
Right way of conduct method;Set up on observable linguistic fact based on the research of the semantic similarity of statistics, and do not depended solely on
The intuition of linguist.It is built upon two phrase semantics it is similar and if only if they in similar context environmental this
On the basis of one assumes, using Large Scale Corpus, using the contextual information of word as Semantic Similarity Measurement reference according to
According to.In addition, this kind of method can be compared accurate effectively tolerance to the Semantic Similarity between word, but the method is relied on very much
It is computationally intensive and method is more complicated in training used corpus, it is larger by Sparse and data noise jamming, sometimes
Apparent error occurs, based on above demand is met, the invention provides a kind of improved Similarity of Words derivation algorithm.
The content of the invention
The similarity problem being directed in word, the invention provides a kind of improved Similarity of Words is solved and calculated
Method.
In order to solve the above problems, the present invention is achieved by the following technical solutions:
Step 1:Initialization statistical method module, can be here《Word dictionary》、《Word woods》, Hownet,《Baidupedia》
Etc. corpus.
Step 2:By word (c to be compared1, c2) in input initialization statistical method module.
Step 3:Word (c to be compared is found in statistical module1, c2) the maximum cliction up and down of weight in adjacent context
(csx1,csx2)。
Step 4:According to word (c to be compared1, c2) the maximum cliction (c up and down of the corresponding weight of differencesx1,csx2) between it is similar
Degree, extracts similarity maximum keyword csx。
Step 5:Similarity maximum keyword c is calculated respectivelysxWith word (c to be compared1, c2) the degree of correlation.
Step 6:The degree of correlation that recycle step 5 is tried to achieve, draws word (c to be compared1, c2) similarity sim (c1, c2) value.
Present invention has the advantages that:
1st, the result of calculation of semantic similarity is basically identical with the semantic similitude degree of actual artificial judgment.
2nd, objective reality is preferably reflected.
3rd, user's request is more met.
Description of the drawings
Fig. 1 is a kind of structure flow chart of improved Similarity of Words derivation algorithm.
Specific embodiment
To solve word (c1, c2) between semantic similarity problem, the present invention has been described in detail with reference to Fig. 1, its tool
Body implementation steps are as follows:
Step 1:Initialization statistical method module, can be here《Word dictionary》、《Word woods》、《Hownet》、《Baidu hundred
Section》Etc. corpus.
Step 2:By word (c to be compared1, c2) in input initialization statistical method module.
Step 3:Word (c to be compared is found in statistical module1, c2) the maximum cliction up and down of weight in adjacent context
(csx1,csx2)。
Find out word (c to be compared1, c2) corresponding upper and lower cliction maximum value (c of weight in corpussx1,csx2), specifically
Calculating process is as follows:
Up and down cliction is searched according to constraints, for example, in Chinese, with stronger context restriction relation
Part of speech is to having:Adjective-noun, verb-noun, noun-verb, adjective-verb etc..
weightSx1,2=p (c1,2/cSx1,2)log2[p(c1,2/cSx1,2)+1]
Above formula cSx1,2Respectively with certain relation and word (c to be compared1, c2) adjacent cliction up and down,
p(c1,2/cSx1,2) it is c1,2With upper and lower cliction cSx1,2The conditional probability of certain relation is presented, equation 1 above is smooth system
Number.
Above formula
n(c1,2/cSx1,2) it is word (c to be compared in corpus1, c2) respectively with upper and lower cliction cSx1,2Certain relation is presented
Co-occurrence quantity, n (c1,2, cSx1,2) it is word (c to be compared in corpus1, c2) with upper and lower cliction cSx1,2Co-occurrence quantity.
In sum, following formula is obtained final product:
MAXweightSx1,2=MAX { p (c1,2/cSx1,2)log2[p(c1,2/cSx1,2)+1]}
Word (c to be compared is found according to above formula1, c2) optimal context Collocation (csx1,csx2),csx1It is and c1.It is in
The existing optimal context Collocation corresponding to certain relation, in the same manner csx2It is and c2It is presented optimal upper and lower corresponding to certain relation
Literary Collocation.
Step 4:According to word (c to be compared1, c2) the maximum cliction (c up and down of the corresponding weight of differencesx1,csx2) between it is similar
Degree, extracts similarity maximum keyword csx。
According to the context vocabulary (c that two weights are maximumsx1,csx2) included in total word frequency number, can find its pass
Key word information csx, i.e.,
Above formula nf (csx1, csx2) it is the maximum context vocabulary (c of two weightssx1,csx2) included in total word quantity,Maximum context vocabulary (the c of respectively two weightssx1,csx2) in entry length.
Obtained by above formula,
f(csx)=max [Sim (csx1, csx2)]
According to f (csx) maximum, you can find best match keyword csx。
Step 5:Similarity maximum keyword c is calculated respectivelysxWith word (c to be compared1, c2) the degree of correlation.
According to csxWith word (c to be compared1, c2) probability of co-occurrence, c in corpussxWith word (c to be compared1, c2) semanteme
Structural relation can respectively draw maximum keyword csxWith word (c to be compared1, c2) the degree of correlation, its concrete calculating process is as follows:
Step 5.1) csxWith word (c to be compared1, c2) the co-occurrence probabilities p (c in corpus1,2/csx), above-mentioned steps 4 in the same manner
Principle can draw.
Step 5.2) csxWith word (c to be compared1, c2) semantic structure relation, this can according to its《Hownet》In
Hierarchical relationship draws its path vector
Step 5.3) in sum, can respectively draw maximum keyword csxWith word (c to be compared1, c2) the degree of correlation.
relativity(csx, c1)、relativity(csx, c2) it is respectively maximum keyword csxWith word (c to be compared1,
c2) the degree of correlation, above formulaFor csxWith word (c to be compared1, c2) semantic structure relation regulatory factor,For co-occurrence
Probability p (c1,2/csx) regulatory factor.
Step 6:The degree of correlation that recycle step 4 is tried to achieve, draws word (c to be compared1, c2) similarity sim (c1, c2) value.
A kind of improved Similarity of Words derivation algorithm, its false code calculating process is as follows:
Input:The selected statistical module of initialization, word (c to be compared1, c2)
Output:Word (c to be compared1, c2) between semantic similarity.
Claims (5)
1. a kind of improved Similarity of Words derivation algorithm, the present invention relates to Semantic Web technology field, and in particular to
Improved Similarity of Words derivation algorithm is planted, be it is characterized in that, comprised the steps:
Step 1:Initialization statistical method module, can be here《Word dictionary》、《Word woods》, Hownet,《Baidupedia》Etc.
Corpus
Step 2:By word to be comparedIn input initialization statistical method module
Step 3:Word to be compared is found in statistical moduleThe maximum cliction up and down of weight in adjacent context
Step 4:According to word to be comparedThe maximum cliction up and down of the corresponding weight of differenceBetween it is similar
Degree, extracts similarity maximum keyword
Step 5:Similarity maximum keyword is calculated respectivelyWith word to be comparedThe degree of correlation
Step 6:The degree of correlation that recycle step 5 is tried to achieve, draws word to be comparedSimilarity simValue.
2., according to a kind of improved Similarity of Words derivation algorithm described in claim 1, it is characterized in that, the above
Concrete calculating process in step 3 is as follows:
Step 3:Word to be compared is found in statistical moduleThe maximum cliction up and down of weight in adjacent context
Find out word to be comparedCorresponding upper and lower cliction maximum value of weight in corpus, specifically
Calculating process is as follows:
Up and down cliction is searched according to constraints, for example, in Chinese, the part of speech with stronger context restriction relation
To having:Adjective-noun, verb-noun, noun-verb, adjective-verb etc.
Above formulaRespectively with certain relation and word to be comparedAdjacent cliction up and down,
ForWith upper and lower clictionThe conditional probability of certain relation is presented, equation 1 above is smoothing factor
Above formula
For word to be compared in corpusRespectively with upper and lower clictionCertain relation is presented
Co-occurrence quantity,For word to be compared in corpusWith upper and lower clictionCo-occurrence quantity
In sum, following formula is obtained final product:
Word to be compared is found according to above formulaOptimal context CollocationBe with, it is in
The existing optimal context Collocation corresponding to certain relation, in the same mannerBe withPresent corresponding to certain relation it is optimal on
Hereafter Collocation.
3., according to a kind of improved Similarity of Words derivation algorithm described in claim 1, it is characterized in that, the above
Concrete calculating process in step 4 is as follows:
Step 4:According to word to be comparedThe maximum cliction up and down of the corresponding weight of differenceBetween it is similar
Degree, extracts similarity maximum keyword
According to the context vocabulary that two weights are maximumIncluded in total word frequency number, its can be found crucial
Word information, i.e.,
Sim
Above formulaFor the context vocabulary that two weights are maximumIncluded in total word number
Amount,The maximum context vocabulary of respectively two weightsIn entry length
Obtained by above formula,
According toMaximum, you can find best match keyword。
4., according to a kind of improved Similarity of Words derivation algorithm described in claim 1, it is characterized in that, the above
Concrete calculating process in step 5 is as follows:
Step 5:Similarity maximum keyword is calculated respectivelyWith word to be comparedThe degree of correlation
According toWith word to be comparedThe probability of co-occurrence in corpus,With word to be comparedSemanteme
Structural relation can respectively draw maximum keywordWith word to be comparedThe degree of correlation, its concrete calculating process is as follows:
Step 5.1)With word to be comparedThe co-occurrence probabilities in corpusAbove-mentioned steps 4 in the same manner
Principle can draw
Step 5.2)With word to be comparedSemantic structure relation, this can according to its《Hownet》In level
Structural relation draws its path vector
Step 5.3)In sum, maximum keyword can respectively be drawnWith word to be comparedThe degree of correlation
Respectively maximum keywordWith word to be compared
The degree of correlation, above formulaForWith word to be comparedSemantic structure relation regulatory factor,It is common
Existing probabilityRegulatory factor.
5., according to a kind of improved Similarity of Words derivation algorithm described in claim 1, it is characterized in that, the above
Concrete calculating process in step 6 is as follows:
Step 6:The degree of correlation that recycle step 4 is tried to achieve, draws word to be comparedSimilarity simValue
sim。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2016105759690 | 2016-07-20 | ||
CN201610575969 | 2016-07-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106610948A true CN106610948A (en) | 2017-05-03 |
Family
ID=58615306
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610838940.7A Pending CN106610948A (en) | 2016-07-20 | 2016-09-21 | Improved lexical semantic similarity solution algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106610948A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108052509A (en) * | 2018-01-31 | 2018-05-18 | 北京神州泰岳软件股份有限公司 | A kind of Text similarity computing method, apparatus and server |
-
2016
- 2016-09-21 CN CN201610838940.7A patent/CN106610948A/en active Pending
Non-Patent Citations (2)
Title |
---|
崔春华 等: ""基于本体的概念相似度计算的改进"", 《世界科技研究与发展》 * |
鲁松 等: ""自然语言处理中词语上下文有效范围的定量描述"", 《计算机学报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108052509A (en) * | 2018-01-31 | 2018-05-18 | 北京神州泰岳软件股份有限公司 | A kind of Text similarity computing method, apparatus and server |
CN108052509B (en) * | 2018-01-31 | 2019-06-28 | 北京神州泰岳软件股份有限公司 | A kind of Text similarity computing method, apparatus and server |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023065544A1 (en) | Intention classification method and apparatus, electronic device, and computer-readable storage medium | |
CN107085581B (en) | Short text classification method and device | |
Dos Santos et al. | Deep convolutional neural networks for sentiment analysis of short texts | |
CN106610951A (en) | Improved text similarity solving algorithm based on semantic analysis | |
JP5936698B2 (en) | Word semantic relation extraction device | |
CN108984526A (en) | A kind of document subject matter vector abstracting method based on deep learning | |
CN106598940A (en) | Text similarity solution algorithm based on global optimization of keyword quality | |
CN106611041A (en) | New text similarity solution method | |
JP2005505869A (en) | Identifying character strings | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
CN106598941A (en) | Algorithm for globally optimizing quality of text keywords | |
CN110750646B (en) | Attribute description extracting method for hotel comment text | |
CN111046660B (en) | Method and device for identifying text professional terms | |
CN112599128A (en) | Voice recognition method, device, equipment and storage medium | |
CN110991180A (en) | Command identification method based on keywords and Word2Vec | |
CN107102985A (en) | Multi-threaded keyword extraction techniques in improved document | |
CN106610954A (en) | Text feature word extraction method based on statistics | |
CN106610952A (en) | Mixed text feature word extraction method | |
CN106610949A (en) | Text feature extraction method based on semantic analysis | |
CN107038155A (en) | The extracting method of text feature is realized based on improved small-world network model | |
CN114742069A (en) | Code similarity detection method and device | |
Lee et al. | Character-level feature extraction with densely connected networks | |
CN114491062A (en) | Short text classification method fusing knowledge graph and topic model | |
CN106610953A (en) | Method for solving text similarity based on Gini index | |
CN112632272B (en) | Microblog emotion classification method and system based on syntactic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170503 |
|
WD01 | Invention patent application deemed withdrawn after publication |