CN106610951A

CN106610951A - Improved text similarity solving algorithm based on semantic analysis

Info

Publication number: CN106610951A
Application number: CN201610864853.9A
Authority: CN
Inventors: 金平艳
Original assignee: Sichuan Yonglian Information Technology Co Ltd
Current assignee: Sichuan Yonglian Information Technology Co Ltd
Priority date: 2016-09-29
Filing date: 2016-09-29
Publication date: 2017-05-03

Abstract

The invention discloses an improved text similarity solving algorithm based on semantic analysis. The algorithm comprises the steps of performing word segmentation and stop word removing processing on two texts; computing weights of the words in the texts based on an improved information theory method; acquiring weights of positions and properties of the words according to the positions and the properties of the words; constructing a target function shown in the description of the extracted text words according to the abovementioned three factors; and at last respectively reducing dimensions of the two feature words according to a semantic similarity, thus acquiring two feature word vectors, and then solving the text similarity sim (W1, W2) between the texts (W1, W2) according to a Pearson correlation coefficient. Compared with traditional text similarity computing method, the algorithm provided by the invention has higher accuracy, wider application range and higher application value, can accurately compute contribution degrees of the different words to a text thought and solve the problems of polysemy and synonym, is more accordant with an empirical value, and meanwhile provides a good theoretical basis for subsequent text clustering.

Description

The improved text similarity derivation algorithm based on semantic analysis

Technical field

The present invention relates to Semantic Web technology field, and in particular to a kind of improved text similarity based on semantic analysis Derivation algorithm.

Background technology

At present, conventional calculating text similarity method mainly has two kinds：A kind of is the method for being based on mathematical statisticss, in addition A kind of is based on the method for semantic analysis.Based on the method for mathematical statisticss calculated according to morphology and word frequency, and semantic point Analysis is calculated using the inherent semantic relation of text internal vocabulary.Vector space model (Vector Space Model Abbreviation VSM) it is the classical way for calculating text similarity, the method does not account for the language between the semantic information of vocabulary and vocabulary Justice contact, therefore the similar situation between text can not be really reacted, in addition VSM does not account for vocabulary semanteme in the text Status and to the contribution done by text centric thought expression, so text similarity is calculated with vector space model being It is defective.In order to improve the accuracy of Text similarity computing and solve phenomenons such as " polysemy " and " adopted many words ", this Invention provides the improved text similarity derivation algorithm based on semantic analysis.

The content of the invention

Be directed in text in feature vocabulary different vocabulary to the difference problem of the significance level of text, " polysemy " with The accuracy problem of " adopted many words " problem and raising Text similarity computing, the invention provides improved based on semantic point The text similarity derivation algorithm of analysis.

In order to solve the above problems, the present invention is achieved by the following technical solutions：

Step 1：Initialization corpus of text library module, to text (W to be compared₁, W₂) carry out pretreatment.

Step 2：Based on method of information theory, vocabulary weighted value W in the text is calculated_I。

Step 3：According to lexical position information and part of speech, vocabulary weighted value in the text is calculated

Step 4:Consider above-mentioned three factor, construction extracts text (W₁, W₂) in eigenvalue object function Text (W is extracted respectively₁, W₂) in eigenvalue.

Step 5：Using Similarity of Words sim (c_1i, c_1i+1) dimensionality reduction is carried out to feature lexical set obtained above Process

Step 6：Text (W to be compared is solved according to Pearson correlation coefficients₁, W₂) between text similarity sim (W₁, W₂)。

Present invention has the advantages that：

1st, the method has higher accuracy than the result that traditional Text similarity computing method is obtained, and more meets people The result that work is extracted.

2nd, the method all has the more preferable suitability in fields such as information retrieval, machine translation, automatically request-answering systems.

3rd, this algorithm has bigger value.

4th, the method has been precisely calculated contribution degree of the different vocabulary to text thought in feature vocabulary.

5th, different vocabulary in feature vocabulary are calculated there is higher degree of accuracy to the contribution degree of text thought.

6th, good theoretical basiss are provided for follow-up text cluster.

7th, the method has processed the problem of " polysemy " and " adopted many words "

8th, the method focuses on the angle of semantic analysis to calculate the similarity between two texts, more meets the experience of people Value.

Description of the drawings

The structure flow chart of the improved text similarity derivation algorithms based on semantic analysis of Fig. 1

Fig. 2 Chinese text preprocessing process flow charts

Fig. 3 n-gram segmentation methods figures

Specific embodiment

Difference problem, " polysemy " of the different vocabulary to the significance level of text in feature vocabulary in order to solve text With " adopted many words " problem and the accuracy problem of raising Text similarity computing, the present invention is carried out with reference to Fig. 1-Fig. 3 Describe in detail, its specific implementation step is as follows：

Step 1：Initialization corpus of text library module, to text (W to be compared₁, W₂) carry out pretreatment, its specific descriptions Process is as follows：

Comprehensive participle and deletion stop words technology, Chinese text preprocessing process flow chart such as Fig. 2.

Here segmenting method is based on theory of information Chinese Automatic Word Segmentation algorithm using a kind of, its concrete participle and goes at stop words Reason step is as follows：

Step 1.1：Using deactivation table respectively to text (W₁, W₂) carry out stop words and process.

Step 1.2：According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary, its specifically describe such as Under：

The complete scanning of the Chinese character string of participle one time is treated, matching is made a look up in the dictionary of system, in running into dictionary Some words are just identified；If there are no relevant matches in dictionary, individual character is just simply partitioned into as word；Until Chinese character string For sky.

Step 1.3：According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence that may be combined Minor structure, every sequential node of this structure SM is defined as successively₁M₂M₃M₄M₅E, its structure chart is as shown in Figure 3.

Step 1.4：Based on method of information theory, to above-mentioned network structure each edge certain weights, its concrete calculating are given Process is as follows：

According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word n_i.That is the number collection of n paths word is combined into (n₁, n₂..., n_n)。

Obtain min ()=min (n₁, n₂..., n_n)

In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved.

In statistics corpus, the quantity of information X (C of each word are calculated_i), then the co-occurrence letter of the adjacent word of solution path

Breath amount X (C_i, C_i+1).Existing following formula：

X(C_i)=| x (C_i)₁-x(C_i)₂|

Above formula x (C_i)₁For word C in text corpus_iQuantity of information, x (C_i)₂It is C containing word_iText message amount.

x(C_i)₁=-P (C_i)₁lnp(C_i)₁

Above formula p (C_i)₁For C_iProbability in text corpus, n is C containing word_iText corpus number.

x(C_i)₂=-p (C_i)₂lnp(C_i)₂

Above formula p (C_i)₂It is C containing word_iTextual data probit, N is statistics corpus Chinese version sum.

X (C in the same manner_i, C_i+1)=| x (C_i, C_i+1)₁-x(C_i, C_i+1)₂|

x(C_i, C_i+1)₁It is the word (C in text corpus_i, C_i+1) co-occurrence information amount, x (C_i, C_i+1)₂For adjacent word (C_i, C_i+1) co-occurrence text message amount.

X (C in the same manner_i, C_i+1)₁=-p (C_i, C_i+1)₁lnp(C_i, C_i+1)₁

Above formula p (C_I,C_i+1)₁It is the word (C in text corpus_i, C_i+1) co-occurrence probabilities, m is the word (C in text library_i, C_i+1) co-occurrence amount of text.

x(C_i, C_i+1)₂=-P (C_hC_i+1)₂lnp(C_i, C_i+1)₂

p(C_i, C_i+1)₂For adjacent word (C in text library_i, C_i+1) co-occurrence textual data probability.

The weights that every adjacent path can to sum up be obtained are

w(C_i, C_i+1)=X (C_i)+X(C_i+1)-2X(C_i, C_i+1)

Step 1.5：A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, it was specifically calculated Journey is as follows：

There are n paths, it is different per paths length, it is assumed that path collection is combined into (L₁, L₂..., L_n)。

Assume the minimum number operation through taking word in path, eliminate m paths, m<n.It is left (n-m) path, if Its path collection is combined into

It is per paths weight then:

Above formulaRespectively the 1,2nd arrivesThe weighted value on path side, can calculate one by one according to step 1.4,For S in remaining (n-m) path_j The length of paths.

One paths of maximum weight:

Step 2：Based on method of information theory, vocabulary weighted value W in the text is calculated_I, its concrete calculating process is as follows：

Had based on the computing formula of theory of information word frequency：

Above formulaThe quantity of information being had in a document with regard to word frequency by vocabulary, p (c_1,2) it is respectively word c₁、c₂ Probit in text.

Had based on the computing formula of theory of information document frequency：

The quantity of information being had in document library with regard to document frequency by vocabulary,To contain c respectively₁、 c₂Number of files, N be document library in document total number.

In sum, there are the function that term weight is calculated based on theory of information, after normalization, such as following formula：

Step 3：According to lexical position information and part of speech, vocabulary weighted value in the text is calculatedIts tool Body calculating process is as follows：

Shown according to data from investigation, Feature Words are got in the forward position of text, can more represent the central idea of text, Feature Words The number of times for occurring in the text is more, more with the representativeness of text implication.Weight of the vocabulary in text is obtained by step 2 Value, takes front n feature vocabulary.Position weight division is carried out to these vocabulary.

In the text each Feature Words at least occurs once, text feature word c_{(1,2) i}The position vector of composition is as follows：

From from the point of view of part of speech, noun typically takes on the role of subject and object, and verb typically takes on the role of predicate, shape Hold word and adverbial word typically takes on the role of attribute.The difference of part of speech, causes their expression contents to text or sentence Ability it is different.Can obtain through association area expert investigation, the part of speech such as noun, verb, adjective, adverbial word power in the text Weight coefficient a_i。

Then consider each Feature Words position is with the weighting function of part of speech：

Above formula k is characterized word c_iThere is paragraph number in the text, q_hIt is containing Feature Words c_iH sections to text thought Contribution margin, a_iFor contribution margin of the part of speech to text thought, a_i、q_hValue can be drawn by corresponding text field expert through investigation. n_hIt is characterized word c_iIn the number of times that h sections occur.

Step 4:Consider above-mentioned three factor, construction extracts text (W₁, W₂) in Feature Words object function Text (W is extracted respectively₁, W₂) in Feature Words, its concrete calculating process is as follows：

Extract text (W₁, W₂) in the object function of feature vocabulary be：

Step 5：Using Similarity of Words sim (c_1i, c_1i+1) dimensionality reduction is carried out to feature lexical set obtained above Process, need to first calculate the similarity sim (g between concept₁, g₂), its concrete calculating process is as follows：

Utilize《Hownet》Data base, it is assumed that feature vocabulary (c_1i, c_ (1i+1)) corresponding concept set is respectivelyThis concept is compared two-by-two, similarity is found Two maximum concepts, as feature vocabulary (c_1i, c_1i+1) between similarity sim (c_1i, c_1i+1)。

Step 5.1) the similarity sim (g between concept is calculated based on information-theoretical method₁, g₂)

Phase is mainly calculated by weighing quantity of information that concept included based on the similarity based method that calculates of information content Like degree.Concept is the succession to its ancestor node, is refining again for ancestor node, so can be included by ancestor node Quantity of information is weighing the shared information of two concepts.

Solve information magnitude I (pr) of its common parent in tree-like hierarchy structure

According to Fig. 2, two Ontological concept (g are drawn₁, g₂) common parent per layer of probability for occurring in tree-like hierarchy structure Value p (pr)

P (pr)=(p₁(pr), p₂(pr) ..., p_k(pr))

Above formula k is two Ontological concept (g₁, g₂) number of plies of the common parent in tree-like hierarchy structure.

E [p (pr)] is two Ontological concept (g₁, g₂) mathematical expectation of probability of the common parent in tree-like hierarchy structure.

Two Ontological concept (g are solved respectively₁, g₂) information magnitude I (g in tree-like hierarchy structure₁)、I(g₂), its is concrete Solution procedure is as follows：

Solve the information magnitude I (g in tree-like hierarchy structure of two Ontological concepts₁)、I(g₂)

In the same manner, according to Fig. 2, two Ontological concept (g are drawn₁, g₂) per layer of the probit p (g in tree-like hierarchy structure₁)、p (g₂)

p(g₁)=(p₁(g₁), p₂(g₁) ..., p_i(g₁))

P(g₂)=(p₁(g₂), p₂(g₂) ..., p_j(g₂))

Above formula i is Ontological concept g₁The number of plies in tree-like hierarchy structure, in the same manner, j is Ontological concept g₂In tree-like hierarchy knot The number of plies in structure.

Above formula E [p (g₁)]、E[p(g₂)] it is respectively two Ontological concept (g₁, g₂) probability in tree-like hierarchy structure is equal Value.

Thus the information magnitude I (g in tree-like hierarchy structure of two Ontological concepts can be obtained₁)、I(g₂)

Based on quantity of information, it can be deduced that the semantic similarity sim (g between two Ontological concepts₁, g₂), its concrete calculating process is such as Under：

Two Ontological concept (g₁, g₂) the quantity of information that includes of common parent can only to represent that two concepts are included identical Information.Two Ontological concept (g can rule of thumb be obtained₁, g₂) between semantic similarity sim (g₁, g₂)。

Step 5.2) concept similarity matrix can be drawn according to step 5.1, it is as follows：

I.e.

Dimension-reduction treatment is carried out to feature lexical set, there is following formula

sim(c_1i, c_1i+1)≥α

When feature vocabulary two-by-two between similarity meet the threshold alpha that sets, then merge into a word, i.e. similarity it is maximum two One of vocabulary, its weighted value need to be redistributed, as：

In the same manner, you can obtain the dimensionality reduction vector of feature lexical set in text 2.

Step:6：Text (W to be compared is solved according to Pearson correlation coefficients₁, W₂) between text similarity sim (W₁, W₂)。

According to the feature vocabulary weighted value that step 4 is calculated, m positions key word before association area selection of specialists, m here< 20, both there is text (W respectively₁, W₂) corresponding feature term vector.

Text W₁The average weight function of corresponding Feature Words is

In the same manner, text W₂The average weight function of character pair word is

According to Pearson correlation coefficients, you can obtain text (W₁, W₂) between text similarity sim (W₁, W₂), there is following formula：

The improved text similarity derivation algorithm based on semantic analysis, its false code calculating process：

Input：Text (W to be compared₁, W₂)。

Output：Text (W₁, W₂) between similarity sim (W₁, W₂)。

Claims

1. the improved text similarity derivation algorithm based on semantic analysis, the present invention relates to Semantic Web technology field, specifically It is related to a kind of improved text similarity derivation algorithm based on semantic analysis, it is characterized in that, comprises the steps：

Step 1：Initialization corpus of text library module, to text to be comparedCarry out pretreatment, it was specifically processed Journey is as follows：

Comprehensive participle and deletion stop words technology, Chinese text preprocessing process flow chart such as Fig. 2

Here segmenting method is based on theory of information Chinese Automatic Word Segmentation algorithm using a kind of, its concrete participle and goes stop words to process step It is rapid as follows：

Step 1.1：Using deactivation table respectively to textCarry out stop words to process

Step 1.2：According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary, it is described in detail below：

The complete scanning of the Chinese character string of participle one time is treated, matching is made a look up in the dictionary of system, run into what is had in dictionary Word is just identified；If there are no relevant matches in dictionary, individual character is just simply partitioned into as word；Until Chinese character string is sky

Step 1.3：According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence knot that may be combined Structure, is successively defined as every sequential node of this structure, its structure chart is as shown in Figure 3

Step 1.4：Based on method of information theory, to above-mentioned network structure each edge certain weights, its concrete calculating process are given It is as follows：

According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word, i.e., The number collection of n paths words is combined into

In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved

In statistics corpus, the quantity of information of each word is calculated, then the co-occurrence information amount of the adjacent word of solution path, existing following formula：

Above formulaFor word in text corpusQuantity of information,It is containing wordText message amount

Above formulaForProbability in text corpus, n is containing wordText corpus number

Above formulaIt is containing wordTextual data probit, N is statistics corpus Chinese version sum

In the same manner

It is the word in text corpusCo-occurrence information amount,For adjacent word The text message amount of co-occurrence

In the same manner

Above formulaIt is the word in text corpusCo-occurrence probabilities, m is the word in text library The amount of text of co-occurrence

For adjacent word in text libraryThe textual data probability of co-occurrence

The weights that every adjacent path can to sum up be obtained are

Step 1.5：A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, its concrete calculating process is such as Under：

There are n paths, it is different per paths length, it is assumed that path collection is combined into

Assume the minimum number operation through taking word in path, eliminate m paths, m<N, that is, be left (n-m) path, if its road Electrical path length collection is combined into

It is per paths weight then:

Above formulaRespectively the 1,2nd arrivesPath side Weighted value, can be calculated one by one according to step 1.4,For in remaining (n-m) path thePaths Length

One paths of maximum weight:

Step 2：Based on method of information theory, vocabulary weighted value in the text is calculated, its concrete calculating process is as follows：

Had based on the computing formula of theory of information word frequency：

Above formulaThe quantity of information being had in a document with regard to word frequency by vocabulary,Respectively word、In text In probit

The quantity of information being had in document library with regard to document frequency by vocabulary,To contain respectively、 Number of files, N be document library in document total number

Step 4:Consider above-mentioned three factor, construction extracts textIn eigenvalue object function, point Indescribably take textIn eigenvalue

Step 5：Using Similarity of WordsDimension-reduction treatment is carried out to feature lexical set obtained above

Step 6：Text to be compared is solved according to Pearson correlation coefficientsBetween text similarity, Its concrete calculating process is as follows：

According to the feature vocabulary weighted value that step 4 is calculated, m positions key word before association area selection of specialists, m here<20, both There is text respectivelyCorresponding feature term vector

TextThe average weight function of corresponding Feature Words is：

In the same manner, textThe average weight function of character pair word is：

According to Pearson correlation coefficients, you can obtain textBetween text similarity, there is following formula：

。

2., according to the improved text similarity derivation algorithm based on semantic analysis described in claim 1, it is characterized in that, with Concrete calculating process in the upper step 3 is as follows：

Step 3：According to lexical position information and part of speech, vocabulary weighted value in the text is calculated, its concrete meter Calculation process is as follows：

Shown according to data from investigation, Feature Words are got in the forward position of text, can more represent the central idea of text, and Feature Words are in text The number of times occurred in this is more, more with the representativeness of text implication, by step 2 weighted value of the vocabulary in text is obtained, and takes These vocabulary are carried out position weight division by front n feature vocabulary

In the text each Feature Words at least occurs once, text feature wordThe position vector of composition is as follows：

From from the point of view of part of speech, noun typically takes on the role of subject and object, and verb typically takes on the role of predicate, adjective With the role that adverbial word typically takes on attribute, the difference of part of speech, them are caused to text or the ability of the expression content of sentence It is different, can obtain through association area expert investigation, the part of speech such as noun, verb, adjective, adverbial word weight system in the text Number

Above formula k is characterized wordThere is paragraph number in the text,It is containing Feature WordsH sections to text thought Contribution margin,For contribution margin of the part of speech to text thought,、Value can draw by corresponding text field expert through investigation,It is characterized wordIn the number of times that h sections occur.

3., according to the improved text similarity derivation algorithm based on semantic analysis described in claim 1, it is characterized in that, with Concrete calculating process in the upper step 4 is as follows：

Step 4:Consider above-mentioned three factor, construction extracts textIn Feature Words object function, point Indescribably take textIn Feature Words, its concrete calculating process is as follows：

Extract textThe object function of middle feature vocabulary is：

。

4., according to the improved text similarity derivation algorithm based on semantic analysis described in claim 1, it is characterized in that, with Concrete calculating process in the upper step 5 is as follows：

Step 5：Using Similarity of WordsFeature lexical set obtained above is carried out at dimensionality reduction Reason, need to first calculate the similarity between concept, its concrete calculating process is as follows：

Utilize《Hownet》Data base, it is assumed that feature vocabularyCorresponding concept set is respectively、, this concept is compared two-by-two, two maximum concepts of similarity are found, i.e., It is characterized vocabularyBetween similarity

Step 5.1）Similarity between concept is calculated based on information-theoretical method

Similarity is mainly calculated by weighing quantity of information that concept included based on the similarity based method that calculates of information content, Concept is the succession to its ancestor node, is refining again for ancestor node, so the information that can be included by ancestor node Measure to weigh the shared information of two concepts

Solve information magnitude of its common parent in tree-like hierarchy structure

According to Fig. 2, two Ontological concepts are drawnThe probit of common parent per layer of appearance in tree-like hierarchy structure

Above formula k is two Ontological conceptsThe number of plies of the common parent in tree-like hierarchy structure

For two Ontological conceptsMathematical expectation of probability of the common parent in tree-like hierarchy structure

Two Ontological concepts are solved respectivelyInformation magnitude in tree-like hierarchy structure、, its concrete solution Process is as follows：

Solve the information magnitude in tree-like hierarchy structure of two Ontological concepts、

In the same manner, according to Fig. 2, two Ontological concepts are drawnPer layer of the probit in tree-like hierarchy structure、

Above formula i is Ontological conceptThe number of plies in tree-like hierarchy structure, in the same manner, j is Ontological conceptIn tree-like hierarchy structure In the number of plies

Above formula、Respectively two Ontological conceptsMathematical expectation of probability in tree-like hierarchy structure

Thus the information magnitude in tree-like hierarchy structure of two Ontological concepts can be obtained、

Based on quantity of information, it can be deduced that the semantic similarity between two Ontological concepts, its concrete calculating process is such as Under：

Two Ontological conceptsThe quantity of information that includes of common parent can only represent the identical letter that two concepts are included Breath, can rule of thumb obtain two Ontological conceptsBetween semantic similarity

Step 5.2）Concept similarity matrix can be drawn according to step 5.1, it is as follows：

I.e.

When feature vocabulary two-by-two between similarity meet the threshold value that sets, then two maximum vocabulary of a word, i.e. similarity are merged into One of, its weighted value need to be redistributed, as：