CN106598940A

CN106598940A - Text similarity solution algorithm based on global optimization of keyword quality

Info

Publication number: CN106598940A
Application number: CN201610939853.0A
Authority: CN
Inventors: 金平艳
Original assignee: Sichuan Yonglian Information Technology Co Ltd
Current assignee: Sichuan Yonglian Information Technology Co Ltd
Priority date: 2016-11-01
Filing date: 2016-11-01
Publication date: 2017-04-26

Abstract

The invention provides a text similarity solution algorithm based on global optimization of keyword quality. The text similarity solution algorithm comprises the following steps: performing word segmentation and stop word removal processing on a text, comprehensively considering the weight, the density, the depth, the part of speech, the word position, the relevancy of core vocabularies and other factors of the keywords in a text library, deeming each keyword (the formula is as shown in the specification) as a multi-dimensional vector, performing dimension reduction processing on a keyword set in a multi-dimensional space by constraint conditions, extracting and calculating the similarity (the formula is as shown in the specification) of two keywords having the largest weights in two text keyword sets, and setting a threshold condition to extract characteristic vocabulary vectors in the two texts. Compared with the traditional word frequency-inverse document frequency method, the text similarity solution algorithm has higher accuracy and overcomes the defect that the defects of only one category can be extracted by the information gain method, the constraint conditions of the algorithm are precise enough to accurately calculate the contribution degrees of different vocabularies to the texts, the individual quality of the keywords can be guaranteed, the overall quality of the keywords can be globally optimized, and in addition, the similarity accuracy between the texts is higher.

Description

Text similarity derivation algorithm based on global optimization keyword quality

Technical field

The present invention relates to Semantic Web technology field, and in particular to the text similarity based on global optimization keyword quality Derivation algorithm.

Background technology

Text similarity computing may apply to text classification, text cluster, information retrieval, question answering system, removing duplicate webpages Etc. many fields.At present, in different field, many similarity calculating methods are suggested and are applied to practice.Such as vector space mould Type, Boolean Model, implicit semantic index statistical model, the string matching model such as model, model based on semantic understanding etc..Can To find, nowadays increasing experts and scholars study the calculating of text similarity, this is because effective meter of text similarity Calculation can play raising recall precision, avoid article ticket from stealing, saves the effect such as memory space.In Text similarity computing field still So there are problems that it is many need people to solve, especially to the research of Chinese text similarity, also do not develop into make us full at present The degree of meaning.Natural language understanding is realized with computer, Chinese is more difficult to process than English.Chinese word unlike English And have obvious separation mark between this, it expresses together a meaning with multiple continuous words, based on context linguistic context Difference, is also easy to cause ambiguity, how to improve the effectiveness and accuracy of Text similarity computing, based on the demand, this It is bright there is provided a kind of text similarity derivation algorithm based on global optimization keyword quality.

The content of the invention

The effectiveness of Text similarity computing and the deficiency of accuracy are directed to, the invention provides a kind of based on global excellent Change the text similarity derivation algorithm of keyword quality.

In order to solve the above problems, the present invention is achieved by the following technical solutions：

Step 1：Using Chinese words segmentation to two text (W₁, W₂) carry out word segmentation processing；

Step 2：Stop words is carried out according to deactivation table to text vocabulary to process；

Step 3：The text key word collection gone after stop words operation is combined into C=(C₁, C₂..., C_n), each key word C_i Contribution margin in text regards a multi-C vector as, i.e.,

Step 4：Using constraints, keyword feature set dimension-reduction treatment is carried out in hyperspace, finally extracted most The text key word set J of optimization₁、J₂；

Step 5：Calculate two keyword sets J₁、J₂Similarity between two maximum words of middle weight；

Step 6：Two keyword sets J₁、J₂Middle solution similarity two-by-two between vocabulary, sets similarity between a vocabulary Threshold value, the similarity sim (W between two texts is calculated according to the vocabulary number for meeting condition₁, W₂)。

Present invention has the advantages that：

1st, the method is higher than the accuracy of the feature lexical set that traditional word frequency-anti-document frequency method is obtained.

2nd, the method overcomes the shortcoming that Information Gain Method is only suitable for the text feature for extracting a classification.

3rd, this algorithm has bigger value.

4th, the method has been precisely calculated contribution degree of the different vocabulary to text thought in feature vocabulary.

5th, the method condition before the method is compared is more harsh, and the result precision for obtaining is higher.

6th, key word Individual Quality is not only can ensure that, and the total quality of key word can be optimized from the overall situation.

7th, the similarity result between text is more accurate, more meets empirical value.

Description of the drawings

Text similarity derivation algorithm structure flow charts of the Fig. 1 based on global optimization keyword quality

Fig. 2 n-grams segmentation methods are illustrated

Fig. 3 Chinese text preprocessing process flow charts

Specific embodiment

In order to the effectiveness and accuracy that solve the problems, such as Text similarity computing are not enough, the present invention is carried out with reference to Fig. 1 Describe in detail, its specific implementation step is as follows：

Step 1：Using Chinese words segmentation to two text (W₁, W₂) word segmentation processing is carried out, its concrete participle technique process is such as Under：

Step 1.1：According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary, the Chinese character for treating participle The complete scanning of string one time, makes a look up matching in the dictionary of system, and the word that run into has in dictionary is just identified；If word There are no relevant matches in allusion quotation, be just simply partitioned into individual character as word；Until Chinese character string is sky.

Step 1.2：According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence that may be combined Minor structure, every sequential node of this structure SM is defined as successively₁M₂M₃M₄M₅E, its structure chart is as shown in Figure 2.

Step 1.3：Based on method of information theory, to above-mentioned network structure each edge certain weights, its concrete calculating are given Process is as follows：

According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word n_i.That is the number collection of n paths word is combined into (n₁, n₂..., n_n)。

Obtain min ()=min (n₁, n₂..., n_n)

In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved.

In statistics corpus, the quantity of information X (C of each word are calculated_i), then the co-occurrence letter of the adjacent word of solution path

Breath amount X (C_i, C_i+1).Existing following formula：

X(C_i)=| x (C_i)₁-x(C_i)₂|

Above formula x (C_i)₁For word C in text corpus_iQuantity of information, x (C_i)₂It is C containing word_iText message amount.

x(C_i)₁=-P (C_i)₁lnp(C_i)₁

Above formula p (C_i)₁For C_iProbability in text corpus, n is C containing word_iText corpus number.

x(C_i)₂=-p (C_i)₂lnp(C_i)₂

Above formula p (C_i)₂It is C containing word_iTextual data probit, N is statistics corpus Chinese version sum.

X (C in the same manner_i, C_i+1)=| x (C_i, C_i+1)1-x(C_i, C_i+1)₂|

x(C_i, C_i+1)₁It is the word (C in text corpus_i, C_i+1) co-occurrence information amount, x (C_i, C_i+1)₂For adjacent word (C_i, C_i+1) co-occurrence text message amount.

X (C in the same manner_i, C_i+1) 1=-p (C_i, C_i+1)₁lnp(C_i, C_i+1)₁

Above formula p (C_i, C_i+1)₁It is the word (C in text corpus_i, C_i+1) co-occurrence probabilities, m is the word (C in text library_i, C_i+1) co-occurrence amount of text.

x(C_i, C_i+1)₂=-p (C_i, C_i+1)₂lnp(C_i, C_i+1)₂

p(C_i, C_i+1)₂For adjacent word (C in text library_i, C_i+1) co-occurrence textual data probability.

The weights that every adjacent path can to sum up be obtained are

w(C_i, C_i+1)=X (C_i)+X(C_i+1)-2X(C_i, C_i+1)

Step 1.4：A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, it was specifically calculated Journey is as follows：

There are n paths, it is different per paths length, it is assumed that path collection is combined into (L₁, L₂..., L_n)。

Assume the minimum number operation through taking word in path, eliminate m paths, m<n.It is left (n-m) path, if Its path collection is combined into

It is per paths weight then:

Above formulaRespectively the 1,2nd arrivesThe weighted value on path side, can calculate one by one according to step 1.4,For S in remaining (n-m) path_jBar The length in path.

One paths of maximum weight:

Step 2：Stop words is carried out according to deactivation table to text vocabulary to process, it is described in detail below：

Stop words refers to that in the text the frequency of occurrences is high, but for word of the Text Flag without too big effect.Go to stop The process of word is exactly to be compared characteristic item with the word in vocabulary is disabled, by this feature entry deletion if matching.

Comprehensive participle and deletion stop words technology, Chinese text preprocessing process flow chart such as Fig. 3.

Step 3：The text key word collection gone after stop words operation is combined into C=(C₁, C₂..., C_n), each key word C_i Contribution margin in text regards a multi-C vector, i.e. object vector as

Its concrete calculating process is as follows：

Object vector：

Above-mentioned f_iFor key word C_iWeighting function in text library：

Above formulaFor key word C_iThe number of times for occurring in the text, L_AlwaysFor the total length of text, N_AlwaysFor text library Chinese version Total number, I_j(Ci) it is key word C_iThe quantity of information of jth text in storehouse,For key word C_iAverage information in storehouse Amount.

I_j(C_i)=lnI_j(C_i)

Object vectorIn s_iFor key word C_iDepth capacity of the correspondence Ontological concept in body network structure；

Object vectorIn m_iFor key word C_iMaximal density of the correspondence Ontological concept in body network structure；

Object vectorIn x_iFor key word C_iPart of speech weight, the rule of thumb power of noun, verb, adjective, adverbial word Weight values are followed successively by β₁、β₂、β₃And β₄, and β₁＞ β₂＞ β₃＞ β₄。

Object vectorIn w_i1For key word C_iThe position weight for occurring for the first time in the text, this can be according to system Meter investigation draws a series of position weight value (α₁, α₂..., α_n), the weight in theme is maximum, and first paragraph takes second place.

Object vectorIn n₁For key word C_iNumber of times in the paragraph of the 1st appearance of text.

Object vectorInFor key word C_iWith weight limit vocabularyCo-occurrence in the text Probability.

Step 4：Using constraints, keyword feature set dimension-reduction treatment is carried out in hyperspace, finally extracted most The text key word set J of optimization₁、J₂, its concrete calculating process is as follows：

Above-mentioned key vocabulariesIn being mapped to hyperspace, successively merge from big to small while meeting following formula constraints Key word；

The distance value d of point-to-point transmission：D ＜ γ

The similarity of point-to-point transmission：

γ, α are the threshold value of expert's setting, and this can draw suitable value by experiment test.

Key word and weight after merging is respectively that maximum in space vectorial corresponding key vocabularies of point, weight Also it is the weight of the key vocabularies.

The key word after merging is sorted from big to small according to above-mentioned constraints, is extracted using following constraintss High-quality key vocabularies characteristic set.

Above formulaTo extract the number of vocabulary, k is to set the threshold value for extracting vocabulary number, and k is individual high-quality before extracting Amount key word, it is to avoid redundancy.

Finally give the high-quality key vocabularies characteristic set that weight arranges from big to small to be respectively

Step 5：Calculate two keyword sets J₁、J₂Similarity between two maximum words of middle weight, its concrete calculating process is such as Under：

Here the description of key word is with the above-mentioned vectorial factorTo retouch State, calculate the similarity between two maximum words of weight, it is possible to use following formula is realized：

Above formula A, B are respectively concept depth and density, the comprehensive weight allocation proportion of other elements in vector element, typically A ＞ B, A+B=1.

Above formula

It is apart from the factor.

Above formula in the same manner

Two maximum words of weight

Step 6：Two keyword sets J₁、J₂Middle solution similarity two-by-two between vocabulary, sets similarity between a vocabulary Threshold value, the similarity sim (W between two texts is calculated according to the vocabulary number for meeting condition₁, W₂), its concrete calculating process is such as Under：

Its matrix form is

In the same manner with step 5Can calculate similar between vocabulary two-by-two in two keyword sets Degree, there is following similarity matrix：

Extraction meets following formula and obtains vocabulary, i.e.,：

The threshold value of similarity, as follows between one vocabulary of setting：

Above formula j ∈ (2,3 ..., k), the threshold value that C sets for expert, then number n ' Jia 1 to meet condition, the initial value of n ' Can be according to experiment iteration out for 0, C.

In sum, the similarity between two texts

X, y are respectively Weight, x>Y, x+y=1；N ' for meet threshold condition key word Number, k is the number of high-quality key word after optimization.

Claims

1. the text similarity derivation algorithm of global optimization keyword quality is based on, the present invention relates to Semantic Web technology field, Text similarity derivation algorithm specifically related to based on global optimization keyword quality, is characterized in that, comprise the steps：

Step 1：Using Chinese words segmentation to two textsWord segmentation processing is carried out, its concrete participle technique process is such as Under：

Step 1.1：According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary, the Chinese character string for treating participle is complete Whole scanning one time, makes a look up matching in the dictionary of system, and the word that run into has in dictionary is just identified；If in dictionary There are no relevant matches, be just simply partitioned into individual character as word；Until Chinese character string is sky

Step 1.2：According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence knot that may be combined Structure, is successively defined as every sequential node of this structure, its structure chart is as shown in Figure 2

Step 1.3：Based on method of information theory, to above-mentioned network structure each edge certain weights, its concrete calculating process are given It is as follows：

According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word, i.e., The number collection of n paths words is combined into

In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved,

In statistics corpus, the quantity of information of each word is calculated, then the co-occurrence information amount of the adjacent word of solution path, existing following formula：

Above formulaFor word in text corpusQuantity of information,It is containing wordText message amount

Above formulaForProbability in text corpus, n is containing wordText corpus number

Above formulaIt is containing wordTextual data probit, N is statistics corpus Chinese version sum

In the same manner

It is the word in text corpusCo-occurrence information amount,For adjacent WordThe text message amount of co-occurrence

In the same manner

Above formulaIt is the word in text corpusCo-occurrence probabilities, m is the word in text libraryThe amount of text of co-occurrence

For adjacent word in text libraryThe textual data probability of co-occurrence

The weights that every adjacent path can to sum up be obtained are

Step 1.4：A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, its concrete calculating process is such as Under：

There are n paths, it is different per paths length, it is assumed that path collection is combined into

Assume the minimum number operation through taking word in path, eliminate m paths, m<N, that is, be left (n-m) path, if its road Electrical path length collection is combined into

It is per paths weight then:

Above formulaRespectively the 1,2nd arrivesThe power on path side Weight values, can calculate one by one according to step 1.4,For in remaining (n-m) path theThe length of paths Degree

One paths of maximum weight:

Stop words refers to that in the text the frequency of occurrences is high, but for word of the Text Flag without too big effect, removes stop words Process be exactly to be compared characteristic item with the word in vocabulary is disabled, by the spy if matching

Levy entry deletion

Comprehensive participle and deletion stop words technology, Chinese text preprocessing process flow chart such as Fig. 3

Step 3：The text key word collection gone after stop words operation is combined into, each key word Contribution margin in text regards a multi-C vector as, i.e.,, its Concrete calculating process is as follows：

Object vector：

It is above-mentionedFor key wordWeighting function in text library：

Above formulaFor key wordThe number of times for occurring in the text,For the total length of text,For text library Chinese version Total number,For key wordThe quantity of information of jth text in storehouse,For key wordIt is average in storehouse Quantity of information

Object vectorInFor key wordDepth capacity of the correspondence Ontological concept in body network structure；

Object vectorInFor key wordMaximal density of the correspondence Ontological concept in body network structure；

Object vectorInFor key wordPart of speech weight, the rule of thumb weight of noun, verb, adjective, adverbial word Value is followed successively by、、With, and

Object vectorInFor key wordThe position weight for occurring for the first time in the text, this can be according to system Meter investigation draws a series of position weight value, weight in theme is maximum, and first paragraph takes second place

Object vectorInFor key wordNumber of times in the paragraph of the 1st appearance of text

Object vectorInFor key wordWith weight limit vocabularyCo-occurrence in the text Probability

Step 4：Using constraints, keyword feature set dimension-reduction treatment is carried out in hyperspace, finally extract optimization Text key word set、, its concrete calculating process is as follows：

Above-mentioned key vocabulariesIn being mapped to hyperspace, successively merge from big to small while meeting the key of following formula constraints Word；

The distance value d of point-to-point transmission：

The similarity of point-to-point transmission：

、For the threshold value of expert's setting, this can draw suitable value by experiment test

Key word and weight after merging is respectively that maximum in space vectorial corresponding key vocabularies of point, and weight is also The weight of the key vocabularies

The key word after merging is sorted from big to small according to above-mentioned constraints, is extracted high-quality using following constraintss The key vocabularies characteristic set of amount

Above formulaTo extract the number of vocabulary, k is to set the threshold value for extracting vocabulary number, and k high-quality is closed before extracting Keyword, it is to avoid redundancy, finally gives the high-quality key vocabularies characteristic set that weight arranges from big to small and is respectively

、

Step 5：Calculate two keyword sets、Similarity between two maximum words of middle weight；

Step 6：Two keyword sets、Middle solution similarity two-by-two between vocabulary, sets the threshold of similarity between a vocabulary Value, according to the vocabulary number for meeting condition the similarity between two texts is calculated。

2. according to the text similarity derivation algorithm based on global optimization keyword quality described in claim 1

It is characterized in that, the concrete calculating process in the above step 5 is as follows：

Step 5：Calculate two keyword sets、Similarity between two maximum words of middle weight, its concrete calculating process is as follows：

Here the description of key word is with the above-mentioned vectorial factor

Come what is described, the phase between two maximum words of weight is calculated Like property, it is possible to use following formula is realized：

Above formula A, B are respectively concept depth and density, the comprehensive weight allocation proportion of other elements in vector element, typically,

Above formula

It is apart from the factor

Above formula in the same manner

Two maximum words of weight。

3. according to the text similarity derivation algorithm based on global optimization keyword quality described in claim 1

It is characterized in that, the concrete calculating process in the above step 6 is as follows：

Step 6：Two keyword sets、Middle solution similarity two-by-two between vocabulary, sets the threshold of similarity between a vocabulary Value, according to the vocabulary number for meeting condition the similarity between two texts is calculated, its concrete calculating process It is as follows：

Its matrix form is

In the same manner with step 5, similarity two-by-two between vocabulary in two keyword sets can be calculated, have Following similarity matrix：

Extraction meets following formula and obtains vocabulary, i.e.,：

Above formula, the threshold value that C sets for expert meets condition then numberPlus 1,Initial value be 0, C can be according to experiment iteration out

In sum, the similarity between two texts

X, y are respectively、Weight, x>Y, x+y=1；To meet the number of threshold condition key word, K is the number of high-quality key word after optimization.