CN103049569A

CN103049569A - Text similarity matching method on basis of vector space model

Info

Publication number: CN103049569A
Application number: CN2012105931481A
Authority: CN
Inventors: 江潮
Original assignee: WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Current assignee: WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority date: 2012-12-31
Filing date: 2012-12-31
Publication date: 2013-04-17

Abstract

The invention discloses a text similarity matching method on the basis of a vector space model. The text similarity matching method includes extracting keywords of texts, clustering all the keywords and generating a keyword concept tree; and computing the similarity of the texts according to the created keyword concept tree of the keywords in the texts to be translated, and acquiring texts in a translation depository according to the similarity. The texts in the translation depository are matched with the texts to be translated. According to the technical scheme, the test similarity matching method has the advantages that relations among the texts can be relatively accurately reflected, so that the similarity of the texts can be sufficiently reflected.

Description

Text similarity matching process based on vector space model

Technical field

The present invention relates to a kind of computer technology, specifically, relate to a kind of text similarity matching process based on vector space model.

Background technology

Present some Text Retrieval Model commonly used comprise based on the retrieval model of literal with based on the retrieval model of structure.The text based retrieval model comprises again: vector space model, approximate model, probability model and statistical language retrieval model; Text Retrieval Model based on structure comprises again: inner structure retrieval model, external structure retrieval model.

The similarity of text, namely the numerical metric of similarity degree between two pieces of texts is got two pieces of text D1, D2, if more the similarity near two pieces of texts of 1 expression is higher for (D1 ∩ D2)/(D1 ∪ D2), on the contrary opposite.In text retrieval technique, similarity is calculated the similarity degree that is mainly used in weighing between the text object, is a basic calculating in data mining, natural language processing.Gordian technique wherein mainly is two parts, the character representation of object and the similarity relation between the characteristic set.Declare weight, commending system etc. at information retrieval, webpage, all relate between the object or the calculating of the similarity of object and object set.For different application scenarioss, be subject to the restriction of data scale, space-time expense etc., the selection of similarity calculating method can be distinguished with different again to some extent.

The method of normally used calculating similarity is the VSM(vector space model).Then this model carries out the weights assignment by text is extracted keyword, and text table is shown as the vector that the keyword by weighted consists of, thereby obtains the similarity of text by the vector distance of calculating two texts.

Because probably there are the phenomenons such as synonym, polysemy in keyword, so the similarity computational solution precision that obtains with traditional vector space model method is not high, the result is often also unsatisfactory; The keyword weighting algorithm only is the relation of seeking between text and the keyword, can not laterally contact the relation between the keyword between different texts, has brought following problem to text retrieval:

(1) user's request can not accurately be expressed in keyword.

The user is difficult to select accurately keyword to search for, because wherein relate to the Semantic mapping problem between inquiry and the concept.The searching keyword that the user provides can not reflect user's intention well.

(2) keyword can not reflect content of text.

If the keyword extension is too large, just is difficult to or can't retrieves related text.

(3) polysemy.

Because the keyword matching technique is difficult to solve polysemy, tends to retrieve a large amount of irrelevant informations.

(4) keyword occurs in the text in the synonym mode.

User's searching keyword does not directly occur sometimes in the text, but occurs in other word-building modes of synonym, near synonym or keyword, and like this, text just can not retrieve.When searching keyword and feature word of text formation concept hyponymy, then more be difficult to retrieve.

Summary of the invention

Technical matters solved by the invention provides a kind of text similarity matching process based on vector space model, and the relatively accurate contact that has reflected between the text can reflect the similarity of text so more fully.

Technical scheme is as follows:

A kind of text similarity matching process based on vector space model comprises:

Extract the keyword of text, all keywords are carried out cluster, generate the keyword conceptional tree;

Calculate the similarity of text according to the keyword conceptional tree of keyword in the text to be translated that makes up, obtain the text that in translation list of references storehouse, mates by the size of similarity.

Further, the step of described generation keyword conceptional tree comprises:

Extract all keywords in document to be sorted and the reference library, obtain keyword set;

Keyword in the keyword set is carried out cluster, the keyword of same concept is polymerized to a concept class set, symphysis becomes described keyword conceptional tree according to described concept class set.

Further, if keyword k _iProbability p (the k that occurs _i) P1; And have, k occurring _iThe text in keyword k also appears _jConditional probability p(k _j| k _i) P2, then think keyword k _jAnd k _iExpress same concept, P1 and P2 are for setting the probability threshold values.

Further, the process concrete steps that generate described keyword conceptional tree comprise:

Extract all keywords in document to be sorted and the reference library, obtain keyword set C={k1, k2 ..., kn} calculates the Probability p (k) that each keyword k occurs among the C in reference library, the textual data of keyword k and the ratio of set Chinese version sum namely occur;

Filter keyword according to setting threshold values, get p _Min＜p(k)＜p _MaxKeyword, it as set entry to be combined, is established qualified keyword number and is m, wherein p _MaxAnd p _MinBe the high lower bound threshold values that sets;

To the keyword that obtains after filtering by p(k) carry out descending sort, and with each keyword as a set, obtain so initial m set to be combined, be designated as { k ₁, { k ₂...., { k _m;

In this m keyword, calculate at keyword k _iKeyword k in the text that occurs _jThe probability that also occurs is designated as p(k _j| k _i), amount to

Individual conditional probability, (1≤i, j≤m; I ≠ j); P(k _j| k _i)=p(k _jk _i)/p(k _i), p(k _jk _i) be k _jAnd k _iAppear at simultaneously the probability in the same piece of writing text;

Merge set to be combined, generating root node is the keyword conceptional tree of keyword set C.

Further, for two keyword set C1 to be combined and C2, the merging condition is: have k _iBelong to C1, k _jBelong to C2, and p(k _i) P1, p(k _j| k _i) P2, work as p(k _i) and p(k _j| k _i) during greater than described setting threshold values, keyword k _iAnd k _jExpress same concept, satisfy one of the merging condition of the set at its place; Appoint to a keyword k in the set after merging _i, its with set in the keyword over half p(k that all satisfies condition _j| k _i) P2; If above two conditions are satisfied in two set, then concept has very large similarity, belongs to annexable set, generates the set of last layer concept class after merging.

Further, the process of searching the text of coupling in reference library comprises: extract the keyword of all documents in the reference library, form keyword set; According to the structure of described keyword conceptional tree, by improved Text similarity computing formula, calculate the similarity of each text in text to be sorted and the reference library, according to similarity descending return results text.

The process concrete steps of further, searching the text of coupling in translation list of references storehouse comprise:

Definition H is the height of the conceptional tree that generates, definition depth(k) be the degree of depth of node k in tree, be from root node to limit number that this node experiences;

Definition com(k _i, k _j) be from node k _iAnd k _jNearest common father node, it is root node that any two nodes must have a common father node;

The long-pending computing formula of any two keywords: k _i* k _j=depth(com(k _i, k _j))/H;

If vectorial A={a ₁, a ₂..., a _n, B={b ₁, b ₂..., b _n, the definition vector calculation:

A * B = Σ_{i = 1}^{n} Σ_{j = 1}^{n} (a_{i} \times b_{j});

The calculating formula of similarity of text is:

Sim (d 1, d 2) = \frac{d 1 * d 2}{\sqrt{d 1 * d 1} \sqrt{d 2 * d 2}},

D1 and d2 represent text vector.

Compared with prior art, technique effect comprises:

In the prior art, when with the vector space model method text being carried out similarity calculating, if the vector representation of two texts is d1={k1, k2, k3}, d2={k4, k5, k6} is because these two text vectors are vertical, so its similarity is 0.The synonymy that the keyword that compares owing to two texts may exist, concept hyponymy etc., the account form that only adopts same keyword to mate can not embody the relation between the text effectively.

Therefore, among the present invention, by keyword is carried out conceptual clustering, the keyword that concept is similar condenses together, by a kind of improved vectorial cosine computing method, the similarity of mutually perpendicular vector may not be 0 just, the relatively accurate contact that has reflected between the text, than traditional vector space method, can reflect more fully like this similarity of text.

Description of drawings

Fig. 1 is one 4 layers the conceptional tree synoptic diagram that makes up among the present invention;

Embodiment

It is text similarity technology in the text retrieval technique that the present invention relates generally to technology.Text retrieval is a cross discipline, from large subject, across subjects such as computing machine, information, mathematical statisticss, on concrete research direction, comprises the technology such as text retrieval, natural language processing, data mining, machine learning.

Translation list of references storehouse (abbreviation reference library) is a huge resources bank that mass text is arranged, adopt the method for complicated similarity retrieval, text to be translated is carried out similarity retrieval therein, thereby find the operation of similar referenced text set, speed is very slow, is difficult to accomplish quick-searching.Yet adopt relatively simple VSN vector space method to carry out similarity retrieval, its precision is very low, this method is utilized a kind of improved VSM method, and raising retrieval precision that can be larger under the prerequisite that keeps VSM method retrieval rate obtains a relatively accurate similar reference documents set.

Among the present invention, provide a kind of Text similarity computing method based on vector space model.

Step 1: extract all keywords of text to be sorted, extract the keyword of all documents in the reference library, form keyword set, all keywords are carried out cluster, generate the keyword conceptional tree;

Technical solution of the present invention has provided a suitable clustering algorithm, and the generation of keyword conceptional tree is described in detail.

Step 11: extract all keywords in text to be sorted and the reference library, obtain keyword set C={k1, k2 ..., kn};

Step 12: the keyword in the keyword set is carried out cluster, the keyword of same concept is polymerized to the same concept set;

If two keywords occur in the one piece of text of being everlasting simultaneously, when namely they appeared at the same piece of probability in the text simultaneously greater than a certain threshold values, we thought that it expresses same concept, belong to the concept that can merge.That is, if keyword k _iProbability p (the k that in text set, occurs _i) P1; And have, k occurring _iThe text in keyword k also appears _jConditional probability p(k _j| k _i) P2, then think keyword k _j, k _iExpress same concept, merge it (P1 and P2 are the probability threshold values that sets).

In like manner for two keyword set C1, C2 to be combined, if satisfy following two conditions:

Condition 1: have k _iBelong to C1, k _jBelong to C2, and p(k _i) P1, p(k _j| k _i) P2;

Work as p(k _i) and p(k _j| k _i) during greater than respective thresholds, we think keyword k _iAnd k _jExpress same concept, satisfy one of the merging condition of the set at its place.

Condition 2: appoint to a keyword k in the set after merging _i, its with the set in keyword over half all meet the following conditions: p(k _j| k _i) P2.

If satisfy condition simultaneously 1 and condition 2, then we think that the concept of these two set has and satisfy certain similarity, belong to the set that can merge, and generate the set of last layer concept class after merging.

When remaining any two keyword set merged, the condition above not satisfying merged and stops, the father node of the remaining set set C that all keywords consist of that serves as reasons.

The step of keyword clustering is as follows:

Step 121: extract all keywords, obtain keyword set C={k1, k2 ..., kn};

Calculate among the C each keyword k at the probability that occurs, be textual data that keyword k occurs and the ratio of text sum, be designated as p(k).

Step 122: filter keyword according to setting threshold values;

Get p _Min＜p(k)＜p _MaxKeyword, it as set entry to be combined, is established qualified keyword number and is m (p _Max, p _MinBe the high lower bound threshold values that sets, with removing extremely high frequency word and extremely low frequency word).

Step 123: to the keyword that obtains after filtering by p(k) carry out descending sort, and with each keyword as a set, obtain so initial m set to be combined, be designated as { k ₁, { k ₂..., { k _m;

Step 124: in this m keyword, calculate at keyword k _iIn the text that occurs, keyword k _jThe probability that also occurs is designated as p(k _j| k _i), amount to Individual conditional probability, (1≤i, j≤m; I ≠ j);

P(k _j| k _i) computing method: p(k _j| k _i)=p(k _jk _i)/p(k _i), p(k _jk _i) be k _j, k _iAppear at simultaneously the probability in the same piece of writing text.

Step 125: merge set I and J, (I, J are set to be combined);

When satisfying following two conditions simultaneously, merge:

I. Satisfy p(k _i) P1, p(k _j| k _i) P2;

Ii.

Satisfy | { k _j∈ I UJ|p(k _j| k _i) P2}| (| I|+|J|)/2, | X| represents to gather the number of element among the X.

Step 126: when any two set all do not meet this two conditions, merge and finish.Obtain simultaneously ground floor cluster keyword set C={C1, C2 ..., Cq};

Step 127: to C={C1, C2 ..., Cq} gets threshold value P3＜P2, again carries out cluster (step 125 and 126) with above-mentioned steps 11 to 17, generates the set of last layer concept.

Repeat this process, until cluster set cluster again, these again the concept set of cluster be combined into the child node of root node C, the keyword conceptional tree that so just to generate a root node be keyword set C.

As shown in Figure 1, be one 4 layers the conceptional tree synoptic diagram that makes up among the present invention.

Step 2: according to the keyword conceptional tree of keyword in the text to be translated that makes up, in translation list of references storehouse, search the text of coupling.

The present invention has defined a kind of computing method of the vectorial cosine based on the keyword conceptional tree, i.e. a kind of method of new Text similarity computing.

Step 21: according to the structure of keyword conceptional tree, adopt Innovative method to calculate the similarity of different keywords;

Step 22: adopt improved cosine similarity method, calculate waiting for translating originally and the similarity of reference translation storehouse Chinese version;

In the VSM vector space model, any two keyword k _i, k _jBe fully vertical, it is long-pending to be 0.And in conceptional tree of the present invention, any two concept k _i, k _jMight not be vertical, but be determined by the distance of their common father nodes from root node.K in Fig. 1 for example ₁, k ₂Common nearest father node is C11, and its distance from root node is 2, and the height of tree is 3, so k ₁* k ₂=2/3.

1. definition H is the height of the conceptional tree of generation.

2. be the degree of depth of node k in tree definition depth(k), be from root node to limit number that this node experiences;

3. define com(k _i, k _j) be from node k _iAnd k _jNearest common father node, it is root node that any two nodes must have a common father node;

4. the long-pending computing formula of any two keywords: k _i* k _j=depth(com(k _i, k _j))/H;

5. establish vectorial A={a ₁, a ₂..., a _n, B={b ₁, b ₂..., b _n, the definition vector calculation:

A * B = Σ_{i = 1}^{n} Σ_{j = 1}^{n} (a_{i} \times b_{j});

6. the calculating formula of similarity of text is:

Sim (d 1, d 2) = \frac{d 1 * d 2}{\sqrt{d 1 * d 1} \sqrt{d 2 * d 2}},

D1 and d2 represent text vector.

Step 23: according to similarity descending return results text.

The below describes concrete application according to technical solution of the present invention.

Use one: adopt the method for interpreter's achievement document content similarity matching to optimize interpreter's retrieval

Each interpreter has the document of much oneself translating, and these documents of translating have consisted of this interpreter's document library, and numerous interpreters' document library consists of huge " interpreter's achievement document library "; Will seek suitable interpreter when one piece of document to be translated translates, this document can be carried out similarity matching in " interpreter's achievement document library ", from the storehouse, match the high document of similarity, the interpreter that the document that these similarities are high is corresponding, being exactly suitable interpreter, is exactly the ordering of interpreter's appropriate degree according to sequencing of similarity.Because the interpreter once translated similarly document, translation gets up just can accomplish faster and better.

Use two: adopt classifying documents storehouse similarity matching to realize the document automation classification

Set up one according to the standardized documentation of set criteria for classification classification, wherein each classification has the sample document of some, with non-classified document still, pass through similarity matching, match all documents that similarity in the classifying documents storehouse surpasses predetermined value, the classification situation of these similar document is carried out tabulate statistics and bring computation model into being weighted calculating, calculate the classification situation score of the document, the classification that score is the highest is exactly the most probable classification of the document.If the classification score of score second is more or less the same with first score, can be used as subsidiary classification.

Use three: adopt ambit to divide the contribution fragmentation strategy of being combined with similarity retrieval

When carrying out large document translation task, large translation contribution is broken into a plurality of less translation fragment contributions, be to promote the division of labor to improve the common method of translation efficiency, but the strategy of how contribution " being smashed " just become key link.Here the method that adopts is that the content of contribution is smashed not according to simple paragraph, but judge the ambit of paragraph content according to keyword, according to ambit the content of contribution is carried out preliminary division, and then in the historical result document library, carry out similarity retrieval with the fragment contribution of dividing, draw the interpreter that these fragment contributions are fit to, carry out again the integration of fragment according to the interpreter: will be suitable for fragment contribution same or same class interpreter translation and merge or the part merging, the final like this result who obtains the contribution fragmentation is exactly desirable, is convenient to very much the arrangement task and is conducive to ensure translation quality.

Claims

1. text similarity matching process based on vector space model comprises:

2. the text similarity matching process based on vector space model as claimed in claim 1 is characterized in that, the step of described generation keyword conceptional tree comprises:

3. the text similarity matching process based on vector space model as claimed in claim 2 is characterized in that, if keyword k _iProbability p (the k that occurs _i) P1; And have, k occurring _iThe text in keyword k also appears _jConditional probability p(k _j| k _i) P2, then think keyword k _jAnd k _iExpress same concept, P1 and P2 are for setting the probability threshold values.

4. the text similarity matching process based on vector space model as claimed in claim 3 is characterized in that, the process concrete steps that generate described keyword conceptional tree comprise:

Extract all keywords in document to be sorted and the reference library, obtain keyword set C={k1, k2 ..., kn}, each keyword k the textual data of keyword k and the ratio of text sum occur and is designated as p(k at the probability that occurs among the calculating C);

To the keyword that obtains after filtering by p(k) carry out descending sort, and with each keyword as a set, obtain so initial m set to be combined, be designated as { k ₁, { k ₂..., { k _m;

In this m keyword, calculate at keyword k _iKeyword k in the text that occurs _jThe probability that occurs is designated as p(k _j| k _i), amount to

5. the text similarity matching process based on vector space model as claimed in claim 4 is characterized in that, for two keyword set C1 to be combined and C2, the merging condition is: have k _iBelong to C1, k _jBelong to C2, and p(k _i) P1, p(k _j| k _i) P2, work as p(k _i) and p(k _j| k _i) during greater than described setting threshold values, keyword k _iAnd k _jExpress same concept, satisfy one of the merging condition of the set at its place; Appoint to a keyword k in the set after merging _i, its with set in the keyword over half p(k that all satisfies condition _j| k _i) P2; If above two conditions are satisfied in two set, then concept has very large similarity, belongs to annexable set, generates the set of last layer concept class after merging.

6. the text similarity matching process based on vector space model as claimed in claim 1, it is characterized in that, the process of searching the text of coupling in translation list of references storehouse comprises: extract the keyword of all documents in the translation list of references storehouse, form keyword set; According to the structure of described keyword conceptional tree, by improved Text similarity computing formula, calculate text to be sorted and reference library close in the similarity of each text, according to similarity descending return results text.

7. the text similarity matching process based on vector space model as claimed in claim 6 is characterized in that, the process concrete steps of searching the text of coupling in translation list of references storehouse comprise:

A * B = Σ_{i = 1}^{n} Σ_{j = 1}^{n} (a_{i} \times b_{j});

The calculating formula of similarity of text is:

Sim (d 1, d 2) = \frac{d 1 * d 2}{\sqrt{d 1 * d 1} \sqrt{d 2 * d 2}},

D1 and d2 represent text vector.