CN101576904A

CN101576904A - Method for calculating similarity of text content based on authorized graph

Info

Publication number: CN101576904A
Application number: CNA2009100787872A
Authority: CN
Inventors: 杜小勇; 刘红岩; 何军; 李佩; 李直旭
Original assignee: Individual
Current assignee: Individual
Priority date: 2009-03-03
Filing date: 2009-03-03
Publication date: 2009-11-11
Anticipated expiration: 2029-03-03
Also published as: CN101576904B

Abstract

The invention relates to a system and a method for calculating the similarity of text content based on an authorized graph, wherein the system comprises an input unit, a building unit, a calculating unit and an output unit; the input unit is used for inputting document sets needing to calculate the similarity thereof; the building unit is used for building the authorized graph; the calculating unit is used for calculating the similarity between any two nodes according to the obtained authorized graph in the building unit; and the output unit is used for returning the similarity result to a user.

Description

A kind of based on there being weight graph to calculate the method for content of text similarity

Technical field

The present invention relates to data mining and information retrieval field, especially relate to a kind of based on there being weight graph to calculate the method for content of text similarity.

Background technology

In actual applications, similarity calculating is an important ring indispensable in an information retrieval system or the collaborative filtering system.In the information retrieval now, document information retrieval is a unfailing research topic, and wherein calculating based on the similarity of content of text is again wherein core.For instance, in a digital library, when the user consults certain piece of document, usually need to check other document relevant with this piece document, perhaps searching system needs automatically other pertinent literature to be presented to the user, wherein just relates to the content of text similarity and calculates.In scientific research, it is the problem of just very paying close attention at the beginning of information retrieval field develops that the content of text similarity is calculated, because initial information retrieval system nearly all is a document information retrieval.Typical information retrieval system has SMART System, and it was developed by U.S. Cornell university in 1960 periods.

A kind of typical text similarity application scenarios, input are given set that comprises some pieces of documents, wherein comprise a certain amount of term (term) in every piece of document, and output is between any two similarity value in these documents.The also the present invention just of this scene towards application.

In the technology in the past, the most frequently used way is that this collection of document is considered as a vector space (Vector Space), and one piece of document wherein is considered as a vector, and the document similarity just is converted into the cosine value of asking corresponding vector angle like this.This computing method at first propose in SMART System mentioned above, have obtained the approval of industry then.However, in following a kind of situation, the performance of this method is unoutstanding: when number of documents is a lot, but the content of every piece of document is when long.Reason is, because document is a lot, so term (term) is a lot, but the term number of every piece of document seldom, causes the common factor between any two pieces of documents very little like this, causes the result who calculates less than normal at last.Regrettably, under the trend that World Wide Web more and more popularizes now, this phenomenon becomes worse: the often sharp increase of webpage quantity, and the length of each webpage is not very long.Just be based on this situation, the present invention proposes based on the content of text similarity that weight graph is arranged and calculate, can successfully solve an above difficult problem.

Crucial part of the present invention is: document is considered as a node among the figure, and if shared once identical term between two pieces of documents thinks that then it is 1 limit that a weight is arranged between them.Obviously, according to this patterning process, we can obtain the non-directed graph of a cum rights at last, just can calculate the similarity between the document then according to the similarity calculating method based on link.

Summary of the invention

The present invention produces in view of above-mentioned technical matters.An object of the present invention is to propose a kind of based on there being weight graph to calculate the method for content of text similarity.

In one aspect, comprise based on the method that has weight graph to calculate the content of text similarity according to of the present invention: A, input needs calculate the collection of document of its similarity.B, be configured with weight graph; C, there is weight graph to come the similarity between any two nodes in the calculating chart according to resulting in step B.D, the document similarity result is returned to the user.

In aspect this, wherein step B further comprises: B1, document is carried out pre-service; B2, statistics term; B3, be built with weight graph.

In aspect this, wherein step B1 is divided into two subs: remove non-term and get stem.

In aspect this, wherein in step B2 aspect the main statistics two: comprise number of times, the frequency that particular term occurs and occurs which term and these terms occur in one piece of particular document in which document.

In aspect this, if shared once identical term between two pieces of documents in step B3 wherein thinks that then the limit weight between them is 1, weight can add up, and promptly obtaining one at last has weight graph.

In aspect this, wherein step C further comprises: C1, structure transition matrix; C2, transition matrix is carried out an iterative computation to obtain the similarity of this time iteration; C3, the similarity that the resulting similarity of current iteration and last iteration are obtained compare, if restrain, then iteration stops, and carries out next iteration otherwise return step C2.

In aspect this, wherein in step C1 there being weight graph to represent with adjacency matrix, allow the weight sum of every row be 1 then.

In one aspect of the method, comprise based on the system that has weight graph to calculate the content of text similarity according to of the present invention: input block is used to import the collection of document that needs calculate its similarity; Tectonic element is used to be configured with weight graph; Computing unit is used for according to having weight graph to come the similarity between any two nodes in the calculating chart in that tectonic element is resulting; Output unit is used for similarity result is returned to the user.

In aspect this, wherein tectonic element further comprises: pretreatment module is used for document is carried out pre-service; Statistical module is used for term is added up; There is weight graph to make up module, is used to be built with weight graph.

In aspect this, wherein pretreatment module is carried out pre-service to document and is divided into two subs: remove non-term and get stem.

In aspect this, wherein statistical module is mainly added up two aspects: comprise number of times, the frequency that particular term occurs and occurs in which document which term and these terms occur in one piece of particular document.

In aspect this, if wherein there being weight graph to make up in the module shared once identical term between two pieces of documents, think that then the limit weight between them is 1, weight can add up, and promptly obtaining one at last has weight graph.

In aspect this, wherein computing unit further comprises: the transition matrix constructing module is used to construct transition matrix; The iterative computation module is used for transition matrix is carried out an iterative computation to obtain the similarity of this time iteration; Judge module, the similarity that the resulting similarity of current iteration and last iteration are obtained compares, if restrain, then iteration stops, otherwise current iteration result is input to the iterative computation module to carry out next iteration.

In aspect this, wherein the transition matrix constructing module is having weight graph to represent with adjacency matrix and allowing the weight sum of every row be 1.

Description of drawings

In conjunction with accompanying drawing subsequently, what may be obvious that from following detailed description draws above-mentioned and other purpose of the present invention, feature and advantage.In the accompanying drawings:

Fig. 1 has provided according to of the present invention based on there being weight graph to calculate the process flow diagram of the method for content of text similarity;

Fig. 2 has provided the sub-process figure of the method according to this invention;

Fig. 3 has provided the another sub-process figure of the method according to this invention;

Fig. 4 has provided according to of the present invention based on there being weight graph to calculate the block scheme of the system of content of text similarity;

Fig. 5 has provided the more detailed block diagram according to tectonic element of the present invention;

Fig. 6 has provided the more detailed block diagram according to computing unit of the present invention;

Fig. 7 has provided the synoptic diagram that can implement an example context of the present invention.

Embodiment

For a more complete understanding of the present invention and advantage, below in conjunction with drawings and the specific embodiments the present invention is done explanation in further detail.

At first, with reference to figure 1, to describing based on the method that has weight graph to calculate the content of text similarity according to of the present invention.

As shown in Figure 1, comprise step according to of the present invention based on the method that has weight graph to calculate the content of text similarity:

A, user import the collection of document that needs to calculate its similarity.

B, be configured with weight graph, be about to one piece of document and be considered as a node among the figure, and if shared once identical term between two pieces of documents thinks that then the limit weight between them is 1, weight can add up, promptly obtaining one at last has weight graph.Be described in detail with reference to 2 pairs of these steps of figure subsequently.

C, iterative computation similarity promptly have a weight graph according to resulting in step B, come the similarity between any two nodes in the calculating chart by the similarity calculating method based on link.Be described in detail with reference to 3 pairs of these steps of figure subsequently.

D, the similarity that the similarity value among the step C is converted between the document return to the user.

Next, with reference to figure 2, the process that is configured with weight graph is illustrated in greater detail.Fig. 2 has provided the process flow diagram that is configured with weight graph according to of the present invention.

As shown in Figure 2, the step that is configured with weight graph further comprises:

B1, document is carried out pre-service.Specifically, at first, those are insignificant in the document, and the speech that can not be considered to term can be removed, as " ", and "Yes" or the like.Secondly,, distortion such as past tense, past participle, nounization or adjectiveization are often arranged, need remove these front and back by the method for getting stem (stem) and sew for the speech in the English.That is to say that this step can be divided into two subs: remove non-term and get stem.The SAMRT system that non-term of the present invention (being also referred to as stopwords) is taken from above to be mentioned, common comprising some not towards specific area, therefore also just lack the speech of discrimination, get the method for stem and then use Porter Stemmer algorithm, this is a stem method of getting commonly used.

B2, statistics term.This step is mainly added up two aspects: comprise which term in one piece of specific document, and the number of times (frequency) of these terms appearance; A specific term occurs in which document, and the frequency that occurs how.This statistic processes is prepared for next step is built with weight graph, and its result formats is " document-term " matrix, and the value of element is the occurrence number of corresponding term in respective document in the matrix.

B3, be built with weight graph.If shared once identical term thinks that then the limit weight between them is 1 between two pieces of documents, weight can add up, and promptly obtaining one at last has weight graph.If two pieces of documents do not have shared any term, the limit weight between them just is 0 so.That is to say that each document in this document sets all is considered as a node." document-term " matrix (supposing it is T) that obtains according to previous step multiplies each other the transposition of T after the standardization and the T after the standardization, just obtains the weight matrix between the document.The meaning of the standardization here be with all elements in the delegation of matrix divided by certain coefficient, make it and be 1.

Now, with reference to figure 3, the process of iterative computation similarity is illustrated in greater detail.Fig. 3 has provided the process flow diagram according to iterative computation similarity of the present invention.

As shown in Figure 3, the step of iterative computation similarity further comprises:

C1, transition matrix structure.Specifically, there is this weight graph to represent, allows the weight sum of every row be 1 (being standardization) then with adjacency matrix.In fact to make up the result who obtains at last be exactly a transition matrix to Fig. 2 right of possession figure, it is characterized in that every row all is normalized to 1.

C2, transition matrix is carried out an iterative computation to obtain the similarity of this time iteration.In the beginning of iterative computation, allow the similarity matrix be a unit matrix, carry out computation process then, allow the k time result calculated as the input of calculating for the k+1 time:

S _k□1(DV)d□T(DV，DV)□S _k(DV)□(T(DV，DV)) ^T□M

S wherein _{K 1}(DV) be the result of this iteration, S _k(DV) be the last iteration result calculated, if k=0, similarity matrix is exactly a unit matrix so.D is a decay factor, represents the decay that similarity of every iteration is propagated, and generally is made as 0.8.(DV DV) is transition matrix to T, and M is a correction matrix, and each calculating finishes, and the diagonal line of matrix of consequence all is 1.Be easy to learn that for the common calculating personnel in this area this iterative computation is not limited to above-mentioned example, for example also can use following example, if promptly have only two nodes among the figure, then the first time S0 (DV) during iteration=1,0}, 0,1}}.Suppose d T (DV, DV) S ₀(DV) (T (DV, DV)) ^TResult of calculation be 0.8,0.7}, 0.6,0.9}}, correction matrix M={{0.2 so, 0}, 0,0.1}}, it makes the diagonal entry of net result is 1.

C3, the similarity that the resulting similarity of current iteration and last iteration are obtained compare, if restrain (for example Zui Da difference is no more than 0.001), then iteration stops, and carries out next iteration otherwise return step C2.Certainly also have self-defining other end condition of user, stop automatically after for example iteration surpasses 10 times.If stop being judged as not, then iteration is proceeded down, and flow process will jump to the beginning of next iteration process.

Next, with reference to figure 4, to describing in detail based on the system that has weight graph to calculate the content of text similarity according to of the present invention.

As shown in Figure 4, comprise input block, tectonic element, computing unit and output unit according to of the present invention based on the system that has weight graph to calculate the content of text similarity.

Input block makes the collection of document that the user imports needs to calculate its similarity.

Tectonic element is used to be configured with weight graph, be about to one piece of document and be considered as a node among the figure, and if shared once identical term between two pieces of documents thinks that then the limit weight between them is 1, weight can add up, promptly obtaining one at last has weight graph.Be described in detail with reference to 5 pairs of these tectonic elements of figure subsequently.

Computing unit is used for the iterative computation similarity.Promptly weight graph arranged, come the similarity between any two nodes in the calculating chart by similarity calculating method based on link according to resulting in tectonic element.Be described in detail with reference to 6 pairs of these computing units of figure subsequently.

Output unit is used for similarity result is returned to the user.

Next, with reference to figure 5, tectonic element is illustrated in greater detail.Fig. 5 has provided the more detailed block diagram according to tectonic element of the present invention.

As shown in Figure 5, tectonic element comprises pretreatment module, statistical module and has weight graph to make up module.

Pretreatment module is used for document is carried out pre-service.As mentioned above, at first, those are insignificant in the document, and the speech that can not be considered to term can be removed, as " ", and "Yes" or the like.Secondly,, distortion such as past tense, past participle, nounization or adjectiveization are often arranged, need remove these front and back by the method for getting stem (stem) and sew for the speech in the English.That is to say that this step can be divided into two subs: remove non-term and get stem.

Statistical module is used for term is added up.This mainly adds up two aspects: comprise which term in one piece of specific document, and the number of times (frequency) of these terms appearance; A specific term occurs in which document, and the frequency that occurs how.This statistic processes is prepared for next step is built with weight graph, and its result formats is " document-term " matrix, and the value of element is the occurrence number of corresponding term in respective document in the matrix.

There is weight graph to make up module and is used to be built with weight graph.If shared once identical term thinks that then the limit weight between them is 1 between two pieces of documents, weight can add up, and promptly obtaining one at last has weight graph.That is to say that each document in this document sets all is considered as a node." document-term " matrix (supposing it is T) that obtains according to previous step multiplies each other the transposition of T after the standardization and the T after the standardization, just obtains the weight matrix between the document.The meaning of the standardization here be with all elements in the delegation of matrix divided by certain coefficient, make it and be 1.

Now, with reference to figure 6, computing unit is illustrated in greater detail.Fig. 6 has provided the more detailed block diagram according to computing unit of the present invention.

As shown in Figure 6, computing unit comprises transition matrix constructing module, iterative computation module and judge module.

The transition matrix constructing module is used to construct transition matrix.Specifically, there is this weight graph to represent, allows the weight sum of every row be 1 (being standardization) then with adjacency matrix.In fact to make up the result who obtains at last be exactly a transition matrix for Fig. 2 or Fig. 5 right of possession figure, it is characterized in that every row all is normalized to 1.

The iterative computation module is used for transition matrix is carried out an iterative computation to obtain the similarity of this time iteration.In first time during iteration, similarity matrix is a unit matrix, and two pieces of different document similarities are made as 0.Transition matrix is described the probability that random walk person walks about between node, along with the carrying out of iteration, similarity between the different document can be because random walk, increases progressively and constantly approaches to the iteration convergence value since 0.Specifically,, allow the similarity matrix be a unit matrix, carry out computation process then, allow the k time result calculated as the input of calculating for the k+1 time in the beginning of iterative computation:

S _k+1(DV)＝d·T(DV，DV)·S _k(DV)·(T(DV，DV)) ^T+M

S wherein _K+1(DV) be the result of this iteration, S _k(DV) be the last iteration result calculated, if k=0, similarity matrix is exactly a unit matrix so.D is a decay factor, represents the decay that similarity of every iteration is propagated, and generally is made as 0.8.(DV DV) is transition matrix to T, and M is a correction matrix, and each calculating finishes, and the diagonal line of matrix of consequence all is 1.Be easy to learn that for the common calculating personnel in this area this iterative computation is not limited to above-mentioned example, for example also can use following example, if promptly have only two nodes among the figure, then the first time S0 (DV) during iteration=1,0}, 0,1}}.Suppose dT (DV, DV) S ₀(DV) (T (DV, DV)) ^TResult of calculation be 0.8,0.7}, 0.6,0.9}}, correction matrix M={{0.2 so, 0}, 0,0.1}}, it makes the diagonal entry of net result is 1.

Judge module, the similarity that the resulting similarity of current iteration and last iteration are obtained compares, if restrain (for example Zui Da difference is no more than 0.001), then iteration stops, otherwise current iteration result is input to the iterative computation module to carry out next iteration.Certainly also have self-defining other end condition of user, stop automatically after for example iteration surpasses 10 times.If stop being judged as not, then iteration is proceeded down, and flow process will jump to the beginning of next iteration process.

Describe implementing an example context of the present invention below with reference to Fig. 7.Fig. 7 has provided the synoptic diagram that can implement an example context of the present invention.

The hardware facility of wanting required for the present invention mainly is a computer equipment, uses (in 100,000 nodes) for the great majority in the real world, and the good slightly PC of configuration can meet the demands.

As shown in Figure 7, memory device 1 for example is internal memory or hard disk, is used to store the given document sets of user, generally is the form with file or data-base recording.Memory device 2 for example is an internal memory, is used to store according to certain algorithm these documents are handled resulting weighted graph.When the very big internal memory of weighted graph is not enough to deposit, just be placed in the hard disk.CPU is used for each parts are controlled to carry out the illustrated flow process of above-mentioned Fig. 1-3.Output device for example is a display, and printer or the like is to output to the user with similarity result.

So far, main modular of the present invention and detailed process are described.With with the class methods lateral comparison, the method that the present invention proposes has significant reduction on time complexity, this is a biggest advantage.Secondly, the present invention is based upon on the rational theoretical model, to damage the significantly improvement of the performance that accuracy obtains hardly.

What may be obvious that for the person of ordinary skill of the art draws other advantages and modification.Therefore, the present invention with wider aspect is not limited to shown and described specifying and exemplary embodiment here.Therefore, under situation about not breaking away from, can make various modifications to it by the spirit and scope of claim and the defined general inventive concept of equivalents thereof subsequently.

Claims

1, a kind ofly comprise step based on the method that has weight graph to calculate the content of text similarity:

A, input need to calculate the collection of document of its similarity.

B, be configured with weight graph;

C, there is weight graph to come the similarity between any two nodes in the calculating chart according to resulting in step B.

D, the document similarity result is returned to the user.

2, according to the process of claim 1 wherein that step B further comprises:

B1, document is carried out pre-service;

B2, statistics term;

B3, be built with weight graph.

3, according to the method for claim 2, wherein step B1 is divided into two subs: remove non-term and get stem.

4, according to the method for claim 3, wherein in step B2 aspect the main statistics two: comprise number of times, the frequency that particular term occurs and occurs which term and these terms occur in one piece of particular document in which document.

5, according to the method for claim 4, if shared once identical term between two pieces of documents in step B3 wherein thinks that then the limit weight between them is 1, weight can add up, and promptly obtaining one at last has weight graph.

6, according to the process of claim 1 wherein that step C further comprises:

C1, structure transition matrix;

C2, based on transition matrix, allow the initial similarity matrix be unit matrix, carry out an iterative computation to obtain the similarity of this time iteration;

C3, the similarity that the resulting similarity of current iteration and last iteration are obtained compare, if restrain, then iteration stops, and carries out next iteration otherwise return step C2.

7, according to the method for claim 6, wherein in step C1 there being weight graph to represent with adjacency matrix, allow the weight sum of every row be 1 then.

8, a kind ofly comprise based on the system that has weight graph to calculate the content of text similarity:

Input block is used to import the collection of document that needs calculate its similarity;

Tectonic element is used to be configured with weight graph;

Computing unit is used for according to having weight graph to come the similarity between any two nodes in the calculating chart in that tectonic element is resulting;

Output unit is used for similarity result is returned to the user.

9, system according to Claim 8, wherein tectonic element further comprises:

Pretreatment module is used for document is carried out pre-service;

Statistical module is used for term is added up;

There is weight graph to make up module, is used to be built with weight graph.

10, according to the system of claim 9, wherein pretreatment module is carried out pre-service to document and is divided into two subs: remove non-term and get stem.

11, according to the system of claim 10, wherein statistical module is mainly added up two aspects: comprise number of times, the frequency that particular term occurs and occurs in which document which term and these terms occur in one piece of particular document.

12, according to the system of claim 11, if wherein there being weight graph to make up in the module shared once identical term between two pieces of documents, think that then the limit weight between them is 1, weight can add up, and promptly obtaining one at last has weight graph.

13, system according to Claim 8, wherein computing unit further comprises:

The transition matrix constructing module is used to construct transition matrix;

The iterative computation module is used for transition matrix is carried out an iterative computation to obtain the similarity of this time iteration;

Judge module, the similarity that the resulting similarity of current iteration and last iteration are obtained compares, if restrain, then iteration stops, otherwise current iteration result is input to the iterative computation module to carry out next iteration.

14, according to the system of claim 13, wherein the transition matrix constructing module is having weight graph to represent with adjacency matrix and allowing the weight sum of every row be 1.