CN104484391A

CN104484391A - Method and device for calculating similarity of character strings

Info

Publication number: CN104484391A
Application number: CN201410766683.1A
Authority: CN
Inventors: 侯明午
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2014-12-11
Filing date: 2014-12-11
Publication date: 2015-04-01
Anticipated expiration: 2034-12-11
Also published as: CN104484391B

Abstract

The invention discloses a method and a device for calculating the similarity of character strings. The method for calculating the similarity of the character strings comprises the following steps of cutting a first character string and a second character string to obtain first sub strings of the first character string and second sub strings of the second character string; comparing the second sub strings with the first character string to delete the part, the same as the second sub strings, from the first character string to obtain a first difference string, and comparing the first sub strings with the second character string to delete the part, the same as the first sub strings, from the second character string to obtain a second difference string; according to the length of the first character string, the length of the second character string, the length of the first difference string and the length of the second difference string, calculating the similarity of the first character string and the second character string. By the method and the device provided by the invention, the problem of low efficiency in calculating the similarity of the character strings in the prior art is solved, and the effect of improving the calculation efficiency is achieved.

Description

The computing method of similarity of character string and device

Technical field

The present invention relates to data processing field, in particular to a kind of computing method and device of similarity of character string.

Background technology

Character string is similar significant in text analyzing.Existing similarity of character string comparison for calculation methods maturation be calculate the Levenshtein method of smallest edit distance.Levenshtein method refers between two character strings, is converted into the minimum edit step needed for another character string by one.Editing operation comprises replacement, deletion, inserts.The method is based on character editing, and all at certain error, and the path of carrying out Similarity Measure is comparatively complicated, and the efficiency causing similarity of character string to calculate is on the low side.

For the inefficient problem calculating similarity of character string in prior art, at present effective solution is not yet proposed.

Summary of the invention

Fundamental purpose of the present invention is the computing method and the device that provide a kind of similarity of character string, to solve in prior art the inefficient problem calculating similarity of character string.

To achieve these goals, according to an aspect of the embodiment of the present invention, a kind of computing method of similarity of character string are provided.

Computing method according to similarity of character string of the present invention comprise: cut the first character string and the second character string, obtain the first substring of described first character string and the second substring of described second character string; Contrast described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string; And according to the length computation of the length of the length of described first character string, described second character string, the length of described first difference string and described second difference string the similarity of the first character string and described second character string.

Further, the quantity of described first substring is m, the quantity of described second substring is n, m and n is the natural number of more than 2, wherein: contrast described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string to comprise and repeat the first determining step and the first delete step, until i=n, obtain described first difference string, wherein, the initial value of i is 1: described first determining step: judge whether described first character string comprises the second substring S2i, and described first delete step: when judging that described first character string comprises described second substring S2i, the part identical with described second substring S2i is deleted from described first character string, and i=i+1 is set, contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string to comprise and repeat the second determining step and the second delete step, until j=m, obtain described second difference string, wherein, the initial value of j is 1: described second determining step: judge whether described second character string comprises the first substring S1j, and described second delete step: when judging that described second character string comprises described first substring S1j, from described second character string, deleting the part identical with described first substring S1j, and j=j+1 is set.

Further, contrasting described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, before obtaining the second difference string, described computing method also comprise: the length obtaining each described first substring, and the length obtaining each described second substring; And respectively m described first substring is sorted according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n described second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n.

Further, according to the length computation of the length of the length of described first character string, described second character string, the length of described first difference string and described second difference string, the similarity of the first character string and described second character string comprises: according to formula calculate described similarity, wherein, L (S1) is the length of described first character string, L (S2) is the length of described second character string, L (DS1) is the length of described first difference string, L (DS2) is the length of described second difference string, and A is described similarity.

To achieve these goals, according to the another aspect of the embodiment of the present invention, provide a kind of calculation element of similarity of character string.

Calculation element according to similarity of character string of the present invention comprises: cutter unit, for cutting the first character string and the second character string, obtains the first substring of described first character string and the second substring of described second character string; Processing unit, for contrasting described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string; And computing unit, for the length of the length according to described first character string, described second character string, the length of described first difference string and described second difference string length computation described in the similarity of the first character string and described second character string.

Further, the quantity of described first substring is m, the quantity of described second substring is n, m and n is the natural number of more than 2, wherein: described processing unit comprises by the first judge module of repeating to call and the first removing module, wherein, described first judge module and described first removing module are repeated to call to i=n, obtain described first difference string, the initial value of i is 1: described first judge module, for judging whether described first character string comprises the second substring S2i, described first removing module, for when described first judge module judges that described first character string comprises described second substring S2i, the part identical with described second substring S2i is deleted from described first character string, and i=i+1 is set, described processing unit also comprises by the second judge module of repeating to call and the second removing module, wherein, described second judge module and described second removing module are repeated to call to j=m, obtain described second difference string, the initial value of j is 1: described second judge module, for judging whether described second character string comprises the first substring S1j, and described second removing module, for when described second judge module judges that described second character string comprises described first substring S1j, from described second character string, delete the part identical with described first substring S1j, and j=j+1 is set.

Further, described calculation element also comprises: acquiring unit, for obtaining the length of each described first substring, and obtains the length of each described second substring; And sequencing unit, for sorting to m described first substring respectively according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n described second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n.

Further, described computing unit comprises: computing module, for according to formula calculate described similarity, wherein, L (S1) is the length of described first character string, L (S2) is the length of described second character string, L (DS1) is the length of described first difference string, L (DS2) is the length of described second difference string, and A is described similarity.

According to inventive embodiments, adopt cutting first character string and the second character string, obtain the first substring of described first character string and the second substring of described second character string; Contrast described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string; And according to the length computation of the length of the length of described first character string, described second character string, the length of described first difference string and described second difference string the similarity of the first character string and described second character string.By cutting needing to calculate the character string of similarity, and then based on the substring cut out, part identical with it in another character string is deleted, obtain difference string, and utilization variance string calculates similarity, achieve mutually to cut based on character string and carry out comparison in difference, and then based on the similarity of difference condition reverse character string, the mode logic that this kind calculates similarity is simple, the similarity of kinds of characters string can be calculated rapidly, solve in prior art the inefficient problem calculating similarity of character string, and then reach the effect improving counting yield.

Accompanying drawing explanation

The accompanying drawing forming a application's part is used to provide a further understanding of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is the process flow diagram of the computing method of similarity of character string according to the embodiment of the present invention; And

Fig. 2 is the schematic diagram of the calculation element of similarity of character string according to the embodiment of the present invention.

Embodiment

The present invention program is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the embodiment of a part of the present invention, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, should belong to the scope of protection of the invention.

It should be noted that, term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged in the appropriate case, so as embodiments of the invention described herein can with except here diagram or describe those except order implement.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.

Embodiment 1

According to the embodiment of the present invention, provide a kind of embodiment of the method that may be used for implementing the application's device embodiment, it should be noted that, can perform in the computer system of such as one group of computer executable instructions in the step shown in the process flow diagram of accompanying drawing, and, although show logical order in flow charts, in some cases, can be different from the step shown or described by order execution herein.

According to the embodiment of the present invention, provide a kind of computing method of similarity of character string.Fig. 1 is the process flow diagram of the computing method of similarity of character string according to the embodiment of the present invention, and as shown in Figure 1, the method comprises following step S102 to step S106:

S102: cut the first character string and the second character string, obtains the first substring of the first character string and the second substring of the second character string.Particularly, N-Gram can be adopted to cut the first character string and the second character string.Such as: the first character string is " Tian An-men, Beijing ", the second character string is " east, Tian An-men ", such as: utilize 3Gram to cut the first character string " Tian An-men, Beijing ", the first substring obtained, is specially sky, Beijing, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, my god, An Men, peace, door; Utilize 3Gram to cut the second character string " east, Tian An-men ", obtain the second substring, be specially Tian An-men, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east.

S104: contrast the second substring with the first character string to delete part identical with the second substring in the first character string, obtain the first difference string, and contrast the first substring with the second character string to delete part identical with the first substring in the second character string, obtain the second difference string.The embodiment of the present invention namely, second substring and the first character string are contrasted, find part identical with the second substring in the first character string, above-mentioned identical part deleted in the first character string, so deleting the first character string after above-mentioned same section is then the first difference string; First substring and the second character string are contrasted, find part identical with the first substring in the second character string, above-mentioned identical part deleted in the second character string, so deleting the second character string after above-mentioned same section is then the second difference string.

S106: according to the length of the first character string, the length of the second character string, the length of the first difference string and length computation first character string of the second difference string and the similarity of the second character string.

In embodiments of the present invention, by cutting needing to calculate the character string of similarity, and then based on the substring cut out, part identical with it in another character string is deleted, obtain difference string, and utilization variance string calculates similarity, achieve mutually to cut based on character string and carry out comparison in difference, and then based on the similarity of difference condition reverse character string, the mode logic that this kind calculates similarity is simple, the similarity of kinds of characters string can be calculated rapidly, solve in prior art the inefficient problem calculating similarity of character string, and then reach the effect improving counting yield.

Preferably, the quantity of the first substring is m, the quantity of the second substring is n, m and n is the natural number of more than 2, wherein: contrast the second substring with the first character string to delete part identical with the second substring in the first character string, obtains the first difference string and comprises and repeat following first determining step and the first delete step, until i=n, obtain the first difference string, wherein, the initial value of i is 1:

First determining step: judge whether the first character string comprises the second substring S2i;

First delete step: when judging that the first character string comprises the second substring S2i, deletes the part identical with the second substring S2i, and arranges i=i+1 from the first character string.

Contrast the first substring with the second character string to delete part identical with the first substring in the second character string, obtain the second difference string to comprise and repeat following second determining step and the second delete step, until j=m, obtain the second difference string, wherein, the initial value of j is 1:

Second determining step: judge whether the second character string comprises the first substring S1j;

Second delete step: when judging that the second character string comprises the first substring S1j, deletes the part identical with the first substring S1j, and arranges j=j+1 from the second character string.

In embodiments of the present invention, by each second substring is contrasted with the first character string, obtain the first difference string, by each first substring is contrasted with the second character string, obtain the second difference string, reach the effect of the accuracy improving the first difference string and the second difference string obtained, for the similarity of subsequent calculations character string (that is, the first character string and the second character string) provides basic data accurately.

Preferably, at contrast second substring with the first character string to delete part identical with the second substring in the first character string, obtain the first difference string, and contrast the first substring with the second character string to delete part identical with the first substring in the second character string, before obtaining the second difference string, the computing method that the invention process arranges the similarity of character string provided also comprise:

Obtain the length of each first substring, and obtain the length of each second substring, wherein, the length of the first substring is the number of the word comprised in the first substring, same, and the length of the second substring is the number of the word comprised in the second substring.Such as, be sky, Beijing for the first substring, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, sky, An Men, peace, door, the length of so each first substring is respectively 3,2,1,3,2,1,3,1,1,2,1,1; Be Tian An-men for the second substring, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east, the length of so each second substring is respectively 3,2,1,3,2,1,2,1,1.

Respectively m the first substring is sorted according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n the second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n, namely, according to the length of each first substring, multiple first substring is sorted from long to short according to length, same, according to the length of each second substring, be also sort from long to short according to length to multiple second substring.

Continue to adopt above-mentioned citing to be described, in the present embodiment, for the first substring: sky, Beijing, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, my god, An Men, peace, door, carrying out sequence from long to short according to length is: sky, Beijing, capital Tian An, Tian An-men, sky, capital, Beijing, Tian An, An Men, north, capital, sky, peace, door.

For the second substring: Tian An-men, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east, carrying out sequence from long to short according to length is: Tian An-men, An Mendong, Tian An, An Men, Men Dong, sky, peace, door, east.

After sequence, each second substring and the first character string are contrasted, and delete the part repeated with the second substring in the first character string, obtaining the first difference string is then: by second substring " Tian An-men after sequence, An Mendong, Tian An, An Men, Men Dong, my god, peace, door, east " contrast with the first character string " Tian An-men, Beijing " respectively successively, in comparison process, there is same section in the first character string " Tian An-men, Beijing " and the second substring " Tian An-men ", then " Tian An-men " is deleted from the first character string " Tian An-men, Beijing ", obtain the first difference string: Beijing.

Same, each first substring and the second character string are contrasted, and delete the second character string and the second substring repeating part, obtaining the second difference string is then: by the first substring " sky, Beijing after sequence, capital Tian An, Tian An-men, sky, capital, Beijing, Tian An, An Men, north, capital, my god, peace, door " contrast with the second character string " east, Tian An-men " respectively successively, in comparison process, same section is there is in the second character string " east, Tian An-men " with the first substring " Tian An-men ", then " Tian An-men " is deleted from the second character string " east, Tian An-men ", obtain the second difference string: east.

In embodiments of the present invention, after the second substring is sorted from high to low according to length, contrast with the first character string again, obtain the process of the first difference string, be compared to not to the situation of the second substring sequence, the second substring directly segmentation obtained and the first character string contrast, obtain the process of the first difference string, the first character string can be made when contrasting with the second substring come above, this character string and the more part of the second substring duplicate contents can be deleted fast, ensuing second substring and the first character string after deleting more repeating part content is below made to carry out in the process contrasted, reduce the content of contrast, thus improve the efficiency obtaining the first difference string, same, after the first substring is sorted from high to low according to length, contrast with the second character string again, obtain the process of the second difference string, be compared to not to the situation of the first substring sequence, the first substring directly segmentation obtained and the second character string contrast, obtain the process of the second difference string, the second character string can be made when contrasting with the first substring come above, part more with the first substring duplicate contents in this character string can be deleted fast, ensuing first substring and the second character string after deleting more repeating part content is below made to carry out in the process contrasted, reduce the content of contrast, thus improve the efficiency obtaining the second difference string.

In embodiments of the present invention, by sorting to the first substring and the second substring, improve the efficiency obtaining the first difference string and the second difference string, and then reach the efficiency improving calculating character string similarity.

Particularly, comprise according to the length of the first character string, the length of the second character string, the length of the first difference string and length computation first character string of the second difference string and the similarity of the second character string: according to formula calculate similarity, wherein, L (S1) is the length of the first character string, L (S2) is the length of the second character string, L (DS1) is the length of the first difference string, and L (DS2) is the length of the second difference string, and A is similarity.Continue to adopt above-mentioned citing to be described, the length L (S1) of the first character string " Tian An-men, Beijing " is 5, the length L (S2) of the second character string " east, Tian An-men " is 4, the length L (DS1) of the first difference string " Beijing " is 2, the length L (DS2) of the second difference string " east " is 2, so, the similarity of the first character string " Tian An-men, Beijing " and the second character string " east, Tian An-men "

A = 1 - \frac{2 + 1}{5 + 4} = 1 - \frac{1}{3} = 0.6667 .

It should be noted that, for aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the present invention is necessary.

Through the above description of the embodiments, those skilled in the art can be well understood to the mode that can add required general hardware platform by software according to the method for above-described embodiment and realize, hardware can certainly be passed through, but in a lot of situation, the former is better embodiment.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in a storage medium (as ROM/RAM, magnetic disc, CD), comprising some instructions in order to make a station terminal equipment (can be mobile phone, computing machine, server, or the network equipment etc.) perform method described in each embodiment of the present invention.

Embodiment 2

According to the embodiment of the present invention, additionally provide a kind of calculation element of similarity of character string of the computing method for implementing above-mentioned similarity of character string, this calculation element is mainly used in the computing method performing the similarity of character string that embodiment of the present invention foregoing provides, and does concrete introduction below to the calculation element of the similarity of character string that the embodiment of the present invention provides:

Fig. 2 is the schematic diagram of the calculation element of similarity of character string according to the embodiment of the present invention, and as shown in Figure 2, this device mainly comprises cutter unit 10, processing unit 20 and computing unit 30, wherein:

Cutter unit 10, for cutting the first character string and the second character string, obtains the first substring of the first character string and the second substring of the second character string.Particularly, N-Gram can be adopted to cut the first character string and the second character string.Such as: the first character string is " Tian An-men, Beijing ", the second character string is " east, Tian An-men ", such as: utilize 3Gram to cut the first character string " Tian An-men, Beijing ", the first substring obtained, is specially sky, Beijing, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, my god, An Men, peace, door; Utilize 3Gram to cut the second character string " east, Tian An-men ", obtain the second substring, be specially Tian An-men, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east.

Processing unit 20 is for contrasting the second substring with the first character string to delete part identical with the second substring in the first character string, obtain the first difference string, and contrast the first substring with the second character string to delete part identical with the first substring in the second character string, obtain the second difference string.The embodiment of the present invention namely, second substring and the first character string are contrasted, find part identical with the second substring in the first character string, above-mentioned identical part deleted in the first character string, so deleting the first character string after above-mentioned same section is then the first difference string; First substring and the second character string are contrasted, find part identical with the first substring in the second character string, above-mentioned identical part deleted in the second character string, so deleting the second character string after above-mentioned same section is then the second difference string.

Computing unit 30 is for the length of the length according to the first character string, the second character string, the length of the first difference string and length computation first character string of the second difference string and the similarity of the second character string.

Preferably, in embodiments of the present invention, the quantity of the first substring is m, the quantity of the second substring is n, m and n is the natural number of more than 2, wherein, processing unit comprises by the first judge module of repeating to call and the first removing module, and processing unit also comprises by the second judge module of repeating to call and the second removing module.

Wherein, the first judge module and the first removing module are repeated to call to i=n, obtain the first difference string, and the initial value of i is 1:

First judge module is for judging whether the first character string comprises the second substring S2i;

First removing module is used for when the first judge module judges that the first character string comprises the second substring S2i, deletes the part identical with the second substring S2i, and arrange i=i+1 from the first character string.

Second judge module and the second removing module are repeated to call to j=m, obtain the second difference string, and the initial value of j is 1:

Second judge module is for judging whether the second character string comprises the first substring S1j;

Second removing module is used for when the second judge module judges that the second character string comprises the first substring S1j, deletes the part identical with the first substring S1j, and arrange j=j+1 from the second character string.

Preferably, the calculation element of the similarity of character string that the embodiment of the present invention provides also comprises acquiring unit and sequencing unit, wherein:

Acquiring unit for obtaining the length of each first substring, and obtains the length of each second substring, and wherein, the length of the first substring is the number of the word comprised in the first substring, same, and the length of the second substring is the number of the word comprised in the second substring.Such as, be sky, Beijing for the first substring, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, sky, An Men, peace, door, the length of so each first substring is respectively 3,2,1,3,2,1,3,1,1,2,1,1; Be Tian An-men for the second substring, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east, the length of so each second substring is respectively 3,2,1,3,2,1,2,1,1.

Sequencing unit is used for sorting to m the first substring respectively according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n the second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n, namely, according to the length of each first substring, multiple first substring is sorted from long to short according to length, same, according to the length of each second substring, be also sort from long to short according to length to multiple second substring.

Particularly, computing unit 30 comprises computing module, and computing module is used for according to formula calculate similarity, wherein, L (S1) is the length of the first character string, L (S2) is the length of the second character string, L (DS1) is the length of the first difference string, and L (DS2) is the length of the second difference string, and A is similarity.Continue to adopt above-mentioned citing to be described, the length L (S1) of the first character string " Tian An-men, Beijing " is 5, the length L (S2) of the second character string " east, Tian An-men " is 4, the length L (DS1) of the first difference string " Beijing " is 2, the length L (DS2) of the second difference string " east " is 2, so, the similarity of the first character string " Tian An-men, Beijing " and the second character string " east, Tian An-men "

A = 1 - \frac{2 + 1}{5 + 4} = 1 - \frac{1}{3} = 0.6667 .

As can be seen from the above description, the invention solves in prior art the inefficient problem calculating similarity of character string, and then reach the effect improving counting yield.

As can be seen from the above description, the invention solves in prior art.

The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

In the above embodiment of the present invention, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.

In several embodiments that the application provides, should be understood that, disclosed client, the mode by other realizes.Wherein, device embodiment described above is only schematic, the such as division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of unit or module or communication connection can be electrical or other form.

The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.

If described integrated unit using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words or all or part of of this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprises all or part of step of some instructions in order to make a computer equipment (can be personal computer, server or the network equipment etc.) perform method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, magnetic disc or CD etc. various can be program code stored medium.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. computing method for similarity of character string, is characterized in that, comprising:

Cut the first character string and the second character string, obtain the first substring of described first character string and the second substring of described second character string;

Contrast described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string; And

The similarity of the first character string and described second character string according to the length computation of the length of the length of described first character string, described second character string, the length of described first difference string and described second difference string.

2. computing method according to claim 1, is characterized in that, the quantity of described first substring is m, and the quantity of described second substring is the natural number that n, m and n are more than 2, wherein:

Contrast described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string to comprise and repeat the first determining step and the first delete step, until i=n, obtain described first difference string, wherein, the initial value of i is 1:

Described first determining step: judge whether described first character string comprises the second substring S2i; And

Described first delete step: when judging that described first character string comprises described second substring S2i, deletes the part identical with described second substring S2i, and arranges i=i+1 from described first character string,

Contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string to comprise and repeat the second determining step and the second delete step, until j=m, obtain described second difference string, wherein, the initial value of j is 1:

Described second determining step: judge whether described second character string comprises the first substring S1j; And

Described second delete step: when judging that described second character string comprises described first substring S1j, deletes the part identical with described first substring S1j, and arranges j=j+1 from described second character string.

3. computing method according to claim 2, it is characterized in that, contrasting described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, before obtaining the second difference string, described computing method also comprise:

Obtain the length of each described first substring, and obtain the length of each described second substring; And

Respectively m described first substring is sorted according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n described second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n.

4. computing method according to claim 1, it is characterized in that, according to the length computation of the length of the length of described first character string, described second character string, the length of described first difference string and described second difference string, the similarity of the first character string and described second character string comprises:

According to formula calculate described similarity, wherein, L (S1) is the length of described first character string, L (S2) is the length of described second character string, L (DS1) is the length of described first difference string, L (DS2) is the length of described second difference string, and A is described similarity.

5. a calculation element for similarity of character string, is characterized in that, comprising:

Cutter unit, for cutting the first character string and the second character string, obtains the first substring of described first character string and the second substring of described second character string;

Processing unit, for contrasting described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string; And

Computing unit, for the length of the length according to described first character string, described second character string, the length of described first difference string and described second difference string length computation described in the similarity of the first character string and described second character string.

6. calculation element according to claim 5, is characterized in that, the quantity of described first substring is m, and the quantity of described second substring is the natural number that n, m and n are more than 2, wherein:

Described processing unit comprises by the first judge module of repeating to call and the first removing module, and wherein, described first judge module and described first removing module are repeated to call to i=n, obtain described first difference string, and the initial value of i is 1:

Described first judge module, for judging whether described first character string comprises the second substring S2i;

Described first removing module, for when described first judge module judges that described first character string comprises described second substring S2i, deletes the part identical with described second substring S2i, and arranges i=i+1 from described first character string,

Described processing unit also comprises by the second judge module of repeating to call and the second removing module, and wherein, described second judge module and described second removing module are repeated to call to j=m, obtain described second difference string, and the initial value of j is 1:

Described second judge module, for judging whether described second character string comprises the first substring S1j; And

Described second removing module, for when described second judge module judges that described second character string comprises described first substring S1j, deletes the part identical with described first substring S1j, and arranges j=j+1 from described second character string.

7. calculation element according to claim 6, is characterized in that, described calculation element also comprises:

Acquiring unit, for obtaining the length of each described first substring, and obtains the length of each described second substring; And

Sequencing unit, for sorting to m described first substring respectively according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n described second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n.

8. calculation element according to claim 5, is characterized in that, described computing unit comprises:

Computing module, for according to formula calculate described similarity, wherein, L (S1) is the length of described first character string, L (S2) is the length of described second character string, L (DS1) is the length of described first difference string, L (DS2) is the length of described second difference string, and A is described similarity.