Embodiment
The present invention program is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the embodiment of a part of the present invention, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, should belong to the scope of protection of the invention.
It should be noted that, term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged in the appropriate case, so as embodiments of the invention described herein can with except here diagram or describe those except order implement.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.
Embodiment 1
According to the embodiment of the present invention, provide a kind of embodiment of the method that may be used for implementing the application's device embodiment, it should be noted that, can perform in the computer system of such as one group of computer executable instructions in the step shown in the process flow diagram of accompanying drawing, and, although show logical order in flow charts, in some cases, can be different from the step shown or described by order execution herein.
According to the embodiment of the present invention, provide a kind of computing method of similarity of character string.Fig. 1 is the process flow diagram of the computing method of similarity of character string according to the embodiment of the present invention, and as shown in Figure 1, the method comprises following step S102 to step S106:
S102: cut the first character string and the second character string, obtains the first substring of the first character string and the second substring of the second character string.Particularly, N-Gram can be adopted to cut the first character string and the second character string.Such as: the first character string is " Tian An-men, Beijing ", the second character string is " east, Tian An-men ", such as: utilize 3Gram to cut the first character string " Tian An-men, Beijing ", the first substring obtained, is specially sky, Beijing, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, my god, An Men, peace, door; Utilize 3Gram to cut the second character string " east, Tian An-men ", obtain the second substring, be specially Tian An-men, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east.
S104: contrast the second substring with the first character string to delete part identical with the second substring in the first character string, obtain the first difference string, and contrast the first substring with the second character string to delete part identical with the first substring in the second character string, obtain the second difference string.The embodiment of the present invention namely, second substring and the first character string are contrasted, find part identical with the second substring in the first character string, above-mentioned identical part deleted in the first character string, so deleting the first character string after above-mentioned same section is then the first difference string; First substring and the second character string are contrasted, find part identical with the first substring in the second character string, above-mentioned identical part deleted in the second character string, so deleting the second character string after above-mentioned same section is then the second difference string.
S106: according to the length of the first character string, the length of the second character string, the length of the first difference string and length computation first character string of the second difference string and the similarity of the second character string.
In embodiments of the present invention, by cutting needing to calculate the character string of similarity, and then based on the substring cut out, part identical with it in another character string is deleted, obtain difference string, and utilization variance string calculates similarity, achieve mutually to cut based on character string and carry out comparison in difference, and then based on the similarity of difference condition reverse character string, the mode logic that this kind calculates similarity is simple, the similarity of kinds of characters string can be calculated rapidly, solve in prior art the inefficient problem calculating similarity of character string, and then reach the effect improving counting yield.
Preferably, the quantity of the first substring is m, the quantity of the second substring is n, m and n is the natural number of more than 2, wherein: contrast the second substring with the first character string to delete part identical with the second substring in the first character string, obtains the first difference string and comprises and repeat following first determining step and the first delete step, until i=n, obtain the first difference string, wherein, the initial value of i is 1:
First determining step: judge whether the first character string comprises the second substring S2i;
First delete step: when judging that the first character string comprises the second substring S2i, deletes the part identical with the second substring S2i, and arranges i=i+1 from the first character string.
Contrast the first substring with the second character string to delete part identical with the first substring in the second character string, obtain the second difference string to comprise and repeat following second determining step and the second delete step, until j=m, obtain the second difference string, wherein, the initial value of j is 1:
Second determining step: judge whether the second character string comprises the first substring S1j;
Second delete step: when judging that the second character string comprises the first substring S1j, deletes the part identical with the first substring S1j, and arranges j=j+1 from the second character string.
In embodiments of the present invention, by each second substring is contrasted with the first character string, obtain the first difference string, by each first substring is contrasted with the second character string, obtain the second difference string, reach the effect of the accuracy improving the first difference string and the second difference string obtained, for the similarity of subsequent calculations character string (that is, the first character string and the second character string) provides basic data accurately.
Preferably, at contrast second substring with the first character string to delete part identical with the second substring in the first character string, obtain the first difference string, and contrast the first substring with the second character string to delete part identical with the first substring in the second character string, before obtaining the second difference string, the computing method that the invention process arranges the similarity of character string provided also comprise:
Obtain the length of each first substring, and obtain the length of each second substring, wherein, the length of the first substring is the number of the word comprised in the first substring, same, and the length of the second substring is the number of the word comprised in the second substring.Such as, be sky, Beijing for the first substring, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, sky, An Men, peace, door, the length of so each first substring is respectively 3,2,1,3,2,1,3,1,1,2,1,1; Be Tian An-men for the second substring, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east, the length of so each second substring is respectively 3,2,1,3,2,1,2,1,1.
Respectively m the first substring is sorted according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n the second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n, namely, according to the length of each first substring, multiple first substring is sorted from long to short according to length, same, according to the length of each second substring, be also sort from long to short according to length to multiple second substring.
Continue to adopt above-mentioned citing to be described, in the present embodiment, for the first substring: sky, Beijing, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, my god, An Men, peace, door, carrying out sequence from long to short according to length is: sky, Beijing, capital Tian An, Tian An-men, sky, capital, Beijing, Tian An, An Men, north, capital, sky, peace, door.
For the second substring: Tian An-men, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east, carrying out sequence from long to short according to length is: Tian An-men, An Mendong, Tian An, An Men, Men Dong, sky, peace, door, east.
After sequence, each second substring and the first character string are contrasted, and delete the part repeated with the second substring in the first character string, obtaining the first difference string is then: by second substring " Tian An-men after sequence, An Mendong, Tian An, An Men, Men Dong, my god, peace, door, east " contrast with the first character string " Tian An-men, Beijing " respectively successively, in comparison process, there is same section in the first character string " Tian An-men, Beijing " and the second substring " Tian An-men ", then " Tian An-men " is deleted from the first character string " Tian An-men, Beijing ", obtain the first difference string: Beijing.
Same, each first substring and the second character string are contrasted, and delete the second character string and the second substring repeating part, obtaining the second difference string is then: by the first substring " sky, Beijing after sequence, capital Tian An, Tian An-men, sky, capital, Beijing, Tian An, An Men, north, capital, my god, peace, door " contrast with the second character string " east, Tian An-men " respectively successively, in comparison process, same section is there is in the second character string " east, Tian An-men " with the first substring " Tian An-men ", then " Tian An-men " is deleted from the second character string " east, Tian An-men ", obtain the second difference string: east.
In embodiments of the present invention, after the second substring is sorted from high to low according to length, contrast with the first character string again, obtain the process of the first difference string, be compared to not to the situation of the second substring sequence, the second substring directly segmentation obtained and the first character string contrast, obtain the process of the first difference string, the first character string can be made when contrasting with the second substring come above, this character string and the more part of the second substring duplicate contents can be deleted fast, ensuing second substring and the first character string after deleting more repeating part content is below made to carry out in the process contrasted, reduce the content of contrast, thus improve the efficiency obtaining the first difference string, same, after the first substring is sorted from high to low according to length, contrast with the second character string again, obtain the process of the second difference string, be compared to not to the situation of the first substring sequence, the first substring directly segmentation obtained and the second character string contrast, obtain the process of the second difference string, the second character string can be made when contrasting with the first substring come above, part more with the first substring duplicate contents in this character string can be deleted fast, ensuing first substring and the second character string after deleting more repeating part content is below made to carry out in the process contrasted, reduce the content of contrast, thus improve the efficiency obtaining the second difference string.
In embodiments of the present invention, by sorting to the first substring and the second substring, improve the efficiency obtaining the first difference string and the second difference string, and then reach the efficiency improving calculating character string similarity.
Particularly, comprise according to the length of the first character string, the length of the second character string, the length of the first difference string and length computation first character string of the second difference string and the similarity of the second character string: according to formula
calculate similarity, wherein, L (S1) is the length of the first character string, L (S2) is the length of the second character string, L (DS1) is the length of the first difference string, and L (DS2) is the length of the second difference string, and A is similarity.Continue to adopt above-mentioned citing to be described, the length L (S1) of the first character string " Tian An-men, Beijing " is 5, the length L (S2) of the second character string " east, Tian An-men " is 4, the length L (DS1) of the first difference string " Beijing " is 2, the length L (DS2) of the second difference string " east " is 2, so, the similarity of the first character string " Tian An-men, Beijing " and the second character string " east, Tian An-men "
It should be noted that, for aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the present invention is necessary.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that can add required general hardware platform by software according to the method for above-described embodiment and realize, hardware can certainly be passed through, but in a lot of situation, the former is better embodiment.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in a storage medium (as ROM/RAM, magnetic disc, CD), comprising some instructions in order to make a station terminal equipment (can be mobile phone, computing machine, server, or the network equipment etc.) perform method described in each embodiment of the present invention.
Embodiment 2
According to the embodiment of the present invention, additionally provide a kind of calculation element of similarity of character string of the computing method for implementing above-mentioned similarity of character string, this calculation element is mainly used in the computing method performing the similarity of character string that embodiment of the present invention foregoing provides, and does concrete introduction below to the calculation element of the similarity of character string that the embodiment of the present invention provides:
Fig. 2 is the schematic diagram of the calculation element of similarity of character string according to the embodiment of the present invention, and as shown in Figure 2, this device mainly comprises cutter unit 10, processing unit 20 and computing unit 30, wherein:
Cutter unit 10, for cutting the first character string and the second character string, obtains the first substring of the first character string and the second substring of the second character string.Particularly, N-Gram can be adopted to cut the first character string and the second character string.Such as: the first character string is " Tian An-men, Beijing ", the second character string is " east, Tian An-men ", such as: utilize 3Gram to cut the first character string " Tian An-men, Beijing ", the first substring obtained, is specially sky, Beijing, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, my god, An Men, peace, door; Utilize 3Gram to cut the second character string " east, Tian An-men ", obtain the second substring, be specially Tian An-men, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east.
Processing unit 20 is for contrasting the second substring with the first character string to delete part identical with the second substring in the first character string, obtain the first difference string, and contrast the first substring with the second character string to delete part identical with the first substring in the second character string, obtain the second difference string.The embodiment of the present invention namely, second substring and the first character string are contrasted, find part identical with the second substring in the first character string, above-mentioned identical part deleted in the first character string, so deleting the first character string after above-mentioned same section is then the first difference string; First substring and the second character string are contrasted, find part identical with the first substring in the second character string, above-mentioned identical part deleted in the second character string, so deleting the second character string after above-mentioned same section is then the second difference string.
Computing unit 30 is for the length of the length according to the first character string, the second character string, the length of the first difference string and length computation first character string of the second difference string and the similarity of the second character string.
In embodiments of the present invention, by cutting needing to calculate the character string of similarity, and then based on the substring cut out, part identical with it in another character string is deleted, obtain difference string, and utilization variance string calculates similarity, achieve mutually to cut based on character string and carry out comparison in difference, and then based on the similarity of difference condition reverse character string, the mode logic that this kind calculates similarity is simple, the similarity of kinds of characters string can be calculated rapidly, solve in prior art the inefficient problem calculating similarity of character string, and then reach the effect improving counting yield.
Preferably, in embodiments of the present invention, the quantity of the first substring is m, the quantity of the second substring is n, m and n is the natural number of more than 2, wherein, processing unit comprises by the first judge module of repeating to call and the first removing module, and processing unit also comprises by the second judge module of repeating to call and the second removing module.
Wherein, the first judge module and the first removing module are repeated to call to i=n, obtain the first difference string, and the initial value of i is 1:
First judge module is for judging whether the first character string comprises the second substring S2i;
First removing module is used for when the first judge module judges that the first character string comprises the second substring S2i, deletes the part identical with the second substring S2i, and arrange i=i+1 from the first character string.
Second judge module and the second removing module are repeated to call to j=m, obtain the second difference string, and the initial value of j is 1:
Second judge module is for judging whether the second character string comprises the first substring S1j;
Second removing module is used for when the second judge module judges that the second character string comprises the first substring S1j, deletes the part identical with the first substring S1j, and arrange j=j+1 from the second character string.
In embodiments of the present invention, by each second substring is contrasted with the first character string, obtain the first difference string, by each first substring is contrasted with the second character string, obtain the second difference string, reach the effect of the accuracy improving the first difference string and the second difference string obtained, for the similarity of subsequent calculations character string (that is, the first character string and the second character string) provides basic data accurately.
Preferably, the calculation element of the similarity of character string that the embodiment of the present invention provides also comprises acquiring unit and sequencing unit, wherein:
Acquiring unit for obtaining the length of each first substring, and obtains the length of each second substring, and wherein, the length of the first substring is the number of the word comprised in the first substring, same, and the length of the second substring is the number of the word comprised in the second substring.Such as, be sky, Beijing for the first substring, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, sky, An Men, peace, door, the length of so each first substring is respectively 3,2,1,3,2,1,3,1,1,2,1,1; Be Tian An-men for the second substring, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east, the length of so each second substring is respectively 3,2,1,3,2,1,2,1,1.
Sequencing unit is used for sorting to m the first substring respectively according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n the second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n, namely, according to the length of each first substring, multiple first substring is sorted from long to short according to length, same, according to the length of each second substring, be also sort from long to short according to length to multiple second substring.
Continue to adopt above-mentioned citing to be described, in the present embodiment, for the first substring: sky, Beijing, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, my god, An Men, peace, door, carrying out sequence from long to short according to length is: sky, Beijing, capital Tian An, Tian An-men, sky, capital, Beijing, Tian An, An Men, north, capital, sky, peace, door.
For the second substring: Tian An-men, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east, carrying out sequence from long to short according to length is: Tian An-men, An Mendong, Tian An, An Men, Men Dong, sky, peace, door, east.
After sequence, each second substring and the first character string are contrasted, and delete the part repeated with the second substring in the first character string, obtaining the first difference string is then: by second substring " Tian An-men after sequence, An Mendong, Tian An, An Men, Men Dong, my god, peace, door, east " contrast with the first character string " Tian An-men, Beijing " respectively successively, in comparison process, there is same section in the first character string " Tian An-men, Beijing " and the second substring " Tian An-men ", then " Tian An-men " is deleted from the first character string " Tian An-men, Beijing ", obtain the first difference string: Beijing.
Same, each first substring and the second character string are contrasted, and delete the second character string and the second substring repeating part, obtaining the second difference string is then: by the first substring " sky, Beijing after sequence, capital Tian An, Tian An-men, sky, capital, Beijing, Tian An, An Men, north, capital, my god, peace, door " contrast with the second character string " east, Tian An-men " respectively successively, in comparison process, same section is there is in the second character string " east, Tian An-men " with the first substring " Tian An-men ", then " Tian An-men " is deleted from the second character string " east, Tian An-men ", obtain the second difference string: east.
In embodiments of the present invention, after the second substring is sorted from high to low according to length, contrast with the first character string again, obtain the process of the first difference string, be compared to not to the situation of the second substring sequence, the second substring directly segmentation obtained and the first character string contrast, obtain the process of the first difference string, the first character string can be made when contrasting with the second substring come above, this character string and the more part of the second substring duplicate contents can be deleted fast, ensuing second substring and the first character string after deleting more repeating part content is below made to carry out in the process contrasted, reduce the content of contrast, thus improve the efficiency obtaining the first difference string, same, after the first substring is sorted from high to low according to length, contrast with the second character string again, obtain the process of the second difference string, be compared to not to the situation of the first substring sequence, the first substring directly segmentation obtained and the second character string contrast, obtain the process of the second difference string, the second character string can be made when contrasting with the first substring come above, part more with the first substring duplicate contents in this character string can be deleted fast, ensuing first substring and the second character string after deleting more repeating part content is below made to carry out in the process contrasted, reduce the content of contrast, thus improve the efficiency obtaining the second difference string.
In embodiments of the present invention, by sorting to the first substring and the second substring, improve the efficiency obtaining the first difference string and the second difference string, and then reach the efficiency improving calculating character string similarity.
Particularly, computing unit 30 comprises computing module, and computing module is used for according to formula
calculate similarity, wherein, L (S1) is the length of the first character string, L (S2) is the length of the second character string, L (DS1) is the length of the first difference string, and L (DS2) is the length of the second difference string, and A is similarity.Continue to adopt above-mentioned citing to be described, the length L (S1) of the first character string " Tian An-men, Beijing " is 5, the length L (S2) of the second character string " east, Tian An-men " is 4, the length L (DS1) of the first difference string " Beijing " is 2, the length L (DS2) of the second difference string " east " is 2, so, the similarity of the first character string " Tian An-men, Beijing " and the second character string " east, Tian An-men "
As can be seen from the above description, the invention solves in prior art the inefficient problem calculating similarity of character string, and then reach the effect improving counting yield.
As can be seen from the above description, the invention solves in prior art.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
In the above embodiment of the present invention, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.
In several embodiments that the application provides, should be understood that, disclosed client, the mode by other realizes.Wherein, device embodiment described above is only schematic, the such as division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of unit or module or communication connection can be electrical or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.
If described integrated unit using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words or all or part of of this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprises all or part of step of some instructions in order to make a computer equipment (can be personal computer, server or the network equipment etc.) perform method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, magnetic disc or CD etc. various can be program code stored medium.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.