CN104484391A - Method and device for calculating similarity of character strings - Google Patents

Method and device for calculating similarity of character strings Download PDF

Info

Publication number
CN104484391A
CN104484391A CN201410766683.1A CN201410766683A CN104484391A CN 104484391 A CN104484391 A CN 104484391A CN 201410766683 A CN201410766683 A CN 201410766683A CN 104484391 A CN104484391 A CN 104484391A
Authority
CN
China
Prior art keywords
character string
substring
string
length
difference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410766683.1A
Other languages
Chinese (zh)
Other versions
CN104484391B (en
Inventor
侯明午
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201410766683.1A priority Critical patent/CN104484391B/en
Publication of CN104484391A publication Critical patent/CN104484391A/en
Application granted granted Critical
Publication of CN104484391B publication Critical patent/CN104484391B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for calculating the similarity of character strings. The method for calculating the similarity of the character strings comprises the following steps of cutting a first character string and a second character string to obtain first sub strings of the first character string and second sub strings of the second character string; comparing the second sub strings with the first character string to delete the part, the same as the second sub strings, from the first character string to obtain a first difference string, and comparing the first sub strings with the second character string to delete the part, the same as the first sub strings, from the second character string to obtain a second difference string; according to the length of the first character string, the length of the second character string, the length of the first difference string and the length of the second difference string, calculating the similarity of the first character string and the second character string. By the method and the device provided by the invention, the problem of low efficiency in calculating the similarity of the character strings in the prior art is solved, and the effect of improving the calculation efficiency is achieved.

Description

The computing method of similarity of character string and device
Technical field
The present invention relates to data processing field, in particular to a kind of computing method and device of similarity of character string.
Background technology
Character string is similar significant in text analyzing.Existing similarity of character string comparison for calculation methods maturation be calculate the Levenshtein method of smallest edit distance.Levenshtein method refers between two character strings, is converted into the minimum edit step needed for another character string by one.Editing operation comprises replacement, deletion, inserts.The method is based on character editing, and all at certain error, and the path of carrying out Similarity Measure is comparatively complicated, and the efficiency causing similarity of character string to calculate is on the low side.
For the inefficient problem calculating similarity of character string in prior art, at present effective solution is not yet proposed.
Summary of the invention
Fundamental purpose of the present invention is the computing method and the device that provide a kind of similarity of character string, to solve in prior art the inefficient problem calculating similarity of character string.
To achieve these goals, according to an aspect of the embodiment of the present invention, a kind of computing method of similarity of character string are provided.
Computing method according to similarity of character string of the present invention comprise: cut the first character string and the second character string, obtain the first substring of described first character string and the second substring of described second character string; Contrast described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string; And according to the length computation of the length of the length of described first character string, described second character string, the length of described first difference string and described second difference string the similarity of the first character string and described second character string.
Further, the quantity of described first substring is m, the quantity of described second substring is n, m and n is the natural number of more than 2, wherein: contrast described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string to comprise and repeat the first determining step and the first delete step, until i=n, obtain described first difference string, wherein, the initial value of i is 1: described first determining step: judge whether described first character string comprises the second substring S2i, and described first delete step: when judging that described first character string comprises described second substring S2i, the part identical with described second substring S2i is deleted from described first character string, and i=i+1 is set, contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string to comprise and repeat the second determining step and the second delete step, until j=m, obtain described second difference string, wherein, the initial value of j is 1: described second determining step: judge whether described second character string comprises the first substring S1j, and described second delete step: when judging that described second character string comprises described first substring S1j, from described second character string, deleting the part identical with described first substring S1j, and j=j+1 is set.
Further, contrasting described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, before obtaining the second difference string, described computing method also comprise: the length obtaining each described first substring, and the length obtaining each described second substring; And respectively m described first substring is sorted according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n described second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n.
Further, according to the length computation of the length of the length of described first character string, described second character string, the length of described first difference string and described second difference string, the similarity of the first character string and described second character string comprises: according to formula calculate described similarity, wherein, L (S1) is the length of described first character string, L (S2) is the length of described second character string, L (DS1) is the length of described first difference string, L (DS2) is the length of described second difference string, and A is described similarity.
To achieve these goals, according to the another aspect of the embodiment of the present invention, provide a kind of calculation element of similarity of character string.
Calculation element according to similarity of character string of the present invention comprises: cutter unit, for cutting the first character string and the second character string, obtains the first substring of described first character string and the second substring of described second character string; Processing unit, for contrasting described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string; And computing unit, for the length of the length according to described first character string, described second character string, the length of described first difference string and described second difference string length computation described in the similarity of the first character string and described second character string.
Further, the quantity of described first substring is m, the quantity of described second substring is n, m and n is the natural number of more than 2, wherein: described processing unit comprises by the first judge module of repeating to call and the first removing module, wherein, described first judge module and described first removing module are repeated to call to i=n, obtain described first difference string, the initial value of i is 1: described first judge module, for judging whether described first character string comprises the second substring S2i, described first removing module, for when described first judge module judges that described first character string comprises described second substring S2i, the part identical with described second substring S2i is deleted from described first character string, and i=i+1 is set, described processing unit also comprises by the second judge module of repeating to call and the second removing module, wherein, described second judge module and described second removing module are repeated to call to j=m, obtain described second difference string, the initial value of j is 1: described second judge module, for judging whether described second character string comprises the first substring S1j, and described second removing module, for when described second judge module judges that described second character string comprises described first substring S1j, from described second character string, delete the part identical with described first substring S1j, and j=j+1 is set.
Further, described calculation element also comprises: acquiring unit, for obtaining the length of each described first substring, and obtains the length of each described second substring; And sequencing unit, for sorting to m described first substring respectively according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n described second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n.
Further, described computing unit comprises: computing module, for according to formula calculate described similarity, wherein, L (S1) is the length of described first character string, L (S2) is the length of described second character string, L (DS1) is the length of described first difference string, L (DS2) is the length of described second difference string, and A is described similarity.
According to inventive embodiments, adopt cutting first character string and the second character string, obtain the first substring of described first character string and the second substring of described second character string; Contrast described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string; And according to the length computation of the length of the length of described first character string, described second character string, the length of described first difference string and described second difference string the similarity of the first character string and described second character string.By cutting needing to calculate the character string of similarity, and then based on the substring cut out, part identical with it in another character string is deleted, obtain difference string, and utilization variance string calculates similarity, achieve mutually to cut based on character string and carry out comparison in difference, and then based on the similarity of difference condition reverse character string, the mode logic that this kind calculates similarity is simple, the similarity of kinds of characters string can be calculated rapidly, solve in prior art the inefficient problem calculating similarity of character string, and then reach the effect improving counting yield.
Accompanying drawing explanation
The accompanying drawing forming a application's part is used to provide a further understanding of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram of the computing method of similarity of character string according to the embodiment of the present invention; And
Fig. 2 is the schematic diagram of the calculation element of similarity of character string according to the embodiment of the present invention.
Embodiment
The present invention program is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the embodiment of a part of the present invention, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, should belong to the scope of protection of the invention.
It should be noted that, term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged in the appropriate case, so as embodiments of the invention described herein can with except here diagram or describe those except order implement.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.
Embodiment 1
According to the embodiment of the present invention, provide a kind of embodiment of the method that may be used for implementing the application's device embodiment, it should be noted that, can perform in the computer system of such as one group of computer executable instructions in the step shown in the process flow diagram of accompanying drawing, and, although show logical order in flow charts, in some cases, can be different from the step shown or described by order execution herein.
According to the embodiment of the present invention, provide a kind of computing method of similarity of character string.Fig. 1 is the process flow diagram of the computing method of similarity of character string according to the embodiment of the present invention, and as shown in Figure 1, the method comprises following step S102 to step S106:
S102: cut the first character string and the second character string, obtains the first substring of the first character string and the second substring of the second character string.Particularly, N-Gram can be adopted to cut the first character string and the second character string.Such as: the first character string is " Tian An-men, Beijing ", the second character string is " east, Tian An-men ", such as: utilize 3Gram to cut the first character string " Tian An-men, Beijing ", the first substring obtained, is specially sky, Beijing, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, my god, An Men, peace, door; Utilize 3Gram to cut the second character string " east, Tian An-men ", obtain the second substring, be specially Tian An-men, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east.
S104: contrast the second substring with the first character string to delete part identical with the second substring in the first character string, obtain the first difference string, and contrast the first substring with the second character string to delete part identical with the first substring in the second character string, obtain the second difference string.The embodiment of the present invention namely, second substring and the first character string are contrasted, find part identical with the second substring in the first character string, above-mentioned identical part deleted in the first character string, so deleting the first character string after above-mentioned same section is then the first difference string; First substring and the second character string are contrasted, find part identical with the first substring in the second character string, above-mentioned identical part deleted in the second character string, so deleting the second character string after above-mentioned same section is then the second difference string.
S106: according to the length of the first character string, the length of the second character string, the length of the first difference string and length computation first character string of the second difference string and the similarity of the second character string.
In embodiments of the present invention, by cutting needing to calculate the character string of similarity, and then based on the substring cut out, part identical with it in another character string is deleted, obtain difference string, and utilization variance string calculates similarity, achieve mutually to cut based on character string and carry out comparison in difference, and then based on the similarity of difference condition reverse character string, the mode logic that this kind calculates similarity is simple, the similarity of kinds of characters string can be calculated rapidly, solve in prior art the inefficient problem calculating similarity of character string, and then reach the effect improving counting yield.
Preferably, the quantity of the first substring is m, the quantity of the second substring is n, m and n is the natural number of more than 2, wherein: contrast the second substring with the first character string to delete part identical with the second substring in the first character string, obtains the first difference string and comprises and repeat following first determining step and the first delete step, until i=n, obtain the first difference string, wherein, the initial value of i is 1:
First determining step: judge whether the first character string comprises the second substring S2i;
First delete step: when judging that the first character string comprises the second substring S2i, deletes the part identical with the second substring S2i, and arranges i=i+1 from the first character string.
Contrast the first substring with the second character string to delete part identical with the first substring in the second character string, obtain the second difference string to comprise and repeat following second determining step and the second delete step, until j=m, obtain the second difference string, wherein, the initial value of j is 1:
Second determining step: judge whether the second character string comprises the first substring S1j;
Second delete step: when judging that the second character string comprises the first substring S1j, deletes the part identical with the first substring S1j, and arranges j=j+1 from the second character string.
In embodiments of the present invention, by each second substring is contrasted with the first character string, obtain the first difference string, by each first substring is contrasted with the second character string, obtain the second difference string, reach the effect of the accuracy improving the first difference string and the second difference string obtained, for the similarity of subsequent calculations character string (that is, the first character string and the second character string) provides basic data accurately.
Preferably, at contrast second substring with the first character string to delete part identical with the second substring in the first character string, obtain the first difference string, and contrast the first substring with the second character string to delete part identical with the first substring in the second character string, before obtaining the second difference string, the computing method that the invention process arranges the similarity of character string provided also comprise:
Obtain the length of each first substring, and obtain the length of each second substring, wherein, the length of the first substring is the number of the word comprised in the first substring, same, and the length of the second substring is the number of the word comprised in the second substring.Such as, be sky, Beijing for the first substring, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, sky, An Men, peace, door, the length of so each first substring is respectively 3,2,1,3,2,1,3,1,1,2,1,1; Be Tian An-men for the second substring, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east, the length of so each second substring is respectively 3,2,1,3,2,1,2,1,1.
Respectively m the first substring is sorted according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n the second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n, namely, according to the length of each first substring, multiple first substring is sorted from long to short according to length, same, according to the length of each second substring, be also sort from long to short according to length to multiple second substring.
Continue to adopt above-mentioned citing to be described, in the present embodiment, for the first substring: sky, Beijing, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, my god, An Men, peace, door, carrying out sequence from long to short according to length is: sky, Beijing, capital Tian An, Tian An-men, sky, capital, Beijing, Tian An, An Men, north, capital, sky, peace, door.
For the second substring: Tian An-men, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east, carrying out sequence from long to short according to length is: Tian An-men, An Mendong, Tian An, An Men, Men Dong, sky, peace, door, east.
After sequence, each second substring and the first character string are contrasted, and delete the part repeated with the second substring in the first character string, obtaining the first difference string is then: by second substring " Tian An-men after sequence, An Mendong, Tian An, An Men, Men Dong, my god, peace, door, east " contrast with the first character string " Tian An-men, Beijing " respectively successively, in comparison process, there is same section in the first character string " Tian An-men, Beijing " and the second substring " Tian An-men ", then " Tian An-men " is deleted from the first character string " Tian An-men, Beijing ", obtain the first difference string: Beijing.
Same, each first substring and the second character string are contrasted, and delete the second character string and the second substring repeating part, obtaining the second difference string is then: by the first substring " sky, Beijing after sequence, capital Tian An, Tian An-men, sky, capital, Beijing, Tian An, An Men, north, capital, my god, peace, door " contrast with the second character string " east, Tian An-men " respectively successively, in comparison process, same section is there is in the second character string " east, Tian An-men " with the first substring " Tian An-men ", then " Tian An-men " is deleted from the second character string " east, Tian An-men ", obtain the second difference string: east.
In embodiments of the present invention, after the second substring is sorted from high to low according to length, contrast with the first character string again, obtain the process of the first difference string, be compared to not to the situation of the second substring sequence, the second substring directly segmentation obtained and the first character string contrast, obtain the process of the first difference string, the first character string can be made when contrasting with the second substring come above, this character string and the more part of the second substring duplicate contents can be deleted fast, ensuing second substring and the first character string after deleting more repeating part content is below made to carry out in the process contrasted, reduce the content of contrast, thus improve the efficiency obtaining the first difference string, same, after the first substring is sorted from high to low according to length, contrast with the second character string again, obtain the process of the second difference string, be compared to not to the situation of the first substring sequence, the first substring directly segmentation obtained and the second character string contrast, obtain the process of the second difference string, the second character string can be made when contrasting with the first substring come above, part more with the first substring duplicate contents in this character string can be deleted fast, ensuing first substring and the second character string after deleting more repeating part content is below made to carry out in the process contrasted, reduce the content of contrast, thus improve the efficiency obtaining the second difference string.
In embodiments of the present invention, by sorting to the first substring and the second substring, improve the efficiency obtaining the first difference string and the second difference string, and then reach the efficiency improving calculating character string similarity.
Particularly, comprise according to the length of the first character string, the length of the second character string, the length of the first difference string and length computation first character string of the second difference string and the similarity of the second character string: according to formula calculate similarity, wherein, L (S1) is the length of the first character string, L (S2) is the length of the second character string, L (DS1) is the length of the first difference string, and L (DS2) is the length of the second difference string, and A is similarity.Continue to adopt above-mentioned citing to be described, the length L (S1) of the first character string " Tian An-men, Beijing " is 5, the length L (S2) of the second character string " east, Tian An-men " is 4, the length L (DS1) of the first difference string " Beijing " is 2, the length L (DS2) of the second difference string " east " is 2, so, the similarity of the first character string " Tian An-men, Beijing " and the second character string " east, Tian An-men " A = 1 - 2 + 1 5 + 4 = 1 - 1 3 = 0.6667 .
It should be noted that, for aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the present invention is necessary.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that can add required general hardware platform by software according to the method for above-described embodiment and realize, hardware can certainly be passed through, but in a lot of situation, the former is better embodiment.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in a storage medium (as ROM/RAM, magnetic disc, CD), comprising some instructions in order to make a station terminal equipment (can be mobile phone, computing machine, server, or the network equipment etc.) perform method described in each embodiment of the present invention.
Embodiment 2
According to the embodiment of the present invention, additionally provide a kind of calculation element of similarity of character string of the computing method for implementing above-mentioned similarity of character string, this calculation element is mainly used in the computing method performing the similarity of character string that embodiment of the present invention foregoing provides, and does concrete introduction below to the calculation element of the similarity of character string that the embodiment of the present invention provides:
Fig. 2 is the schematic diagram of the calculation element of similarity of character string according to the embodiment of the present invention, and as shown in Figure 2, this device mainly comprises cutter unit 10, processing unit 20 and computing unit 30, wherein:
Cutter unit 10, for cutting the first character string and the second character string, obtains the first substring of the first character string and the second substring of the second character string.Particularly, N-Gram can be adopted to cut the first character string and the second character string.Such as: the first character string is " Tian An-men, Beijing ", the second character string is " east, Tian An-men ", such as: utilize 3Gram to cut the first character string " Tian An-men, Beijing ", the first substring obtained, is specially sky, Beijing, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, my god, An Men, peace, door; Utilize 3Gram to cut the second character string " east, Tian An-men ", obtain the second substring, be specially Tian An-men, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east.
Processing unit 20 is for contrasting the second substring with the first character string to delete part identical with the second substring in the first character string, obtain the first difference string, and contrast the first substring with the second character string to delete part identical with the first substring in the second character string, obtain the second difference string.The embodiment of the present invention namely, second substring and the first character string are contrasted, find part identical with the second substring in the first character string, above-mentioned identical part deleted in the first character string, so deleting the first character string after above-mentioned same section is then the first difference string; First substring and the second character string are contrasted, find part identical with the first substring in the second character string, above-mentioned identical part deleted in the second character string, so deleting the second character string after above-mentioned same section is then the second difference string.
Computing unit 30 is for the length of the length according to the first character string, the second character string, the length of the first difference string and length computation first character string of the second difference string and the similarity of the second character string.
In embodiments of the present invention, by cutting needing to calculate the character string of similarity, and then based on the substring cut out, part identical with it in another character string is deleted, obtain difference string, and utilization variance string calculates similarity, achieve mutually to cut based on character string and carry out comparison in difference, and then based on the similarity of difference condition reverse character string, the mode logic that this kind calculates similarity is simple, the similarity of kinds of characters string can be calculated rapidly, solve in prior art the inefficient problem calculating similarity of character string, and then reach the effect improving counting yield.
Preferably, in embodiments of the present invention, the quantity of the first substring is m, the quantity of the second substring is n, m and n is the natural number of more than 2, wherein, processing unit comprises by the first judge module of repeating to call and the first removing module, and processing unit also comprises by the second judge module of repeating to call and the second removing module.
Wherein, the first judge module and the first removing module are repeated to call to i=n, obtain the first difference string, and the initial value of i is 1:
First judge module is for judging whether the first character string comprises the second substring S2i;
First removing module is used for when the first judge module judges that the first character string comprises the second substring S2i, deletes the part identical with the second substring S2i, and arrange i=i+1 from the first character string.
Second judge module and the second removing module are repeated to call to j=m, obtain the second difference string, and the initial value of j is 1:
Second judge module is for judging whether the second character string comprises the first substring S1j;
Second removing module is used for when the second judge module judges that the second character string comprises the first substring S1j, deletes the part identical with the first substring S1j, and arrange j=j+1 from the second character string.
In embodiments of the present invention, by each second substring is contrasted with the first character string, obtain the first difference string, by each first substring is contrasted with the second character string, obtain the second difference string, reach the effect of the accuracy improving the first difference string and the second difference string obtained, for the similarity of subsequent calculations character string (that is, the first character string and the second character string) provides basic data accurately.
Preferably, the calculation element of the similarity of character string that the embodiment of the present invention provides also comprises acquiring unit and sequencing unit, wherein:
Acquiring unit for obtaining the length of each first substring, and obtains the length of each second substring, and wherein, the length of the first substring is the number of the word comprised in the first substring, same, and the length of the second substring is the number of the word comprised in the second substring.Such as, be sky, Beijing for the first substring, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, sky, An Men, peace, door, the length of so each first substring is respectively 3,2,1,3,2,1,3,1,1,2,1,1; Be Tian An-men for the second substring, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east, the length of so each second substring is respectively 3,2,1,3,2,1,2,1,1.
Sequencing unit is used for sorting to m the first substring respectively according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n the second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n, namely, according to the length of each first substring, multiple first substring is sorted from long to short according to length, same, according to the length of each second substring, be also sort from long to short according to length to multiple second substring.
Continue to adopt above-mentioned citing to be described, in the present embodiment, for the first substring: sky, Beijing, Beijing, north, capital Tian An, sky, capital, capital, Tian An-men, Tian An, my god, An Men, peace, door, carrying out sequence from long to short according to length is: sky, Beijing, capital Tian An, Tian An-men, sky, capital, Beijing, Tian An, An Men, north, capital, sky, peace, door.
For the second substring: Tian An-men, Tian An, sky, An Mendong, An Men, peace, Men Dong, door, east, carrying out sequence from long to short according to length is: Tian An-men, An Mendong, Tian An, An Men, Men Dong, sky, peace, door, east.
After sequence, each second substring and the first character string are contrasted, and delete the part repeated with the second substring in the first character string, obtaining the first difference string is then: by second substring " Tian An-men after sequence, An Mendong, Tian An, An Men, Men Dong, my god, peace, door, east " contrast with the first character string " Tian An-men, Beijing " respectively successively, in comparison process, there is same section in the first character string " Tian An-men, Beijing " and the second substring " Tian An-men ", then " Tian An-men " is deleted from the first character string " Tian An-men, Beijing ", obtain the first difference string: Beijing.
Same, each first substring and the second character string are contrasted, and delete the second character string and the second substring repeating part, obtaining the second difference string is then: by the first substring " sky, Beijing after sequence, capital Tian An, Tian An-men, sky, capital, Beijing, Tian An, An Men, north, capital, my god, peace, door " contrast with the second character string " east, Tian An-men " respectively successively, in comparison process, same section is there is in the second character string " east, Tian An-men " with the first substring " Tian An-men ", then " Tian An-men " is deleted from the second character string " east, Tian An-men ", obtain the second difference string: east.
In embodiments of the present invention, after the second substring is sorted from high to low according to length, contrast with the first character string again, obtain the process of the first difference string, be compared to not to the situation of the second substring sequence, the second substring directly segmentation obtained and the first character string contrast, obtain the process of the first difference string, the first character string can be made when contrasting with the second substring come above, this character string and the more part of the second substring duplicate contents can be deleted fast, ensuing second substring and the first character string after deleting more repeating part content is below made to carry out in the process contrasted, reduce the content of contrast, thus improve the efficiency obtaining the first difference string, same, after the first substring is sorted from high to low according to length, contrast with the second character string again, obtain the process of the second difference string, be compared to not to the situation of the first substring sequence, the first substring directly segmentation obtained and the second character string contrast, obtain the process of the second difference string, the second character string can be made when contrasting with the first substring come above, part more with the first substring duplicate contents in this character string can be deleted fast, ensuing first substring and the second character string after deleting more repeating part content is below made to carry out in the process contrasted, reduce the content of contrast, thus improve the efficiency obtaining the second difference string.
In embodiments of the present invention, by sorting to the first substring and the second substring, improve the efficiency obtaining the first difference string and the second difference string, and then reach the efficiency improving calculating character string similarity.
Particularly, computing unit 30 comprises computing module, and computing module is used for according to formula calculate similarity, wherein, L (S1) is the length of the first character string, L (S2) is the length of the second character string, L (DS1) is the length of the first difference string, and L (DS2) is the length of the second difference string, and A is similarity.Continue to adopt above-mentioned citing to be described, the length L (S1) of the first character string " Tian An-men, Beijing " is 5, the length L (S2) of the second character string " east, Tian An-men " is 4, the length L (DS1) of the first difference string " Beijing " is 2, the length L (DS2) of the second difference string " east " is 2, so, the similarity of the first character string " Tian An-men, Beijing " and the second character string " east, Tian An-men " A = 1 - 2 + 1 5 + 4 = 1 - 1 3 = 0.6667 .
As can be seen from the above description, the invention solves in prior art the inefficient problem calculating similarity of character string, and then reach the effect improving counting yield.
As can be seen from the above description, the invention solves in prior art.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
In the above embodiment of the present invention, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.
In several embodiments that the application provides, should be understood that, disclosed client, the mode by other realizes.Wherein, device embodiment described above is only schematic, the such as division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of unit or module or communication connection can be electrical or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.
If described integrated unit using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words or all or part of of this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprises all or part of step of some instructions in order to make a computer equipment (can be personal computer, server or the network equipment etc.) perform method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, magnetic disc or CD etc. various can be program code stored medium.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (8)

1. computing method for similarity of character string, is characterized in that, comprising:
Cut the first character string and the second character string, obtain the first substring of described first character string and the second substring of described second character string;
Contrast described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string; And
The similarity of the first character string and described second character string according to the length computation of the length of the length of described first character string, described second character string, the length of described first difference string and described second difference string.
2. computing method according to claim 1, is characterized in that, the quantity of described first substring is m, and the quantity of described second substring is the natural number that n, m and n are more than 2, wherein:
Contrast described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string to comprise and repeat the first determining step and the first delete step, until i=n, obtain described first difference string, wherein, the initial value of i is 1:
Described first determining step: judge whether described first character string comprises the second substring S2i; And
Described first delete step: when judging that described first character string comprises described second substring S2i, deletes the part identical with described second substring S2i, and arranges i=i+1 from described first character string,
Contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string to comprise and repeat the second determining step and the second delete step, until j=m, obtain described second difference string, wherein, the initial value of j is 1:
Described second determining step: judge whether described second character string comprises the first substring S1j; And
Described second delete step: when judging that described second character string comprises described first substring S1j, deletes the part identical with described first substring S1j, and arranges j=j+1 from described second character string.
3. computing method according to claim 2, it is characterized in that, contrasting described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, before obtaining the second difference string, described computing method also comprise:
Obtain the length of each described first substring, and obtain the length of each described second substring; And
Respectively m described first substring is sorted according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n described second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n.
4. computing method according to claim 1, it is characterized in that, according to the length computation of the length of the length of described first character string, described second character string, the length of described first difference string and described second difference string, the similarity of the first character string and described second character string comprises:
According to formula calculate described similarity, wherein, L (S1) is the length of described first character string, L (S2) is the length of described second character string, L (DS1) is the length of described first difference string, L (DS2) is the length of described second difference string, and A is described similarity.
5. a calculation element for similarity of character string, is characterized in that, comprising:
Cutter unit, for cutting the first character string and the second character string, obtains the first substring of described first character string and the second substring of described second character string;
Processing unit, for contrasting described second substring with described first character string to delete part identical with described second substring in described first character string, obtain the first difference string, and contrast described first substring with described second character string to delete part identical with described first substring in described second character string, obtain the second difference string; And
Computing unit, for the length of the length according to described first character string, described second character string, the length of described first difference string and described second difference string length computation described in the similarity of the first character string and described second character string.
6. calculation element according to claim 5, is characterized in that, the quantity of described first substring is m, and the quantity of described second substring is the natural number that n, m and n are more than 2, wherein:
Described processing unit comprises by the first judge module of repeating to call and the first removing module, and wherein, described first judge module and described first removing module are repeated to call to i=n, obtain described first difference string, and the initial value of i is 1:
Described first judge module, for judging whether described first character string comprises the second substring S2i;
Described first removing module, for when described first judge module judges that described first character string comprises described second substring S2i, deletes the part identical with described second substring S2i, and arranges i=i+1 from described first character string,
Described processing unit also comprises by the second judge module of repeating to call and the second removing module, and wherein, described second judge module and described second removing module are repeated to call to j=m, obtain described second difference string, and the initial value of j is 1:
Described second judge module, for judging whether described second character string comprises the first substring S1j; And
Described second removing module, for when described second judge module judges that described second character string comprises described first substring S1j, deletes the part identical with described first substring S1j, and arranges j=j+1 from described second character string.
7. calculation element according to claim 6, is characterized in that, described calculation element also comprises:
Acquiring unit, for obtaining the length of each described first substring, and obtains the length of each described second substring; And
Sequencing unit, for sorting to m described first substring respectively according to length order from high to low, obtain the first substring S11 to the first substring S1m, and respectively n described second substring is sorted according to length order from high to low, obtain the second substring S21 to the second substring S2n.
8. calculation element according to claim 5, is characterized in that, described computing unit comprises:
Computing module, for according to formula calculate described similarity, wherein, L (S1) is the length of described first character string, L (S2) is the length of described second character string, L (DS1) is the length of described first difference string, L (DS2) is the length of described second difference string, and A is described similarity.
CN201410766683.1A 2014-12-11 2014-12-11 The computational methods and device of similarity of character string Active CN104484391B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410766683.1A CN104484391B (en) 2014-12-11 2014-12-11 The computational methods and device of similarity of character string

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410766683.1A CN104484391B (en) 2014-12-11 2014-12-11 The computational methods and device of similarity of character string

Publications (2)

Publication Number Publication Date
CN104484391A true CN104484391A (en) 2015-04-01
CN104484391B CN104484391B (en) 2017-11-21

Family

ID=52758932

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410766683.1A Active CN104484391B (en) 2014-12-11 2014-12-11 The computational methods and device of similarity of character string

Country Status (1)

Country Link
CN (1) CN104484391B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598986A (en) * 2015-10-16 2017-04-26 北京国双科技有限公司 Similarity calculation method and apparatus
WO2017143907A1 (en) * 2016-02-22 2017-08-31 阿里巴巴集团控股有限公司 Character string distance calculation method and device
CN107451125A (en) * 2017-08-19 2017-12-08 洪志令 A kind of method that quick close semantic matches are carried out for order outlier group
CN112527952A (en) * 2019-09-18 2021-03-19 本田技研工业株式会社 File comparison system
CN116132431A (en) * 2023-04-19 2023-05-16 泰诺尔(北京)科技有限公司 Data transmission method and system
CN116502611A (en) * 2023-06-28 2023-07-28 深圳魔视智能科技有限公司 Labeling method, labeling device, equipment and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1434400A (en) * 2002-01-22 2003-08-06 住友电气工业株式会社 Method, device, program, and recording medium for chararacter similarity calculation
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
US20090049028A1 (en) * 2003-07-30 2009-02-19 Oracle International Corporation Method of determining the similarity of two strings
CN101561813A (en) * 2009-05-27 2009-10-21 东北大学 Method for analyzing similarity of character string under Web environment
CN102982291A (en) * 2012-11-05 2013-03-20 北京奇虎科技有限公司 Methods and device of dependable file digital signature acquisition
CN103399907A (en) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 Method and device for calculating similarity of Chinese character strings on the basis of edit distance

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1434400A (en) * 2002-01-22 2003-08-06 住友电气工业株式会社 Method, device, program, and recording medium for chararacter similarity calculation
US20090049028A1 (en) * 2003-07-30 2009-02-19 Oracle International Corporation Method of determining the similarity of two strings
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN101561813A (en) * 2009-05-27 2009-10-21 东北大学 Method for analyzing similarity of character string under Web environment
CN102982291A (en) * 2012-11-05 2013-03-20 北京奇虎科技有限公司 Methods and device of dependable file digital signature acquisition
CN103399907A (en) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 Method and device for calculating similarity of Chinese character strings on the basis of edit distance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宋彦 等,: ""基于N-gram的句子相似度计算技术"", 《第九届全国计算语言学学术会议》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598986A (en) * 2015-10-16 2017-04-26 北京国双科技有限公司 Similarity calculation method and apparatus
CN106598986B (en) * 2015-10-16 2020-11-27 北京国双科技有限公司 Similarity calculation method and device
WO2017143907A1 (en) * 2016-02-22 2017-08-31 阿里巴巴集团控股有限公司 Character string distance calculation method and device
TWI659358B (en) * 2016-02-22 2019-05-11 香港商阿里巴巴集團服務有限公司 Method and device for calculating string distance
US11256756B2 (en) 2016-02-22 2022-02-22 Advanced New Technologies Co., Ltd. Character string distance calculation method and device
CN107451125A (en) * 2017-08-19 2017-12-08 洪志令 A kind of method that quick close semantic matches are carried out for order outlier group
CN107451125B (en) * 2017-08-19 2021-05-18 洪志令 Method for performing rapid close semantic matching aiming at sequence-independent item groups
CN112527952A (en) * 2019-09-18 2021-03-19 本田技研工业株式会社 File comparison system
CN112527952B (en) * 2019-09-18 2024-04-30 本田技研工业株式会社 File comparison system
CN116132431A (en) * 2023-04-19 2023-05-16 泰诺尔(北京)科技有限公司 Data transmission method and system
CN116502611A (en) * 2023-06-28 2023-07-28 深圳魔视智能科技有限公司 Labeling method, labeling device, equipment and readable storage medium
CN116502611B (en) * 2023-06-28 2023-12-05 深圳魔视智能科技有限公司 Labeling method, labeling device, equipment and readable storage medium

Also Published As

Publication number Publication date
CN104484391B (en) 2017-11-21

Similar Documents

Publication Publication Date Title
CN104484391A (en) Method and device for calculating similarity of character strings
CN105389349B (en) Dictionary update method and device
US11138250B2 (en) Method and device for extracting core word of commodity short text
CN105183923B (en) New word discovery method and device
CN104699772B (en) A kind of big data file classification method based on cloud computing
CN107301170B (en) Method and device for segmenting sentences based on artificial intelligence
CN106844640B (en) Webpage data analysis processing method
CN106897290B (en) Method and device for establishing keyword model
CN103377239A (en) Method and device for calculating inter-textual similarity
CN105550253B (en) Method and device for acquiring type relationship
CN106296286A (en) The predictor method of ad click rate and estimating device
CN105550359B (en) Webpage sorting method and device based on vertical search and server
CN104484449A (en) Web page text extraction method and web page text extraction device
CN112650743A (en) Funnel data analysis method and system, electronic device and storage medium
CN116881430A (en) Industrial chain identification method and device, electronic equipment and readable storage medium
CN110347934B (en) Text data filtering method, device and medium
CN106650610A (en) Human face expression data collection method and device
CN111931848A (en) Data feature extraction method and device, computer equipment and storage medium
CN104462061A (en) Word extraction method and word extraction device
CN105447004A (en) Mining device for query suggestion words, related query method and device
CN106033444B (en) Text content clustering method and device
CN111159213A (en) Data query method, device, system and storage medium
CN112559465A (en) Log compression method and device, electronic equipment and storage medium
CN104657749A (en) Method and device for classifying time series
CN104281710A (en) Network data excavation method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method and device for calculating similarity of character strings

Effective date of registration: 20190531

Granted publication date: 20171121

Pledgee: Shenzhen Black Horse World Investment Consulting Co.,Ltd.

Pledgor: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Registration number: 2019990000503

CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Patentee after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Patentee before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20240604

Granted publication date: 20171121