Disclosure of Invention
The application provides a method for judging the same character string, which comprises the following steps:
calculating the editing distance between the first character string and the second character string; adapting the lengths of the first character string and the second character string based on the editing distance, and calculating the similarity based on the lengths of the first character string and the second character string after adaptation;
determining whether the first character string and the second character string are the same based on the similarity.
Optionally, the calculating the edit distance between the first character string and the second character string includes:
unicode coding is carried out on the first character string and the second character string;
and calculating the editing distance between the first character string and the second character string after unicode coding.
Optionally, the adapting the lengths of the first character string and the second character string based on the edit distance includes:
calculating the maximum value and the minimum value in the lengths of the first character string and the second character string;
subtracting the edit distance from the maximum value, or adding the edit distance to the minimum value, to adapt the lengths of the first and second character strings.
Optionally, the calculating the similarity between the first character string and the second character string after the length adaptation includes:
calculating the ratio of the maximum value to the minimum value after subtracting the editing distance, and representing the similarity of the first character string and the second character string based on the ratio; or
And calculating the ratio of the maximum value to the minimum value added with the editing distance, and representing the similarity of the first character string and the second character string based on the ratio.
Optionally, the adapting the lengths of the first character string and the second character string based on the edit distance, and calculating the similarity based on the adapted lengths of the first character string and the second character string includes:
calculating the similarity of the first character string and the second character string based on a preset similarity calculation formula;
the similarity calculation formula includes:
Wherein x represents a first character string, | x | represents the length of the first character string; y represents the second string, | y | represents the length of the second string; max (| x |, | y |) represents the maximum value of the lengths of the first string and the second string; min (| x |, | y |) represents the minimum value of the lengths of the first string and the second string; ds represents an edit distance between the first character string and the second character string; c represents a correction parameter, and is a constant equal to or greater than 0.
Optionally, the determining whether the first character string and the second character string are the same based on the similarity includes:
judging whether the calculated similarity reaches a preset threshold value or not;
when the calculated similarity reaches a preset threshold value, judging that the first character string is the same as the second character string;
and when the calculated similarity does not reach a preset threshold value, judging that the first character string is different from the second character string.
The present application also provides a device for determining the same character string, the device including:
the first calculation module is used for calculating the editing distance between the first character string and the second character string; the second calculation module is used for adapting the lengths of the first character string and the second character string based on the editing distance and calculating the similarity based on the lengths of the first character string and the second character string after adaptation;
and the judging module is used for judging whether the first character string is the same as the second character string or not based on the similarity.
Optionally, the first calculating module is specifically configured to:
unicode coding is carried out on the first character string and the second character string;
and calculating the editing distance between the first character string and the second character string after unicode coding.
Optionally, the second calculating module is specifically configured to:
calculating the maximum value and the minimum value in the lengths of the first character string and the second character string;
subtracting the edit distance from the maximum value, or adding the edit distance to the minimum value, to adapt the lengths of the first and second character strings.
Optionally, the second calculating module is further configured to:
calculating the ratio of the maximum value to the minimum value after subtracting the editing distance, and representing the similarity of the first character string and the second character string based on the ratio; or
And calculating the ratio of the maximum value to the minimum value added with the editing distance, and representing the similarity of the first character string and the second character string based on the ratio.
Optionally, the second calculating module is further configured to:
calculating the similarity of the first character string and the second character string based on a preset similarity calculation formula;
the similarity calculation formula includes:
Wherein x represents a first character string, | x | represents the length of the first character string; y represents the second string, | y | represents the length of the second string; max (| x |, | y |) represents the maximum value of the lengths of the first string and the second string; min (| x |, | y |) represents the minimum value of the lengths of the first string and the second string; ds represents an edit distance between the first character string and the second character string; c represents a correction parameter, and is a constant equal to or greater than 0.
Optionally, the determining module is specifically configured to:
judging whether the calculated similarity reaches a preset threshold value or not;
when the calculated similarity reaches a preset threshold value, judging that the first character string is the same as the second character string;
and when the calculated similarity does not reach a preset threshold value, judging that the first character string is different from the second character string.
In the application, the editing distance between a first character string and a second character string is calculated, the lengths of the first character string and the second character string are adapted based on the editing distance, and the similarity is calculated based on the lengths of the first character string and the second character string after adaptation; it is then determined whether the first string and the second string are the same based on the similarity. Because the lengths of the first character string and the second character string are adapted based on the editing distance in the application, the length difference between the first character string and the second character string can be reduced, when the similarity calculation is carried out on the first character string and the second character string after the length adaptation, the influence of the length difference of the character strings on the similarity calculation result can be reduced to the greatest extent, the accuracy of the similarity calculation is improved, and therefore whether the first character string and the second character string are the same or not is judged based on the similarity, and the accuracy of the judgment result can be remarkably improved.
Detailed Description
In the related art, when determining whether two character strings (such as detailed addresses) are identical, it is common to determine whether the two character strings are identical by calculating the similarity between the two character strings and then determining whether the two character strings are identical by the similarity.
When calculating the similarity between two character strings, the similarity can be generally achieved by:
in one embodiment shown, word segmentation processing may be performed on two character strings that need to be subjected to the same judgment, the two character strings are converted into structured data, and then the similarity of the two character strings is calculated based on the structured data; for example, a word segmentation length may be set, the two character strings are subjected to text segmentation according to the word segmentation length to obtain a plurality of text segmentation units with the same length, and then the similarity of the two character strings is calculated by comparing the text segmentation units one by one based on the text switching units obtained by text segmentation.
However, in this way, the text segmentation needs to be performed on the character strings, and the text switching units obtained by switching need to be compared one by one when the similarity is calculated, so that the implementation is complicated.
In another embodiment shown, the similarity of two character strings may be calculated based on the edit distance between the two character strings that need to be determined, and then whether the two character strings are the same may be determined based on the calculated similarity.
Wherein, when calculating the similarity based on the edit distance, the definition of the similarity can be generally characterized by the following formula:
in the above formula, S represents similarity; ds denotes the edit distance (Levenshtein distance); l represents a character string length.
When the similarity between the first string x and the second string y is calculated by the above formula, the value of L may be any one of min (| x |, | y |), max (| x |, | y |), or | x | + | y |, according to actual requirements.
Wherein, in the above formula, | x | represents the character length of the first character string; | y | represents the character length of the second character string; min (| x |, | y |) represents the minimum value of the character lengths in the first character string and the second character string; max (| x |, | y |) represents the maximum value of the character lengths in the first character string and the second character string.
However, for a character string such as an address, the above similarity formula is sometimes used when judging whether or not the two character strings express the same meaning
The two strings presented are not similar, but in practice do represent the same address.
For example, the detailed addresses of the same user collected on different platforms have a certain difference in character length (the difference may be caused by the user entering an irregular address on different platforms). Suppose that the first address is 'Hangzhou city West lake region Huanglong times square B seat 17 stories'; the second address is the ant gold service department of 17 stories in yellow dragon times square in the western lake region of Hangzhou city. The first address and the second address are substantially the same address although the character lengths are different.
When the similarity between the first address and the second address is calculated by the above formula and whether the first address and the second address are the same address is determined based on the calculated similarity, it is likely that erroneous determination is made to erroneously determine the first address and the second address as different addresses.
In view of this, the present application provides a method for determining identical character strings, which includes calculating an edit distance between a first character string and a second character string, adapting lengths of the first character string and the second character string based on the edit distance, and calculating a similarity based on the adapted lengths of the first character string and the second character string; it is then determined whether the first string and the second string are the same based on the similarity. Because the lengths of the first character string and the second character string are adapted based on the editing distance in the application, the length difference between the first character string and the second character string can be reduced, when the similarity calculation is carried out on the first character string and the second character string after the length adaptation, the influence of the length difference of the character strings on the similarity calculation result can be reduced to the greatest extent, the accuracy of the similarity calculation is improved, and therefore whether the first character string and the second character string are the same or not is judged based on the similarity, and the accuracy of the judgment result can be remarkably improved.
The present application is described below with reference to specific embodiments and specific application scenarios.
Referring to fig. 1, fig. 1 is a method for determining the same character string according to an embodiment of the present application, applied to a server, where the method performs the following steps:
step 101, calculating the edit distance between a first character string and a second character string; the server side can comprise a server, a server cluster or a cloud platform constructed based on the server cluster; for example, taking an application scenario of e-commerce as an example, the server may be a cloud platform of an e-commerce provider, and the cloud platform may assist a merchant in comparing a work address or a home address uploaded by a user with a real recipient address reserved in the platform by the user to determine whether the address uploaded by the user is a real and valid address of the user, so as to avoid fraud caused by uploading false address information by the user.
The character string may include a detailed address of the user; the first character string and the second character string may be the same detailed address with different lengths.
For example, the first character string and the second character string may be detailed addresses reserved by the user in different platforms, and due to differences in input formats in different platforms, when the user inputs the same detailed address in different platforms, there may be a certain difference in length.
In the e-commerce platform, a detailed address reserved in the platform by a user is an ant gold service department of 17 th building in square B of Huanglong times in the west lake region of Hangzhou city, and a detailed address provided for the merchant by the user is the address of 17 th building in square B of yellow dragon times in the west lake region of Hangzhou city, although the lengths of the two addresses are different, the two addresses are substantially the same address.
The edit distance can be used to characterize the minimum number of edits to convert one string to another. The editing operation on the character string may generally include operations of adding, deleting, replacing, and transposing.
When one character string is converted into another character string through operations of adding one character string, deleting one character string, replacing one character string, transposition one character string and the like, the editing distance between the two character strings can be obtained through counting the times of the editing operations. For example, assuming that the first character string is ABCD and the second character string is AFCDE, the first character string can be converted into the second character string by replacing the character a with the character F and adding one character E, and the replacement and the addition of two editing operations are performed in the whole process, so that the editing distance between the first character string and the second character string is 2.
In this example, when the server calculates the edit distance between the first character string and the second character string, the server may count the number of edits when the first character string is converted into the second character string, and then use the counted number of edits as the edit distance between the first character string and the second character string.
When the method is implemented, the edit distance can adopt a universal Levenshtein distance, and also can adopt a Damerau-Levenshtein distance.
The general Levenshtein distance is only used for counting the times of editing operations such as adding, deleting and replacing, therefore, when the server side adopts the general Levenshtein distance, the editing times of converting a first character string into a second character string by adding a character, deleting a character and replacing a character can be counted, and then the editing times is set as the editing distance between the first character string and the second character string.
The Damerau-Levenshtein distance generally needs to count the times of editing operations such as adding, deleting, replacing, and transposition, so that when the Damerau-Levenshtein distance is adopted by the server, the editing times when a first character string is converted into a second character string by adding a character, deleting a character, replacing a character, and transposing a character can be counted, and then the editing times is set as the editing distance between the first character string and the second character string.
It should be noted that, in practical applications, when the server side counts the number of times of editing when the first character string is converted into the second character string, the number of times of editing can be implemented through a preset execution code or algorithm, and details are not described in this application, and a person skilled in the art can refer to the description in the related art when putting the technical solution disclosed in this application into practice.
In addition, when the server calculates the edit distance between the first character string and the second character string, because the first character string and the second character string may contain characters such as Chinese characters, letters, numbers and the like, and the characters such as Chinese characters, letters, numbers and the like, the number of bytes corresponding to the characters may be different when the characters are processed in the platform; for example, a Chinese character occupies two bytes, while letters and numbers usually occupy one byte; therefore, in order to avoid the influence on the calculation result due to the different number of bytes occupied by each byte in the character string, when the server calculates the edit distance of the first character string and the second character string, the server may perform unicode encoding on the first character string and the second character string, and then calculate the edit distance for the unicode encoded first character string and second character string. Since unicode coding is a unified coding scheme aiming at Chinese characters, numbers and characters in the industry, and unified and unique coding is set for the Chinese characters, the numbers and the letters, the method can meet the requirements of cross-language and cross-platform text conversion and processing.
Step 102, adapting the lengths of the first character string and the second character string based on the editing distance, and calculating similarity based on the lengths of the first character string and the second character string after adaptation;
in this example, after the server calculates the edit distance between the first character string and the second character string, the server may adapt the lengths of the first character string and the second character string according to the edit distance to reduce the length difference between the first character string and the second character string, so that the influence of the length difference on the calculation result may be reduced to the greatest extent when the similarity between the first character string and the second character string is calculated through the edit distance.
In one embodiment shown, when the server adapts the lengths of the first character string and the second character string based on the calculated edit distance, the server may calculate a maximum value and a minimum value of the lengths of the first character string and the second character string, and then the server may subtract the calculated edit distance from the maximum value or add the calculated edit distance to the minimum value to reduce the length difference between the first character string and the second character string, thereby achieving the purpose of adapting the lengths of the first character string and the second character string.
For example, assume that the first character string is ABCD, the second character string is AFCDEG, the length of the first character string is 4, the length of the second character string is 6, and the edit distance between the first character string and the second character string is 3 (one replacement edit, two new edits). When the server side adapts the lengths of the first character string and the second character string, the length 4 of the first character string can be added with the editing distance 3, after the adaptation is finished, the adaptation length of the first character string is 7, and the length difference between the adaptation length of the first character string and the adaptation length of the second character string is reduced. Or, when the server side adapts the lengths of the first character string and the second character string, the editing distance 3 may be subtracted from the length 6 of the second character string, and after the adaptation is finished, the adaptation length of the second character string is 3, and the difference between the adaptation length and the length of the second character string is reduced.
Of course, in practical applications, when the server adapts the lengths of the first character string and the second character string based on the edit distance, there may be other implementation manners except that the edit distance is subtracted from the maximum value of the lengths of the first character string and the second character string, or the edit distance is added to the minimum value of the lengths of the first character string and the second character string, and this embodiment is not described in detail in this embodiment.
In this example, after the server performs adaptation on the length of the first character string and the length of the second character string, the server may calculate the similarity based on the adapted lengths of the first character string and the second character string.
In an embodiment shown in the present disclosure, after the lengths of the first string and the second string are adapted, the server may calculate a ratio between a minimum value and a maximum value of the lengths of the first string and the second string after the adaptation is completed, where the ratio is a value between 0 and 1, and thus the server may characterize the first string and the second string based on the ratio.
On one hand, if the server adapts the lengths of the first character string and the second character string by subtracting the edit distance from the maximum value of the lengths of the first character string and the second character string, when calculating the similarity of the first character string and the second character string, the server may calculate a ratio between the maximum value and the minimum value of the lengths of the first character string and the second character string, and then characterize the similarity of the first character string and the second character string by the ratio.
On the other hand, if the server adapts the lengths of the first character string and the second character string by adding the edit distance between the minimum value of the lengths of the first character string and the second character string, when calculating the similarity between the first character string and the second character string, the server may calculate a ratio between the maximum value and the minimum value added with the edit distance, and then characterize the similarity between the first character string and the second character string by the ratio.
Based on this, it is assumed that the first character string is x, the second character string is y, the length of the first character string x is | x |, the length of the second character string is | y |, and the edit distance between the first character string and the second character string is ds.
If the server adapts | x | and | y | by subtracting ds from the maximum value of | x | and | y |, the server can calculate the similarity between the first string x and the second string y by the following equation 1:
if the server adapts | x | and | y | by adding ds to the minimum value of | x | and | y |, the server can calculate the similarity between the first string x and the second string y by the following equation 2:
in the above two formulas, S represents the similarity between the first character string x and the second character string y. max (| x |, | y |) represents the maximum value of the lengths of the first string and the second string; min (| x |, | y |) represents the minimum of the lengths of the first and second strings. C represents a correction parameter introduced in the formula, and the correction parameter may be a constant greater than or equal to 0 (i.e. the formula may introduce a C value or may not introduce a C value), and the calculation result of the formula may be corrected by introducing the correction parameter in the formula.
The specific value of the correction parameter may be an engineering experience value set by a user according to actual requirements, and is not particularly limited in this disclosure; for example, in implementation, the correction parameter may be a smoothing parameter obtained by a user based on a smoothing method, and the calculation result of the formula may be corrected by introducing the smoothing parameter into the formula, so as to reduce an error of the calculation result of the formula. In the above formula, when the length of the first string x is the same as that of the second string y, that is, | x | > | y |, max (| x |, | y |) has the same value as min (| x |, | y |), at this time, the value of the correction parameter C may be 0 (the length is the same and no correction is needed), and the above formula 1 may be converted into S | -1-ds/min (| x |, | y |) or S | -1-ds/max (|, | y |), because in this case, max (| x |, |) has the same value as min (| x |, | y |)
Similarly, the lengths of the first string x and the second string y can be expressed, so that the above formula 1 can be converted into S ═ 1-ds/L, where the value of L represents the lengths of the first string x and the second string y.
It can be seen that the similarity calculation formula described in the above embodiment, in the case where the lengths of the first character string x and the second character string y are the same, conforms to the definition of the similarity in calculating the similarity based on the edit distance in the related art.
Step 103, determining whether the first character string is the same as the second character string based on the similarity.
In this example, after the server calculates the similarity between the first character string and the second character string, the server may compare the calculated similarity with a preset similarity threshold to determine whether the calculated similarity reaches the similarity threshold. If the calculated similarity value reaches the similarity threshold value, the server side can judge that the first character string is the same as the second character string. On the contrary, if the calculated similarity value is smaller than the similarity threshold, the server may determine that the first character string is different from the second character string.
It should be noted that the similarity threshold may be set by a user according to actual requirements; for example, in implementation, the similarity threshold may be an engineering experience value, and an engineer may manually determine whether a large number of character strings are the same, and then analyze the result of the manual determination to set the similarity threshold; or the result of the manual judgment can be used as a data analysis sample, and the server side performs statistical analysis to set the similarity threshold.
The technical solutions in the above embodiments are described in detail below by specific examples and in combination with application scenarios.
In this example, it is assumed that the character string is a detailed address of a user, and the server is a cloud platform of an e-commerce provider; such as a treasure house platform.
The cloud platform can assist a merchant in comparing a detailed address uploaded by a user with a real recipient address reserved in the platform by the user to determine whether the address uploaded by the user is a real and effective address of the user, so that fraud caused by uploading false address information by the user is avoided.
Suppose that the first address uploaded to the merchant by the user is 'Hangzhou city West lake region Huanglong times square B seat 17 stories'; the second address of the user reserved in the cloud platform is 'ant gold service department of 17 stories in square of yellow dragon times of lake region in Hangzhou city'. The character length of the first address is 17 (each character is a chinese character, a letter, and a number), and the character length of the second address is 24.
When the server side converts the first address into the second address, the first address can be converted into the second address by adding 7 Chinese characters such as the ant gold service part, and the like, so that the editing distance between the first address and the second address calculated by the server side is 7.
In existing implementations, the similarity of the first address x and the second address y can be calculated by the following formula:
in the above formula, S represents similarity; ds represents an edit distance; l represents a character string length. Wherein, the value of L can be min (| x |, | y |), max (| x |, | y |), or | x | + | y |.
When the value of L is min (| x |, | y |):
when L takes the value max (| x |, | y |):
when the value of L is | x | + | y |:
in this example, assuming that the preset similarity threshold of the cloud platform is 0.85, the results of the similarities above calculated based on the similarity calculation formula provided in the prior art are all smaller than the similarity threshold.
In this case, when the cloud platform determines whether the first address and the second address are the same address based on the similarity threshold, the cloud platform may misdetermine the first address and the second address as different addresses. And the first address and the second address are essentially only the same address where there is a difference in length.
In this example, if the cloud platform adapts the character lengths of the first address and the second address by the editing distance of the first address and the second address, and calculates the similarity value based on the adapted length, the influence of the length difference of the character strings on the similarity calculation result can be significantly reduced, so that when determining whether the first address and the second address are the same address based on the calculated similarity value, the accuracy of the determination result can be improved, and the occurrence of misjudgment can be avoided.
On one hand, assuming that the cloud platform adapts the lengths of the first address and the second address by subtracting the edit distance from the maximum value of the lengths of the first address and the second address, the cloud platform may calculate the similarity between the first address and the second address by the following formula (for example, the C value is 0):
on the other hand, assuming that the cloud platform adapts the lengths of the first address and the second address by adding the edit distance to the minimum value of the lengths of the first address and the second address, the cloud platform may calculate the similarity between the first address and the second address by the following formula:
therefore, after the cloud platform adapts the lengths of the first address and the second address, the similarity value obtained through calculation is 1, and the accuracy of the similarity is remarkably improved.
At this time, the similarity value is greater than the similarity threshold value of 0.85, and when the cloud platform determines whether the first address and the second address are the same address based on the similarity threshold value, the cloud platform determines the first address and the second address as the same address, thereby avoiding misjudgment.
In the above embodiment, by calculating the edit distance of a first character string and a second character string, the lengths of the first character string and the second character string are adapted based on the edit distance, and the similarity is calculated based on the lengths of the first character string and the second character string after the adaptation; it is then determined whether the first string and the second string are the same based on the similarity.
Because the lengths of the first character string and the second character string are adapted based on the editing distance in the application, the length difference between the first character string and the second character string can be reduced, when the similarity calculation is carried out on the first character string and the second character string after the length adaptation, the influence of the length difference of the character strings on the similarity calculation result can be reduced to the greatest extent, the accuracy of the similarity calculation is improved, and therefore whether the first character string and the second character string are the same or not is judged based on the similarity, and the accuracy of the judgment result can be remarkably improved.
Corresponding to the method embodiment, the application also provides an embodiment of the device.
Referring to fig. 2, the present application provides a device 20 for determining the same character string, which is applied to a server; referring to fig. 3, the hardware architecture related to the server of the determination device 20 for bearing the same character string generally includes a CPU, a memory, a non-volatile memory, a network interface, an internal bus, and the like; taking a software implementation as an example, the device 20 for determining the same character string can be generally understood as a computer program loaded in a memory, and a logic device formed by combining software and hardware after being run by a CPU, where the device 20 includes:
a first calculating module 201, configured to calculate an edit distance between the first character string and the second character string; a second calculating module 202, configured to adapt lengths of the first character string and the second character string based on the edit distance, and calculate a similarity based on the adapted lengths of the first character string and the second character string;
a determining module 203, configured to determine whether the first character string and the second character string are the same based on the similarity.
In this example, the first calculating module 201 is specifically configured to:
unicode coding is carried out on the first character string and the second character string;
and calculating the editing distance between the first character string and the second character string after unicode coding.
In this example, the second calculating module 202 is specifically configured to:
calculating the maximum value and the minimum value in the lengths of the first character string and the second character string;
subtracting the edit distance from the maximum value, or adding the edit distance to the minimum value, to adapt the lengths of the first and second character strings.
In this example, the second calculation module 202 is further configured to:
calculating the ratio of the maximum value to the minimum value after subtracting the editing distance, and representing the similarity of the first character string and the second character string based on the ratio; or
And calculating the ratio of the maximum value to the minimum value added with the editing distance, and representing the similarity of the first character string and the second character string based on the ratio.
In this example, the second calculation module 202 is further configured to:
calculating the similarity of the first character string and the second character string based on a preset similarity calculation formula;
the similarity calculation formula includes:
Wherein x represents a first character string, | x | represents the length of the first character string; y represents the second string, | y | represents the length of the second string; max (| x |, | y |) represents the maximum value of the lengths of the first string and the second string; min (| x |, | y |) represents the minimum value of the lengths of the first string and the second string; ds represents an edit distance between the first character string and the second character string; c represents a correction parameter, and is a constant equal to or greater than 0.
In this example, the determining module 203 is specifically configured to:
judging whether the calculated similarity reaches a preset threshold value or not;
when the calculated similarity reaches a preset threshold value, judging that the first character string is the same as the second character string;
and when the calculated similarity does not reach a preset threshold value, judging that the first character string is different from the second character string.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.