CN106997335A - The decision method and device of identical characters string - Google Patents

The decision method and device of identical characters string Download PDF

Info

Publication number
CN106997335A
CN106997335A CN201610052823.8A CN201610052823A CN106997335A CN 106997335 A CN106997335 A CN 106997335A CN 201610052823 A CN201610052823 A CN 201610052823A CN 106997335 A CN106997335 A CN 106997335A
Authority
CN
China
Prior art keywords
character string
similarity
length
editing distance
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610052823.8A
Other languages
Chinese (zh)
Other versions
CN106997335B (en
Inventor
赵科科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610052823.8A priority Critical patent/CN106997335B/en
Publication of CN106997335A publication Critical patent/CN106997335A/en
Application granted granted Critical
Publication of CN106997335B publication Critical patent/CN106997335B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • G06F40/129Handling non-Latin characters, e.g. kana-to-kanji conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a kind of decision method and device of identical characters string, and method therein includes:Calculate the editing distance of the first character string and the second character string;The length of first character string and second character string is adapted to based on the editing distance, and based on first character string and the length computation similarity of second character string after adaptation;Judge whether first character string is identical with second character string based on the similarity.The application can lift the degree of accuracy of Similarity Measure, when judging whether the first character string is identical with the second character string based on similarity, can be obviously improved the accuracy of result of determination.

Description

The decision method and device of identical characters string
Technical field
The application is related to the communications field, more particularly to a kind of decision method and device of identical characters string.
Background technology
In electric business platform, fraud is caused to avoid user from uploading address dummy information, platform would generally be helped Work address or home address that trade company uploads to user is helped to verify, to determine the ground of user's upload Location whether the authentic and valid address of the user.For example, user can be supplied to the address of trade company by platform, The true posting address reserved with user in platform is compared, to determine whether for same address. However, in actual applications, user is supplied to the address of trade company, the addressee reserved with user in platform There may be certain difference in address, length, but actually remain as same address.Therefore it is directed to Such case, whether platform is when judging the two addresses to be same, it is possible to can have erroneous judgement.
The content of the invention
The application proposes a kind of decision method of identical characters string, and this method includes:
Calculate the editing distance of the first character string and the second character string;Based on the editing distance to described The length of one character string and second character string is adapted to, and based on first character after adaptation The length computation similarity of string and second character string;
Judge whether first character string is identical with second character string based on the similarity.
Optionally, the editing distance for calculating the first character string and the second character string includes:
Unicode codings are carried out for first character string and second character string;
Calculate the editing distance of first character string and second character string after unicode codings.
Optionally, it is described based on the editing distance to first character string and second character string Length, which carries out adaptation, to be included:
Calculate first character string and the maximum and minimum value in the length of second character string;
The maximum is subtracted into the editing distance, or the minimum value is added into the editing distance, It is adapted to the length to first character string and second character string.
Optionally, first character string and second character string after the computational length is adapted to Similarity include:
The ratio for subtracting the maximum after the editing distance and the minimum value is calculated, based on the ratio Value characterizes the similarity of first character string and second character string;Or
The maximum and the ratio plus the minimum value after the editing distance are calculated, based on the ratio Value characterizes the similarity of first character string and second character string.
Optionally, it is described based on the editing distance to first character string and second character string Length is adapted to, and based on first character string and the length of second character string after adaptation Calculating similarity includes:
First character string and second character string are calculated based on default calculating formula of similarity Similarity;
The calculating formula of similarity includes:
Or
Wherein, x represents the first character string, | x | represent the length of the first character string;Y represents the second character string, | y | represent the length of the second character string;Max (| x |, | y |) represent in the length of the first character string and the second character string Maximum;Min (| x |, | y |) represent the first character string and the minimum value in the length of the second character string;ds Represent the editing distance of the first character string and the second character string;C represents corrected parameter, to be more than or waiting In 0 constant.
Optionally, it is described to judge that first character string is with second character string based on the similarity It is no it is identical including:
Judge whether the similarity calculated reaches default threshold value;
When the similarity calculated reaches default threshold value, judge first character string with it is described Second character string is identical;
When the similarity calculated is not up to default threshold value, first character string and institute are judged The second character string is stated to differ.
The application also proposes a kind of decision maker of identical characters string, and the device includes:
First computing module, the editing distance for calculating the first character string and the second character string;Second meter Calculate module, for based on the editing distance to first character string and the length of second character string It is adapted to, and based on first character string and the length computation of second character string after adaptation Similarity;
Determination module, for judging first character string and second character string based on the similarity It is whether identical.
Optionally, first computing module specifically for:
Unicode codings are carried out for first character string and second character string;
Calculate the editing distance of first character string and second character string after unicode codings.
Optionally, second computing module specifically for:
Calculate first character string and the maximum and minimum value in the length of second character string;
The maximum is subtracted into the editing distance, or the minimum value is added into the editing distance, It is adapted to the length to first character string and second character string.
Optionally, second computing module is further used for:
The ratio for subtracting the maximum after the editing distance and the minimum value is calculated, based on the ratio Value characterizes the similarity of first character string and second character string;Or
The maximum and the ratio plus the minimum value after the editing distance are calculated, based on the ratio Value characterizes the similarity of first character string and second character string.
Optionally, second computing module is further used for:
First character string and second character string are calculated based on default calculating formula of similarity Similarity;
The calculating formula of similarity includes:
Or
Wherein, x represents the first character string, | x | represent the length of the first character string;Y represents the second character string, | y | represent the length of the second character string;Max (| x |, | y |) represent in the length of the first character string and the second character string Maximum;Min (| x |, | y |) represent the first character string and the minimum value in the length of the second character string;ds Represent the editing distance of the first character string and the second character string;C represents corrected parameter, to be more than or waiting In 0 constant.
Optionally, the determination module specifically for:
Judge whether the similarity calculated reaches default threshold value;
When the similarity calculated reaches default threshold value, judge first character string with it is described Second character string is identical;
When the similarity calculated is not up to default threshold value, first character string and institute are judged The second character string is stated to differ.
In the application, by calculating the editing distance of the first character string and the second character string, compiled based on described Collect distance to be adapted to the length of first character string and second character string, and be based on after adaptation First character string and second character string length computation similarity;It is then based on the phase Judge whether first character string is identical with second character string like degree.Due to being based in this application Editing distance is adapted to the length of the first character string and the second character string, therefore can reduce first The difference in length of character string and the second character string, the first character string and the second word after for length adaptation When symbol string carries out Similarity Measure, the difference in length of character string can be farthest reduced to similarity meter The influence of result is calculated, the degree of accuracy of Similarity Measure is lifted, so as to judge the first character based on similarity When whether string is identical with the second character string, the accuracy of result of determination can be obviously improved.
Brief description of the drawings
Fig. 1 is a kind of flow chart of the decision method for identical characters string that the embodiment of the application one is provided;
Fig. 2 is a kind of logic diagram of the decision maker for identical characters string that the embodiment of the application one is provided;
Fig. 3 is carrying a kind of clothes of the decision maker of identical characters string that the embodiment of the application one is provided The hardware structure diagram at business end.
Embodiment
In the related art, judging two character strings (such as better address) whether as identical character string When, typically by calculate the two character strings between similarity, then judged by similarity this two Whether individual character string is identical.
Wherein, in the similarity between calculating two character strings, it can generally be accomplished by the following way:
, can be to needing two character strings for carrying out identical judgement to be divided in a kind of embodiment shown Word processing, the two character strings is converted to the data of structuring, is then based on the data of structuring to calculate The similarity of the two character strings;For example, a participle length can be set, to the two character strings according to The participle length carries out text dividing, obtains some length identical text dividing units, is then based on text These text switch units that cutting is obtained, are compared to calculate the similarity of the two character strings one by one.
However, in this way, it is necessary to character string progress text dividing, be needed when calculating similarity The text switch unit that switching is obtained is compared one by one, therefore is realized more complicated.
, can be based between two character strings for needing to be judged in the another embodiment shown Editing distance, to calculate the similarity of the two character strings, is then based on calculating obtained similarity to judge Whether the two character strings are same.
Wherein, when calculating similarity based on editing distance, the definition of similarity generally can be by following public Formula is characterized:
In above-mentioned formula, S represents similarity;Ds represents editing distance (Levenshtein distances);L tables Show string length.
When calculating the first character string x and the second character string y similarity by above-mentioned formula, above-mentioned L Value according to actual demand can be min (| x |, | y |), max (| x |, | y |) or | x |+| y | in it is any one It is individual.
Wherein, in above-mentioned formula, | x | represent the character length of the first character string;| y | represent the second character The character length of string;Min (| x |, | y |) represent the minimum of character length in the first character string and the second character string Value;Max (| x |, | y |) represent the maximum of character length in the first character string and the second character string.
However, for the character string as address, judging whether the two character strings express identical During implication, similarity formula above sometimesTwo character strings and dissmilarity provided, But actually the two character strings really illustrate identical address.
For example, the better address of the same user collected in different platform, exists on character length Certain difference (difference be probably due to user inputted in different platform it is lack of standardization caused by).Assuming that First address is " 17 buildings, Xihu District of Hangzhou City Huanglong Times Square Building B ";Second address is " Hangzhou 17 buildings, Xihu District Huanglong Times Square Building B ant Jin Fu Legal Affairs Depts ".Although the first address and the second address Character length has differences, but is essentially same address.
First address and two address similarity are being calculated by above-mentioned formula, and based on calculating what is obtained When whether similarity is to judge the first address and the second address as identical address, it is possible to cause erroneous judgement, First address and the second address are mistaken for different addresses.
In view of this, the application proposes a kind of decision method of identical characters string, by calculating the first character String and the editing distance of the second character string, based on the editing distance to first character string and described the The length of two character strings is adapted to, and based on first character string and second word after adaptation Accord with the length computation similarity of string;It is then based on the similarity and judges first character string and described the Whether two character strings are identical.Due to being based on editing distance in this application to the first character string and the second character The length of string is adapted to, therefore can reduce the difference in length of the first character string and the second character string, , can be maximum when the first character string after for length adaptation and the second character string carry out Similarity Measure Influence of the difference in length of the reduction character string of degree to Similarity Measure result, lifting Similarity Measure The degree of accuracy, can be with so that when judging whether the first character string and the second character string are identical based on similarity It is obviously improved the accuracy of result of determination.
The application is described below by specific embodiment and with reference to specific application scenarios.
Fig. 1 is refer to, Fig. 1 is a kind of decision method for identical characters string that the embodiment of the application one is provided, Applied to service end, methods described performs following steps:
Step 101, the editing distance of the first character string and the second character string is calculated;Above-mentioned service end can be with The cloud platform built including server, server cluster or based on server cluster;For example, with electronics Exemplified by the application scenarios of commercial affairs, above-mentioned service end can be the cloud platform of ecommerce provider, and the cloud is put down Platform can assist work address or home address that trade company uploads to user to be reserved in user in platform True posting address is compared, come determine address that user uploads whether the authentic and valid address of the user, Fraud is caused so as to avoid user from uploading address dummy information.
Above-mentioned character string can include the better address of user;Wherein, above-mentioned first character string and the second word It can be the mutually different same better address of length to accord with string.
For example, above-mentioned first character string and the second character string can user reserved in different platforms Better address, due to the difference in different platforms in pattern of the input, user inputs in different platforms During same better address, being likely to result in length, there is some difference.
Assuming that in electric business platform, the better address that user reserves in platform is " Xihu District of Hangzhou City is yellow Imperial 17 buildings ant Jin Fu Legal Affairs Depts in Times Square Building B ", user is supplied to the better address of businessman to be " Hangzhoupro 17 buildings, state city Xihu District Huanglong Times Square Building B ", although the two address sizes have differences, but it is real It is same address in matter.
Above-mentioned editing distance, can be used for characterizing the minimum that a character string is converted to another character string Editor's number of times.Wherein, the edit operation to character string can generally include increase, delete, replace And transposition etc. operation.
When by a character string, by increasing a character string, a character string is deleted, a word is replaced When the operations such as one character string of symbol string and transposition are converted to another character string, by counting above-mentioned editor The number of times of operation, it is possible to obtain the editing distance between the two character strings.For example, it is assumed that the first word Symbol string is ABCD, and the second character string is AFCDE, and the first character string by character A by replacing with word F is accorded with, and one character E of increase can be just converted to and replacement is performed in the second character string, whole process And increase edit operation twice, therefore the editing distance of the first character string and the second character string is 2.
In this example, service end is when calculating the editing distance of the first character string and the second character string, can be with Editor's number of times when the first character string to be changed to the second character string is counted, then by the editor's number of times counted It is used as the editing distance of the first character string and the second character string.
Wherein, when realizing, above-mentioned editing distance can use general Levenshtein distances, also may be used To use Damerau-Levenshtein distances.
The usual only statistics increase of general Levenshtein distances, is deleted, time of the edit operation such as replacement Number, thus service end using general Levenshtein apart from when, can count and pass through the first character string Increase a character, delete a character and replace an editor when character is converted to the second character string Editor's number of times, is then set to the editing distance between the first character string and the second character string by number of times.
Damerau-Levenshtein are deleted apart from statistics increase is usually required, replacement and transposition etc. The number of times of edit operation, thus service end using Damerau-Levenshtein apart from when, can count By the first character string by increasing a character, delete a character, replace a character and transposition one Individual character is converted to editor's number of times during the second character string, and editor's number of times then is set into the first character Editing distance between string and the second character string.
It is pointed out that in actual applications, the first character string is changed the second character by service end statistics During editor's number of times during string, it can be realized by default execution code or algorithm, in this application No longer it is described in detail, those skilled in the art, can when technical scheme disclosed in the present application is put into practice With with reference to the introduction in correlation technique.
In addition, service end is when calculating the editing distance of the first character string and the second character string, due to first May be comprising Chinese character, letter and the character such as numeral in character string and the second character string, and Chinese character, letter And the character such as numeral, corresponding byte number may generally be differed when being handled in platform;Than Such as, a Chinese character occupies two bytes, and letter and number generally takes up a byte;Therefore, in order to Avoid due in character string each byte occupy byte number difference result of calculation is impacted, service end meter , can be to the first character string and the second character string when calculating the editing distance of the first character string and the second character string Unicode codings are carried out, the first character string and the second character string after then being encoded for unicode are carried out The calculating of editing distance.Because unicode is encoded to unification of the industry for Chinese character, numeral and character Encoding scheme, it is unified and uniquely encode to Chinese character, numeral and letter setting, so as to pass through this The mode of kind, can meet across language, cross-platform carry out text conversion, the requirement handled.
Step 102, based on the editing distance to first character string and the length of second character string Degree is adapted to, and based on first character string and the length gauge of second character string after adaptation Calculate similarity;
In this example, service end is calculated after the editing distance of the first character string and the second character string, can be with The length of the first character string and the second character string is adapted to according to the editing distance, to reduce by the first word The difference in length of symbol string and the second character string, so that follow-up calculating the first character string by the editing distance During with the similarity of the second character string, influence of the difference in length to result of calculation can be farthest reduced.
In a kind of embodiment shown, service end is based on the editing distance calculated, to the first word When symbol string and the length of the second character string are adapted to, the first character string and the second character string can be calculated Maximum and minimum value in length, then service end the maximum can be subtracted the editor that calculates away from From, or the minimum value is added to the editing distance calculated, to reduce the first character string and the second character Difference in length between string, so as to reach what is be adapted to the length of the first character string and the second character string Purpose.
For example, it is assumed that the first character string is ABCD, the second character string is AFCDEG, the first character string Length be 4, the length of the second character string is 6, the editing distance of the two be 3 (once replace editor, Increase editor newly twice).Service end when the length to the first character string and the second character string is adapted to, The length 4 of first character string can be added to editing distance 3, after adaptation terminates, the first character string it is suitable It is 7 with length, is reduced with the difference in length of the second character string.Or, service end is to the first character string When being adapted to the length of the second character string, the length 6 of the second character string can be subtracted editing distance 3, after adaptation terminates, the adapted length of the second character string is 3, is reduced with the difference in length of the second character string.
Certainly, in actual applications, service end based on editing distance to the first character string and the second character When the length of string is adapted to, except by by the maximum in the length of the first character string and the second character string Value subtracts the editing distance of the two, or by the minimum value in the length of the first character string and the second character string Beyond the editing distance of the two, it is possibility to have other implementations, in the present embodiment no longer one by one Enumerate.
In this example, when service end is adapted to the length of the first character string and the length of the second character string After the completion of, now service end can be based on the first character string and the length of the second character string after the completion of adaptation Degree calculates similarity.
In a kind of embodiment shown, the length adaptation of the first character string and the second character string is completed Afterwards, service end can calculate adaptation after the completion of the first character string and the second character string length in most Ratio between small value and maximum, now the ratio is a numerical value between 0 to 1, therefore Service end can characterize the first character string and the second character string based on the ratio.
On the one hand, if service end is by the way that the maximum of length in the first character string and the second character string is subtracted The editing distance of the two is removed, the length to the first character string and the second character string is adapted to, then service Hold when calculating the two similarity, can calculate and subtract between the maxima and minima after editing distance Ratio, the similarity of the two is then characterized by the ratio.
On the other hand, if service end is by by the minimum value of length in the first character string and the second character string Plus the editing distance of the two, the length to the first character string and the second character string is adapted to, then clothes End be engaged in when calculating the two similarity, can calculate maximum and add editing distance minimum value it Between ratio, the similarity of the two is then characterized by the ratio.
Based on this, it is assumed that the first character string is x, the second string length is y, the first character string x length Spend and be | x |, the length of the second character string is | y |, the editing distance of the two is ds.
If service end is by by | x | and | y | in maximum subtract ds, it is right | x | and | y | be adapted to, then Service end can be calculated similar between the first character string x and the second character string y by equation below 1 Degree:
If service end pass through will | x | and | y | in minimum value add ds, it is right | x | and | y | be adapted to, then Service end can be calculated similar between the first character string x and the second character string y by equation below 2 Degree:
In above-mentioned two formula, S represents the similarity between the first character string x and the second character string y. Max (| x |, | y |) represent the first character string and the maximum in the length of the second character string;Min (| x |, | y |) represent First character string and the minimum value in the length of the second character string.C represents the amendment ginseng introduced in formula Number, the corrected parameter can be that (i.e. above-mentioned formula can introduce C to a constant more than or equal to 0 Value, can not also introduce C values), can be to above-mentioned by introducing above-mentioned corrected parameter in above formula The result of calculation of formula is modified.
Wherein, the specific value of above-mentioned corrected parameter can be the work set by user according to actual demand Journey empirical value, in the disclosure without being particularly limited to;For example, when realizing, above-mentioned corrected parameter can , can by introducing smoothing parameter in above-mentioned formula to be smoothing parameter that user is obtained based on exponential smoothing It is modified with the result of calculation to above-mentioned formula, to reduce the error of above-mentioned formula result of calculation.Upper State in formula, when the first character string x is identical with the second character string y length, i.e., | x |=| y | when, max (| x |, | y |) Value and min (| x |, | y |) value is identical, now, above-mentioned corrected parameter C value can be 0 (length It is identical to be modified), above-mentioned formula 1 can then be converted into S=1-ds/min (| x |, | y |) or S=1-ds/max (| x |, | y |), due in this case, max (| x |, | y |) value and min (| x |, | y |) value phase
Together, the first character string x and the second character string y length can be represented, therefore Above-mentioned formula 1 can be converted into S=1-ds/L, and wherein L value represents the first character string x and the second word Symbol string y length.
It can be seen that, the calculating formula of similarity described in above example, in the first character string x and the second word In the case of symbol string y length identical, meet in the prior art when calculating similarity based on editing distance For the definition of similarity.
Step 103, judge whether are first character string and second character string based on the similarity It is identical.
In this example, after service end calculates the similarity between the first character string and the second character string, The Similarity value that calculating can be obtained is compared with default similarity threshold, to judge that calculating is obtained Similarity value whether reach the similarity threshold.If calculating obtained Similarity value reaches the similarity Threshold value, now service end can be determined that the first character string is identical with the second character string.If on the contrary, calculated When obtained Similarity value is less than the similarity threshold, service end can be determined that the first character string and the second word Symbol string is different.
It is pointed out that above-mentioned similarity threshold can be set by user according to actual demand; For example, when realizing, the similarity threshold can be an engineering experience value, engineering staff can be to big The progress of amount character string manually determines whether identical, then the result manually judged is analyzed, to set Put above-mentioned similarity threshold;Or can also be using the result manually judged as data analysis sample, by taking Business end carries out statistical analysis to set above-mentioned similarity threshold.
Below by way of specific example, simultaneously connected applications scene is carried out in detail to the technical scheme in above example Thin description.
In the present example it is assumed that above-mentioned character string is the better address of user, above-mentioned service end is ecommerce The cloud platform of provider;Such as Taobao's platform.
It is true in platform that the cloud platform can assist the better address that trade company uploads to user to be reserved in user Real posting address is compared, come determine address that user uploads whether the authentic and valid address of the user, from And avoid user from uploading address dummy information and cause fraud.
Assuming that the first address that user is uploaded to trade company is " Xihu District of Hangzhou City Huanglong Times Square Building B 17 Building ";The second address of the user reserved in cloud platform is " Xihu District of Hangzhou City Huanglong Times Square B 17 buildings ant Jin Fu Legal Affairs Depts of seat ".The character length of first address is 17 (by Chinese character, letter and number Word is as a character), two address character length is 24.
Service end is when being the second address by the first address conversion, by increasing " ant Jin Fu Legal Affairs Depts " newly Deng 7 Chinese characters just can by the first address conversion be the second address, therefore service end calculate the two Editing distance is 7.
In existing realization, the first address x and the second address y similarity can by equation below come Calculate:
In above-mentioned formula, S represents similarity;Ds represents editing distance;L represents string length.Its In, L value be can be min (| x |, | y |), max (| x |, | y |) or | x |+| y |.
When L value for min (| x |, | y |) when:
When L value for max (| x |, | y |) when:
When L value is | x |+| y | when:
In the present example it is assumed that the default similarity threshold of cloud platform is 0.85, provided based on prior art The result for the above similarity that calculating formula of similarity is calculated, is respectively less than the similarity threshold.
In this case, cloud platform is judging that the first address and the second address are based on the similarity threshold It is no be same address when, then the first address and the second address may be mistaken for different addresses.And it is real The first address and the second address are only the same address that there is the difference in length in matter.
In this example, if cloud platform by the first address and two address editing distance to the first address It is adapted to two address character length, and based on the length computation Similarity value after adaptation, then Influence of the difference in length to Similarity Measure result due to character string can will be substantially reduced, so that in base When whether the Similarity value obtained in calculating judges the first address and the second address as same address, Ke Yiti Rise the degree of accuracy of result of determination, it is to avoid the situation of erroneous judgement occurs.
On the one hand, it is assumed that cloud platform by the maximum of length in the first address and the second address by subtracting volume Distance is collected, to be adapted to the first address and two address length, now cloud platform can be by such as Lower formula calculates the similarity of the two (so that C values are 0 as an example):
On the other hand, it is assumed that cloud platform is by the way that the minimum value of length in the first address and the second address is added Editing distance, to be adapted to the first address and two address length, now cloud platform can pass through Equation below calculates the similarity of the two:
It can be seen that, after cloud platform is adapted to the first address and two address length, calculate obtained phase It is 1 like degree value, the degree of accuracy of similarity is obviously improved.
Now, the Similarity value is more than similarity threshold 0.85, and cloud platform is judged based on similarity threshold When whether the first address and the second address are same address, the first address and the second address can be determined as Identical address, so as to avoid erroneous judgement.
In the embodiment above, by calculating the editing distance of the first character string and the second character string, sick base The length of first character string and second character string is adapted in the editing distance, and base In first character string and the length computation similarity of second character string after adaptation;Then base Judge whether first character string is identical with second character string in the similarity.
Due to being carried out in this application based on editing distance to the length of the first character string and the second character string Adaptation, therefore the difference in length of the first character string and the second character string can be reduced, for length adaptation When the first character string and the second character string afterwards carries out Similarity Measure, character can be farthest reduced Influence of the difference in length of string to Similarity Measure result, lifts the degree of accuracy of Similarity Measure, so that When judging whether the first character string is identical with the second character string based on similarity, judgement knot can be obviously improved The accuracy of fruit.
Corresponding with above method embodiment, present invention also provides the embodiment of device.
Fig. 2 is referred to, the application proposes a kind of decision maker 20 of identical characters string, applied to service end; Wherein, Fig. 3 is referred to, as involved by the service end for the decision maker 20 for carrying the identical characters string Hardware structure in, generally include CPU, internal memory, nonvolatile memory, network interface and inside Bus etc.;Exemplified by implemented in software, the decision maker 20 of the identical characters string is it is generally understood that add The computer program in internal memory is loaded in, the logic dress that the software and hardware formed after being run by CPU is combined Put, described device 20 includes:
First computing module 201, the editing distance for calculating the first character string and the second character string;The Two computing modules 202, for based on the editing distance to first character string and second character The length of string is adapted to, and based on first character string after adaptation and second character string Length computation similarity;
Determination module 203, for judging first character string and second word based on the similarity Whether symbol string is identical.
In this example, first computing module 201 specifically for:
Unicode codings are carried out for first character string and second character string;
Calculate the editing distance of first character string and second character string after unicode codings.
In this example, second computing module 202 specifically for:
Calculate first character string and the maximum and minimum value in the length of second character string;
The maximum is subtracted into the editing distance, or the minimum value is added into the editing distance, It is adapted to the length to first character string and second character string.
In this example, second computing module 202 is further used for:
The ratio for subtracting the maximum after the editing distance and the minimum value is calculated, based on the ratio Value characterizes the similarity of first character string and second character string;Or
The maximum and the ratio plus the minimum value after the editing distance are calculated, based on the ratio Value characterizes the similarity of first character string and second character string.
In this example, second computing module 202 is further used for:
First character string and second character string are calculated based on default calculating formula of similarity Similarity;
The calculating formula of similarity includes:
Or
Wherein, x represents the first character string, | x | represent the length of the first character string;Y represents the second character string, | y | represent the length of the second character string;Max (| x |, | y |) represent in the length of the first character string and the second character string Maximum;Min (| x |, | y |) represent the first character string and the minimum value in the length of the second character string;ds Represent the editing distance of the first character string and the second character string;C represents corrected parameter, to be more than or waiting In 0 constant.
In this example, the determination module 203 specifically for:
Judge whether the similarity calculated reaches default threshold value;
When the similarity calculated reaches default threshold value, judge first character string with it is described Second character string is identical;
When the similarity calculated is not up to default threshold value, first character string and institute are judged The second character string is stated to differ.
Those skilled in the art will readily occur to this after considering specification and putting into practice invention disclosed herein Other embodiments of application.The application is intended to any modification, purposes or the adaptability of the application Change, these modifications, purposes or adaptations follow the general principle of the application and including this Shen Please undocumented common knowledge or conventional techniques in the art.Description and embodiments only by It is considered as exemplary, the true scope of the application and spirit are pointed out by following claim.
It should be appreciated that the application be not limited to be described above and be shown in the drawings it is accurate Structure, and various modifications and changes can be being carried out without departing from the scope.Scope of the present application is only by institute Attached claim is limited.
The preferred embodiment of the application is the foregoing is only, it is all at this not to limit the application Within the spirit and principle of application, any modification, equivalent substitution and improvements done etc. should be included in Within the scope of the application protection.

Claims (12)

1. a kind of decision method of identical characters string, it is characterised in that this method includes:
Calculate the editing distance of the first character string and the second character string;
The length of first character string and second character string is adapted to based on the editing distance, And based on first character string and the length computation similarity of second character string after adaptation;
Judge whether first character string is identical with second character string based on the similarity.
2. according to the method described in claim 1, it is characterised in that the first character string of the calculating and the The editing distance of two character strings includes:
Unicode codings are carried out for first character string and second character string;
Calculate the editing distance of first character string and second character string after unicode codings.
3. according to the method described in claim 1, it is characterised in that described to be based on the editing distance pair The length of first character string and second character string, which carries out adaptation, to be included:
Calculate first character string and the maximum and minimum value in the length of second character string;
The maximum is subtracted into the editing distance, or the minimum value is added into the editing distance, It is adapted to the length to first character string and second character string.
4. method according to claim 3, it is characterised in that after the computational length is adapted to The similarity of first character string and second character string include:
The ratio for subtracting the maximum after the editing distance and the minimum value is calculated, based on the ratio Value characterizes the similarity of first character string and second character string;Or
The maximum and the ratio plus the minimum value after the editing distance are calculated, based on the ratio Value characterizes the similarity of first character string and second character string.
5. according to the method described in claim 1, it is characterised in that described to be based on the editing distance Length to first character string and second character string is adapted to, and based on described in after adaptation The length computation similarity of first character string and second character string includes:
First character string and second character string are calculated based on default calculating formula of similarity Similarity;
The calculating formula of similarity includes:
S = m a x ( | x | , | y | ) - d s + C min ( | x | , | y | ) + C ;
Or S = m a x ( | x | , | y | ) + C min ( | x | , | y | ) + d s + C
Wherein, x represents the first character string, | x | represent the length of the first character string;Y represents the second character string, | y | represent the length of the second character string;Max (| x |, | y |) represent in the length of the first character string and the second character string Maximum;Min (| x |, | y |) represent the first character string and the minimum value in the length of the second character string;ds Represent the editing distance of the first character string and the second character string;C represents corrected parameter, to be more than or waiting In 0 constant.
6. according to the method described in claim 1, it is characterised in that described to be judged based on the similarity First character string it is whether identical with second character string including:
Judge whether the similarity calculated reaches default threshold value;
When the similarity calculated reaches default threshold value, judge first character string with it is described Second character string is identical;
When the similarity calculated is not up to default threshold value, first character string and institute are judged The second character string is stated to differ.
7. a kind of decision maker of identical characters string, it is characterised in that the device includes:
First computing module, the editing distance for calculating the first character string and the second character string;Second meter Calculate module, for based on the editing distance to first character string and the length of second character string It is adapted to, and based on first character string and the length computation of second character string after adaptation Similarity;
Determination module, for judging first character string and second character string based on the similarity It is whether identical.
8. device according to claim 7, it is characterised in that first computing module is specifically used In:
Unicode codings are carried out for first character string and second character string;
Calculate the editing distance of first character string and second character string after unicode codings.
9. device according to claim 7, it is characterised in that second computing module is specifically used In:
Calculate first character string and the maximum and minimum value in the length of second character string;
The maximum is subtracted into the editing distance, or the minimum value is added into the editing distance, It is adapted to the length to first character string and second character string.
10. device according to claim 9, it is characterised in that second computing module enters one Walking is used for:
The ratio for subtracting the maximum after the editing distance and the minimum value is calculated, based on the ratio Value characterizes the similarity of first character string and second character string;Or
The maximum and the ratio plus the minimum value after the editing distance are calculated, based on the ratio Value characterizes the similarity of first character string and second character string.
11. device according to claim 7, it is characterised in that second computing module enters one Walking is used for:
First character string and second character string are calculated based on default calculating formula of similarity Similarity;
The calculating formula of similarity includes:
S = m a x ( | x | , | y | ) - d s + C min ( | x | , | y | ) + C ;
Or S = m a x ( | x | , | y | ) + C min ( | x | , | y | ) + d s + C
Wherein, x represents the first character string, | x | represent the length of the first character string;Y represents the second character string, | y | represent the length of the second character string;Max (| x |, | y |) represent in the length of the first character string and the second character string Maximum;Min (| x |, | y |) represent the first character string and the minimum value in the length of the second character string;ds Represent the editing distance of the first character string and the second character string;C represents corrected parameter, to be more than or waiting In 0 constant.
12. device according to claim 7, it is characterised in that the determination module specifically for:
Judge whether the similarity calculated reaches default threshold value;
When the similarity calculated reaches default threshold value, judge first character string with it is described Second character string is identical;
When the similarity calculated is not up to default threshold value, first character string and institute are judged The second character string is stated to differ.
CN201610052823.8A 2016-01-26 2016-01-26 Identical character string determination method and device Active CN106997335B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610052823.8A CN106997335B (en) 2016-01-26 2016-01-26 Identical character string determination method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610052823.8A CN106997335B (en) 2016-01-26 2016-01-26 Identical character string determination method and device

Publications (2)

Publication Number Publication Date
CN106997335A true CN106997335A (en) 2017-08-01
CN106997335B CN106997335B (en) 2020-05-19

Family

ID=59428480

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610052823.8A Active CN106997335B (en) 2016-01-26 2016-01-26 Identical character string determination method and device

Country Status (1)

Country Link
CN (1) CN106997335B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710562A (en) * 2018-05-10 2018-10-26 深圳市腾讯网络信息技术有限公司 Merging method, device and the equipment of exception record
CN109783811A (en) * 2018-12-26 2019-05-21 东软集团股份有限公司 A kind of method, apparatus, equipment and storage medium identifying text editing mistake
CN110708308A (en) * 2019-09-29 2020-01-17 武汉大学 Cross-site script vulnerability mining method and system for cloud computing environment
CN112580342A (en) * 2019-09-30 2021-03-30 深圳无域科技技术有限公司 Method and device for comparing company names, computer equipment and storage medium
CN113723466A (en) * 2019-05-21 2021-11-30 创新先进技术有限公司 Text similarity quantification method, equipment and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7444326B1 (en) * 2002-06-17 2008-10-28 At&T Corp. Method of performing approximate substring indexing
CN101388023A (en) * 2008-09-12 2009-03-18 北京搜狗科技发展有限公司 Electronic map interest point data redundant detecting method and system
CN101930458A (en) * 2010-08-18 2010-12-29 杭州东信北邮信息技术有限公司 Short message matching method based on characteristic value
US20110016111A1 (en) * 2009-07-20 2011-01-20 Alibaba Group Holding Limited Ranking search results based on word weight
CN102081642A (en) * 2010-10-28 2011-06-01 华南理工大学 Chinese label extraction method for clustering search results of search engine
CN103106264A (en) * 2013-01-29 2013-05-15 河南理工大学 Matching method and matching device of place names
CN103218423A (en) * 2013-04-02 2013-07-24 中国科学院信息工程研究所 Data inquiry method and device
CN104598591A (en) * 2015-01-20 2015-05-06 清华大学 Model element matching method for type attribute graph model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7444326B1 (en) * 2002-06-17 2008-10-28 At&T Corp. Method of performing approximate substring indexing
CN101388023A (en) * 2008-09-12 2009-03-18 北京搜狗科技发展有限公司 Electronic map interest point data redundant detecting method and system
US20110016111A1 (en) * 2009-07-20 2011-01-20 Alibaba Group Holding Limited Ranking search results based on word weight
CN101930458A (en) * 2010-08-18 2010-12-29 杭州东信北邮信息技术有限公司 Short message matching method based on characteristic value
CN102081642A (en) * 2010-10-28 2011-06-01 华南理工大学 Chinese label extraction method for clustering search results of search engine
CN103106264A (en) * 2013-01-29 2013-05-15 河南理工大学 Matching method and matching device of place names
CN103218423A (en) * 2013-04-02 2013-07-24 中国科学院信息工程研究所 Data inquiry method and device
CN104598591A (en) * 2015-01-20 2015-05-06 清华大学 Model element matching method for type attribute graph model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GUOLIANG LI 等: "A Partition-Based Method for String Similarity Joins with Edit-Distance Constraints", 《ACM TRANSACTIONS ON DATABASE SYSTEMS》 *
夏天 等: "改进编辑距离算法与汉语句子相似度计算", 《中国科协第2届优秀博士生学术年会》 *
姜华 等: "基于改进编辑距离的字符串相似度求解算法", 《计算机工程》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710562A (en) * 2018-05-10 2018-10-26 深圳市腾讯网络信息技术有限公司 Merging method, device and the equipment of exception record
CN108710562B (en) * 2018-05-10 2023-03-31 深圳市腾讯网络信息技术有限公司 Abnormal record merging method, device and equipment
CN109783811A (en) * 2018-12-26 2019-05-21 东软集团股份有限公司 A kind of method, apparatus, equipment and storage medium identifying text editing mistake
CN109783811B (en) * 2018-12-26 2023-10-31 东软集团股份有限公司 Method, device, equipment and storage medium for identifying text editing errors
CN113723466A (en) * 2019-05-21 2021-11-30 创新先进技术有限公司 Text similarity quantification method, equipment and system
CN113723466B (en) * 2019-05-21 2024-03-08 创新先进技术有限公司 Text similarity quantification method, device and system
CN110708308A (en) * 2019-09-29 2020-01-17 武汉大学 Cross-site script vulnerability mining method and system for cloud computing environment
CN110708308B (en) * 2019-09-29 2021-08-17 武汉大学 Cross-site script vulnerability mining method and system for cloud computing environment
CN112580342A (en) * 2019-09-30 2021-03-30 深圳无域科技技术有限公司 Method and device for comparing company names, computer equipment and storage medium

Also Published As

Publication number Publication date
CN106997335B (en) 2020-05-19

Similar Documents

Publication Publication Date Title
CN106997335A (en) The decision method and device of identical characters string
CN104008166B (en) Dialogue short text clustering method based on form and semantic similarity
CN112529203B (en) Entangled quantum state distinguishing method, device, equipment and storage medium
CN106528508A (en) Repeated text judgment method and apparatus
CN112765324B (en) Concept drift detection method and device
CN104376174B (en) Alternating current line parameter identification and correction method based on line impedance ratio
CN112784720A (en) Key information extraction method, device, equipment and medium based on bank receipt
CN115952760A (en) Method, device and equipment for simulating digital-analog circuit and computer storage medium
CN112800527B (en) Construction information model-based structural beam steel bar information generation method and related device
CN111291140A (en) Method, device, equipment and medium for identifying topological nodes
CN112925552A (en) Code processing method, device, equipment and storage medium
CN105095658B (en) Hydrology flowed fluctuation situation recognition methods and system
CN115547508A (en) Data correction method, data correction device, electronic equipment and storage medium
CN113792558B (en) Self-learning translation method and device based on machine translation and post-translation editing
WO2021115111A1 (en) Information display method and apparatus, electronic device, and storage medium
CN107391490A (en) A kind of intelligent semantic analysis and text mining method
CN112733507A (en) Method for automatically generating legal text marking event
CN116127948B (en) Recommendation method and device for text data to be annotated and electronic equipment
CN104503970B (en) The method and system that a kind of number is matched with self-defined matching formula
CN115760006B (en) Data correction method, device, electronic equipment and storage medium
CN117272970B (en) Document generation method, device, equipment and storage medium
CN115576902B (en) Method, device, equipment and medium for processing calibration description file
CN116303337A (en) Data migration method, device, equipment and computer storage medium
CN111222313B (en) Security measure auditing method, device and equipment
CN115857939A (en) Statement validity checking method and device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: Alibaba Group Holding Ltd.

TR01 Transfer of patent right