CN108255836B - Character string matching method and device - Google Patents

Character string matching method and device Download PDF

Info

Publication number
CN108255836B
CN108255836B CN201611237454.6A CN201611237454A CN108255836B CN 108255836 B CN108255836 B CN 108255836B CN 201611237454 A CN201611237454 A CN 201611237454A CN 108255836 B CN108255836 B CN 108255836B
Authority
CN
China
Prior art keywords
character string
length
string
matching
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201611237454.6A
Other languages
Chinese (zh)
Other versions
CN108255836A (en
Inventor
闫继东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Potevio Information Technology Co Ltd
Original Assignee
Potevio Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Potevio Information Technology Co Ltd filed Critical Potevio Information Technology Co Ltd
Priority to CN201611237454.6A priority Critical patent/CN108255836B/en
Publication of CN108255836A publication Critical patent/CN108255836A/en
Application granted granted Critical
Publication of CN108255836B publication Critical patent/CN108255836B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying

Abstract

The embodiment of the invention provides a character string matching method and device. The method comprises the following steps: acquiring a character string to be matched and at least 1 key character string corresponding to the character string to be matched, and calculating a matching value of the key character strings, wherein the character string to be matched comprises a first character string and a second character string; calculating the maximum prefix matching character string length of the first character string and the second character string; calculating a first editing distance between the first character string and the second character string by using a preset rule according to the length of the maximum prefix matching character string; and obtaining the similarity of the first character string and the second character string according to the first editing distance and the matching value. The apparatus is configured to perform the method. According to the embodiment of the invention, the matching value of the key character string is calculated, the first editing distance of the first character string and the second character string is calculated by using the preset rule, and finally, the similarity is obtained according to the first editing distance and the matching value, so that the accuracy of character string matching is improved.

Description

Character string matching method and device
Technical Field
The embodiment of the invention relates to the technical field of text classification processing, in particular to a character string matching method and device.
Background
The Jaro-Winkler algorithm is used for calculating the similarity between two character strings and is the mainstream algorithm for measuring the matching degree of the character strings at present.
The calculation method of the Jaro-Winkler algorithm is shown as a formula (1):
Wij=Dij+lp(1-Dij) (1)
wherein: wijFor a string S to be matchediAnd SjThe edit distance of (d); l is the character string S to be matchediAnd SjThe upper limit of the length of the common prefix field of (4); p is a constant scaling factor, p is 0.1, DijThe jaro distance, a category of data edit distances.
DijThe calculation method of (2) is shown in formula (2):
Figure BDA0001195631830000011
wherein m isijFor a string S to be matchediAnd SjThe number of matched characters; i SiI is the character string S to be matchediIs the length of the character, | Sj| character string S to be matchedjThe character length of (d); t isijThe number of the bit changes in the character string is half, that is, if a, b appears at the ith bit and b, a appears at the jth bit of the character string, it indicates that the bit changes between the ith bit and the jth bit. For mijThe condition that two characters from two character strings to be matched match is that the characters are the same, and the difference in the positions of the two characters in the character strings is not more than dmax,dmaxFor matching the window, the calculation method is shown in formula (3):
Figure BDA0001195631830000012
the existing Jaro-Winkler algorithm focuses on calculating the character matching and editing distance of two character strings (only transposition is considered), and the algorithm gives higher similarity to the same character strings at the initial part, and is particularly suitable for calculating the similarity of short character strings.
However, the method still treats the characters in the character string equally, and does not consider the information content of the character string, but often some key characters (such as numbers, place names and the like) in the character string have a decisive effect on the properties of the character string, and l and p in the formula cannot be accurately measured for all the character strings. For example: if the two character strings are considered to be from the same project with the similarity greater than 0.8, such as the 'Changsha track traffic No. 6 line project' and the 'Changsha track traffic No. 3 line project', the similarity is calculated to be 0.9583 according to the existing Jaro-Winkler algorithm, namely the two character strings are identified as a project and are matched, and obviously incorrect. The following steps are repeated: the anti-clamping device competitively negotiates project delay bulletins by the curve platform of the first-stage project south gate station and the loess ridge station of the track traffic No. 1 line in Changsha city, and the project delay bulletins for monitoring the project of the first-stage project operation period of the track traffic No. 1 line in Changsha city come from the same large project, and the project delay bulletins are matched, but the similarity of the project delay bulletins and the project delay bulletins is only 0.6119. It can be seen that the matching condition of the character strings is measured only by the similarity obtained by calculating the edit distance, the accuracy of the character strings with smaller measurement length and without numbers and special names is higher, and the error of the character strings with numbers, too large difference between the length of the character strings and with special names is too large.
Disclosure of Invention
Aiming at the problems in the prior art, the embodiment of the invention provides a character string matching method and device.
In one aspect, an embodiment of the present invention provides a character string matching method, including:
acquiring a character string to be matched and at least 1 key character string corresponding to the character string to be matched, and calculating a matching value of the key character string, wherein the character string to be matched comprises a first character string and a second character string;
calculating the maximum prefix matching character string length of the first character string and the second character string;
calculating a first editing distance between the first character string and the second character string by using a preset rule according to the length of the maximum prefix matching character string;
and obtaining the similarity of the first character string and the second character string according to the first editing distance and the matching value.
On the other hand, an embodiment of the present invention provides a character string matching apparatus, including:
the device comprises an acquisition module, a comparison module and a comparison module, wherein the acquisition module is used for acquiring a character string to be matched and at least 1 key character string corresponding to the character string to be matched, and calculating a matching value of the key character strings, and the character string to be matched comprises a first character string and a second character string;
a character string length calculating module, configured to calculate a maximum prefix matching character string length of the first character string and the second character string;
the editing distance calculation module is used for calculating a first editing distance of the first character string and the second character string by using a preset rule according to the length of the maximum prefix matching character string;
and the similarity calculation module is used for obtaining the similarity of the first character string and the second character string according to the first editing distance and the matching value.
According to the character string matching method and device provided by the embodiment of the invention, the matching value of the key character string is calculated, the first editing distance of the first character string and the second character string is calculated by using the preset rule, and finally, the similarity is obtained according to the first editing distance and the matching value, so that the accuracy of character string matching is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a character string matching method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a character string matching apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an entity of a string matching apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a character string matching method according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step 101: acquiring a character string to be matched and at least 1 key character string corresponding to the character string to be matched, and calculating a matching value of the key character string, wherein the character string to be matched comprises a first character string and a second character string;
in particular, a bidding project from a web crawler is a large set, where many projects are children from the same large project. The method comprises the steps of obtaining character strings to be matched from a huge set, wherein the character strings to be matched comprise a first character string and a second character string, if the first character string is ' Changsha rail transit No. 1 line first-term engineering south gate station, loess ridge station curve platform anti-pinch device competitive negotiation project delay bulletin ', and the second character string is ' Changsha rail transit No. 1 line first-term engineering operation period detection project delay bulletin ', and obtaining at least 1 key character string corresponding to the character strings to be matched, wherein the key character strings are characters capable of determining character string types, attributes and the like, for the project names of the rail transit bidding bulletin listed in the embodiment of the invention, the names comprise three key character strings of ' number line ', number line ' and ' term ', and the matching values of the character strings are calculated according to the three key character strings. It should be noted that different types of strings contain different key strings.
Step 102: calculating the maximum prefix matching character string length of the first character string and the second character string;
specifically, the comparison is performed in sequence starting from the first character of the first character string and the first character of the second character string, that is, if the first character of the first character string is the same as the first character of the second character string, the maximum prefix matching character string length is 1, then whether the second character of the first character string is the same as the second character of the second character string is compared, if the first character of the first character string is the same as the first character of the second character string, the maximum prefix matching character string length is 2, then whether the third character of the first character string is the same as the third character of the second character string is compared, and so on until the characters corresponding to the first character string and the second character string are different, or until all characters in the first character string or all characters in the second character string are judged, and the maximum prefix matching character string length is counted.
Step 103: calculating a first editing distance between the first character string and the second character string by using a preset rule according to the length of the maximum prefix matching character string;
specifically, the length of the character string is matched according to the maximum prefix obtained through statistics, and a first editing distance between the first character string and the second character string is calculated through a preset rule.
Step 104: and obtaining the similarity of the first character string and the second character string according to the first editing distance and the matching value.
Specifically, the product of the first edit distance and the matching value is the similarity of the first character string and the second character string.
According to the embodiment of the invention, the matching value of the key character string is calculated, the first editing distance of the first character string and the second character string is calculated by using the preset rule, and finally, the similarity is obtained according to the first editing distance and the matching value, so that the accuracy of character string matching is improved.
On the basis of the above embodiment, the calculating the matching value of the key character string includes:
acquiring a preset number of first characters before the key character string in the first character string and a preset number of second characters before the key character string in the second character string;
if the first character is judged to be the same as the second character, the first character string is matched with the second character string;
and obtaining the matching value according to the matching condition of the first character string and the second character string.
Specifically, for a first string: the anti-clamping device for the curve platform of the long sand city rail transit No. 1 line first-stage project south gate station and the loess ridge station competitively negotiates project delay bulletins, and the first two characters of a key character string 'city' are taken as first characters. For the second string: the first two characters of the key character string 'city' are taken as the second character, so that the first character and the second character are both 'Changsha', and therefore the key character string 'city' is matched. And obtaining that the number lines and the periods of the key character strings corresponding to the first character string and the second character string are matched through matching judgment. Assuming that the matching value calculation function is a 0-1 function, namely, if all the three key character strings can be matched, the matching value is 1; otherwise, the match value is 0. It should be noted that the matching function may also be other functions, such as a normal distribution function, and the matching value is determined according to the matching number of the key character strings.
According to the embodiment of the invention, the matching value of the key character string is determined according to the matching of the first character and the second character with the preset number in front of the key character string, and the matching value is used for calculating the similarity of the character string to be matched, so that the accuracy of character string matching is improved.
On the basis of the foregoing embodiment, the calculating a maximum prefix matching string length of the first string and the second string includes:
calculating a first character string length corresponding to the first character string and a second character string length corresponding to the second character string, and initializing the maximum prefix matching character string length;
judging whether a first character in the first character string is the same as a first character in the second character string, if so, adding 1 to the length of the maximum prefix matching character string, and continuously judging whether the next character is the same;
sequentially judging whether the characters corresponding to the first character string and the second character string are the same or not until the characters corresponding to the first character string and the second character string are different or judging the last character in the character string to be matched with the character string with the smaller length of the first character string and the second character string;
and obtaining the maximum prefix matching character string length.
Specifically, the method for calculating the maximum prefix matching string length of the first string and the second string includes the following steps:
(1) let the first string be SiThe second character string is SjFirst string SiFirst word ofString length of | SiI, the second string SjHas a second character string length of | SjL and initializing the maximum prefix match string length to Lij=0;
(2) Judging the first character string SiFirst character S ofi[1]And a second character string SjFirst character S ofj[1]If it is the same, if Si[1]And Sj[1]If not, ending the judgment, Lij=0;
(3) If S isi[1]And Sj[1]Same, Lij1 and continue to judge Si[2]And Sj[2]Whether they are the same;
(4) up to Si[k]And Sj[k]Is different, or k>min[|Si|,|Sj|]To this point, at this time Lij=k-1。
The embodiment of the invention calculates the similarity of the character strings to be matched by obtaining the length of the maximum prefix matching character string, does not adopt fixed l and p to measure the importance of the prefix character, and really calculates the length of the maximum prefix matching character string of the character strings, thereby improving the accuracy of character string matching.
On the basis of the foregoing embodiment, the calculating an edit distance of the first character string and the second character string by using a preset rule includes:
calculating the matching length and the transposition number of the first character string and the second character string, the first character string length corresponding to the first character string and the second character string length corresponding to the second character string, and calculating a second editing distance according to the matching length, the transposition number, the first character string length and the second character string length;
calculating a first ratio of the maximum prefix matching string length to the first string length and a second ratio of the maximum prefix matching string length to the second string length;
if the length of the maximum prefix matching character string is greater than or equal to a preset threshold, the length of the prefix matching character string is a preset threshold, otherwise, the length of the prefix matching character string is equal to the length of the maximum prefix matching character string;
if 1/2 times of the sum of the first ratio and the second ratio is smaller than the product of the prefix matching string length and a constant scaling factor, calculating the first editing distance according to the second editing distance and the product of the prefix matching string length and the constant scaling factor;
if 1/2 times the sum of the first ratio and the second ratio is greater than or equal to the product of the prefix matching string length and the constant scaling factor, calculating the first edit distance according to the second edit distance and 1/2 times the sum of the first ratio and the second ratio.
Specifically, assume that the first string is: "south gate station of No. 1 line first stage project of track traffic in Changsha station, loess ridge station curvilinear platform anti-pinch device competitive negotiation project postpone bulletin", the second character string is: "the first-stage project operation period detection project postponed announcement of track traffic No. 1 line in Changsha city" can learn that:
(1) the first string length of the first string is 42, the second string length of the second string is 25, and the formula is calculated according to the matching window
Figure BDA0001195631830000071
Can obtain dmaxAt 20, the condition that two characters from two character strings match is that the characters are the same, and the difference in position of the two characters in the character strings is not more than dmaxTherefore, the matching length of the first character string and the second character string is 14, the transposition number is 0, and a second editing distance is 0.63 according to the matching length, the transposition data, the length of the first character string and the length of the second character string;
(2) according to the first character string and the second character string, the length of the maximum prefix matching character string is 14, the first ratio of the length of the maximum prefix matching character string to the length of the first character string is 0.33, and the second ratio of the length of the maximum prefix matching character string to the length of the second character string is 0.56;
(3) in the Jaro-Winkler algorithm in the prior art, when the length of the maximum prefix matching character string in the first character string and the second character string is greater than or equal to 4, the length of the prefix matching character string is 4, and as the length of the maximum prefix matching character string is 14 and greater than 4, the length of the prefix matching character string is 4;
(4) 1/2 times of the sum of the first ratio and the second ratio is 0.445, the product of the prefix matching character string length and the constant scaling factor is 0.4, and since 0.445 is larger than 0.4, the first editing distance is calculated according to 1/2 times of the second editing distance and the sum of the first ratio and the second ratio;
according to the embodiment of the invention, the matching value of the key character string is calculated, the first editing distance of the first character string and the second character string is calculated by using the preset rule, and finally, the similarity is obtained according to the first editing distance and the matching value, so that the accuracy of character string matching is improved.
On the basis of the foregoing embodiment, the calculating an edit distance of the first character string and the second character string by using a preset rule includes:
the first edit distance is:
Figure BDA0001195631830000081
wherein, WijThe first edit distance; dijThe second edit distance; l is the prefix matching string length; p is the constant scaling factor, and p is 0.1; l isijMatching string length, | S, for the maximum prefixiI is the first string length, | SjAnd | is the second character string length.
Specifically, from the above example, it can be seen that the prefix matching string length is multiplied by a constant scaling factor, i.e., lp is 0.4, and 1/2 times the sum of the first ratio and the second ratio, i.e.
Figure BDA0001195631830000082
Therefore, it is not only easy to use
Figure BDA0001195631830000083
Thus, it is possible to provideUse of
Figure BDA0001195631830000084
This formula calculates the first edit distance WijW can be obtained by calculationij0.93, and W is known by prior art calculationsijIf the edit distance is greater than or equal to 0.8, the first character string and the second character string are judged to belong to the same project, and if the edit distance is less than or equal to 0.8, the first character string and the second character string are judged not to belong to the same project, and the first character string and the second character string belong to the same project.
The embodiment of the invention can accurately judge whether the character strings to be matched are matched or not through the specific first edit distance calculation formula.
Fig. 2 is a schematic structural diagram of a character string matching apparatus according to an embodiment of the present invention, and as shown in fig. 2, the apparatus includes: an obtaining module 201, a character string length calculating module 202, an edit distance calculating module 203, and a similarity calculating module 204, wherein:
the obtaining module 201 is configured to obtain a to-be-matched character string and at least 1 key character string corresponding to the to-be-matched character string, and calculate a matching value of the key character string, where the to-be-matched character string includes a first character string and a second character string; the string length calculating module 202 is configured to calculate a maximum prefix matching string length of the first string and the second string; the edit distance calculation module 203 is configured to calculate a first edit distance between the first character string and the second character string according to the maximum prefix matching character string length by using a preset rule; the similarity calculation module 204 is configured to obtain a similarity between the first character string and the second character string according to the first edit distance and the matching value.
In particular, a bidding project from a web crawler is a large set, where many projects are children from the same large project. The obtaining module 201 obtains a to-be-matched character string from the huge set, where the to-be-matched character string includes a first character string and a second character string, and obtains at least 1 key character string corresponding to the to-be-matched character string, where the key character string is a character capable of determining a character string category, an attribute, and the like, and calculates a matching value according to the key character string. The string length calculating module 202 sequentially compares the first character of the first string with the first character of the second string, that is, if the first character of the first string is the same as the first character of the second string, the maximum prefix matching string length is 1, compares whether the second character of the first string is the same as the second character of the second string, if so, the maximum prefix matching string length is 2, compares whether the third character of the first string is the same as the third character of the second string, and so on until the first string is different from the second string, or until all the characters of the first string or all the characters of the second string are determined, and counts the maximum prefix matching string length. The edit distance calculation module 203 matches the length of the character string according to the maximum prefix obtained by statistics, and calculates a first edit distance between the first character string and the second character string by using a preset rule. The product of the first edit distance and the matching value is the similarity of the first character string and the second character string by the similarity calculation module 204.
The embodiment of the apparatus provided in the present invention may be specifically configured to execute the processing flows of the above method embodiments, and the functions of the apparatus are not described herein again, and refer to the detailed description of the above method embodiments.
According to the embodiment of the invention, the matching value of the key character string is calculated, the first editing distance of the first character string and the second character string is calculated by using the preset rule, and finally, the similarity is obtained according to the first editing distance and the matching value, so that the accuracy of character string matching is improved.
On the basis of the foregoing embodiment, the obtaining module is specifically configured to:
acquiring a preset number of first characters before the key character string in the first character string and a preset number of second characters before the key character string in the second character string;
if the first character is judged to be the same as the second character, the first character string is matched with the second character string;
and obtaining the matching value according to the matching condition of the first character string and the second character string.
Specifically, for a first string: the anti-clamping device for the curve platform of the long sand city rail transit No. 1 line first-stage project south gate station and the loess ridge station competitively negotiates project delay bulletins, and the acquisition module acquires the first two characters of a key character string 'city' as first characters. For the second string: the acquisition module acquires that the first two characters of a key character string 'city' are second characters, and can know that the first character and the second character are 'Changsha', so that the key character string 'city' is matched. And obtaining that the number lines and the periods of the key character strings corresponding to the first character string and the second character string are matched through matching judgment. Setting a matching function as a 0-1 function, namely, if the three key character strings can be matched, setting a matching value as 1; otherwise, the match value is 0. It should be noted that the matching function may also be other functions, such as a normal distribution function, and the matching value is determined according to the matching number of the key character strings.
According to the embodiment of the invention, the matching value of the key character string is determined according to the matching of the first character and the second character with the preset number in front of the key character string, and the matching value is used for calculating the similarity of the character string to be matched, so that the accuracy of character string matching is improved.
On the basis of the foregoing embodiment, the character string length calculating module is specifically configured to:
calculating a first character string length corresponding to the first character string and a second character string length corresponding to the second character string, and initializing the maximum prefix matching character string length;
judging whether a first character in the first character string is the same as a first character in the second character string, if so, adding 1 to the length of the maximum prefix matching character string, and continuously judging whether the next character is the same;
sequentially judging whether the characters corresponding to the first character string and the second character string are the same or not until the characters corresponding to the first character string and the second character string are different or judging the last character in the character string to be matched with the character string with the smaller length of the first character string and the second character string;
and obtaining the maximum prefix matching character string length.
Specifically, the calculation module calculates a first character string length corresponding to the first character string and a second character string length corresponding to the second character string, and initializes the maximum prefix matching character length to 0; judging from a first character in a first character string and a first character in a second character string, judging whether the characters are the same, if so, adding 1 to the maximum prefix matching character length, continuously judging whether the next character is the same, and repeating the steps until the characters corresponding to the first character string and the second character string are different, or judging the last character in the character string to be matched with which the length of the first character string and the length of the second character string are smaller; finally, the maximum prefix matching string length can be obtained.
The embodiment of the invention calculates the similarity of the character strings to be matched by obtaining the length of the maximum prefix matching character string, does not adopt fixed l and p to measure the importance of the prefix character, and really calculates the length of the maximum prefix matching character string of the character strings, thereby improving the accuracy of character string matching.
On the basis of the foregoing embodiment, the edit distance calculating module is specifically configured to:
calculating the matching length and the transposition number of the first character string and the second character string, the first character string length corresponding to the first character string and the second character string length corresponding to the second character string, and calculating a second editing distance according to the matching length, the transposition data, the first character string length and the second character string length;
calculating a first ratio of the maximum prefix matching string length to the first string length and a second ratio of the maximum prefix matching string length to the second string length;
if the length of the maximum prefix matching character string is greater than or equal to a preset threshold, the length of the prefix matching character string is a preset threshold, otherwise, the length of the prefix matching character string is equal to the length of the maximum prefix matching character string;
if 1/2 times of the sum of the first ratio and the second ratio is smaller than the product of the prefix matching string length and a constant scaling factor, calculating the first editing distance according to the second editing distance and the product of the prefix matching string length and the constant scaling factor;
if 1/2 times of the sum of the first ratio and the second ratio is greater than or equal to the product of the prefix matching string length and 0.1, calculating the first edit distance according to the second edit distance and 1/2 times of the sum of the first ratio and the second ratio.
Specifically, assume that the first string is: "south gate station of No. 1 line first stage project of track traffic in Changsha station, loess ridge station curvilinear platform anti-pinch device competitive negotiation project postpone bulletin", the second character string is: "the track traffic No. 1 line first project detection project postpone announcement in Changsha city", can learn according to the edit distance calculation module:
(1) the first string length of the first string is 42, the second string length of the second string is 25, and the formula is calculated according to the matching window
Figure BDA0001195631830000121
Can obtain dmaxAt 20, the condition that two characters from two character strings match is that the characters are the same, and that the two characters matchThe position difference in the character string is not more than dmaxTherefore, the matching length of the first character string and the second character string is 14, the transposition number is 0, and a second editing distance is 0.63 according to the matching length, the transposition data, the length of the first character string and the length of the second character string;
(2) according to the first character string and the second character string, the length of the maximum prefix matching character string is 14, the first ratio of the length of the maximum prefix matching character string to the length of the first character string is 0.33, and the second ratio of the length of the maximum prefix matching character string to the length of the second character string is 0.56;
(3) in the Jaro-Winkler algorithm in the prior art, when the length of the maximum prefix matching character string in the first character string and the second character string is greater than or equal to 4, the length of the prefix matching character string is 4, and as the length of the maximum prefix matching character string is 14 and greater than 4, the length of the prefix matching character string is 4;
(4) 1/2 times of the sum of the first ratio and the second ratio is 0.445, the product of the prefix matching character string length and the constant scaling factor is 0.4, and since 0.445 is larger than 0.4, the first editing distance is calculated according to 1/2 times of the second editing distance and the sum of the first ratio and the second ratio;
according to the embodiment of the invention, the matching value of the key character string is calculated, the first editing distance of the first character string and the second character string is calculated by using the preset rule, and finally, the similarity is obtained according to the first editing distance and the matching value, so that the accuracy of character string matching is improved.
On the basis of the foregoing embodiment, the calculating an edit distance of the first character string and the second character string by using a preset rule includes:
the first edit distance is:
Figure BDA0001195631830000131
wherein, WijThe first edit distance; dijThe second edit distance; l is the prefix matching string length; p is the constant shrinkageFactor is released, and p is 0.1; l isijMatching string length, | S, for the maximum prefixiI is the first string length, | SjAnd | is the second character string length.
Specifically, from the above example, it can be seen that the prefix matching string length is multiplied by a constant scaling factor, i.e., lp is 0.4, and 1/2 times the sum of the first ratio and the second ratio, i.e.
Figure BDA0001195631830000132
Therefore, it is not only easy to use
Figure BDA0001195631830000133
Thus, use is made of
Figure BDA0001195631830000134
This formula calculates the first edit distance WijW can be obtained by calculationij0.93, and W is known by prior art calculationsijIf the edit distance is greater than or equal to 0.8 and the edit distance is less than 0.8, the first character string and the second character string belong to the same project, and the edit distance is less than 0.8 and the edit distance is not less than 0.8, the first character string and the second character string do not belong to the same project.
The embodiment of the invention can accurately judge whether the character strings to be matched are matched or not through the specific first edit distance calculation formula.
Fig. 3 is a schematic structural diagram of an entity of a character string matching apparatus according to an embodiment of the present invention, as shown in fig. 3, the apparatus includes: a processor (processor)301, a memory (memory)302, and a bus 303; wherein the content of the first and second substances,
the processor 301 and the memory 302 complete mutual communication through the bus 303;
the processor 301 is configured to call program instructions in the memory 302 to perform the methods provided by the above-mentioned method embodiments, including: acquiring a character string to be matched and at least 1 key character string corresponding to the character string to be matched, and calculating a matching value of the key character string, wherein the character string to be matched comprises a first character string and a second character string; calculating the maximum prefix matching character string length of the first character string and the second character string; calculating a first editing distance between the first character string and the second character string by using a preset rule according to the length of the maximum prefix matching character string; and obtaining the similarity of the first character string and the second character string according to the first editing distance and the matching value.
The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising: acquiring a character string to be matched and at least 1 key character string corresponding to the character string to be matched, and calculating a matching value of the key character string, wherein the character string to be matched comprises a first character string and a second character string; calculating the maximum prefix matching character string length of the first character string and the second character string; calculating a first editing distance between the first character string and the second character string by using a preset rule according to the length of the maximum prefix matching character string; and obtaining the similarity of the first character string and the second character string according to the first editing distance and the matching value.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: acquiring a character string to be matched and at least 1 key character string corresponding to the character string to be matched, and calculating a matching value of the key character string, wherein the character string to be matched comprises a first character string and a second character string; calculating the maximum prefix matching character string length of the first character string and the second character string; calculating a first editing distance between the first character string and the second character string by using a preset rule according to the length of the maximum prefix matching character string; and obtaining the similarity of the first character string and the second character string according to the first editing distance and the matching value.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The above-described embodiments of the apparatuses and the like are merely illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A method for string matching, comprising:
acquiring a character string to be matched and at least 1 key character string corresponding to the character string to be matched, and calculating a matching value of the key character string, wherein the character string to be matched comprises a first character string and a second character string;
calculating the maximum prefix matching character string length of the first character string and the second character string;
calculating a first editing distance between the first character string and the second character string by using a preset rule according to the length of the maximum prefix matching character string;
obtaining the similarity of the first character string and the second character string according to the first editing distance and the matching value;
the calculating the edit distance of the first character string and the second character string by using a preset rule includes:
calculating the matching length and the transposition number of the first character string and the second character string, the first character string length corresponding to the first character string and the second character string length corresponding to the second character string, and calculating a second editing distance according to the matching length, the transposition number, the first character string length and the second character string length;
calculating a first ratio of the maximum prefix matching string length to the first string length and a second ratio of the maximum prefix matching string length to the second string length;
if the length of the maximum prefix matching character string is greater than or equal to a preset threshold, the length of the prefix matching character string is a preset threshold, otherwise, the length of the prefix matching character string is equal to the length of the maximum prefix matching character string;
if 1/2 times of the sum of the first ratio and the second ratio is smaller than the product of the prefix matching string length and a constant scaling factor, calculating the first editing distance according to the second editing distance and the product of the prefix matching string length and the constant scaling factor;
if 1/2 times the sum of the first ratio and the second ratio is greater than or equal to the product of the prefix matching string length and the constant scaling factor, calculating the first edit distance according to the second edit distance and 1/2 times the sum of the first ratio and the second ratio.
2. The method of claim 1, wherein the calculating the matching value for the key string comprises:
acquiring a preset number of first characters before the key character string in the first character string and a preset number of second characters before the key character string in the second character string;
if the first character is judged to be the same as the second character, the first character string is matched with the second character string;
and obtaining the matching value according to the matching condition of the first character string and the second character string.
3. The method of claim 1, wherein calculating the maximum prefix match string length for the first string and the second string comprises:
calculating a first character string length corresponding to the first character string and a second character string length corresponding to the second character string, and initializing the maximum prefix matching character string length;
judging whether a first character in the first character string is the same as a first character in the second character string, if so, adding 1 to the length of the maximum prefix matching character string, and continuously judging whether the next character is the same;
sequentially judging whether the characters corresponding to the first character string and the second character string are the same or not until the characters corresponding to the first character string and the second character string are different or judging the last character in the character string to be matched with the character string with the smaller length of the first character string and the second character string;
and obtaining the maximum prefix matching character string length.
4. The method according to claim 1, wherein the calculating the edit distance of the first character string and the second character string by using a preset rule comprises:
the first edit distance is:
Figure FDA0002568531060000031
wherein, WijThe first edit distance; dijThe second edit distance; l is the prefix matching string length; p is the constant scaling factor, and p is 0.1; l isijMatching string length, | S, for the maximum prefixiI is the first string length, | SjAnd | is the second character string length.
5. A character string matching apparatus, comprising:
the device comprises an acquisition module, a comparison module and a comparison module, wherein the acquisition module is used for acquiring a character string to be matched and at least 1 key character string corresponding to the character string to be matched, and calculating a matching value of the key character strings, and the character string to be matched comprises a first character string and a second character string;
a character string length calculating module, configured to calculate a maximum prefix matching character string length of the first character string and the second character string;
the editing distance calculation module is used for calculating a first editing distance of the first character string and the second character string by using a preset rule according to the length of the maximum prefix matching character string;
the similarity calculation module is used for obtaining the similarity of the first character string and the second character string according to the first editing distance and the matching value;
the edit distance calculation module is specifically configured to:
calculating the matching length and the transposition number of the first character string and the second character string, the first character string length corresponding to the first character string and the second character string length corresponding to the second character string, and calculating a second editing distance according to the matching length, the transposition number, the first character string length and the second character string length;
calculating a first ratio of the maximum prefix matching string length to the first string length and a second ratio of the maximum prefix matching string length to the second string length;
if the length of the maximum prefix matching character string is greater than or equal to a preset threshold, the length of the prefix matching character string is a preset threshold, otherwise, the length of the prefix matching character string is equal to the length of the maximum prefix matching character string;
if 1/2 times of the sum of the first ratio and the second ratio is smaller than the product of the prefix matching string length and a constant scaling factor, calculating the first editing distance according to the second editing distance and the product of the prefix matching string length and the constant scaling factor;
if 1/2 times of the sum of the first ratio and the second ratio is greater than or equal to the product of the prefix matching string length and 0.1, calculating the first edit distance according to the second edit distance and 1/2 times of the sum of the first ratio and the second ratio.
6. The apparatus of claim 5, wherein the obtaining module is specifically configured to:
acquiring a preset number of first characters before the key character string in the first character string and a preset number of second characters before the key character string in the second character string;
if the first character is judged to be the same as the second character, the first character string is matched with the second character string;
and obtaining the matching value according to the matching condition of the first character string and the second character string.
7. The apparatus of claim 5, wherein the string length calculation module is specifically configured to:
calculating a first character string length corresponding to the first character string and a second character string length corresponding to the second character string, and initializing the maximum prefix matching character string length;
judging whether a first character in the first character string is the same as a first character in the second character string, if so, adding 1 to the length of the maximum prefix matching character string, and continuously judging whether the next character is the same;
sequentially judging whether the characters corresponding to the first character string and the second character string are the same or not until the characters corresponding to the first character string and the second character string are different or judging the last character in the character string to be matched with the character string with the smaller length of the first character string and the second character string;
and obtaining the maximum prefix matching character string length.
8. The apparatus of claim 5, wherein the edit distance calculation module is specifically configured to:
the first edit distance is:
Figure FDA0002568531060000051
wherein, WijThe first edit distance; dijThe second edit distance; l is the prefix matching string length; p is the constant scaling factor, and p is 0.1; l isijMatching string length, | S, for the maximum prefixiI is the first string length, | SjAnd | is the second character string length.
CN201611237454.6A 2016-12-28 2016-12-28 Character string matching method and device Expired - Fee Related CN108255836B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611237454.6A CN108255836B (en) 2016-12-28 2016-12-28 Character string matching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611237454.6A CN108255836B (en) 2016-12-28 2016-12-28 Character string matching method and device

Publications (2)

Publication Number Publication Date
CN108255836A CN108255836A (en) 2018-07-06
CN108255836B true CN108255836B (en) 2020-12-25

Family

ID=62720353

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611237454.6A Expired - Fee Related CN108255836B (en) 2016-12-28 2016-12-28 Character string matching method and device

Country Status (1)

Country Link
CN (1) CN108255836B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284422B (en) * 2018-08-31 2019-12-27 成都信息工程大学 Construction method of universal character string similarity measurement framework
CN111191087B (en) * 2019-12-31 2023-11-07 歌尔股份有限公司 Character matching method, terminal device and computer readable storage medium
CN112668131B (en) * 2021-01-04 2023-11-17 北京全路通信信号研究设计院集团有限公司 Wiring table generation method, device, equipment and computer readable storage medium
CN116304056B (en) * 2023-04-11 2024-01-30 山西玖邦科技有限公司 Management method for computer software development data
CN117312624B (en) * 2023-11-30 2024-02-20 北京睿企信息科技有限公司 Data processing system for acquiring target data list

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2665594A1 (en) * 2008-05-12 2009-11-12 Telecommunications Research Laboratory An apparatus for secure computation of string comparators
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN103365998A (en) * 2013-07-12 2013-10-23 华东师范大学 Retrieval method of similar character strings
CN106033416A (en) * 2015-03-09 2016-10-19 阿里巴巴集团控股有限公司 A string processing method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2665594A1 (en) * 2008-05-12 2009-11-12 Telecommunications Research Laboratory An apparatus for secure computation of string comparators
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN103365998A (en) * 2013-07-12 2013-10-23 华东师范大学 Retrieval method of similar character strings
CN106033416A (en) * 2015-03-09 2016-10-19 阿里巴巴集团控股有限公司 A string processing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
字符串匹配算法比较与分析;严大治;《计算机光盘软件与应用》;20130115(第2期);第138-140页 *

Also Published As

Publication number Publication date
CN108255836A (en) 2018-07-06

Similar Documents

Publication Publication Date Title
CN108255836B (en) Character string matching method and device
WO2019184217A1 (en) Hotspot event classification method and apparatus, and storage medium
CN104750798B (en) Recommendation method and device for application program
EP1396795A2 (en) Method and apparatus for aligning bilingual corpora
JP2018523195A5 (en)
WO2019037258A1 (en) Information recommendation method, device and system, and computer-readable storage medium
CN107729465B (en) Appraisal procedure, device and the electronic equipment of personage's danger level
CN107679031B (en) Advertisement and blog identification method based on stacking noise reduction self-coding machine
CN108304377B (en) Extraction method of long-tail words and related device
CN112199602B (en) Post recommendation method, recommendation platform and server
WO2018068648A1 (en) Information matching method and related device
CN112100374A (en) Text clustering method and device, electronic equipment and storage medium
CN109829151B (en) Text segmentation method based on hierarchical dirichlet model
WO2022089227A1 (en) Address parameter processing method, and related device
EP3835997A1 (en) Methods and systems for detecting duplicate document using document similarity measuring model based on deep learning cross-reference to related applications
CN109067708B (en) Method, device, equipment and storage medium for detecting webpage backdoor
CN112685396A (en) Financial data violation detection method and device, computer equipment and storage medium
CN113360711A (en) Model training and executing method, device, equipment and medium for video understanding task
CN113988061A (en) Sensitive word detection method, device and equipment based on deep learning and storage medium
CN104408087A (en) Method and system for identifying cheating text
JP7093292B2 (en) Systems and methods for segmenting dialogue session text
US10169364B2 (en) Gauging accuracy of sampling-based distinct element estimation
CN110781275A (en) Question answering distinguishing method based on multiple characteristics and computer storage medium
CN109977131A (en) A kind of house type matching system
CN111950267B (en) Text triplet extraction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20201225

Termination date: 20211228

CF01 Termination of patent right due to non-payment of annual fee