CN106933834B - Data matching method and device - Google Patents

Data matching method and device Download PDF

Info

Publication number
CN106933834B
CN106933834B CN201511018347.XA CN201511018347A CN106933834B CN 106933834 B CN106933834 B CN 106933834B CN 201511018347 A CN201511018347 A CN 201511018347A CN 106933834 B CN106933834 B CN 106933834B
Authority
CN
China
Prior art keywords
character string
matched
calculation
matching
maximum common
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201511018347.XA
Other languages
Chinese (zh)
Other versions
CN106933834A (en
Inventor
皇甫庆彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Youquan Information Technology Co ltd
Original Assignee
Youxinpai Beijing Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Youxinpai Beijing Information Technology Co ltd filed Critical Youxinpai Beijing Information Technology Co ltd
Priority to CN201511018347.XA priority Critical patent/CN106933834B/en
Publication of CN106933834A publication Critical patent/CN106933834A/en
Application granted granted Critical
Publication of CN106933834B publication Critical patent/CN106933834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/02Indexing scheme relating to groups G06F7/02 - G06F7/026
    • G06F2207/025String search, i.e. pattern matching, e.g. find identical word or best match in a string

Abstract

The application discloses a data matching method and device. In the method, firstly, a character string to be matched and a character string set to be matched are obtained, and then matching parameters of each character string in the character string set and the character string to be matched are respectively calculated, wherein the matching parameters comprise: and determining a target character string matched with the character string to be matched in the character string set according to the matching parameters. By adopting the scheme disclosed by the application to carry out data matching, the matched character strings are not required to be completely equal, and whether the matching is carried out or not can be determined according to the matching parameters, so that the matching rate is improved.

Description

Data matching method and device
Technical Field
The present disclosure relates to the field of data matching technologies, and in particular, to a data matching method and apparatus.
Background
With the development of information technology, the data volume of various information is continuously expanding. Data matching is usually required in order to clarify the relationship between different data. Wherein, data matching refers to registration between data according to some internal relation.
In the prior art, when data matching is performed, an congruent matching method is usually adopted, in the method, characters in two character strings to be matched are compared one by one, and if the two character strings are completely equal, the matching is considered to be successful.
However, in the research process of the present application, the inventor finds that when the data matching is performed by using the congruent matching method, the matching rate is not high, and a large amount of data cannot be matched because the matching is confirmed to be successful only by the fact that two character strings are completely equal.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a data matching method and apparatus.
In order to solve the technical problem, the embodiment of the invention discloses the following technical scheme:
according to a first aspect of the embodiments of the present disclosure, there is provided a data matching method, including:
acquiring a character string to be matched and a character string set needing to be matched;
respectively calculating matching parameters of each character string in the character string set and the character string to be matched, wherein the matching parameters comprise: similarity and/or amount of matching;
and determining a target character string matched with the character string to be matched in the character string set according to the matching parameters.
Preferably, if the matching parameter is similarity, the calculating the matching parameter between each character string in the character string set and the character string to be matched respectively includes:
21) selecting any character string from the character string set as a calculation character string, and acquiring the length str1 of the character string to be matched1
22) Acquiring the length str2 of the calculation character string1
23) Acquiring the maximum common substrings of the character strings to be matched and the calculation character strings, calculating the length of the maximum common substring, and respectively acquiring the number of the maximum common substrings in the calculation character strings and the character strings to be matched;
24) removing the maximum common substring contained in the calculation character string, obtaining a new calculation character string, removing the maximum common substring contained in the matching character string, obtaining a new matching character string, returning to execute the step 23), and executing the operation of the step 25) until the character string to be matched and the calculation character string do not contain the maximum common substring;
25) according to the length str1 of the character string to be matched1Length str2 of the calculated string1Calculating the similarity between the calculated character string and the character string to be matched according to the length of each maximum common substring and the number of the maximum common substrings in the calculated character string and the character string to be matched;
26) selecting another character string from the character string set as a calculation character string, and returning to execute the operation of the step 22) until the similarity between all the character strings contained in the character string set and the character string to be matched is obtained.
Preferably, the similarity between the calculation character string and the character string to be matched is calculated by adopting the following formula:
Figure BDA0000894426010000021
of these, str11Representing the length of the character string to be matched before the maximum common substring is not removed; str21Representing the length of the character string to be calculated before the maximum common substring is not removed; m isiRepresenting the length of the ith maximum common substring acquired when the character string to be matched and the matching parameter of the character string are calculated; k is a radical ofi1Representing the number of the obtained ith maximum public substrings in the character strings to be matched; k is a radical ofi2Representing the number of the obtained ith maximum public substrings in the character string to be calculated; setting the number of the obtained maximum public substrings as a when calculating the character strings to be matched and calculating the matching parameters of the character strings, wherein n represents any numerical value not less than a; and L represents the similarity of the character string to be matched and the calculation character string.
Preferably, if the matching parameter is a matching degree, the calculating the matching parameter between each character string in the character string set and the character string to be matched respectively includes:
41) selecting one character string from the character string set as a calculation character string;
42) acquiring the maximum common substrings of the character strings to be matched and the calculation character strings, calculating the length of the maximum common substring, and respectively acquiring the number of the maximum common substrings in the calculation character strings and the character strings to be matched;
43) removing the maximum common substring contained in the calculation character string, obtaining a new calculation character string, removing the maximum common substring contained in the matching character string, obtaining a new matching character string, and returning to execute the step 42), and executing the operation of the step 44) until the character string to be matched and the calculation character string do not contain the maximum common substring;
44) calculating the matching degree of the calculation character string and the character string to be matched according to the length of each maximum common substring and the number of the maximum common substrings in the calculation character string and the character string to be matched;
45) selecting another character string from the character string set as a calculation character string, and returning to execute the operation of the step 22) until the similarity between all the character strings contained in the character string set and the character string to be matched is obtained.
Preferably, the matching degree between the calculation character string and the character string to be matched is calculated by adopting the following formula:
Figure BDA0000894426010000031
wherein m isiRepresenting the length of the ith maximum common substring acquired when the character string to be matched and the matching parameter of the character string are calculated; k is a radical ofi1Representing the number of the obtained ith maximum public substrings in the character strings to be matched; k is a radical ofi2Representing the number of the obtained ith maximum public substrings in the character string to be calculated; setting the number of the obtained maximum public substrings as a when calculating the character strings to be matched and calculating the matching parameters of the character strings, wherein n represents any numerical value not less than a; and E represents the matching degree of the character string to be matched and the calculation character string.
According to a second aspect of the embodiments of the present disclosure, there is provided a data matching apparatus including:
the acquisition module is used for acquiring a character string to be matched and a character string set to be matched;
a calculating module, configured to calculate matching parameters between each character string in the character string set and the character string to be matched, where the matching parameters include: similarity and/or amount of matching;
and the determining module is used for determining a target character string matched with the character string to be matched in the character string set according to the matching parameters.
Preferably, if the matching parameter is similarity, the calculating module includes:
a first length obtaining unit, configured to select any character string from the character string set as a calculation character string, and obtain a length str1 of the character string to be matched1
A second length acquisition unit for acquiring the length str2 of the calculation string1
The first maximum common substring obtaining unit is used for obtaining the maximum common substring of the character string to be matched and the calculation character string, calculating the length of the maximum common substring, and respectively obtaining the number of the maximum common substring in the calculation character string and the character string to be matched;
the first removal unit is used for removing the maximum common substring contained in the calculation character string, obtaining a new calculation character string, removing the maximum common substring contained in the matching character string, obtaining a new matching character string, triggering the first maximum common substring obtaining unit to execute operation until the character string to be matched and the calculation character string do not contain the maximum common substring, and triggering the similarity calculation unit to execute operation;
a similarity calculation unit for calculating the similarity according to the length str1 of the character string to be matched1Length str2 of the calculated string1Calculating the similarity between the calculated character string and the character string to be matched according to the length of each maximum common substring and the number of the maximum common substrings in the calculated character string and the character string to be matched;
and the first selection unit is used for selecting another character string from the character string set as a calculation character string, and triggering the second length acquisition unit to execute corresponding operation until the similarity between all the character strings contained in the character string set and the character string to be matched is acquired.
Preferably, the similarity calculation unit calculates the similarity between the calculation character string and the character string to be matched by using the following formula:
Figure BDA0000894426010000041
of these, str11Representing the length of the character string to be matched before the maximum common substring is not removed; str21Representing the length of the character string to be calculated before the maximum common substring is not removed; m isiShowing the matching parameters in calculating the character string to be matched and calculating the character stringWhen counting, the length of the ith maximum common substring is obtained; k is a radical ofi1Representing the number of the obtained ith maximum public substrings in the character strings to be matched; k is a radical ofi2Representing the number of the obtained ith maximum public substrings in the character string to be calculated; setting the number of the obtained maximum public substrings as a when calculating the character strings to be matched and calculating the matching parameters of the character strings, wherein n represents any numerical value not less than a; and L represents the similarity of the character string to be matched and the calculation character string.
Preferably, if the matching parameter is a matching degree, the calculating module includes:
a calculation character string obtaining unit, configured to select any one character string from the character string set as a calculation character string;
the second maximum common substring obtaining unit is used for obtaining the maximum common substring of the character string to be matched and the calculation character string, calculating the length of the maximum common substring, and respectively obtaining the number of the maximum common substring in the calculation character string and the character string to be matched;
the second removal unit is used for removing the maximum common substring contained in the calculation character string, acquiring a new calculation character string, removing the maximum common substring contained in the matching character string, acquiring a new matching character string, returning to execute corresponding operations by the second maximum common substring acquisition unit until the character string to be matched and the calculation character string do not contain the maximum common substring, and triggering the matching degree calculation unit to execute the operations;
the matching degree calculation unit is used for calculating the matching degree of the calculation character string and the character string to be matched according to the length of each maximum common substring and the number of the maximum common substrings in the calculation character string and the character string to be matched;
and the second selection unit is used for selecting another character string from the character string set as a calculation character string and triggering the second maximum common sub-string acquisition unit to execute operation until the similarity between all the character strings contained in the character string set and the character string to be matched is acquired.
Preferably, the matching degree calculating unit calculates the matching degree between the calculation character string and the character string to be matched by using the following formula:
Figure BDA0000894426010000051
wherein m isiRepresenting the length of the ith maximum common substring acquired when the character string to be matched and the matching parameter of the character string are calculated; k is a radical ofi1Representing the number of the obtained ith maximum public substrings in the character strings to be matched; k is a radical ofi2Representing the number of the obtained ith maximum public substrings in the character string to be calculated; setting the number of the obtained maximum public substrings as a when calculating the character strings to be matched and calculating the matching parameters of the character strings, wherein n represents any numerical value not less than a; and E represents the matching degree of the character string to be matched and the calculation character string.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
when data matching is carried out through the method and the device, after the character string to be matched and the character string set needing to be matched are obtained, the target character string matched with the character string to be matched is determined according to the matching parameters of each character string in the character string set and the character string to be matched. The method is adopted to carry out data matching, the matched character strings are not required to be completely equal, and whether the matching is carried out or not can be determined according to the matching parameters, so that the matching rate is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic workflow diagram illustrating a data matching method according to an example embodiment;
FIG. 2 is a schematic diagram illustrating a workflow for calculating similarity in a data matching method according to an exemplary embodiment;
FIG. 3 is a flowchart illustrating a process of calculating a degree of match in a data matching method according to an exemplary embodiment;
fig. 4 is a schematic structural diagram illustrating a data matching apparatus according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
In order to solve the problems that the matching rate is not high and a large amount of data cannot be matched when data matching is carried out through the prior art, the application discloses a data matching method and a data matching device.
The embodiment of the application discloses a data matching method. Referring to the workflow diagram shown in fig. 1, the data matching method includes the following steps:
and step S11, acquiring the character string to be matched and the character string set needing to be matched.
When data matching is performed, it is often necessary to match a plurality of character strings with a certain character string to determine whether the plurality of character strings match the certain character string, in this case, the certain character string is referred to as a character string to be matched, and the plurality of character strings form a character string set to be matched.
Step S12, respectively calculating matching parameters between each character string in the character string set and the character string to be matched, where the matching parameters include: similarity and/or amount of match.
The similarity is used for representing the similarity between each character string in the character string set and the character string to be matched, and the matching degree is used for representing the matching information between each character string in the character string set and the character string to be matched.
And step S13, determining a target character string matched with the character string to be matched in the character string set according to the matching parameters. And the target character string is the character string matched with the character string to be matched.
The first embodiment of the application discloses a data matching method, which is characterized in that after a character string to be matched and a character string set to be matched are obtained, a target character string matched with the character string to be matched is determined according to matching parameters of each character string in the character string set and the character string to be matched. The method is adopted to carry out data matching, the matched character strings are not required to be completely equal, and whether the matching is carried out or not can be determined according to the matching parameters, so that the matching rate is improved.
In this application, the matching parameters include: similarity and/or amount of match. If the matching parameters are similarity, referring to the workflow diagram shown in fig. 2, the step of calculating the matching parameters of each character string in the character string set and the character string to be matched respectively includes the following steps:
step S21, selecting one optional character string from the character string set as a calculation character string, and obtaining the length str1 of the character string to be matched1
Step S22, obtaining the length str2 of the calculation character string1
And S23, judging whether the character string to be matched and the calculation character string have the same substring, if so, executing the operation of S24, and if not, executing the operation of S26.
Step S24, if the same substring exists between the character string to be matched and the calculation character string, obtaining the maximum common substring of the character string to be matched and the calculation character string, calculating the length of the maximum common substring, and respectively obtaining the number of the maximum common substring in the calculation character string and the number of the maximum common substring in the character string to be matched. Wherein, the maximum common substring refers to the longest identical substring between two given character strings (i.e. the character string to be matched and the calculation character string).
And S25, removing the maximum common substring contained in the calculation character string to obtain a new calculation character string, removing the maximum common substring contained in the matching character string to obtain a new matching character string, and then returning to execute the operation of the S23.
Step S26, according to the length str1 of the character string to be matched1Length str2 of the calculated string1The length of each maximum common substring and the number of the maximum common substrings in the calculation character string and the character string to be matched, calculating the similarity between the calculation character string and the character string to be matched, and then executing the operation of step S27.
And step S27, judging whether a character string with the similarity not calculated exists in the character string set, if so, executing the operation of step S28, and if not, executing the operation of step S29.
Step S28, selecting another character string from the character string set as a calculation character string, and returning to perform the operation of step S22. And completing the calculation of the similarity until the similarity between all the character strings contained in the character string set and the character string to be matched is obtained.
And step S29, finishing the calculation of the similarity at this time.
Wherein in step S25, the length str1 according to the character string to be matched is disclosed1Length str2 of the calculated string1And calculating the matching parameters of the calculation character string and the character string to be matched according to the length of each maximum common substring and the number of the maximum common substrings in the calculation character string and the character string to be matched. If the matching parameter is similarity, calculating the similarity between the calculation character string and the character string to be matched by adopting the following formula:
Figure BDA0000894426010000071
wherein, L represents the similarity of the character string to be matched and the calculation character string; str11Representing the length of the character string to be matched before the maximum common substring is not removed; str21Representing the length of the character string to be calculated before the largest common substring is not removed.
miAnd the length of the ith maximum common substring acquired when the character string to be matched and the matching parameter of the character string are calculated is shown. Wherein m is1Indicating the length of the first largest common substring retrieved, i.e. m1Indicating the length, m, of the largest common substring between the matching string and the calculated string before the removal operation was not performed2Indicating the length of the largest common substring between the obtained new matching and calculation strings after the first removal of the matching and calculation strings, and so on.
ki1Representing the number of the obtained ith maximum public substrings in the character strings to be matched; k is a radical ofi2And the number of the obtained ith maximum common substring in the character string to be calculated is represented. For example, if the character string to be matched is "abcabcadd" and the calculated character string is "abcmnf" before any removal operation is not performed, the first largest common substring is "abc" and two "abcs" are included in the character string to be matched, k is11A value of 2, if the calculated string contains an "abc", then k12Is 1.
And setting the number of the acquired maximum common substrings as a when calculating the character strings to be matched and the matching parameters of the character strings to be matched, wherein n represents any numerical value not less than a. For example, if the character string to be matched and the calculated character string contain a maximum common substring before any removal operation is not performed, the maximum common substring between the new matched character string and the new calculated character string can still be obtained after the first removal operation is performed, but after the second removal operation is performed, the maximum common substring between the new matched character string and the new calculated character string does not contain any common substring any more, and the obtained maximum common substringThe number of substrings is 2, i.e., a is 2, n is a number not less than 2, and in this case, if i is a number greater than 2, m is a number greater than 21Is 0, and ki1And ki2Is 0.
If the matching parameter is the matching degree, referring to the workflow diagram shown in fig. 3, the step of calculating the matching parameter between each character string in the character string set and the character string to be matched respectively includes the following steps:
and step S31, selecting one character string from the character string set as a calculation character string.
And S32, judging whether the character string to be matched and the calculation character string have the same substring, if so, executing the operation of S33, and if not, executing the operation of S35.
Step S33, if the same substring exists between the character string to be matched and the calculation character string, obtaining the maximum common substring of the character string to be matched and the calculation character string, calculating the length of the maximum common substring, and respectively obtaining the number of the maximum common substring in the calculation character string and the number of the maximum common substring in the character string to be matched. Wherein, the maximum common substring refers to the longest identical substring between two given character strings (i.e. the character string to be matched and the calculation character string).
And S34, removing the maximum common substring contained in the calculation character string to obtain a new calculation character string, removing the maximum common substring contained in the matching character string to obtain a new matching character string, and then returning to execute the operation of the S32.
Step S35, calculating the matching degree of the calculation character string and the character string to be matched according to the length of each maximum common substring and the number of the maximum common substrings in the calculation character string and the character string to be matched, and then executing the operation of step S36.
And step S36, judging whether a character string with a matching degree not calculated yet exists in the character string set, if so, executing the operation of step S37, and if not, executing the operation of step S38.
Step S37, selecting another character string from the character string set as a calculation character string, and returning to perform the operation of step S32. And completing the calculation of the matching degree until the similarity between all the character strings contained in the character string set and the character string to be matched is obtained.
And step S38, finishing the calculation of the matching degree.
In step S35, an operation of calculating the matching degree between the calculated character string and the character string to be matched according to the length of each maximum common substring and the number of the maximum common substrings in the calculated character string and the character string to be matched is disclosed. If the matching parameters are similarity, calculating the matching parameters of the calculation character string and the character string to be matched by adopting the following formula:
Figure BDA0000894426010000091
e represents the matching degree of the character string to be matched and the calculated character string; m isiAnd the length of the ith maximum common substring acquired when the character string to be matched and the matching parameter of the character string are calculated is shown. Wherein m is1Indicating the length of the first largest common substring retrieved, i.e. m1Indicating the length, m, of the largest common substring between the matching string and the calculated string before the removal operation was not performed2Indicating the length of the largest common substring between the obtained new matching and calculation strings after the first removal of the matching and calculation strings, and so on.
ki1Representing the number of the obtained ith maximum public substrings in the character strings to be matched; k is a radical ofi2And the number of the obtained ith maximum common substring in the character string to be calculated is represented. For example, if the character string to be matched is "abcabcadd" and the calculated character string is "abcmnf" before any removal operation is not performed, the first largest common substring is "abc" and two "abcs" are included in the character string to be matched, k is11To 2, calculate the characterThe string contains an "abc", then k12Is 1.
And setting the number of the acquired maximum common substrings as a when calculating the character strings to be matched and the matching parameters of the character strings to be matched, wherein n represents any numerical value not less than a. For example, if a maximum common substring is included in the character string to be matched and the calculated character string before any removal operation is performed, and after the first removal operation is performed, the maximum common substring between the new matched character string and the new calculated character string can still be obtained, but after the second removal operation is performed, the maximum common substring is no longer included between the new matched character string and the new calculated character string, the number of the obtained maximum common substrings is 2, that is, a is 2, n is a numerical value not less than 2, in this case, if i is a number greater than 2, m is a number greater than 21Is 0, and ki1And ki2Is 0.
In addition, if the same substring is not included between the character string to be matched and the calculation character string before any removal operation is not performed, the similarity and matching degree between the calculation character string and the character string to be matched are generally considered to be 0.
In step S13, an operation of determining a target character string in the character string set that matches the character string to be matched according to the matching parameter is disclosed. Specifically, when the target character string is determined, the character string in the character string set, of which the similarity, or the matching degree, or the combination of the similarity and the matching degree is within a preset range, may be determined as the target character string. In addition, after the matching parameters are obtained, the character strings included in the character string set are sequenced according to the sequence from the similarity, or the matching degree, or the combination of the similarity and the matching degree from large to small to obtain a new character string sequence, and then the first N character strings in the character string sequence are determined as target character strings, wherein the value of N is a preset positive integer.
When calculating the matching parameters of the character string to be matched and the calculation character string, the maximum common substring of the character string to be matched and the calculation character string needs to be obtained. When the maximum common substring is obtained, the two character strings can be combined into one character string using symbol intervals, such as a form of 'character string 1/character string 2', then the character strings before and after the symbol are matched, all the character strings (namely, the same substring) matched from the left to the right are added into an array, and then all the character strings in the array are sorted according to the length, wherein the largest length is the character string to be matched and the maximum common substring of the calculated character strings.
Accordingly, a second embodiment of the present application discloses a data matching apparatus, referring to the schematic structural diagram shown in fig. 4, the data matching apparatus includes: an acquisition module 100, a calculation module 200 and a determination module 300.
The acquiring module 100 is configured to acquire a character string to be matched and a character string set to be matched, and when data matching is performed, it is often necessary to match a plurality of character strings with a certain character string to determine whether the plurality of character strings are matched with the certain character string, where in this case, the certain character string is called as a character string to be matched, and the plurality of character strings form the character string set to be matched;
the calculating module 200 is configured to calculate matching parameters between each character string in the character string set and the character string to be matched, where the matching parameters include: the similarity and/or the matching quantity are/is used for representing the similarity between each character string in the character string set and the character string to be matched, and the matching quantity is used for representing the matching information between each character string in the character string set and the character string to be matched;
the determining module 300 is configured to determine, according to the matching parameter, a target character string in the character string set, which is matched with the character string to be matched. And the target character string is the character string matched with the character string to be matched.
In this application, the matching parameters include: similarity and/or amount of match. If the matching parameter is similarity, the calculating module 200 includes:
a first length obtaining unit, configured to select one of the character strings from the character string set as a calculation character string, and obtain the to-be-matched character stringLength of matching string str11
A second length acquisition unit for acquiring the length str2 of the calculation string1
The first maximum common substring obtaining unit is used for obtaining the maximum common substring of the character string to be matched and the calculation character string, calculating the length of the maximum common substring, and respectively obtaining the number of the maximum common substring in the calculation character string and the character string to be matched, wherein the maximum common substring refers to the longest identical substring between two given character strings;
the first removal unit is used for removing the maximum common substring contained in the calculation character string, obtaining a new calculation character string, removing the maximum common substring contained in the matching character string, obtaining a new matching character string, triggering the first maximum common substring obtaining unit to execute operation until the character string to be matched and the calculation character string do not contain the maximum common substring, and triggering the similarity calculation unit to execute operation;
a similarity calculation unit for calculating the similarity according to the length str1 of the character string to be matched1Length str2 of the calculated string1Calculating the similarity between the calculated character string and the character string to be matched according to the length of each maximum common substring and the number of the maximum common substrings in the calculated character string and the character string to be matched;
and the first selection unit is used for selecting another character string from the character string set as a calculation character string, and triggering the second length acquisition unit to execute corresponding operation until the similarity between all the character strings contained in the character string set and the character string to be matched is acquired.
The similarity calculation unit calculates the similarity between the calculation character string and the character string to be matched by adopting the following formula:
Figure BDA0000894426010000121
wherein, L tableDisplaying the similarity of the character string to be matched and the calculated character string; str11Representing the length of the character string to be matched before the maximum common substring is not removed; str21Representing the length of the character string to be calculated before the largest common substring is not removed.
miAnd the length of the ith maximum common substring acquired when the character string to be matched and the matching parameter of the character string are calculated is shown. Wherein m is1Indicating the length of the first largest common substring retrieved, i.e. m1Indicating the length, m, of the largest common substring between the matching string and the calculated string before the removal operation was performed2Indicating the length of the largest common substring between the obtained new matching and calculation strings after the first removal of the matching and calculation strings, and so on.
ki1Representing the number of the obtained ith maximum public substrings in the character strings to be matched; k is a radical ofi2And the number of the obtained ith maximum common substring in the character string to be calculated is represented. For example, if the character string to be matched is "abcabcadd" and the calculated character string is "abcmnf" before any removal operation is not performed, the first largest common substring is "abc" and two "abcs" are included in the character string to be matched, k is11A value of 2, if the calculated string contains an "abc", then k12Is 1.
And setting the number of the acquired maximum common substrings as a when calculating the character strings to be matched and the matching parameters of the character strings to be matched, wherein n represents any numerical value not less than a. For example, if a maximum common substring is included in the character string to be matched and the calculated character string before any removal operation is performed, and after the first removal operation is performed, the maximum common substring between the new matched character string and the new calculated character string can still be obtained, but after the second removal operation is performed, the maximum common substring is no longer included between the new matched character string and the new calculated character string, the number of the obtained maximum common substrings is 2, that is, a is 2, n is a numerical value not less than 2, in this case, if i is a number greater than 2, then the number of the obtained maximum common substrings is 2m1Is 0, and ki1And ki2Is 0.
Further, if the matching parameter is a matching degree, the calculating module 200 includes:
a calculation character string obtaining unit, configured to select any one character string from the character string set as a calculation character string;
the second maximum common substring obtaining unit is used for obtaining the maximum common substring of the character string to be matched and the calculation character string, calculating the length of the maximum common substring, and respectively obtaining the number of the maximum common substring in the calculation character string and the character string to be matched;
the second removal unit is used for removing the maximum common substring contained in the calculation character string, acquiring a new calculation character string, removing the maximum common substring contained in the matching character string, acquiring a new matching character string, returning to execute corresponding operations by the second maximum common substring acquisition unit until the character string to be matched and the calculation character string do not contain the maximum common substring, and triggering the matching degree calculation unit to execute the operations;
the matching degree calculation unit is used for calculating the matching degree of the calculation character string and the character string to be matched according to the length of each maximum common substring and the number of the maximum common substrings in the calculation character string and the character string to be matched;
and the second selection unit is used for selecting another character string from the character string set as a calculation character string and triggering the second maximum common sub-string acquisition unit to execute operation until the similarity between all the character strings contained in the character string set and the character string to be matched is acquired.
The matching degree calculation unit calculates the matching degree of the calculation character string and the character string to be matched by adopting the following formula:
Figure BDA0000894426010000131
wherein L represents the character to be matchedSimilarity of strings and calculation character strings; str11Representing the length of the character string to be matched before the maximum common substring is not removed; str21Representing the length of the character string to be calculated before the largest common substring is not removed.
miAnd the length of the ith maximum common substring acquired when the character string to be matched and the matching parameter of the character string are calculated is shown. Wherein m is1Indicating the length of the first largest common substring retrieved, i.e. m1Indicating the length, m, of the largest common substring between the matching string and the calculated string before the removal operation was performed2Indicating the length of the largest common substring between the obtained new matching and calculation strings after the first removal of the matching and calculation strings, and so on.
ki1Representing the number of the obtained ith maximum public substrings in the character strings to be matched; k is a radical ofi2And the number of the obtained ith maximum common substring in the character string to be calculated is represented. For example, if the character string to be matched is "abcabcadd" and the calculated character string is "abcmnf" before any removal operation is not performed, the first largest common substring is "abc" and two "abcs" are included in the character string to be matched, k is11A value of 2, if the calculated string contains an "abc", then k12Is 1.
And setting the number of the acquired maximum common substrings as a when calculating the character strings to be matched and the matching parameters of the character strings to be matched, wherein n represents any numerical value not less than a. For example, if a maximum common substring is included in the character string to be matched and the calculated character string before any removal operation is performed, and after the first removal operation is performed, the maximum common substring between the new matched character string and the new calculated character string can still be obtained, but after the second removal operation is performed, the maximum common substring is no longer included between the new matched character string and the new calculated character string, the number of the obtained maximum common substrings is 2, that is, a is 2, n is a numerical value not less than 2, in this case, if i is a number greater than 2, m is a number greater than 21Is 0, and,ki1and ki2Is 0.
The second embodiment of the application discloses a data matching device, which determines a target character string matched with a character string to be matched according to matching parameters of each character string in a character string set and the character string set to be matched after acquiring the character string to be matched and the character string set to be matched. The device is adopted to carry out data matching, the matched character strings are not required to be completely equal, and whether the matching is carried out or not can be determined according to the matching parameters, so that the matching rate is improved.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (8)

1. A method of data matching, comprising:
acquiring a character string to be matched and a character string set needing to be matched;
respectively calculating matching parameters of each character string in the character string set and the character string to be matched, wherein the matching parameters comprise: similarity;
determining a target character string matched with the character string to be matched in the character string set according to the matching parameters;
if the matching parameters are similarity, the calculating the matching parameters of each character string in the character string set and the character string to be matched respectively comprises:
21) selecting any character string from the character string set as a calculation character string, and acquiring the length str1 of the character string to be matched1
22) Acquiring the length str2 of the calculation character string1
23) Acquiring the maximum common substrings of the character strings to be matched and the calculation character strings, calculating the length of the maximum common substring, and respectively acquiring the number of the maximum common substrings in the calculation character strings and the character strings to be matched;
24) removing the maximum common substring contained in the calculation character string, obtaining a new calculation character string, removing the maximum common substring contained in the character string to be matched, obtaining a new matching character string, returning to execute the step 23), and executing the operation of the step 25) until the character string to be matched and the calculation character string do not contain the maximum common substring;
25) according to the length str1 of the character string to be matched1Length str2 of the calculated string1Calculating the similarity between the calculated character string and the character string to be matched according to the length of each maximum common substring and the number of the maximum common substrings in the calculated character string and the character string to be matched;
26) selecting another character string from the character string set as a calculation character string, and returning to execute the operation of the step 22) until the similarity between all the character strings contained in the character string set and the character string to be matched is obtained.
2. The method according to claim 1, wherein the similarity between the calculation character string and the character string to be matched is calculated by adopting the following formula:
Figure FDA0002520469380000011
of these, str11Representing the length of the character string to be matched before the maximum common substring is not removed; str21Indicating the length of the calculated character string before the largest common substring is not removed; m isiRepresenting the length of the ith maximum common substring acquired when the character string to be matched and the matching parameter of the character string are calculated; k is a radical ofi1Representing the number of the obtained ith maximum public substrings in the character strings to be matched; k is a radical ofi2Representing the number of the obtained ith maximum public substrings in the calculation character string; setting the number of the obtained maximum public substrings as a when calculating the character strings to be matched and calculating the matching parameters of the character strings, wherein n represents any numerical value not less than a; and L represents the similarity of the character string to be matched and the calculation character string.
3. A method of data matching, comprising:
acquiring a character string to be matched and a character string set needing to be matched;
respectively calculating matching parameters of each character string in the character string set and the character string to be matched, wherein the matching parameters comprise: matching amount;
determining a target character string matched with the character string to be matched in the character string set according to the matching parameters;
if the matching parameters are matching quantities, the step of respectively calculating the matching parameters of each character string in the character string set and the character string to be matched comprises the following steps:
41) selecting one character string from the character string set as a calculation character string;
42) acquiring the maximum common substrings of the character strings to be matched and the calculation character strings, calculating the length of the maximum common substring, and respectively acquiring the number of the maximum common substrings in the calculation character strings and the character strings to be matched;
43) removing the maximum common substring contained in the calculation character string, obtaining a new calculation character string, removing the maximum common substring contained in the character string to be matched, obtaining a new matching character string, and returning to execute the step 42) until the character string to be matched and the calculation character string do not contain the maximum common substring, and then executing the operation of the step 44);
44) calculating the matching amount of the calculation character string and the character string to be matched according to the length of each maximum common substring and the number of the maximum common substrings in the calculation character string and the character string to be matched;
45) selecting another character string from the character string set as a calculation character string, and returning to execute the operation of the step 42) until obtaining the matching amount of all the character strings contained in the character string set and the character string to be matched.
4. The method according to claim 3, wherein the matching amount of the calculation character string and the character string to be matched is calculated by adopting the following formula:
Figure FDA0002520469380000021
wherein m isiRepresenting the length of the ith maximum common substring acquired when the character string to be matched and the matching parameter of the character string are calculated; k is a radical ofi1Representing the number of the obtained ith maximum public substrings in the character strings to be matched; k is a radical ofi2Representing the number of the obtained ith maximum public substrings in the calculation character string; setting the number of the obtained maximum public substrings as a when calculating the character strings to be matched and calculating the matching parameters of the character strings, wherein n represents any numerical value not less than a; and E represents the matching amount of the character string to be matched and the calculation character string.
5. A data matching apparatus, comprising:
the acquisition module is used for acquiring a character string to be matched and a character string set to be matched;
a calculating module, configured to calculate matching parameters between each character string in the character string set and the character string to be matched, where the matching parameters include: similarity;
the determining module is used for determining a target character string matched with the character string to be matched in the character string set according to the matching parameters;
if the matching parameter is similarity, the calculating module includes:
a first length obtaining unit, configured to select any character string from the character string set as a calculation character string, and obtain a length str1 of the character string to be matched1
A second length acquisition unit for acquiring the length str2 of the calculation string1
The first maximum common substring obtaining unit is used for obtaining the maximum common substring of the character string to be matched and the calculation character string, calculating the length of the maximum common substring, and respectively obtaining the number of the maximum common substring in the calculation character string and the character string to be matched;
the first removal unit is used for removing the maximum common substring contained in the calculation character string, acquiring a new calculation character string, removing the maximum common substring contained in the character string to be matched, acquiring a new matching character string, triggering the first maximum common substring acquisition unit to execute operation until the character string to be matched and the calculation character string do not contain the maximum common substring, and triggering the similarity calculation unit to execute operation;
a similarity calculation unit for calculating the similarity according to the length str1 of the character string to be matched1Length str2 of the calculated string1Calculating the similarity between the calculated character string and the character string to be matched according to the length of each maximum common substring and the number of the maximum common substrings in the calculated character string and the character string to be matched;
and the first selection unit is used for selecting another character string from the character string set as a calculation character string, and triggering the second length acquisition unit to execute corresponding operation until the similarity between all the character strings contained in the character string set and the character string to be matched is acquired.
6. The apparatus according to claim 5, wherein the similarity calculation unit calculates the similarity between the calculation string and the string to be matched using the following formula:
Figure FDA0002520469380000041
of these, str11Representing the length of the character string to be matched before the maximum common substring is not removed; str21Indicating the length of the calculated character string before the largest common substring is not removed; m isiRepresenting the length of the ith maximum common substring acquired when the character string to be matched and the matching parameter of the character string are calculated; k is a radical ofi1Representing the number of the obtained ith maximum public substrings in the character strings to be matched; k is a radical ofi2Representing the number of the obtained ith maximum public substrings in the calculation character string; setting the number of the obtained maximum public substrings as a when calculating the character strings to be matched and calculating the matching parameters of the character strings, wherein n represents any numerical value not less than a; and L represents the similarity of the character string to be matched and the calculation character string.
7. A data matching apparatus, comprising:
the acquisition module is used for acquiring a character string to be matched and a character string set to be matched;
a calculating module, configured to calculate matching parameters between each character string in the character string set and the character string to be matched, where the matching parameters include: matching amount;
the determining module is used for determining a target character string matched with the character string to be matched in the character string set according to the matching parameters;
if the matching parameter is a matching quantity, the calculation module comprises:
a calculation character string obtaining unit, configured to select any one character string from the character string set as a calculation character string;
the second maximum common substring obtaining unit is used for obtaining the maximum common substring of the character string to be matched and the calculation character string, calculating the length of the maximum common substring, and respectively obtaining the number of the maximum common substring in the calculation character string and the character string to be matched;
the second removal unit is used for removing the maximum common substring contained in the calculation string, acquiring a new calculation string, removing the maximum common substring contained in the to-be-matched string, acquiring a new matched string, returning to execute corresponding operations by the second maximum common substring acquisition unit until the to-be-matched string and the calculation string do not contain the maximum common substring, and triggering the matching amount calculation unit to execute the operations;
the matching amount calculation unit is used for calculating the matching amount of the calculation character string and the character string to be matched according to the length of each maximum common substring and the number of the maximum common substrings in the calculation character string and the character string to be matched;
and the second selection unit is used for selecting another character string from the character string set as a calculation character string and triggering the second maximum common sub-string acquisition unit to execute operation until the matching amount of all the character strings contained in the character string set and the character string to be matched is acquired.
8. The apparatus according to claim 7, wherein the matching amount calculating unit calculates the matching amount of the calculation character string and the character string to be matched using the following formula:
Figure FDA0002520469380000051
wherein m isiRepresenting the length of the ith maximum common substring acquired when the character string to be matched and the matching parameter of the character string are calculated; k is a radical ofi1Indicating the i-th acquiredThe number of the maximum common substrings in the character strings to be matched; k is a radical ofi2Representing the number of the obtained ith maximum public substrings in the calculation character string; setting the number of the obtained maximum public substrings as a when calculating the character strings to be matched and calculating the matching parameters of the character strings, wherein n represents any numerical value not less than a; and E represents the matching amount of the character string to be matched and the calculation character string.
CN201511018347.XA 2015-12-29 2015-12-29 Data matching method and device Active CN106933834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511018347.XA CN106933834B (en) 2015-12-29 2015-12-29 Data matching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511018347.XA CN106933834B (en) 2015-12-29 2015-12-29 Data matching method and device

Publications (2)

Publication Number Publication Date
CN106933834A CN106933834A (en) 2017-07-07
CN106933834B true CN106933834B (en) 2020-09-08

Family

ID=59441580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511018347.XA Active CN106933834B (en) 2015-12-29 2015-12-29 Data matching method and device

Country Status (1)

Country Link
CN (1) CN106933834B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191087B (en) * 2019-12-31 2023-11-07 歌尔股份有限公司 Character matching method, terminal device and computer readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239565A (en) * 2014-09-28 2014-12-24 陆嘉恒 Name automatic prompting method based on academic research

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9230041B2 (en) * 2013-12-02 2016-01-05 Qbase, LLC Search suggestions of related entities based on co-occurrence and/or fuzzy-score matching

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239565A (en) * 2014-09-28 2014-12-24 陆嘉恒 Name automatic prompting method based on academic research

Also Published As

Publication number Publication date
CN106933834A (en) 2017-07-07

Similar Documents

Publication Publication Date Title
TWI794157B (en) Automatic multi-threshold feature filtering method and device
CN107943874B (en) Knowledge mapping processing method, device, computer equipment and storage medium
US10796244B2 (en) Method and apparatus for labeling training samples
US9471964B2 (en) Non-local mean-based video denoising method and apparatus
TWI670623B (en) Method and device for acquiring device fingerprint
WO2016184163A1 (en) Method and device for generating a dpi rules
CN109002784B (en) Street view identification method and system
WO2020134010A1 (en) Training of image key point extraction model and image key point extraction
CN109685805B (en) Image segmentation method and device
CN113377964B (en) Knowledge graph link prediction method, device, equipment and storage medium
US20180247152A1 (en) Method and apparatus for distance measurement
CN109272044A (en) A kind of image similarity determines method, apparatus, equipment and storage medium
CN111241217A (en) Data processing method, device and system
KR20220004692A (en) Data augmentation policy update methods, devices, devices and storage media
CN108491715A (en) Generation method, device and the server in Terminal fingerprints library
CN111149101B (en) Target pattern searching method and computer readable storage medium
CN106933834B (en) Data matching method and device
CN105302715B (en) The acquisition methods and device of application program user interface
CN110876072B (en) Batch registered user identification method, storage medium, electronic device and system
CN107077617B (en) Fingerprint extraction method and device
CN105848155B (en) Terminal illegal flashing recognition method and device
CN105786789B (en) A kind of calculation method and device of text similarity
CN110210522A (en) The training method and device of picture quality Fraction Model
CN106934409B (en) Data matching method and device
CN110414845B (en) Risk assessment method and device for target transaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20170707

Assignee: Beijing May 8th clapping Information Technology Co.,Ltd.

Assignor: YOUXINPAI (BEIJING) INFORMATION TECHNOLOGY Co.,Ltd.

Contract record no.: X2020990000158

Denomination of invention: Data matching method and device thereof

License type: Common License

Record date: 20200402

GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230913

Address after: 230012 In the factory building of Anhui Guogou Energy Co., Ltd., 100 meters east of the intersection of Guanjing Road and Luban Road in Xinzhan District, Hefei City, Anhui Province

Patentee after: Hefei Youquan Information Technology Co.,Ltd.

Address before: 100020 2507, 21 / F, building 10, No. 93, Jianguo Road, Chaoyang District, Beijing

Patentee before: YOUXINPAI (BEIJING) INFORMATION TECHNOLOGY Co.,Ltd.