CN113887223A - Character string matching method and related device - Google Patents

Character string matching method and related device Download PDF

Info

Publication number
CN113887223A
CN113887223A CN202111155436.4A CN202111155436A CN113887223A CN 113887223 A CN113887223 A CN 113887223A CN 202111155436 A CN202111155436 A CN 202111155436A CN 113887223 A CN113887223 A CN 113887223A
Authority
CN
China
Prior art keywords
character
matched
character string
matching
string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111155436.4A
Other languages
Chinese (zh)
Other versions
CN113887223B (en
Inventor
唐超
甄鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202111155436.4A priority Critical patent/CN113887223B/en
Publication of CN113887223A publication Critical patent/CN113887223A/en
Application granted granted Critical
Publication of CN113887223B publication Critical patent/CN113887223B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a character string matching method and a related device, in the first round of character matching process, according to the principle of first middle and then two sides, the middle character of a character string to be matched is matched with characters in a plurality of texts, if the middle character of the character string to be matched is found in the first text, the first character and the tail character of the character string to be matched are respectively matched with the first text by utilizing the distance between the middle character and the first character and the tail character of the character string to be matched, and the second text of the similar character string comprising the character string to be matched is obtained through screening. And according to the principle of firstly carrying out the middle and secondly carrying out the two sides, executing the next round of character matching on the rest characters in the character string to be matched and the similar character string in the second text, and screening the target text from the second text until the character matching end condition is met. And if the target text comprises a target character string which is completely matched with the characters of the character string to be matched, taking the target character string as the matching character string of the character string to be matched.

Description

Character string matching method and related device
Technical Field
The present application relates to the field of data processing, and in particular, to a string matching method and related apparatus.
Background
String matching is an important component of computer science, implies numerous scientific theories, algorithmic ideas and algorithmic techniques, and is applied to various fields of information technology. With the rapid reading development of the popularization of network information, users put higher requirements on information retrieval, functionally require improvement of recall ratio, accuracy ratio and accurate positioning, and have simple, flexible and rapid operation requirements. As the basis of information retrieval, string matching becomes more and more important, becoming a bottleneck technology of information retrieval, and directly influencing retrieval modes, retrieval functions, retrieval effects, user interfaces and the like of information retrieval.
Currently widely known string matching algorithms are: Brute-Force algorithm, Knuth-Morris-Pratt algorithm, KMP algorithm, Boyer-Boore algorithm, Rabin-Karp algorithm and the like, and the known algorithms are high in algorithm complexity or complex to implement.
Disclosure of Invention
In order to solve the technical problem, the application provides a character string matching method and a related device, which reduce the complexity of character string matching, are simple to implement, and improve the character string matching efficiency.
The embodiment of the application discloses the following technical scheme:
in a first aspect, an embodiment of the present application provides a character string matching method, where the method includes:
acquiring a character string to be matched;
in the first round of character matching process, matching the middle character of the character string to be matched with the characters in the plurality of texts according to the principle of first middle and then two sides;
if a first text in the plurality of texts comprises a middle character of the character string to be matched, matching the first character of the character string to be matched with the first text by using the distance between the middle character and the first character of the character string to be matched, matching the tail character of the character string to be matched with the first text by using the distance between the middle character and the tail character of the character string to be matched, and screening the first text to obtain a second text, wherein the second text comprises a similar character string of the character string to be matched;
according to the principle of firstly carrying out the middle and then carrying out the two sides, carrying out the next round of character matching on the rest characters in the character string to be matched and the similar character string in the second text, and screening a target text from the second text until a character matching end condition is met;
and when a character matching end condition is met, if the target text comprises a target character string which is matched with all characters of the character string to be matched, taking the target character string as a matching character string of the character string to be matched.
Optionally, before the matching of the middle character of the character string to be matched with the characters in the plurality of texts according to the principle of first middle and then two sides in the first round of character matching process, the method further includes:
according to the principle of first middle and second two sides, identifying characters in the character string to be matched to obtain an identification characteristic value of each character in the character string to be matched, wherein the identification characteristic value is used for identifying the position of a target character used by each round of character matching in the character string to be matched, the target character corresponding to each round of character matching at least comprises at least one character of a middle character, a first character and a tail character in the character string corresponding to the round of character matching, and the character string corresponding to the round of character matching is all character strings or partial character strings in the character string to be matched.
Optionally, the matching the middle character of the character string to be matched with the characters in the plurality of texts includes:
acquiring a middle character of the character string to be matched corresponding to the first round of character matching in the character string to be matched according to the identification characteristic value;
and matching the acquired middle characters of the character string to be matched with characters in a plurality of texts.
Optionally, the target character corresponding to each character match further includes an additional character, where the additional character is a character in the character string corresponding to the character match in the round that is the same as the middle character, the first character, or the last character of the character string.
Optionally, the method further includes:
determining a target text comprising the matching character string as a retrieval result of the character string to be matched;
and returning the retrieval result.
Optionally, the method further includes:
and marking the matching character strings in the retrieval result.
In a second aspect, an embodiment of the present application provides a character string matching apparatus, where the apparatus includes:
the acquiring unit is used for acquiring a character string to be matched;
the matching unit is used for matching the middle character of the character string to be matched with the characters in the plurality of texts according to the principle of first middle and second sides in the first round of character matching process;
the matching unit is further configured to, if a first text in the plurality of texts includes a middle character of the character string to be matched, match the first character of the character string to be matched with the first text by using a distance between the middle character and the first character of the character string to be matched, match the last character of the character string to be matched with the first text by using a distance between the middle character and the last character of the character string to be matched, and screen a second text from the first text, where the second text includes a similar character string of the character string to be matched;
the matching unit is further configured to perform a next round of character matching on the remaining characters in the character string to be matched and the similar character string in the second text according to the principle of first middle and second sides, and screen a target text from the second text until a character matching end condition is met;
and the determining unit is used for taking the target character string as the matching character string of the character string to be matched if the target text comprises the target character string which is completely matched with the characters of the character string to be matched when the character matching end condition is met.
Optionally, the apparatus further includes an identification unit:
the identification unit is used for identifying the characters in the character string to be matched according to the principle of first middle and then two sides to obtain an identification characteristic value of each character in the character string to be matched, the identification characteristic value is used for identifying the position of a target character used by each round of character matching in the character string to be matched, the target character corresponding to each round of character matching at least comprises at least one character of a middle character, a first character and a tail character in the character string corresponding to the round of character matching, and the character string corresponding to the round of character matching is all or part of the character string in the character string to be matched.
Optionally, the matching unit is configured to:
acquiring a middle character of the character string to be matched corresponding to the first round of character matching in the character string to be matched according to the identification characteristic value;
and matching the acquired middle characters of the character string to be matched with characters in a plurality of texts.
Optionally, the target character corresponding to each character match further includes an additional character, where the additional character is a character in the character string corresponding to the character match in the round that is the same as the middle character, the first character, or the last character of the character string.
Optionally, the determining unit is configured to determine a target text including the matching character string as a retrieval result of the character string to be matched;
the apparatus further comprises a return unit:
and the return unit is used for returning the retrieval result.
Optionally, the apparatus further comprises a marking unit:
and the marking unit is used for marking the matching character strings in the retrieval result.
In a third aspect, an embodiment of the present application provides an apparatus for string matching, where the apparatus includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of any of the preceding first aspects according to instructions in the program code.
According to the technical scheme, the embodiment of the application has the following advantages:
when a matching character string of a character string to be matched is searched from a plurality of texts, in the first round of character matching process, according to the principle of first middle and second sides, a middle character of the character string to be matched is matched with characters in the plurality of texts, if a middle character of the character string to be matched is found in a first text in the plurality of texts, a first character of the character string to be matched is matched with a first text by using the distance between the middle character and the first character of the character string to be matched, a tail character of the character string to be matched is matched with the first text by using the distance between the middle character and the tail character of the character string to be matched, a second text is obtained by screening from the first text, and the second text comprises a similar character string of the character string to be matched. And then, according to the principle of firstly carrying out the middle and secondly carrying out the next round of character matching on the rest characters in the character string to be matched and the similar character string in the second text, and screening the target text from the second text until the character matching end condition is met. And when the character matching end condition is met, if the target text comprises a target character string which is completely matched with the characters of the character string to be matched, taking the target character string as the matching character string of the character string to be matched. According to the scheme, a small amount of second texts comprising similar character strings can be screened out from a large amount of texts by the principle of first middle and then second sides, so that the next round of character matching can be performed on the small amount of second texts in the follow-up process, the processing amount of the next round of character matching is reduced, each round of character matching is based on the principle of first middle and then second sides, the processing amount of each follow-up round of character matching is continuously reduced, the complexity of character string matching is reduced, the implementation is simple, and the character string matching efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a flowchart of a character string matching method according to an embodiment of the present application;
FIG. 2 is a flow chart of a pre-processing process provided by an embodiment of the present application;
fig. 3 is a flowchart of a matching process provided in an embodiment of the present application;
fig. 4 is a structural diagram of a character string matching apparatus according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In order to solve the technical problems existing in the prior art: the method comprises the following steps of carrying out character string matching by using a Brute-Force algorithm, a Knuth-Morris-Pratt algorithm, a KMP algorithm, a Boyer-Boore algorithm, a Rabin-Karp algorithm and the like, wherein the known algorithms have high algorithm complexity or complex implementation and low character string matching efficiency, the embodiment of the application provides a character string matching method, and the method is shown in figure 1 and comprises the following steps:
s101, obtaining a character string to be matched.
It should be noted that the method provided by the embodiment of the present application may be applied to a scene of retrieval, query, search, and the like, and in this case, the character string to be matched may be a character string input by a user. The character strings to be matched may be character strings corresponding to different languages, such as chinese, korean, japanese, english, and the like.
And S102, in the first round of character matching process, matching the middle character of the character string to be matched with the characters in the plurality of texts according to the principle of first middle and then two sides.
In the first round of character matching, before matching the middle character of the character string to be matched with the characters in the plurality of texts according to the principle of first middle and then two sides, the characters in the character string to be matched can be identified according to the principle of first middle and then two sides, so as to obtain the identification characteristic value of each character in the character string to be matched, the identification characteristic value is used for identifying the position of a target character used by each round of character matching in the character string to be matched, the target character corresponding to each round of character matching at least comprises at least one of the middle character, the first character and the tail character in the character string corresponding to the round of character matching, and the character string corresponding to the round of character matching is all the character strings or part of the character strings in the character string to be matched.
It should be noted that, the identification feature values corresponding to the target characters used in each round of character matching may be identified sequentially. In general, the identification feature value corresponding to the target character used in the first round of character matching may be identified, then the identification feature value corresponding to the target character used in the second round of character matching may be identified, and so on, until each character has a corresponding identification feature value. The process of identifying and obtaining the identification characteristic value of each character in the character string to be matched may be referred to as a process of preprocessing the character string to be matched.
Specifically, the identification characteristic value corresponding to the target character used in the first round of character matching may be identified, the target character used in the first round of character matching is the character string to be matched itself, the identification characteristic value of the first character of the character string to be matched is F1, the identification characteristic value of the middle character (when the number of characters in the character string to be matched is an even number, the middle character is a 2-bit character, otherwise, the middle character is a 1-bit character) is M1, and the identification characteristic value of the last character is L1.
Secondly, identifying characteristic values corresponding to the target characters used by the second round of character matching from the rest character strings. The first character, the middle character and the tail character are removed (F1, M1 and L1), and the character string to be matched still has the first half character and the second half character. According to the principle of first middle and second two sides, the first character, the middle character and the tail character in the first half of the character are marked by adopting a method similar to the method to obtain the marking characteristic values of F2, M2 and L2, and the first character, the middle character and the tail character in the second half of the character are marked to obtain the marking characteristic values of F21, M21 and L21. The identification feature values F2, M2, L2, F21, M21, and L21 extracted this time are used to identify the positions of the target characters used in the second round of character matching in the character string to be matched.
Finally, the remaining character strings are recursively processed in a similar manner until each character has a corresponding identifying characteristic value. And finishing preprocessing the character string to be matched.
It should be noted that, in some possible implementations, the target character corresponding to each character match includes an additional character, where the additional character is a character in the character string corresponding to the character match, and the character is the same as its middle character, first character, or last character. For example, the principle of first identifying the character strings to be matched as the identification characteristic values of F1, M1 and L1 is adopted, and then the character which is the same as the character identified by M1, F1 or L1 is searched in the character strings to be matched, so that the character is identified as the identification characteristic values of M11, M12 … and the like. The character identified with M1 or F1 or L1 may be referred to as an extra character. In this way, the similarity of similar character strings screened by the first round of character matching is greatly increased.
Assuming that the character string to be matched is ABCDADE, preprocessing the characters in the character string to be matched, firstly identifying the identification characteristic value corresponding to the target character used in the first round of character matching, wherein the first occurring A in the character string to be matched is a first character, the middle D is a middle character, and the last E is a tail character, so that the characters are respectively identified to obtain the identification characteristic values F1, M1 and L1. And then searching for the character which is the same as the character identified by M1, and determining that the identification characteristic value corresponding to another character D in the character string to be matched is M12. Secondly, identifying characteristic values corresponding to the target characters used by the second round of character matching from the rest character strings. And (3) removing the first character A, the middle character D and the tail character E, remaining the first half characters (B and C) and the second half characters (A and D) of the character string to be matched, and identifying the first character, the middle character and the tail character in the first half characters by adopting a method similar to the method described above according to the principle of first middle and second sides. In this case, since the first half character and the second half character include only two characters, only two of the identification feature values F2, M2, and L2 can be obtained from the first half character, and the identification feature values F2 and M2 can be obtained in a normal case, but other combinations are also possible. Similarly, the identification feature values obtained from the second half character are any two of F21, M21, and L21, and may be, for example, F21 and M21.
In summary, after the preprocessing, the identification feature value of each character in the character string to be matched is obtained as shown in table 1:
TABLE 1
Identifying characteristic values F1 F2 M2 M1 F21 M21/M12 L1
Character string to be matched A B C D A D E
S103, if a first text in the plurality of texts is found to include the middle character of the character string to be matched, matching the first character of the character string to be matched with the first text by using the distance between the middle character and the first character of the character string to be matched, matching the tail character of the character string to be matched with the first text by using the distance between the middle character and the tail character of the character string to be matched, and screening the first text to obtain a second text, wherein the second text includes the similar character string of the character string to be matched.
Under the condition of preprocessing the character string to be matched, the way of matching the middle character of the character string to be matched with the characters in the texts can be to obtain the middle character of the character string to be matched corresponding to the first round of character matching in the character string to be matched according to the identification characteristic value, and match the obtained middle character of the character string to be matched with the characters in the texts.
Specifically, after the character D identified by M1 is found in the text, and the character D identified by M1 is found, the distance between the character D identified by M1 and the character E identified by the character A, L1 identified by F1 is used to quickly match the character E identified by the character A, L1 identified by F1, when all the characters are matched, the additional characters identified by M11, M12. At this time, the character string with the first character A, the middle character D and the tail character E found in the text conforms to the characteristics of the similar character string and can be used as the similar character string of the character string to be matched, otherwise, the character string is not the similar character string, and the steps of S102-S103 are continuously repeated until the similar character string is found or the text is finished. In this way, a small amount of second text including similar character strings can be found, and a large amount of text not including similar character strings is filtered, so that the data processing amount of subsequent matching is reduced.
Assuming that the text bbcabaabcdccdadeldfe exists, the character D identified by M1, e.g., the 7 th character "D" in bbcabaabcdccdadeldfe, is looked up in bbcabaabccdadadeldfe, followed by matching the character E identified by L1. Since the distance between the character E identified by L1 and the character D identified by M1 is two characters in the character string to be matched and after the character D identified by M1, what is two characters away from the 7 th character "D" in the text bbcabadbcdababccdadelffe is character a, which does not match the character E identified by L1. The search then continues for the second character in the text that matches the character D identified by M1, e.g., the 13 th character D in bbcabadbacaabcdcadelefe, followed by the character E identified by L1. Two characters away from the 13 th character "D" in the text bbcababbcdabcdadeilffe is character C, which does not match character E identified by L1. The search continues for the third character in the text that matches the character D identified by M1, e.g., the 17 th character D in bbcabadbcdabcdccdadelfe, followed by the character E identified by L1. Two characters away from the 17 th character "D" in the text bbcabadbcdabcdadeilffe is the character E, which matches the character E identified by L1. Then matching the character a identified by F1, since the distance between the character a identified by F1 and the character D identified by M1 in the character string to be matched is two characters and precedes the character D identified by M1, while the character a is two characters and precedes the character D identified by M1 in the text bbcababab cdabc cdadelffe, which is two characters away from and precedes the 17 th character "D", matches the character a identified by F1. The additional character D identified by M12 is matched again, the distance between the additional character D identified by M12 and the character D identified by M1 in the string to be matched is 1 character and after the character D identified by M1, while the additional character D identified by M12 is matched with the additional character D identified by bbcabadbabcdabccdadcdadelffe, which is 1 character and after the 17 th character "D". By the end of the first round of character matching, the character string (the character string between the 14 th character and the 20 th character) is a similar character string, and the text bbcabaabcdccdadelfe is the second text.
And S104, performing next round of character matching on the remaining characters in the character string to be matched and the similar character string in the second text according to the principle of firstly performing middle and secondly performing two sides, and screening a target text from the second text until a character matching end condition is met.
Through the processing of the foregoing S102-S103, there are remaining character strings in the character string to be matched that are not matched, so according to the principle of first middle and second sides, the next round of character matching is performed on the remaining characters in the character string to be matched and the similar character string in the second text, and the target text is screened from the second text until the character matching end condition is satisfied.
Specifically, when the next round of character matching is performed, for example, the second round of character matching, the characters identified by F2, M2, L2, F21, M21, L21, etc. are matched again among the similar characters of the second text, if all characters are matched, a further similar character string is obtained, the next round of character matching, for example, the third round of character matching, is continued, the characters identified by F3, M3, L3, F31, M31, L31, etc. are matched again, and so on.
In the process, when the characters identified by F2/F3, M2/M3 and L2/L3 are overlapped with the characters identified by M11, M12 and M13 …, the matching is not repeated, and the matching is directly determined.
Continuing to use the text bbcababbcdabcddadap element fe as the second text and the character string to be matched is ABCDADE as an example, the first round of character matching determines that the character string between the 14 th character and the 20 th character (ABCDADE) in the second text is a similar character string, and then the character string can be continuously matched with the characters identified by F2, M2, F21 and M21 in the character string to be matched ABCDADE, and since the character identified by M21 is repeated with the character identified by M12, the character is no longer matched, that is, the characters identified by F2, M2 and F21 only need to be matched, and the character string between the 14 th character and the 20 th character (i.e., ABCDADE) in the second text is found to be completely matched with the character string to be matched.
And S105, when a character matching end condition is met, if the target text comprises a target character string which is matched with all characters of the character string to be matched, taking the target character string as a matching character string of the character string to be matched.
After the matching is completed, the remaining character string in the second text is the character string 'LFE' between the 21 st character and the 24 th character, the length of the character string is smaller than the length of the character string to be matched, so the matching is finished, and a matching character string completely matched with the character string to be matched in the second text is found.
It is understood that the process of S102-S105 may be referred to as a matching process.
In the embodiment of the application, the target text including the matching character string can be determined as the search result of the character string to be matched, so that the search result is returned to the user.
Depending on different application scenarios, the manner of determining the matching character strings may be slightly different, and the display manner may also be different. If a search result is expected to be returned according to the character strings to be matched, only the character strings to be matched are required to be determined to be included in the text, namely the obtained target text includes the character strings to be matched, and all the character strings to be matched included in the target text do not need to be found out. If the user hopes to see the positions of the character strings to be matched in the target text, all the character strings to be matched in the target text can be found out, and then the matched character strings are marked in the search result. The mark may be, for example, a highlight mark, a bold mark, an underline mark, or the like.
According to the technical scheme, the embodiment of the application has the following advantages:
when a matching character string of a character string to be matched is searched from a plurality of texts, in the first round of character matching process, according to the principle of first middle and second sides, a middle character of the character string to be matched is matched with characters in the plurality of texts, if a middle character of the character string to be matched is found in a first text in the plurality of texts, a first character of the character string to be matched is matched with a first text by using the distance between the middle character and the first character of the character string to be matched, a tail character of the character string to be matched is matched with the first text by using the distance between the middle character and the tail character of the character string to be matched, a second text is obtained by screening from the first text, and the second text comprises a similar character string of the character string to be matched. And then, according to the principle of firstly carrying out the middle and secondly carrying out the next round of character matching on the rest characters in the character string to be matched and the similar character string in the second text, and screening the target text from the second text until the character matching end condition is met. And when the character matching end condition is met, if the target text comprises a target character string which is completely matched with the characters of the character string to be matched, taking the target character string as the matching character string of the character string to be matched. According to the scheme, a small amount of second texts comprising similar character strings can be screened out from a large amount of texts by the principle of first middle and then second sides, so that the next round of character matching can be performed on the small amount of second texts in the follow-up process, the processing amount of the next round of character matching is reduced, each round of character matching is based on the principle of first middle and then second sides, the processing amount of each follow-up round of character matching is continuously reduced, the complexity of character string matching is reduced, the implementation is simple, and the character string matching efficiency is improved.
Through the foregoing description, the method provided in the embodiment of the present application mainly includes a preprocessing process and a matching process of a character string to be matched, and the two processes are described below respectively.
First, a preprocessing process of a character string to be matched is described. Referring to fig. 2, the pre-treatment process includes the steps of:
s201, obtaining a character string to be matched.
S202, according to the principle of first middle and second sides, obtaining identification characteristic values F1, M1 and L1 corresponding to the first character, the middle character and the tail character of the character string to be matched used in the first round of character matching respectively. And searching the character which is the same as the character identified by M1 in the character string to be matched, and respectively identifying to obtain identification characteristic values M11, M12 and M13 ….
And S203, marking the rest first half segment of characters and the rest second half segment of characters according to the principle of first middle and second sides to obtain marking characteristic values of F2, M2, L2, F21, M21 and L21.
And S204, processing the rest characters by using a recursive idea, and sequentially identifying to obtain identification characteristic values of F3, M3, L3, F31, M31, L31, F4, M4, L4, F41, M41 and L41 … until each character in the character string to be matched has a corresponding identification characteristic value.
Next, a matching process will be described. Referring to fig. 3, the steps of the matching process include:
s301, entering a matching process of the character strings to be matched.
S302, whether similar character strings matched with the characters identified by the character strings to be matched F1, M1, L1, M11, M12 and M13 … exist in the text or not is determined, if yes, S303 is executed, and if not, S306 is executed.
And S303, whether the similar character strings are sequentially matched with the characters identified by the F2, the M2, the L2, the F21, the M21 and the L21 in the character strings to be matched or not is judged, if yes, S304 is executed, and if not, S306 is executed.
And S304, sequentially matching the characters identified by the remaining identification characteristic values F3, M3 and L3 … by using a recursive method until the characters are completely matched, if so, executing S305, and if not, executing S306.
S305, determining a matching character string of the character string to be matched.
S306, whether the remaining length of the text is larger than the length of the character string to be matched or not is judged, and if yes, S302 is executed.
Based on the character string matching method provided by the embodiment corresponding to fig. 1, the embodiment of the present application provides a character string matching apparatus, referring to fig. 4, the apparatus includes:
an obtaining unit 401, configured to obtain a character string to be matched;
a matching unit 402, configured to match a middle character of the character string to be matched with characters in multiple texts according to a principle that a middle part is followed by two sides in a first round of character matching process;
the matching unit 402 is further configured to, if a first text in the plurality of texts is found to include a middle character of the character string to be matched, match the first character of the character string to be matched with the first text by using a distance between the middle character and the first character of the character string to be matched, match the last character of the character string to be matched with the first text by using a distance between the middle character and the last character of the character string to be matched, and filter a second text from the first text, where the second text includes a similar character string of the character string to be matched;
the matching unit 402 is further configured to perform a next round of character matching on the remaining characters in the character string to be matched and the similar character string in the second text according to the principle of first middle and then two sides, and screen a target text from the second text until a character matching end condition is met;
a determining unit 403, configured to, when a character matching end condition is satisfied, if the target text includes a target character string that is completely matched with characters of the character string to be matched, use the target character string as a matching character string of the character string to be matched.
Optionally, the apparatus further includes an identification unit:
the identification unit is used for identifying the characters in the character string to be matched according to the principle of first middle and then two sides to obtain an identification characteristic value of each character in the character string to be matched, the identification characteristic value is used for identifying the position of a target character used by each round of character matching in the character string to be matched, the target character corresponding to each round of character matching at least comprises at least one character of a middle character, a first character and a tail character in the character string corresponding to the round of character matching, and the character string corresponding to the round of character matching is all or part of the character string in the character string to be matched.
Optionally, the matching unit is configured to:
acquiring a middle character of the character string to be matched corresponding to the first round of character matching in the character string to be matched according to the identification characteristic value;
and matching the acquired middle characters of the character string to be matched with characters in a plurality of texts.
Optionally, the target character corresponding to each character match further includes an additional character, where the additional character is a character in the character string corresponding to the character match in the round that is the same as the middle character, the first character, or the last character of the character string.
Optionally, the determining unit is configured to determine a target text including the matching character string as a retrieval result of the character string to be matched;
the apparatus further comprises a return unit:
and the return unit is used for returning the retrieval result.
Optionally, the apparatus further comprises a marking unit:
and the marking unit is used for marking the matching character strings in the retrieval result.
According to the technical scheme, the embodiment of the application has the following advantages:
when a matching character string of a character string to be matched is searched from a plurality of texts, in the first round of character matching process, according to the principle of first middle and second sides, a middle character of the character string to be matched is matched with characters in the plurality of texts, if a middle character of the character string to be matched is found in a first text in the plurality of texts, a first character of the character string to be matched is matched with a first text by using the distance between the middle character and the first character of the character string to be matched, a tail character of the character string to be matched is matched with the first text by using the distance between the middle character and the tail character of the character string to be matched, a second text is obtained by screening from the first text, and the second text comprises a similar character string of the character string to be matched. And then, according to the principle of firstly carrying out the middle and secondly carrying out the next round of character matching on the rest characters in the character string to be matched and the similar character string in the second text, and screening the target text from the second text until the character matching end condition is met. And when the character matching end condition is met, if the target text comprises a target character string which is completely matched with the characters of the character string to be matched, taking the target character string as the matching character string of the character string to be matched. According to the scheme, a small amount of second texts comprising similar character strings can be screened out from a large amount of texts by the principle of first middle and then second sides, so that the next round of character matching can be performed on the small amount of second texts in the follow-up process, the processing amount of the next round of character matching is reduced, each round of character matching is based on the principle of first middle and then second sides, the processing amount of each follow-up round of character matching is continuously reduced, the complexity of character string matching is reduced, the implementation is simple, and the character string matching efficiency is improved.
An embodiment of the present application further provides an apparatus for string matching, where the apparatus includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the string matching method according to any one of the preceding embodiments according to instructions in the program code.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as read-only memory (ROM), RAM, magnetic disk, or optical disk.
It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of string matching, the method comprising:
acquiring a character string to be matched;
in the first round of character matching process, matching the middle character of the character string to be matched with the characters in the plurality of texts according to the principle of first middle and then two sides;
if a first text in the plurality of texts comprises a middle character of the character string to be matched, matching the first character of the character string to be matched with the first text by using the distance between the middle character and the first character of the character string to be matched, matching the tail character of the character string to be matched with the first text by using the distance between the middle character and the tail character of the character string to be matched, and screening the first text to obtain a second text, wherein the second text comprises a similar character string of the character string to be matched;
according to the principle of firstly carrying out the middle and then carrying out the two sides, carrying out the next round of character matching on the rest characters in the character string to be matched and the similar character string in the second text, and screening a target text from the second text until a character matching end condition is met;
and when a character matching end condition is met, if the target text comprises a target character string which is matched with all characters of the character string to be matched, taking the target character string as a matching character string of the character string to be matched.
2. The method as claimed in claim 1, wherein before the matching of the middle character of the character string to be matched with the characters in the plurality of texts according to the principle of middle first and two sides in the first round of character matching process, the method further comprises:
according to the principle of first middle and second two sides, identifying characters in the character string to be matched to obtain an identification characteristic value of each character in the character string to be matched, wherein the identification characteristic value is used for identifying the position of a target character used by each round of character matching in the character string to be matched, the target character corresponding to each round of character matching at least comprises at least one character of a middle character, a first character and a tail character in the character string corresponding to the round of character matching, and the character string corresponding to the round of character matching is all character strings or partial character strings in the character string to be matched.
3. The method of claim 2, wherein matching the middle character of the string of characters to be matched with the characters in the plurality of texts comprises:
acquiring a middle character of the character string to be matched corresponding to the first round of character matching in the character string to be matched according to the identification characteristic value;
and matching the acquired middle characters of the character string to be matched with characters in a plurality of texts.
4. The method of claim 2, wherein each round of character matching corresponds to a target character further comprising an additional character, the additional character being a character in the string of characters to which the round of character matching corresponds that is the same as its middle, first, or last character.
5. The method of claim 1, further comprising:
determining a target text comprising the matching character string as a retrieval result of the character string to be matched;
and returning the retrieval result.
6. The method of claim 5, further comprising:
and marking the matching character strings in the retrieval result.
7. An apparatus for matching character strings, the apparatus comprising:
the acquiring unit is used for acquiring a character string to be matched;
the matching unit is used for matching the middle character of the character string to be matched with the characters in the plurality of texts according to the principle of first middle and second sides in the first round of character matching process;
the matching unit is further configured to, if a first text in the plurality of texts includes a middle character of the character string to be matched, match the first character of the character string to be matched with the first text by using a distance between the middle character and the first character of the character string to be matched, match the last character of the character string to be matched with the first text by using a distance between the middle character and the last character of the character string to be matched, and screen a second text from the first text, where the second text includes a similar character string of the character string to be matched;
the matching unit is further configured to perform a next round of character matching on the remaining characters in the character string to be matched and the similar character string in the second text according to the principle of first middle and second sides, and screen a target text from the second text until a character matching end condition is met;
and the determining unit is used for taking the target character string as the matching character string of the character string to be matched if the target text comprises the target character string which is completely matched with the characters of the character string to be matched when the character matching end condition is met.
8. The apparatus of claim 7, wherein the apparatus further comprises an identification unit:
the identification unit is used for identifying the characters in the character string to be matched according to the principle of first middle and then two sides to obtain an identification characteristic value of each character in the character string to be matched, the identification characteristic value is used for identifying the position of a target character used by each round of character matching in the character string to be matched, the target character corresponding to each round of character matching at least comprises at least one character of a middle character, a first character and a tail character in the character string corresponding to the round of character matching, and the character string corresponding to the round of character matching is all or part of the character string in the character string to be matched.
9. The apparatus of claim 8, wherein the matching unit is configured to:
acquiring a middle character of the character string to be matched corresponding to the first round of character matching in the character string to be matched according to the identification characteristic value;
and matching the acquired middle characters of the character string to be matched with characters in a plurality of texts.
10. An apparatus for string matching, the apparatus comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of any of claims 1-6 according to instructions in the program code.
CN202111155436.4A 2021-09-29 2021-09-29 Character string matching method and related device Active CN113887223B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111155436.4A CN113887223B (en) 2021-09-29 2021-09-29 Character string matching method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111155436.4A CN113887223B (en) 2021-09-29 2021-09-29 Character string matching method and related device

Publications (2)

Publication Number Publication Date
CN113887223A true CN113887223A (en) 2022-01-04
CN113887223B CN113887223B (en) 2023-08-29

Family

ID=79004451

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111155436.4A Active CN113887223B (en) 2021-09-29 2021-09-29 Character string matching method and related device

Country Status (1)

Country Link
CN (1) CN113887223B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013106989A1 (en) * 2012-01-16 2013-07-25 中国科学院北京基因组研究所 Method and device for matching character strings
CN103425739A (en) * 2013-07-09 2013-12-04 国云科技股份有限公司 Character string matching algorithm
CN105095369A (en) * 2015-06-29 2015-11-25 北京金山安全软件有限公司 Website matching method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013106989A1 (en) * 2012-01-16 2013-07-25 中国科学院北京基因组研究所 Method and device for matching character strings
CN103425739A (en) * 2013-07-09 2013-12-04 国云科技股份有限公司 Character string matching algorithm
CN105095369A (en) * 2015-06-29 2015-11-25 北京金山安全软件有限公司 Website matching method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张建莉;: "字符串单模式匹配算法研究", 农业网络信息, no. 04 *

Also Published As

Publication number Publication date
CN113887223B (en) 2023-08-29

Similar Documents

Publication Publication Date Title
US9195738B2 (en) Tokenization platform
CN109558479A (en) Rule matching method, device, equipment and storage medium
CN108595679B (en) Label determining method, device, terminal and storage medium
CN110198482B (en) Video key bridge segment marking method, terminal and storage medium
US5553284A (en) Method for indexing and searching handwritten documents in a database
CN111159497A (en) Regular expression generation method and regular expression-based data extraction method
CN111241248A (en) Synonymy question generation model training method and system and synonymy question generation method
CN111159334A (en) Method and system for house source follow-up information processing
CN114861635B (en) Chinese spelling error correction method, device, equipment and storage medium
CN115358397A (en) Parallel graph rule mining method and device based on data sampling
CN112765963B (en) Sentence word segmentation method, sentence word segmentation device, computer equipment and storage medium
CN111159490B (en) Method, device and equipment for processing pattern character strings
CN112183074A (en) Data enhancement method, device, equipment and medium
CN113887223B (en) Character string matching method and related device
CN116306498A (en) Text rendering method and device
CN114416954B (en) Text retrieval method, device, equipment and storage medium
CN116229484A (en) Text recognition method, list scanning method and device
CN112380445B (en) Data query method, device, equipment and storage medium
CN111061927A (en) Data processing method and device and electronic equipment
CN115858797A (en) Method and system for generating Chinese near-meaning words based on OCR technology
KR101452638B1 (en) Method and apparatus for recommending contents
CN113010642A (en) Semantic relation recognition method and device, electronic equipment and readable storage medium
CN113435166B (en) Underline method and system, computer device and readable storage medium
CN111339756A (en) Text error detection method and device
CN113780449B (en) Text similarity calculation method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant