CN109977276B - Sunday algorithm-based improved single-mode matching method - Google Patents

Sunday algorithm-based improved single-mode matching method Download PDF

Info

Publication number
CN109977276B
CN109977276B CN201910221407.XA CN201910221407A CN109977276B CN 109977276 B CN109977276 B CN 109977276B CN 201910221407 A CN201910221407 A CN 201910221407A CN 109977276 B CN109977276 B CN 109977276B
Authority
CN
China
Prior art keywords
string
matching
character
pattern
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910221407.XA
Other languages
Chinese (zh)
Other versions
CN109977276A (en
Inventor
陆以勤
胡凡
覃健诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201910221407.XA priority Critical patent/CN109977276B/en
Publication of CN109977276A publication Critical patent/CN109977276A/en
Application granted granted Critical
Publication of CN109977276B publication Critical patent/CN109977276B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a single pattern matching method based on Sunday algorithm improvement, which is characterized in that the single pattern matching method comprises the steps of judging whether the next character of the last character in a text string, which participates in matching, appears in the pattern string, preprocessing the pattern string in advance, matching the pattern string and the text string in different orders according to the characteristics of the pattern string, finishing matching the character string if the matching is successful, sliding the pattern string if the matching is unsuccessful, and continuously judging by using the method until the pattern string slides to reach the tail end of the text string or the matching is successful, and finishing the matching of the character string. By using the single-mode matching method, the total matching times of the original algorithm can be effectively reduced, and the matching efficiency of the text is improved.

Description

Sunday algorithm-based improved single-mode matching method
Technical Field
The invention relates to the technical field of single-mode character matching, in particular to a single-mode matching method based on Sunday algorithm improvement.
Background
String pattern matching has been used more and more in many ways, and faster matching speed has been the goal pursued by researchers. How to improve the execution speed of the matching algorithm is also paid attention to by more people. Among the single pattern matching algorithms, the BM matching algorithm and the KMP matching algorithm are the most well-known two. Under the most ideal condition, the two algorithms are linear time complexity, but in practical application, the BM algorithm is 3-5 times faster than the KMP algorithm. Sunday m.sunday in the 90's of the 20 th century proposed Sunday's algorithm which is faster and easier to understand than BM algorithm, so that the matching efficiency of character strings was improved. With the increasing volume of the internet, more and more information is provided, and how to quickly search the information needed by the user from the massive information is a hot spot of network search research. The character string matching algorithm plays an important role, and the efficient character string matching algorithm can greatly improve the searching efficiency and quality. String matching has wide application in the field of networks. Such as spell checking, language translation, data compression, search engines, network intrusion detection, etc. The core idea of the Sunday algorithm is that in the matching process, the pattern string is not required to be compared from left to right or from right to left, so that if the selection of the matching sequence direction is not appropriate, many invalid matches will be added, and the efficiency of the Sunday algorithm will be significantly reduced.
Disclosure of Invention
The invention aims to solve the technical problem that the total matching of original Sunday algorithm mode strings is not sequential in the prior art, and provides an improved single-mode matching method based on a Sunday algorithm.
The purpose of the invention can be achieved by adopting the following technical scheme:
a single-mode matching method based on Sunday algorithm improvement comprises the following steps:
and S1, respectively calculating the character probabilities of the head and the tail of the pattern string by using the characteristic information of the pattern string, wherein the pattern string is a character string to be matched.
S2, respectively solving the probability sum of the first n characters and the tail n characters in the pattern string, wherein n is 1, 2 or 3, if the probability sum of the first n characters is less than or equal to the probability sum of the tail n characters, jumping to the step S3; otherwise, if the sum of the probability of the first n characters is greater than the sum of the probability of the tail n characters, the step S4 is skipped;
s3, left aligning a mode string and a text character string, wherein the text character string is a text to be searched, sliding the mode string on the text character string, and if the mode string sliding exceeds the text character string, failing to match; if not, judging whether the characters on the corresponding positions of the pattern string and the text string are matched from left to right, if so, successfully matching and finishing the character string matching; if not, jumping to step S5;
s4, right aligning a mode string and a text character string, wherein the mode string slides on the text character string, and if the mode string slides beyond the text character string, the matching fails; if not, judging whether the characters on the corresponding positions of the pattern string and the text string are matched from right to left, if so, successfully matching and finishing the character string matching; if not, jumping to step S5;
s5, if the next character of the last character in the text string that participates in matching does not appear in the pattern string, the pattern string slides from left to right by a distance step equal to the length of the pattern string + 1; otherwise, if the next character of the last character in the text character string participating in matching appears in the pattern string, the pattern string slides from left to right by a distance step length which is the distance from the rightmost character in the matching string to the tail + 1; meanwhile, if the step S3 is skipped to the step S5, the step S3 is repeated; if the process goes from step S4 to step S5, step S4 is repeated.
Further, in step S1, the character probabilities of the head and the tail of the pattern string are respectively calculated by using the feature information of the pattern string, and according to the specific application scenario of pattern matching, if a certain character string in an english article is found, each character is based on the statistical result of the occurrence probability of each character in the english article given by the existing literature; if a certain mode string in the random chaotic text string is searched, the user-defined sampling ratio takes out the random chaotic text string, and the occurrence probability of each character is respectively calculated.
Further, the value rule of n is as follows:
if the mode string length m is greater than 1 and m < 3, then n equals 1; if the mode string length m is greater than 3 and m < 5, then n is 2; if the mode string length m is greater than 6, n is 3.
Compared with the prior art, the invention has the following advantages and effects:
1. the probability of each character in the pattern string is calculated aiming at different service scenes, and a character priority matching principle with low probability is adopted, so that the total matching times are reduced, and the efficiency of the original Sunday algorithm is improved.
2. Once the pattern string is determined, the occurrence probability of each character is obtained through preprocessing, and then the matching direction (from left to right or from right to left) is determined, so that the matching efficiency in practical application scenes such as detection of malicious codes in network intrusion, search of large text in paper retrieval and virus multi-feature scanning is improved.
Drawings
FIG. 1 is a flow chart of the improved single pattern matching method based on Sunday algorithm disclosed in the present invention;
fig. 2 is a flow chart of the Sunday algorithm in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Examples
The embodiment discloses a Sunday algorithm-based improved single-mode matching method, and the single-mode matching method in the embodiment is described in detail below with reference to fig. 1.
A single-mode matching method based on Sunday algorithm improvement comprises the following steps:
and step S1, respectively calculating the character probabilities of the head and the tail of the pattern string by using the characteristic information of the pattern string, wherein the pattern string is a character string to be matched. According to the pattern matching specific application scene, if a certain character string in an English article is searched, each character is according to the statistical result of the occurrence probability of each character in the English article given by the existing literature; if a certain mode string in the random chaotic text string is searched, the user-defined sampling ratio takes out the random chaotic text string, and the occurrence probability of each character is respectively calculated. For example, the statistical result of the occurrence probability of each character in an English article:
space (Space) 0.2; e0.105; t0.072; o0.0654; a0.063; n0.059; i0.055; r0.054; s0.052; h0.047; d0.035; l0.029; c0.023; F/U0.0225; m0.021; p0.0175; y0.0120; w0.012; g0.011; b0.0105; v0.008; k is 0.003; x is 0.002; j0.001; q is 0.001; z is 0.001;
step S2, respectively obtaining the probability sum of the first n characters and the tail n characters in the pattern string, and if the probability sum of the first n characters is less than or equal to the probability sum of the tail n characters, jumping to step S3; otherwise, if the sum of the probability of the first n characters is greater than the sum of the probability of the tail n characters, the step S4 is skipped; for example, the characters in a pattern string are: search, the first three bytes add to: 0.052+0.105+0.063 ═ 0.22; the last three bytes add up to: 0.054+0.023+0.047 ═ 0.124; 0.124 is smaller than 0.22, so it jumps to step S3.
Step S3, left aligning a mode string and a text character string, wherein the text character string is a text to be searched, the mode string slides on the text character string, and if the mode string slides beyond the text character string, matching fails; if not, judging whether the characters on the corresponding positions of the pattern string and the text string are matched from left to right, if so, ending the successful matching program; if not, jumping to step S5;
step S4, left aligning a mode string and a text character string, wherein the text character string is a text to be searched, the mode string slides on the text character string, and if the mode string slides beyond the text character string, matching fails; if not, judging whether the characters on the corresponding positions of the pattern string and the text string are matched from right to left, if so, ending the successful matching program; if not, jumping to step S5;
step S5, if the next character of the last character in the text string that participates in matching does not appear in the pattern string, the pattern string slides from left to right by a distance step equal to the length of the pattern string + 1; otherwise, if the next character of the last character in the text character string participating in matching appears in the pattern string, the pattern string slides from left to right by a distance step length which is the distance from the rightmost character in the matching string to the tail + 1; if the step S3 is skipped to the step S5, the step S3 is repeated; if the process goes from step S4 to step S5, step S4 is repeated.
The dotted line in fig. 1 indicates that the previous step is returned when the matching fails after the pattern string is slid to the right. If the step S3 is skipped to the step S5, the step S3 is repeated; if the process goes from step S4 to step S5, step S4 is repeated.
The present invention will be described in further detail with reference to specific examples. To find a pattern string in the text string of an english article, for example:
the text string is: s u b s t r i n g s e a r c h i n g
The pattern string P is: s e a r c h
If the pattern string is P, the length of the pattern string is m, and the next character of the last character in the text string participating in matching is X, the shift digit formula is as follows:
Figure BDA0002003743750000051
calculating a moving digit table in advance:
in this example, the pattern string P is "search"
Length m of mode string is 6
shift [ s ] ═ 6-max (position of s) ═ 6-0 ═ 6
shift [ e ] ═ 6-max (position of e) ═ 6-1 ═ 5
shift [ a ] ═ 6-max (position of a) ═ 6-2 ═ 4
shift [ r ] ═ 6-max (position of r) ═ 6-3 ═ 3
shift [ c ] ═ 6-max (position of c) ═ 6-4 ═ 2
shift [ h ] ═ 6-max (position of h) ═ 6-5 ═ 1
shift [ other ] ═ m +1 ═ 6+1 ═ 7
Step S1 counts the probability of each character in the pattern string according to the existing literature, and then goes to step S2.
In step S2, the length of the pattern string is 6, so n is 3, i.e. the sum of the probabilities of the first 3 bytes and the last 3 bytes of the pattern string is obtained; it is found that the sum of the probabilities of the first 3 bytes is smaller than the sum of the probabilities of the last 3 bytes, so that it goes to step S3.
Matching for the first time:
Figure BDA0002003743750000061
and (3) second matching:
Figure BDA0002003743750000062
the third matching is successful:
main string: s u b s t r i n g s e a r c h i n g
Mode string: s e a r c h
Matching times are as follows: 6 times of
The total number of matches was 8.
Fig. 2 is a flow chart of the original Sunday algorithm. The pattern string is not required to be compared from left to right or from right to left, so that if the matching order direction is not properly chosen, many invalid matches will be added. If left-to-right matching is adopted:
matching for the first time:
Figure BDA0002003743750000071
and (3) second matching:
Figure BDA0002003743750000072
the third matching is successful:
main string: s u b s t r i n g s e a r c h i n g
Mode string: s e a r c h
Matching times are as follows: 6 times of
The total number of matches was 9.
Therefore, the total matching times of the improved Sunday algorithm is reduced by 1 time, and when the data volume of a specific service scene is large, the matching efficiency is obviously improved.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (3)

1. A single-mode matching method based on Sunday algorithm improvement is characterized by comprising the following steps:
s1, respectively calculating the character probability of the head and tail of the pattern string by using the characteristic information of the pattern string, wherein the pattern string is a character string to be matched;
s2, respectively solving the probability sum of the first n characters and the tail n characters in the pattern string, wherein n is 1, 2 or 3, if the probability sum of the first n characters is less than or equal to the probability sum of the tail n characters, jumping to the step S3; otherwise, if the sum of the probability of the first n characters is greater than the sum of the probability of the tail n characters, the step S4 is skipped;
s3, left aligning a mode string and a text character string, wherein the text character string is a text to be searched, sliding the mode string on the text character string, and if the mode string sliding exceeds the text character string, failing to match; if not, judging whether the characters on the corresponding positions of the pattern string and the text string are matched from left to right, if so, successfully matching and finishing the character string matching; if not, jumping to step S5;
s4, right aligning a mode string and a text character string, wherein the mode string slides on the text character string, and if the mode string slides beyond the text character string, the matching fails; if not, judging whether the characters on the corresponding positions of the pattern string and the text string are matched from right to left, if so, successfully matching and finishing the character string matching; if not, jumping to step S5;
s5, if the next character of the last character in the text string that participates in matching does not appear in the pattern string, the pattern string slides from left to right by a distance step equal to the length of the pattern string + 1; otherwise, if the next character of the last character in the text character string participating in matching appears in the pattern string, the pattern string slides from left to right by a distance step length which is the distance from the rightmost character in the matching string to the tail + 1; meanwhile, if the step S3 is skipped to the step S5, the step S3 is repeated; if the process goes from step S4 to step S5, step S4 is repeated.
2. The Sunday algorithm-based improved single pattern matching method as claimed in claim 1, wherein in step S1, the character probabilities of the head and tail of the pattern string are respectively calculated according to the pattern matching application scenario, and when the application scenario is to search for a certain character string in an english article, each character is based on the statistical result of the occurrence probability of each character in the english article given by the existing literature; when the application scene is to search a certain mode string in the random chaotic text string, the user-defined sampling ratio takes out the random chaotic text string, and the occurrence probability of each character is respectively calculated.
3. The improved single-pattern matching method based on the Sunday algorithm as claimed in claim 1, wherein the value rule of n is as follows:
if the mode string length m is greater than 1 and m < 3, then n equals 1; if the mode string length m is greater than 3 and m < 5, then n is 2; if the mode string length m > is 6, then n is 3.
CN201910221407.XA 2019-03-22 2019-03-22 Sunday algorithm-based improved single-mode matching method Active CN109977276B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910221407.XA CN109977276B (en) 2019-03-22 2019-03-22 Sunday algorithm-based improved single-mode matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910221407.XA CN109977276B (en) 2019-03-22 2019-03-22 Sunday algorithm-based improved single-mode matching method

Publications (2)

Publication Number Publication Date
CN109977276A CN109977276A (en) 2019-07-05
CN109977276B true CN109977276B (en) 2020-12-22

Family

ID=67080016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910221407.XA Active CN109977276B (en) 2019-03-22 2019-03-22 Sunday algorithm-based improved single-mode matching method

Country Status (1)

Country Link
CN (1) CN109977276B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489997A (en) * 2019-08-16 2019-11-22 北京计算机技术及应用研究所 A kind of sensitive information desensitization method based on pattern matching algorithm
CN110674364B (en) * 2019-08-30 2021-11-23 北京浩瀚深度信息技术股份有限公司 Method for realizing sliding character string matching by utilizing FPGA (field programmable Gate array)
CN111159490B (en) * 2019-12-13 2023-05-26 杭州迪普科技股份有限公司 Method, device and equipment for processing pattern character strings
CN111125459A (en) * 2019-12-25 2020-05-08 中消云(北京)物联网科技研究院有限公司 Character string processing method and device
CN111814009B (en) * 2020-06-28 2022-03-01 四川长虹电器股份有限公司 Mode matching method based on search engine retrieval information
CN112069303B (en) * 2020-09-17 2022-08-16 四川长虹电器股份有限公司 Matching search method and device for character strings and terminal
CN113010882B (en) * 2021-03-18 2022-08-30 哈尔滨工业大学 Custom position sequence pattern matching method suitable for cache loss attack
CN113672779B (en) * 2021-08-11 2023-07-14 国网浙江省电力有限公司绍兴供电公司 Character string matching method, equipment and medium for transformer substation message sequence detection
CN114461865A (en) * 2022-03-14 2022-05-10 深圳希施玛数据科技有限公司 Character string matching method, device and storage medium
CN115065496B (en) * 2022-04-13 2024-05-07 山石网科通信技术股份有限公司 Authentication user role mapping information generation method and device on network security equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271468A (en) * 2008-05-05 2008-09-24 哈尔滨工程大学 Method for accelerating character string matching by trans-border protection mechanism
JP5391583B2 (en) * 2008-05-29 2014-01-15 富士通株式会社 SEARCH DEVICE, GENERATION DEVICE, PROGRAM, SEARCH METHOD, AND GENERATION METHOD
CN101609455A (en) * 2009-07-07 2009-12-23 哈尔滨工程大学 A kind of method of high-speed accurate single-pattern character string coupling
CN103425739B (en) * 2013-07-09 2016-09-14 国云科技股份有限公司 A kind of character string matching method
CN104850241A (en) * 2015-05-28 2015-08-19 北京奇点机智信息技术有限公司 Mobile terminal and text input method thereof
CN107220333B (en) * 2017-05-24 2020-01-31 电子科技大学 character search method based on Sunday algorithm

Also Published As

Publication number Publication date
CN109977276A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN109977276B (en) Sunday algorithm-based improved single-mode matching method
US9195738B2 (en) Tokenization platform
US9768802B2 (en) Look-ahead hash chain matching for data compression
CN107918604B (en) Chinese word segmentation method and device
JP5010885B2 (en) Document search apparatus, document search method, and document search program
US20080228750A1 (en) &#34;Query-log match&#34; relevance features
WO2012151255A1 (en) Statistical spell checker
CN107168966B (en) Search engine index construction method and device
CN105426412A (en) Multi-mode string matching method and device
CN107220333B (en) character search method based on Sunday algorithm
US20120254190A1 (en) Extracting method, computer product, extracting system, information generating method, and information contents
CN108628907A (en) A method of being used for the Trie tree multiple-fault diagnosis based on Aho-Corasick
CN116562297B (en) Chinese sensitive word deformation identification method and system based on HTRIE tree
US20140222852A1 (en) System and method for bit-map based keyword spotting in communication traffic
CN110909214A (en) KMP matching algorithm-based rapid character string matching method
CN109923538B (en) Text search device, text search method, and computer program
CN113407693B (en) Text similarity comparison method and device for full-media reading
Chayapathi Survey and comparison of string matching algorithms
CN113010882B (en) Custom position sequence pattern matching method suitable for cache loss attack
CN112001168B (en) Word error correction method, device, electronic equipment and storage medium
WO2018135023A1 (en) Information processing system, information processing method, and computer program
CN116501781B (en) Data rapid statistical method for enhanced prefix tree
Wang et al. New cyber word discovery using Chinese word segmentation
Rădescu et al. The Original Method of Fixed Constraints Transform for Lossless Text Compression
JP6194760B2 (en) Keyword generation method, program, and information processing apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
OL01 Intention to license declared
OL01 Intention to license declared