CN111159490B - Method, device and equipment for processing pattern character strings - Google Patents
Method, device and equipment for processing pattern character strings Download PDFInfo
- Publication number
- CN111159490B CN111159490B CN201911280495.7A CN201911280495A CN111159490B CN 111159490 B CN111159490 B CN 111159490B CN 201911280495 A CN201911280495 A CN 201911280495A CN 111159490 B CN111159490 B CN 111159490B
- Authority
- CN
- China
- Prior art keywords
- character
- character string
- string
- mode
- processed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 72
- 238000012545 processing Methods 0.000 title claims abstract description 34
- 238000001914 filtration Methods 0.000 claims abstract description 85
- 230000008569 process Effects 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 4
- 238000012790 confirmation Methods 0.000 description 20
- 238000003672 processing method Methods 0.000 description 7
- 244000089409 Erythrina poeppigiana Species 0.000 description 6
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 230000009471 action Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a method, a device and equipment for processing a pattern character string, wherein the method comprises the following steps: determining any mode character string as a mode character string to be processed, and aligning the head/tail end of a sliding window with a preset length with the head/tail end of the mode character string to be processed; determining the score of the character string based on the character distribution condition of the character string of the mode character string to be processed in the sliding window; moving the sliding window to the tail/head end of the character string of the mode to be processed with a preset step length until the sliding window cannot move; and determining a filtering character string of the to-be-processed mode character string based on the score of each character string of the to-be-processed mode character string in the sliding window, and finally finishing the determination of the filtering character string by each mode character string. According to the method and the device, the filtering character strings are determined based on the character distribution condition of the mode character strings, so that the multi-mode character strings are matched based on the determined filtering character strings, the condition that the performance is greatly reduced can be avoided, and the matching efficiency is improved.
Description
Technical Field
The present invention relates to the field of data processing, and in particular, to a method, an apparatus, and a device for processing a pattern string.
Background
The matching of the multimode character strings means that a plurality of character strings are given as mode character strings, and then all the mode character strings and the appearance positions thereof are found out from a section of data to be matched.
Matching of multimode strings is currently applied in more and more business scenarios, such as feature code-based protocol identification and attack detection. With the complexity of the application, the number of pattern strings is continuously increasing, and the performance requirement on matching is also higher and higher, which presents challenges to the existing matching algorithm of multimode strings, for example, the classical AC automaton algorithm becomes unavailable in some cases due to huge memory consumption, and the other matching algorithm based on filtering multimode strings is increasingly applied due to the advantages of smaller memory occupation and higher matching performance in most cases.
However, the matching algorithm based on the filtered multimode strings also has a situation that the performance is greatly reduced in some scenes (specific scenes will be described later), and most of reasons are found through research to be caused by that the filtered strings in the mode strings are fixedly formed by the first K characters or the last K characters, so how to determine the filtered strings in the mode strings is a key for solving the problem that the matching algorithm based on the filtered multimode strings is greatly reduced in performance in some scenes.
Disclosure of Invention
In view of this, the present application provides a method, an apparatus, and a device for processing a pattern string, which can determine a filtering string of the pattern string based on a character distribution condition of the pattern string, so that when matching multiple strings is performed based on the filtering string determined in the present application, a problem that performance is greatly reduced in some scenes can be avoided.
In order to achieve the above object, the present application provides a method for processing a pattern string, where the method includes:
s1, determining any mode character string as a mode character string to be processed, and aligning the head/tail end of a sliding window with a preset length with the head/tail end of the mode character string to be processed;
s2, determining the score of the character string in the sliding window based on the character distribution condition of the character string of the mode to be processed in the sliding window;
s3, moving the sliding window to the tail/head end of the character string of the mode to be processed in a preset step length, and continuing to execute the S2 until the interval length between the tail/head end of the sliding window and the tail/head end of the character string of the mode to be processed is smaller than the preset step length;
s4, determining a filtering character string of the to-be-processed mode character string based on the score of each character string of the to-be-processed mode character string in the sliding window, and continuing to execute S1 to process the next mode character string until each mode character string is determined to be the filtering character string.
In an optional implementation manner, after determining the filtering string of the pending mode string in S4, the method further includes:
counting the occurrence times of each character in the filtering character string;
updating a character occurrence frequency table by utilizing each character and the corresponding occurrence times;
correspondingly, the determining the score of the character string based on the character distribution condition of the character string of the to-be-processed mode character string in the sliding window comprises the following steps:
determining the score of the character string based on the occurrence times of each character in the character string in the sliding window of the character string of the mode to be processed in the character occurrence frequency table; wherein the score is inversely related to the number of occurrences.
In an optional implementation manner, the determining the score of the character string based on the occurrence times of each character in the character string in the sliding window in the character string of the to-be-processed mode in the character occurrence frequency table includes:
determining the occurrence times of each character in the character string in the sliding window of the character string in the mode to be processed in the character occurrence frequency table, and calculating the average value of the occurrence times of each character;
if the average value is larger than a preset threshold value, determining the score of the character string based on the average value; wherein the score is inversely related to the average value.
In an alternative embodiment, the method further comprises:
if the average value is not greater than the preset threshold value, determining the score of the character string based on the repeated number of continuous repeated characters in the character string of the mode to be processed in the sliding window; wherein the score is inversely related to the number of repetitions.
In an optional implementation manner, the determining the score of the character string based on the character distribution situation of the character string of the to-be-processed mode character string in the sliding window includes:
and determining the score of the character string based on the repeated number of the continuous repeated characters in the character string of the mode to be processed in the sliding window and the occurrence frequency of each character in the character string in the character occurrence frequency table.
In an optional implementation manner, after determining the filtering string of the pending mode string in S4, the method further includes:
and recording the position information of the filtering character string in the character string of the mode to be processed.
In a second aspect, the present application further provides a processing apparatus for a pattern string, where the apparatus includes:
a first determining module, configured to determine any one of the mode strings as a mode string to be processed, and align a head/tail end of a sliding window with a preset length with a head/tail end of the mode string to be processed;
the second determining module is used for determining the score of the character string in the sliding window based on the character distribution condition of the character string of the to-be-processed mode in the sliding window;
the moving module is used for moving the sliding window to the tail/head end of the character string of the mode to be processed in a preset step length, and continuing to trigger the second determining module until the length of the interval between the tail/head end of the sliding window and the tail/head end of the character string of the mode to be processed is smaller than the preset step length;
and the third determining module is used for determining the filtering character string of the to-be-processed mode character string based on the score of each character string of the to-be-processed mode character string in the sliding window, and continuously triggering the first determining module to process the next mode character string until each mode character string completes the determination of the filtering character string.
In an alternative embodiment, the apparatus further comprises:
the statistics module is used for counting the occurrence times of each character in the filtering character string;
the updating module is used for updating the character occurrence frequency table by utilizing each character and the corresponding occurrence times;
correspondingly, the second determining module is specifically configured to:
determining the score of the character string based on the occurrence times of each character in the character string in the sliding window of the character string of the mode to be processed in the character occurrence frequency table; wherein the score is inversely related to the number of occurrences.
In a third aspect, the present application also provides a computer readable storage medium having instructions stored therein which, when run on a terminal device, cause the terminal device to perform a method as claimed in any one of the preceding claims.
In a fourth aspect, the present application further provides a processing device for an internal mode character string, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the preceding claims when the computer program is executed.
In the processing method of the mode character strings, before matching of the multimode character strings, filtering character strings are dynamically determined for the mode character strings based on character distribution conditions of the mode character strings, so that the filtering character strings are not characters in fixed positions. The filtering character strings dynamically determined based on the embodiment of the application are used for matching the multi-mode character strings, so that the condition that the performance is greatly reduced caused by frequent entering of a confirmation stage can be avoided, and the matching efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.
Fig. 1 is a flowchart of a method for processing a pattern string according to an embodiment of the present application;
FIG. 2 is a flowchart of another method for processing a pattern string according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a processing device for a pattern string according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a processing device for a pattern string according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a filtering string composed of characters in three pattern strings of "search", "filter" and "algorithm" in the embodiment of the present application;
FIG. 6 is a schematic diagram of a filtering string composed of characters in three pattern strings of "seaaaa", "filter" and "algorithm" in the embodiment of the present application;
fig. 7 is a schematic diagram of a filter string formed by characters in three pattern strings of "baidu.com", "sina.com", and "alibaba.com" in the embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Currently, matching methods of multi-mode strings in mainstream applications can be basically divided into two types: automaton-based methods and filtering-based methods. The former is represented by the AC algorithm, and the latter is represented by the SOG (Shift-Or with q-Gram) algorithm. The AC algorithm completes matching of multi-mode strings within linear time complexity, but memory occupation of the AC algorithm increases sharply with increasing number of the mode strings, so that a large number of cache miss occurs when the memory is accessed in a matching stage, and further, matching efficiency is reduced. The filtering-based method is based on an assumption (the assumption is true in most application scenarios), that is, the pattern character string rarely appears in the data to be matched, and the pattern character string is found in the data to be matched and is similar to a sea fishing needle. Therefore, the filtering-based method can filter the data to be matched, and only compare the pattern character string with the data to be matched when necessary, so that the memory occupation can be greatly reduced, and the matching efficiency can be improved.
Before describing the processing method of the pattern character string provided by the application, a simple description is first provided for a matching method of the multimode character string based on filtering so as to facilitate understanding of the processing method of the pattern character string provided by the application as a whole.
Currently, in a matching method of multi-mode strings based on filtering, substrings at fixed positions in the mode strings are generally selected as filtering strings for filtering. Assuming that consecutive characters in the pattern string are selected for filtering as filtering strings, the first k or the last k characters in the pattern string are generally selected.
Taking an SOG algorithm as an example, a matching method of multi-mode character strings based on filtering and problems existing in some scenes of the method are described below. Let k=4, and there are three pattern strings, "search", "filter", "algorithm", i.e., a string that illustrates that the filter string of each pattern string is composed of the last 4 characters. In a specific SOG algorithm:
first, three pattern strings of "search", "filter", "algorithm" are tail-aligned, as shown in fig. 5.
Wherein, the characters in the black boxes in the three mode character strings form a filtering character string corresponding to the mode character string, each character in the filtering character string corresponds to a bit array with the length of k, the bit array can be called as the state of the character, and the states of all the characters in the filtering character string of each mode character string form a state table for representing the distribution condition of the characters in each mode character string. Specifically, the state determining method of each character may be that if a character appears at the ith position of the filtering string in any one mode string, the ith bit in the state corresponding to the character needs to be set to 0, and other positions need to be set to 1. The state table determined for the three pattern strings based on the state determining method is as follows:
a:0111
c:1101
e:1101
h:1100
i:0111
l:0111
m:1110
r:1010
t:1011
in practical application, when matching multimode character strings, a state S is initialized in advance: 0 k For representing k consecutive 0 s. Then, the characters in the data to be matched are sequentially read in, the state Sc corresponding to the characters is obtained from a predetermined state table, and the Sc and the S are subjected to OR operation. After the k characters in the data to be matched are read in, whether the kth bit of the operation result is 0 or not needs to be detected, if the kth bit is 0, the fact that at least one candidate pattern character string of the data to be matched appears in the pattern character strings is indicated, and a confirmation stage needs to be entered for further confirmation. After the completion of the processing for each read character, the or operation result corresponding to the character needs to be shifted to the right by 1 bit for updating the state S. After entering the confirmation stage, firstly determining all candidate pattern character strings existing in the pattern character strings according to k recently read characters, secondly aligning the tail part of each candidate pattern character string with the recently read character of the data to be matched, and then sequentially confirming whether the characters at the corresponding positions of the candidate pattern character strings and the data to be matched are identical from back to front, wherein if the characters are identical, the matching between the data to be matched and the candidate pattern character strings is successful, otherwise, the matching is failed.
The following describes a matching method of the multimode character string by taking a specific scheme as an example, and assuming that the data to be matched is an "architecture", the matching method of the multimode character string is as follows:
1) Setting the initial state S as 0000;
2) Reading a first character a in the data to be matched, "archtechnology", and performing OR operation by using states 0111 and S corresponding to the character a in a predetermined state table, wherein the obtained operation result is as follows: 0000|0111=0111, then shift the operation result 0111 by 1 bit to the right to get 0011 for updating S.
3) Reading in a second character r in the data to be matched "architecture", and using a state 1010 corresponding to the character r in the state table and the updated S:0011, and performing an or operation, where the obtained operation result is: 0011|1010=1011, then shift the operation 1011 by 1 bit to the right to get 0101 for updating S.
4) Reading a third character c in the data to be matched "architecture", and using a state 1101 corresponding to the character c in the state table and the updated S:0101 performs an or operation, and the obtained operation result is: 0101|1101=1101, and then shifting the operation result 1101 by 1 bit to the right to obtain 0110 for updating S.
5) Reading a fourth character h in the data to be matched 'architecture', and using a state 1100 corresponding to the character h in the state table and the updated S:0110 performs OR operation, and the obtained operation result is: 0110|1100=1110.
Since 4 characters have been read at this time, it is necessary to check whether the 4 th bit of the operation result is 0. Since bit 4 of the operation result 1110 is 0, it indicates that 4 characters at the end of a pattern string may be matched at this time, and further confirmation is required by entering a confirmation stage. In the determination stage, the candidate pattern character string determined from the 4 characters "search" that have been read recently is "search", and then the tail of "search" and the tail of "search" are aligned, and confirmation is made from the back to the front. It is clear that "architecture" and "search" are not successfully matched. At this time, the operation result 1110 is continued to be shifted to the right by 1 bit to obtain 0111 for updating S.
6) And (3) sequentially reading the characters left in the data to be matched 'architecture' according to the modes 1) -5), and finishing the matching process.
In general, most of the characters read in the data to be matched in the SOG algorithm are filtered in the filtering stage, and only a few of the characters enter the confirmation stage to be further confirmed, and in addition, the matching efficiency is usually very high by combining the advantage of very small memory space occupied by the SOG algorithm. The inventors have found that the efficiency of the algorithm described above is significantly reduced in some special scenarios.
Wherein, the first scene is: the filter character string determined for any pattern character string has consecutively repeated characters, and the data to be matched also has consecutively repeated identical characters.
For the above scenario, the following specific examples are used for specific explanation:
wherein, it is assumed that the pattern string includes "seaaaa", "filter" and "algorithm", and the filter string is composed of characters of the last four bits. As shown in fig. 6, the characters in the black box are the filter strings that constitute the corresponding pattern string;
based on the state determination method, the state table determined for the three pattern strings is as follows:
a:0000
e:1101
h:1101
i:0111
l:0111
m:1110
r:1110
t:1011
*:1111
assuming that the data to be matched is "xxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" in this scenario, after "xxxaaa" in the data to be matched is read in, S is: 0000. thereafter, every 1 character a is read in, with its state: 0000 and S: the operation results obtained by the OR operation of 0000 are: 0000, then shift it by 1 bit to the right for updating S, so that the updated S is also: 0000. therefore, after every 1 character a is read in, a confirmation phase needs to be entered for further confirmation, but in the SOG algorithm, the execution of the confirmation phase is very time-consuming, and the frequent entry of the confirmation phase definitely causes a significant performance degradation, so that the matching efficiency of the algorithm is also greatly degraded.
The second scene is: the filter strings determined for the plurality of pattern strings are identical, and the filter strings are included in the data to be matched.
For the above scenario, the following specific examples are used for specific explanation:
wherein, it is assumed that the pattern string includes "baidu.com", "sina.com", and "alibaba.com", and the filter string is composed of characters of the last four bits. Obviously, the filtering strings of the three pattern strings are the same. As shown in fig. 7, the characters in the black box are the filter strings that constitute the corresponding pattern string;
based on the state determination method, the state table determined for the three pattern strings is as follows:
.:0111
c:1011
o:1101
m:1110
*:1111
let the data to be matched be "xx.com yyyy.com", contain the same filter string ". Com". In this scenario, since the filtering strings of the three pattern strings are ". Com", three candidate pattern strings need to enter the confirmation phase for further confirmation every time ". Com" is read, but the execution of the confirmation phase in the SOG algorithm is very time-consuming, and the frequent entering the confirmation phase definitely causes a significant performance degradation, so that the matching efficiency of the algorithm is also greatly reduced.
Therefore, the present application provides a method for processing a pattern string for the above two scenarios, but not limited to the above two scenarios, and before matching a multimode string, the method dynamically determines a filtering string for each pattern string based on a character distribution condition of the pattern string, so that the filtering string is no longer a character in a fixed position. The multi-mode character string is matched based on the dynamically determined filtering character string, so that the condition that the performance is greatly reduced caused by frequent entering a confirmation stage can be avoided, and the matching efficiency is improved.
Based on this, the present application provides a method for processing a pattern string, and referring to fig. 1, a flowchart of a method for processing a pattern string provided in an embodiment of the present application is provided. Specifically, the processing method of the pattern character string comprises the following steps:
s1, determining any mode character string as a mode character string to be processed, and aligning the head/tail end of a sliding window with a preset length with the head/tail end of the mode character string to be processed.
The pattern character string is a character string predetermined based on the user's demand, and the pattern character string in the general application scene is a plurality of. According to the embodiment of the application, the filtering character strings are dynamically determined for each mode character string respectively and used for a subsequent matching method of the multimode character strings based on filtering.
In the embodiment of the application, any one of the predetermined pattern character strings is determined as the pattern character string to be processed. Before determining the filtering string, the length of the filtering string is first determined, for example, the filtering string may be 4 characters long, and for this purpose, a sliding window with the filtering string length as a preset length is determined in the embodiment of the present application, so as to determine the filtering string.
In practical application, the head end or tail end of the sliding window is aligned with the head end or tail end of the mode character string to be processed. In an alternative embodiment, the tail end of the sliding window may be aligned with the tail end of the pending pattern string, and the sliding window may be subsequently moved back and forth to determine the filter string of the pending pattern string. In another alternative embodiment, the head end of the sliding window may be aligned with the head end of the pending pattern string, and the sliding window may be subsequently moved from front to back to determine the filtering string of the pending pattern string.
S2, determining the score of the character string in the sliding window based on the character distribution condition of the character string in the sliding window of the character string of the mode to be processed.
In this embodiment of the present application, for a string in a sliding window of a to-be-processed mode string, a corresponding score may be determined according to a character distribution condition of the string. The character distribution of the character string may represent the distribution of each character in the character string, the distribution of each character in the filtering character string of each mode character string, and the like.
The embodiment of the application combines the local distribution condition and the global distribution condition of each character of the character strings in the sliding window to determine the filtering character strings, so that continuous repeated characters in the determined filtering character strings and the condition that the filtering character strings of a plurality of mode character strings are identical can be avoided to a large extent, and the condition that the performance is greatly reduced when the matching of the multimode character strings is carried out is avoided.
Specifically, the manner in which the score of a character string is determined based on the character distribution is described in the following embodiments.
S3, moving the sliding window to the tail/head end of the character string of the mode to be processed in a preset step length, and continuing to execute the S2 until the interval length between the tail/head end of the sliding window and the tail/head end of the character string of the mode to be processed is smaller than the preset step length.
In the embodiment of the application, after determining the score of the character string in the sliding window, the sliding window is moved by a preset step length so as to update the character string in the sliding window. Specifically, if the trailing end is aligned in S1, the sliding window moves from the rear to the front, and conversely, if the leading end is aligned in S1, the sliding window moves from the front to the rear.
In practical application, after each moving of the sliding window, the score of the character string is determined based on the character distribution condition of the character string in the sliding window until the sliding window cannot continue to slide. Thus, the scores corresponding to the plurality of character strings in the sliding window in the character string of the mode to be processed can be determined.
S4, determining a filtering character string of the to-be-processed mode character string based on the score of each character string of the to-be-processed mode character string in the sliding window, and continuing to execute S1 to process the next mode character string until each mode character string is determined to be the filtering character string.
In the embodiment of the application, after determining the score corresponding to each character string in the sliding window in the to-be-processed mode character string, determining which character string is used as the filtering character string based on the score. The filter strings of the respective pattern strings are determined in the above manner for the subsequent matching method of the multi-pattern strings based on the filtering.
In the processing method of the mode character strings, before matching of the multimode character strings, filtering character strings are dynamically determined for the mode character strings based on character distribution conditions of the mode character strings, so that the filtering character strings are not characters in fixed positions. The filtering character strings dynamically determined based on the embodiment of the application are used for matching the multi-mode character strings, so that the condition that the performance is greatly reduced caused by frequent entering of a confirmation stage can be avoided, and the matching efficiency is improved.
The following embodiments of the present application provide a specific method for processing a pattern string, and referring to fig. 2, a flowchart of another method for processing a pattern string is provided in the present application. The method comprises the following steps:
s201: pre-establishing a character occurrence frequency table, and initializing elements in the character occurrence frequency table to 0; the character occurrence frequency table is used for recording the occurrence times of each character in the filtering character string.
S202: and determining any mode character string as a mode character string to be processed, and aligning the tail end of a sliding window with a preset length with the tail end of the mode character string to be processed.
S203: and determining a first sub-score based on the occurrence times of each character in the character string in the sliding window of the character string of the mode to be processed in the character occurrence frequency table.
In the embodiment of the application, after determining the character string in the sliding window, the occurrence times of each character in the character string are respectively queried in the character occurrence frequency table. Since the more occurrences of each character are less suitable as a filtered string, the first sub-score is inversely related to the number of occurrences.
In an alternative embodiment, an average of the number of occurrences of each character in the string is calculated and a first sub-score for the string is determined based on the average.
S204: and determining a second sub-score based on the repeated number of continuous repeated characters in the character string of the mode character string to be processed in the sliding window.
In the embodiment of the application, after determining the character string in the sliding window, the number of occurrences of the character with the largest number of occurrences in the character string is determined. Since the larger the number of character repetitions occurring in the string, the less suitable the string is for filtering, the second sub-fraction is inversely related to the number of repetitions.
S205: a score of the string is determined based on the first sub-score and the second sub-score.
In this embodiment of the present application, after determining the first sub-score and the second sub-score, the product of the first sub-score and the second sub-score may be used as the score of the string, which is used as the basis for determining the filtering string.
In an alternative embodiment, the first sub-score and the second sub-score are multiplied to obtain the score of the character string, and although the score of the character string can reflect the character distribution condition of the character string to a certain extent, the product is determined in a manner that the magnitude of the first sub-score and the second sub-score is different and some 0 values exist, so that the score of the character string cannot be obtained in many cases.
Because the same number of the filtering character strings of the mode character strings can cause larger influence on performance when matching the multimode character strings when reaching a certain threshold value, when the average value of the occurrence times of each character in the character strings is determined to be larger than a preset threshold value through the character occurrence frequency table, the score of the character string can be determined based on the first sub-score determined by the average value; otherwise, the score of the character string can be determined based on the repeated number of the continuous repeated characters in the character string of the to-be-processed mode character string in the sliding window.
S206: moving the sliding window to the head end of the to-be-processed mode character string with a preset step length, and continuing to execute the step S203 until the interval length between the head end of the sliding window and the head end of the to-be-processed mode character string is smaller than the preset step length;
s207: and determining a filtering character string of the to-be-processed mode character string based on the score of each character string of the to-be-processed mode character string in the sliding window, and recording the position information of the filtering character string in the to-be-processed mode character string.
In order to facilitate matching of subsequent multimode strings, the embodiment of the present application further needs to record location information of the filtering strings in the pending mode strings, so as to perform subsequent matching based on the location information.
S208: and counting the occurrence times of each character in the filtering character string, and updating a character occurrence frequency table by utilizing each character and the corresponding occurrence times.
In the embodiment of the application, after determining the filtering character string of any mode character string, the occurrence times of each character in the filtering character string are counted, and then the character occurrence frequency table is updated.
S209: s201 is continued to process the next pattern string until each pattern string completes the determination of the filter string.
The following describes the processing method of the pattern character string provided in the above embodiment with a specific example, wherein it is assumed that the following three pattern character strings are predetermined:
baidu.com
slna.com
alibaaaaaaa
wherein, the length k=4 of the sliding window is set, the sliding step length t=2 of the sliding window, and the preset threshold of the first sub-score global_score is 2. The elements in the pre-created character occurrence frequency TABLE foc_table are initialized to 0.
Specifically, the sliding window for the pattern string baidu.com and the corresponding first sub-score global_score and second sub-score local_score are shown in table 1:
TABLE 1
In this case, a string corresponding to any one of the sliding windows may be selected as the filter string of the pattern string baidu. Suppose that the sliding window corresponding to the second row of TABLE 1 is selected at this time as the optimal sliding window,. Com is the filtered string of the pattern string baidu.com, the number of occurrences of each character in. Com is counted, and used to update foc_table. Wherein, the updated FOC_TABLE is shown in Table 2:
character(s) | foc(frequency of occurency) |
. | 1 |
c | 1 |
o | 1 |
m | 1 |
Others | 0 |
TABLE 2
Then, the processing mode string sina.com, its sliding window and corresponding local_score and global_score are as shown in table 3:
TABLE 3 Table 3
Since the score determined based on the local_score and global_score corresponding to the last line of TABLE 3 is the smallest, the corresponding sina can be determined as a filtered string of the pattern string sina. Wherein, the updated FOC_TABLE is shown in Table 4:
character(s) | foc(frequency of occurency) |
. | 1 |
c | 1 |
o | 1 |
m | 1 |
s | 1 |
i | 1 |
n | 1 |
a | 1 |
TABLE 4 Table 4
Finally, the processing mode string alibaaaaaaa, its sliding window and corresponding local_score and global_score are shown in table 5:
TABLE 5
As can be seen from TABLE 5, if the score determined based on the local_score and global_score corresponding to the last line of TABLE 5 is the smallest, the corresponding liba can be determined as the filtered string of the pattern string alibaaaaaaa, and the occurrence number of each character in the liba is counted for updating the foc_table. The updated foc_table is shown in TABLE 6:
TABLE 6
Then, a state table is created by using the filter strings ". Com", "sina", "liba" determined for each pattern string, and the specific creation process of the state table is not described here, as follows:
a:1110
b:1101
c:1011
i:1011
l:0111
m:1110
n:1101
s:0111
*:1111
in practical application, the filtering character string determined based on the above manner is used for scene one: the filter character string determined for any pattern character string has continuously repeated characters, and the data to be matched also has continuously repeated identical characters, and scene two: the filter character strings determined for the mode character strings are the same, and the data to be matched contains the filter character strings, so that the condition that the performance is greatly reduced caused by frequent entering of a confirmation stage can be avoided, and the matching efficiency is improved.
Based on the above method embodiments, the present application further provides a processing device for a pattern string, and referring to fig. 3, fig. 3 is a schematic structural diagram of a processing device for a pattern string according to an embodiment of the present application, where the device includes:
a first determining module 301, configured to determine any one of the mode strings as a mode string to be processed, and align a head/tail end of a sliding window with a preset length with the head/tail end of the mode string to be processed;
a second determining module 302, configured to determine a score of the character string based on a character distribution situation of the character string of the to-be-processed mode character string in the sliding window;
a moving module 303, configured to move the sliding window toward the tail/head end of the to-be-processed mode string with a preset step length, and continue to trigger the second determining module until the interval length between the tail/head end of the sliding window and the tail/head end of the to-be-processed mode string is smaller than the preset step length;
and a third determining module 304, configured to determine a filtering string of the to-be-processed mode string based on a score of each string of the to-be-processed mode string in the sliding window, and continuously trigger the first determining module to process the next mode string until each mode string completes determining the filtering string.
In an alternative embodiment, the apparatus further comprises:
the statistics module is used for counting the occurrence times of each character in the filtering character string;
the updating module is used for updating the character occurrence frequency table by utilizing each character and the corresponding occurrence times;
correspondingly, the second determining module is specifically configured to:
determining the score of the character string based on the occurrence times of each character in the character string in the sliding window of the character string of the mode to be processed in the character occurrence frequency table; wherein the score is inversely related to the number of occurrences.
In the processing device for the pattern character strings provided by the embodiment of the application, before matching of the multimode character strings, the filtering character strings are dynamically determined for each pattern character string based on the character distribution condition of the pattern character strings, so that the filtering character strings are not characters in fixed positions. The filtering character strings dynamically determined based on the embodiment of the application are used for matching the multi-mode character strings, so that the condition that the performance is greatly reduced caused by frequent entering of a confirmation stage can be avoided, and the matching efficiency is improved.
In addition, the embodiment of the application further provides a processing device for a pattern string, as shown in fig. 4, which may include:
a processor 401, a memory 402, an input device 403 and an output device 404. The number of processors 401 in the processing device of the pattern string may be one or more, one processor being exemplified in fig. 4. In some embodiments of the invention, the processor 401, memory 402, input device 403, and output device 404 may be connected by a bus or other means, with the bus connection being exemplified in FIG. 4.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications of the processing device of the pattern string and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area that may store an operating system, application programs required for at least one function, and the like, and a storage data area. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. The input means 403 may be used to receive input numeric or character information and to generate signal inputs related to user settings and function control of the processing device of the mode string.
In particular, in this embodiment, the processor 401 loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions in the above-mentioned processing method of the pattern string.
In addition, the application also provides a computer readable storage medium, wherein the computer readable storage medium stores instructions, and when the instructions run on the terminal equipment, the terminal equipment is caused to execute the method for processing the pattern character string.
It is to be understood that for the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing describes in detail a method, an apparatus, and a device for processing a pattern string provided in the embodiments of the present application, and specific examples are applied to describe the principles and implementations of the present application, where the descriptions of the foregoing examples are only used to help understand the method and core ideas of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.
Claims (8)
1. A method for processing a pattern string, the method comprising:
s1, determining any mode character string as a mode character string to be processed, and aligning the head/tail end of a sliding window with a preset length with the head/tail end of the mode character string to be processed;
s2, determining the score of the character string in the sliding window based on the character distribution condition of the character string of the mode to be processed in the sliding window;
s3, moving the sliding window to the tail/head end of the character string of the mode to be processed in a preset step length, and continuing to execute the S2 until the interval length between the tail/head end of the sliding window and the tail/head end of the character string of the mode to be processed is smaller than the preset step length;
s4, determining a filtering character string of the to-be-processed mode character string based on the score of each character string of the to-be-processed mode character string in the sliding window, and continuing to execute S1 to process the next mode character string until each mode character string is determined to be the filtering character string;
wherein after determining the filtering string of the pending mode string in S4, the method further includes:
counting the occurrence times of each character in the filtering character string;
updating a character occurrence frequency table by utilizing each character and the corresponding occurrence times;
correspondingly, the determining the score of the character string based on the character distribution condition of the character string of the to-be-processed mode character string in the sliding window comprises the following steps:
determining the score of the character string based on the occurrence times of each character in the character string in the sliding window of the character string of the mode to be processed in the character occurrence frequency table; wherein the score is inversely related to the number of occurrences.
2. The method according to claim 1, wherein the determining the score of the character string based on the number of occurrences of each character in the character string in the sliding window in which the character string of the mode to be processed is in the character occurrence frequency table includes:
determining the occurrence times of each character in the character string in the sliding window of the character string in the mode to be processed in the character occurrence frequency table, and calculating the average value of the occurrence times of each character;
if the average value is larger than a preset threshold value, determining the score of the character string based on the average value; wherein the score is inversely related to the average value.
3. The method according to claim 2, wherein the method further comprises:
if the average value is not greater than the preset threshold value, determining the score of the character string based on the repeated number of continuous repeated characters in the character string of the mode to be processed in the sliding window; wherein the score is inversely related to the number of repetitions.
4. The method of claim 1, wherein the determining the score of the string based on the character distribution of the strings in the sliding window for which the pending pattern string is located comprises:
and determining the score of the character string based on the repeated number of the continuous repeated characters in the character string of the mode to be processed in the sliding window and the occurrence frequency of each character in the character string in the character occurrence frequency table.
5. The method according to claim 1, wherein after determining the filter string of the pending pattern string in S4, further comprising:
and recording the position information of the filtering character string in the character string of the mode to be processed.
6. A processing apparatus for a pattern string, the apparatus comprising:
a first determining module, configured to determine any one of the mode strings as a mode string to be processed, and align a head/tail end of a sliding window with a preset length with a head/tail end of the mode string to be processed;
the second determining module is used for determining the score of the character string in the sliding window based on the character distribution condition of the character string of the to-be-processed mode in the sliding window;
the moving module is used for moving the sliding window to the tail/head end of the character string of the mode to be processed in a preset step length, and continuing to trigger the second determining module until the length of the interval between the tail/head end of the sliding window and the tail/head end of the character string of the mode to be processed is smaller than the preset step length;
a third determining module, configured to determine a filtering string of the to-be-processed mode string based on a score of each string of the to-be-processed mode string in the sliding window, and continuously trigger the first determining module to process a next mode string until each mode string completes determination of the filtering string;
the statistics module is used for counting the occurrence times of each character in the filtering character string;
the updating module is used for updating the character occurrence frequency table by utilizing each character and the corresponding occurrence times;
correspondingly, the second determining module is specifically configured to:
determining the score of the character string based on the occurrence times of each character in the character string in the sliding window of the character string of the mode to be processed in the character occurrence frequency table; wherein the score is inversely related to the number of occurrences.
7. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein instructions, which when run on a terminal device, cause the terminal device to perform the method according to any of claims 1-5.
8. A processing apparatus for a pattern character string, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1-5 when the computer program is executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911280495.7A CN111159490B (en) | 2019-12-13 | 2019-12-13 | Method, device and equipment for processing pattern character strings |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911280495.7A CN111159490B (en) | 2019-12-13 | 2019-12-13 | Method, device and equipment for processing pattern character strings |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111159490A CN111159490A (en) | 2020-05-15 |
CN111159490B true CN111159490B (en) | 2023-05-26 |
Family
ID=70557050
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911280495.7A Active CN111159490B (en) | 2019-12-13 | 2019-12-13 | Method, device and equipment for processing pattern character strings |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111159490B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113836367B (en) * | 2021-09-26 | 2023-04-28 | 杭州迪普科技股份有限公司 | Method and device for character reverse matching |
CN116388767B (en) * | 2023-04-11 | 2023-10-13 | 河北湛泸软件开发有限公司 | Security management method for software development data |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0677927A2 (en) * | 1994-04-15 | 1995-10-18 | International Business Machines Corporation | Character string pattern matching for compression and the like using minimal cycles per character |
CN101609455A (en) * | 2009-07-07 | 2009-12-23 | 哈尔滨工程大学 | A kind of method of high-speed accurate single-pattern character string coupling |
CN101876986A (en) * | 2009-11-27 | 2010-11-03 | 福建星网锐捷网络有限公司 | Character string matching method based on finite state automation and content filtering equipment |
CN102063482A (en) * | 2010-12-27 | 2011-05-18 | 北京友录在线科技发展有限公司 | High-efficiency contact searching method of handheld device |
CN102750379A (en) * | 2012-06-25 | 2012-10-24 | 华南理工大学 | Fast character string matching method based on filtering type |
CN103559018A (en) * | 2013-10-23 | 2014-02-05 | 东软集团股份有限公司 | String matching method and system based on graphics processing unit (GPU) calculation |
CN106599097A (en) * | 2016-11-24 | 2017-04-26 | 东软集团股份有限公司 | Massive feature string sets matching method and apparatus |
CN109977276A (en) * | 2019-03-22 | 2019-07-05 | 华南理工大学 | A kind of single pattern matching method based on Sunday algorithm improvement |
-
2019
- 2019-12-13 CN CN201911280495.7A patent/CN111159490B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0677927A2 (en) * | 1994-04-15 | 1995-10-18 | International Business Machines Corporation | Character string pattern matching for compression and the like using minimal cycles per character |
CN101609455A (en) * | 2009-07-07 | 2009-12-23 | 哈尔滨工程大学 | A kind of method of high-speed accurate single-pattern character string coupling |
CN101876986A (en) * | 2009-11-27 | 2010-11-03 | 福建星网锐捷网络有限公司 | Character string matching method based on finite state automation and content filtering equipment |
CN102063482A (en) * | 2010-12-27 | 2011-05-18 | 北京友录在线科技发展有限公司 | High-efficiency contact searching method of handheld device |
CN102750379A (en) * | 2012-06-25 | 2012-10-24 | 华南理工大学 | Fast character string matching method based on filtering type |
CN103559018A (en) * | 2013-10-23 | 2014-02-05 | 东软集团股份有限公司 | String matching method and system based on graphics processing unit (GPU) calculation |
CN106599097A (en) * | 2016-11-24 | 2017-04-26 | 东软集团股份有限公司 | Massive feature string sets matching method and apparatus |
CN109977276A (en) * | 2019-03-22 | 2019-07-05 | 华南理工大学 | A kind of single pattern matching method based on Sunday algorithm improvement |
Also Published As
Publication number | Publication date |
---|---|
CN111159490A (en) | 2020-05-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3432157B1 (en) | Data table joining mode processing method and apparatus | |
US4916655A (en) | Method and apparatus for retrieval of a search string | |
CN111159490B (en) | Method, device and equipment for processing pattern character strings | |
CN111971931B (en) | Method for verifying transactions in a blockchain network and nodes constituting the network | |
CN108985934B (en) | Block chain modification method and device | |
US20220005546A1 (en) | Non-redundant gene set clustering method and system, and electronic device | |
CN116450656B (en) | Data processing method, device, equipment and storage medium | |
CN108681490B (en) | Vector processing method, device and equipment for RPC information | |
CN117349235A (en) | LSM-Tree-based KV storage system, electronic equipment and medium | |
CN111175810B (en) | Microseismic signal arrival time picking method, device, equipment and storage medium | |
US8407187B2 (en) | Validating files using a sliding window to access and correlate records in an arbitrarily large dataset | |
KR102085132B1 (en) | Efficient cuckoo hashing using hash function categorization in inside of bucket | |
CN116775695A (en) | Dynamic combination query optimization method and device based on index and storage medium | |
CN116610636A (en) | Data processing method and device of file system, electronic equipment and storage medium | |
CN115982426A (en) | Retrieval method, device, storage medium and terminal based on improved MinHash algorithm | |
CN112579839B (en) | Multi-mode matching method and device for large-scale features and storage medium | |
CN113010882B (en) | Custom position sequence pattern matching method suitable for cache loss attack | |
CN113887223B (en) | Character string matching method and related device | |
KR20030032499A (en) | A method for matching subsequence based on time-warping in sequence databases | |
US9864765B2 (en) | Entry insertion apparatus, method, and program | |
AU2021390717B2 (en) | Batch job performance improvement in active-active architecture | |
US8364675B1 (en) | Recursive algorithm for in-place search for an n:th element in an unsorted array | |
CN110427391B (en) | Method, apparatus and computer program product for determining duplicate data | |
WO2001091132A2 (en) | The implementation of a content addressable memory using a ram-cell structure | |
CN114461865A (en) | Character string matching method, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |