CN111159490A - Method, device and equipment for processing mode character string - Google Patents

Method, device and equipment for processing mode character string Download PDF

Info

Publication number
CN111159490A
CN111159490A CN201911280495.7A CN201911280495A CN111159490A CN 111159490 A CN111159490 A CN 111159490A CN 201911280495 A CN201911280495 A CN 201911280495A CN 111159490 A CN111159490 A CN 111159490A
Authority
CN
China
Prior art keywords
character string
character
mode
string
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911280495.7A
Other languages
Chinese (zh)
Other versions
CN111159490B (en
Inventor
谭天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou DPTech Technologies Co Ltd
Original Assignee
Hangzhou DPTech Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou DPTech Technologies Co Ltd filed Critical Hangzhou DPTech Technologies Co Ltd
Priority to CN201911280495.7A priority Critical patent/CN111159490B/en
Publication of CN111159490A publication Critical patent/CN111159490A/en
Application granted granted Critical
Publication of CN111159490B publication Critical patent/CN111159490B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method, a device and equipment for processing a mode character string, wherein the method comprises the following steps: determining any mode character string as a mode character string to be processed, and aligning the head/tail end of a sliding window with a preset length with the head/tail end of the mode character string to be processed; determining the score of the character string based on the character distribution condition of the character string of the mode to be processed in the sliding window; moving a sliding window to the tail/head end of the character string in the mode to be processed by a preset step length until the sliding window cannot be moved; and determining the filtering character strings of the mode character strings to be processed based on the scores of the character strings of the mode character strings to be processed in the sliding window, and finally determining that the filtering character strings are completed by all the mode character strings. According to the method and the device, the filtering character string is determined based on the character distribution condition of the mode character string, so that the multimode character string is matched based on the determined filtering character string, the condition that the performance is greatly reduced can be avoided, and the matching efficiency is improved.

Description

Method, device and equipment for processing mode character string
Technical Field
The present application relates to the field of data processing, and in particular, to a method, an apparatus, and a device for processing a pattern string.
Background
The matching of the multi-mode character strings means that a plurality of character strings are given as mode character strings, and then all the occurring mode character strings and the occurring positions thereof are found in a section of data to be matched.
Matching of multi-mode strings is currently applied in more and more service scenarios, such as feature code-based protocol identification and attack detection. With the complexity of application, the number of pattern strings is continuously increased, and the performance requirement for matching is higher and higher, which poses a challenge to the existing matching algorithm for multi-pattern strings, for example, the classic AC automaton algorithm becomes unavailable in some cases due to its large memory consumption, while the other matching algorithm based on the filtered multi-pattern strings is more and more applied due to its advantage of smaller memory usage and higher matching performance in most cases.
However, the performance of the matching algorithm based on the filtered multi-mode character string may be greatly reduced in some scenarios (specific scenarios will be specifically described later), and it is found through research that most of the reasons are caused by the fact that the filtering character string in the pattern character string is formed by fixing the first K characters or the last K characters, and therefore how to determine the filtering character string in the pattern character string is a key for solving the problem that the performance of the matching algorithm based on the filtered multi-mode character string is greatly reduced in some scenarios.
Disclosure of Invention
In view of this, the present application provides a method, an apparatus, and a device for processing a modal string, which can determine a filtering string of the modal string based on a character distribution condition of the modal string, so that when matching a multi-mode string is performed on the filtering string determined by the present application, a problem of performance being greatly reduced in some scenarios can be avoided.
In a first aspect, to achieve the above object, the present application provides a method for processing a pattern string, where the method includes:
s1, determining any mode character string as a mode character string to be processed, and aligning the head/tail end of a sliding window with a preset length with the head/tail end of the mode character string to be processed;
s2, determining the score of the character string in the sliding window based on the character distribution condition of the character string of the mode character string to be processed in the sliding window;
s3, moving the sliding window to the tail/head end of the to-be-processed mode character string by a preset step length, and continuing to execute the S2 until the interval length between the tail/head end of the sliding window and the tail/head end of the to-be-processed mode character string is smaller than the preset step length;
s4, determining a filter character string of the mode character string to be processed based on the score of each character string of the mode character string to be processed in the sliding window, and continuing to execute S1 to process the next mode character string until each mode character string completes the determination of the filter character string.
In an optional implementation manner, after determining the filter string of the to-be-processed mode string in S4, the method further includes:
counting the occurrence times of each character in the filtering character string;
updating a character occurrence frequency table by using each character and the corresponding occurrence frequency;
correspondingly, the determining the score of the character string based on the character distribution of the character string of the mode to be processed in the sliding window includes:
determining the score of the character string based on the occurrence frequency of each character in the character string of the mode to be processed in the sliding window in the character occurrence frequency table; wherein the score is inversely related to the number of occurrences.
In an optional embodiment, the determining the score of the character string based on the number of occurrences of each character in the character string of which the character string of the mode to be processed is in the sliding window in the character occurrence frequency table includes:
determining the occurrence frequency of each character in the character string of the mode character string to be processed in the sliding window in the character occurrence frequency table, and calculating the average value of the occurrence frequency of each character;
determining a score of the character string based on the average value if the average value is greater than a preset threshold value; wherein the score is inversely related to the average.
In an optional embodiment, the method further comprises:
if the average value is not larger than the preset threshold value, determining the fraction of the character string based on the repeated number of continuous repeated characters in the character string of the character string in the sliding window; wherein the score is inversely related to the number of repetitions.
In an optional embodiment, the determining the score of the character string based on the character distribution of the character string in which the character string of the mode to be processed is in the sliding window includes:
and determining the score of the character string based on the repeated number of the continuous repeated characters in the character string of the mode to be processed in the sliding window and the occurrence number of each character in the character string in the character occurrence frequency table.
In an optional implementation manner, after determining the filter string of the to-be-processed mode string in S4, the method further includes:
and recording the position information of the filtering character string in the to-be-processed mode character string.
In a second aspect, the present application further provides an apparatus for processing a pattern string, where the apparatus includes:
the first determining module is used for determining any mode character string as a mode character string to be processed and aligning the head/tail end of a sliding window with a preset length with the head/tail end of the mode character string to be processed;
the second determination module is used for determining the scores of the character strings in the sliding window based on the character distribution condition of the character strings of the mode to be processed in the sliding window;
the moving module is used for moving the sliding window to the tail/head end of the mode character string to be processed by a preset step length and continuously triggering the second determining module until the interval length between the tail/head end of the sliding window and the tail/head end of the mode character string to be processed is smaller than the preset step length;
and the third determining module is used for determining the filtering character string of the mode character string to be processed based on the fraction of each character string of the mode character string to be processed in the sliding window, and continuously triggering the first determining module to process the next mode character string until each mode character string finishes the determination of the filtering character string.
In an alternative embodiment, the apparatus further comprises:
the counting module is used for counting the occurrence times of each character in the filtering character string;
the updating module is used for updating the character occurrence frequency table by utilizing each character and the corresponding occurrence frequency;
correspondingly, the second determining module is specifically configured to:
determining the score of the character string based on the occurrence frequency of each character in the character string of the mode to be processed in the sliding window in the character occurrence frequency table; wherein the score is inversely related to the number of occurrences.
In a third aspect, the present application also provides a computer-readable storage medium having stored therein instructions that, when run on a terminal device, cause the terminal device to perform the method according to any one of the above.
In a fourth aspect, the present application further provides an intra-mode character string processing device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, when executing the computer program, implementing the method as in any one of the above.
In the method for processing the modal character string provided in the embodiment of the present application, before matching the multimode character string, a filtering character string is dynamically determined for each modal character string based on a character distribution condition of the modal character string, so that the filtering character string is no longer a character at a fixed position. The multimode character string is matched based on the dynamically determined filtering character string, so that the condition that the performance is greatly reduced due to frequent entering of a confirmation stage can be avoided, and the matching efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
Fig. 1 is a flowchart of a method for processing a pattern string according to an embodiment of the present application;
fig. 2 is a flowchart of another processing method for a pattern string according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a device for processing a pattern string according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a device for processing a pattern string according to an embodiment of the present application.
FIG. 5 is a diagram illustrating a filter string formed by characters in three pattern strings of "search", "filter" and "algorithm" in the embodiment of the present application;
FIG. 6 is a diagram illustrating a filter string composed of characters in three pattern strings of "seaaaa", "filter", and "algorithm" in the embodiment of the present application;
fig. 7 is a schematic diagram of a filter string formed by characters in three pattern strings of "baidu.com", "sina.com", and "alibaba.com" in the embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Currently, the matching methods of the multimode character strings in mainstream application can be basically divided into two categories: automaton-based methods and filter-based methods. The former is represented by an AC algorithm, and the latter is represented by an SOG (Shift-Or with q-Gram) algorithm. The AC algorithm completes matching of the multi-mode character strings within the linear time complexity, but the memory occupancy becomes large sharply with the increase of the number of the mode character strings, so that a large amount of cache misses are generated when the memory is accessed in the matching stage, and the matching efficiency is further reduced. The filtering-based method is premised on an assumption (which is true in most application scenarios), that is, a pattern string rarely appears in the data to be matched, and searching for the pattern string in the data to be matched is similar to a sea fishing needle. Therefore, the data to be matched is filtered based on the filtering method, and the pattern character string is compared with the data to be matched only when necessary, so that the memory occupation can be greatly reduced, and the matching efficiency can be improved.
Before introducing the processing method of the pattern character string provided by the application, firstly, a matching method of the multi-mode character string based on filtering is simply introduced, so that the processing method of the pattern character string provided by the application is conveniently understood on the whole.
Currently, in a filtering-based matching method for multi-mode strings, a sub-string at a fixed position in a mode string is usually selected as a filtering string for filtering. It is assumed that several consecutive characters in the pattern string are selected as a filter string for filtering, and the first k or the last k characters in the pattern string are generally selected.
The SOG algorithm is taken as an example below to describe a filtering-based multi-mode string matching method and problems of the method in some scenarios. Let k be 4, and there are three pattern strings "search", "filter", "algorithm", i.e. a string in which the filter string of each pattern string is described as the last 4 characters. In a specific SOG algorithm:
first, the three pattern strings "search", "filter", "algorithm" are tail-aligned, as shown in fig. 5.
The characters in the black boxes in the three mode character strings form a filtering character string corresponding to the mode character string, wherein each character in the filtering character string corresponds to a bit array with the length of k, the bit array can be called as the state of the character, and the states of all the characters in the filtering character string of each mode character string form a state table for representing the distribution condition of the characters in each mode character string. Specifically, the method for determining the state of each character may be that, if a character appears at the ith position of the filter string in any one of the pattern strings, the ith bit in the state corresponding to the character needs to be set to 0, and the other positions need to be set to 1. The state table determined for the three mode strings based on the state determination method is as follows:
a:0111
c:1101
e:1101
h:1100
i:0111
l:0111
m:1110
r:1010
t:1011
in practical application, when matching multi-mode character strings, a state S is initialized in advance: 0kAnd is used to represent k consecutive 0 s. Then, reading in the characters in the data to be matched in sequence, acquiring the state Sc corresponding to the characters from a predetermined state table, and carrying out 'OR' operation on the Sc and the S. After reading k characters in the data to be matched, whether the kth bit of an operation result is 0 or not needs to be detected, if so, the kth bit of the operation result is 0, and if so, the kth bit of the operation result indicates that at least one candidate pattern character string of the data to be matched appears in the pattern character string, and the method needs to enter a confirmation stage for further confirmation. After the processing is completed for each read character, the or operation result corresponding to the character needs to be shifted to the right by 1 bit for updating the state S. After entering a confirmation stage, firstly determining all candidate pattern character strings in the pattern character strings according to k characters read recently, secondly aligning the tail part of each candidate pattern character string with the characters read recently by the data to be matched, then sequentially confirming whether the characters at the corresponding positions of the candidate pattern character strings and the data to be matched are the same from back to front, and if the confirmation is the same, showing that the data to be matched is matched with the candidate pattern character stringsAnd if the matching fails, the matching is failed.
A specific scheme is taken as an example to introduce a matching method of the multi-mode character string, and if data to be matched is "architecture", the matching method of the multi-mode character string is as follows:
1) setting the initial state S to 0000;
2) reading in a first character a in data "archtecture" to be matched, and carrying out OR operation by using states 0111 and S corresponding to the character a in a predetermined state table, wherein the obtained operation result is as follows: 0000| 0111 ═ 0111, and then the operation result 0111 is right-shifted by 1 bit to 0011, which is used to update S.
3) Reading in a second character r in the data "archtecture" to be matched, and using the state 1010 corresponding to the character r in the state table and the updated S: 0011, carrying out an OR operation to obtain an operation result: 0011 |1010 is 1011, and the operation result 1011 is shifted right by 1 bit to 0101 for updating S.
4) Reading in a third character c in the data "architecture" to be matched, and using the state 1101 corresponding to the character c in the state table and the updated S: 0101, an OR operation is performed to obtain the following operation results: 0101 |1101 becomes 1101, and then the operation result 1101 is shifted right by 1 bit to obtain 0110 for updating S.
5) Reading in a fourth character h in the data "architecture" to be matched, and using the state 1100 corresponding to the character h in the state table and the updated S: 0110 performs OR operation to obtain the following operation result: 0110 |1100 ═ 1110.
Since 4 characters are read at this time, it is necessary to check whether or not the 4 th bit of the operation result is 0. Since the 4 th bit of the operation result 1110 is 0, it means that 4 characters at the end of a pattern string may be matched at this time, and it needs to enter into the confirmation stage for further confirmation. In the determination stage, the candidate pattern character string determined according to the recently read-in 4 characters "arm" is "search", and then the tail of "search" and the tail of "arm" are aligned and confirmed from back to front. It is clear that the "acquisition" and "search" are not successful matches. At this point, the operation result 1110 is shifted to the right by 1 bit to obtain 0111, which is used to update S.
6) And sequentially reading the rest characters in the data to be matched, namely the "archtecture", in the modes of 1) to 5) to finish the matching process.
In general, most of the read characters of the data to be matched in the SOG algorithm are filtered in the filtering stage, and only a few characters enter the confirming stage for further confirmation. The inventor of the present invention has found that the efficiency of the above algorithm is greatly reduced in some special scenarios.
Wherein the first scene is as follows: in the filter string determined for any pattern string, there are continuously repeated characters, and there are also continuously repeated identical characters in the data to be matched.
With respect to the above scenario, the following specific example is used for specific explanation:
wherein, it is assumed that the pattern string includes "seaaaa", "filter", and "algorithm", and the filter string is constituted by the characters of the last four digits. As shown in fig. 6, the characters in the black boxes constitute the filter character strings of the corresponding mode character strings;
based on the state determination method, the state table determined for the three mode strings is as follows:
a:0000
e:1101
h:1101
i:0111
l:0111
m:1110
r:1110
t:1011
*:1111
assuming that the data to be matched is "xxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", in this scenario, after "xxxaaa" in the data to be matched is read in, S is: 0000. thereafter, every 1 character a is read in, with its state: 0000 and S: 0000 the or operation results are all as follows: 0000, then shift it right by 1 bit for updating S, so that the updated S is also: 0000. therefore, after each read of 1 character a, the algorithm needs to enter a confirmation phase for further confirmation, while the execution of the confirmation phase in the SOG algorithm is very time-consuming, and the frequent entering of the confirmation phase undoubtedly causes a great performance degradation, so that the matching efficiency of the algorithm is also greatly degraded.
The second scene is: the filtering character strings determined for the plurality of mode character strings are the same, and the data to be matched comprises the filtering character strings.
With respect to the above scenario, the following specific example is used for specific explanation:
here, it is assumed that the pattern character string includes "baidu.com", "sina.com", and "alibaba.com", and the filter character string is constituted by characters of the last four digits. Obviously, the filter strings of the above three pattern strings are the same. As shown in fig. 7, the characters in the black boxes constitute the filter character strings corresponding to the mode character strings;
based on the state determination method, the state table determined for the three mode strings is as follows:
.:0111
c:1011
o:1101
m:1110
*:1111
let the data to be matched be "xxxx.com yyy.com", containing the same filter string ". com". In this scenario, since the filter character strings of the three pattern character strings are all ". com", three candidate pattern character strings will need to enter the confirmation stage for further confirmation every time ". com" is read in, and the execution of the confirmation stage in the SOG algorithm is very time-consuming, and the frequent entering of the confirmation stage will undoubtedly result in a situation of greatly reduced performance, so that the matching efficiency of the algorithm is also greatly reduced.
Therefore, the present application provides a processing method of a pattern character string for the above two scenarios, but not limited to the above two scenarios, before matching a multi-pattern character string, first dynamically determining a filter character string for each pattern character string based on a character distribution condition of the pattern character string, so that the filter character string is no longer a character with a fixed position. The multi-mode character string matching is carried out on the basis of the dynamically determined filtering character string, the situation that performance is greatly reduced due to frequent entering of a confirmation stage can be avoided, and therefore matching efficiency is improved.
Based on this, the present application provides a processing method of a pattern string, and referring to fig. 1, it is a flowchart of a processing method of a pattern string provided in an embodiment of the present application. Specifically, the method for processing the pattern character string includes:
and S1, determining any mode character string as a mode character string to be processed, and aligning the head/tail end of a sliding window with a preset length with the head/tail end of the mode character string to be processed.
The mode character string is a character string predetermined based on user requirements, and the mode character string in a general application scenario is multiple. The embodiment of the application dynamically determines the filtering character string for each mode character string respectively, and is used for a subsequent multi-mode character string matching method based on filtering.
In the embodiment of the present application, any one of predetermined pattern character strings is determined as a pattern character string to be processed. Before determining the filtering character string, first determining the length of the filtering character string, for example, the filtering character string may be 4 characters in length, and for this reason, the embodiment of the present application determines a sliding window, which takes the length of the filtering character string as a preset length, for determining the filtering character string.
In practical applications, the head end or the tail end of the sliding window is first aligned with the head end or the tail end of the pending mode string. In an alternative embodiment, the tail end of the sliding window may be aligned with the tail end of the pending mode string, and the sliding window may be subsequently moved from back to front to determine the filter string of the pending mode string. In another alternative embodiment, the head end of the sliding window may be aligned with the head end of the pending mode string, and the sliding window may be moved from front to back subsequently to determine the filter string of the pending mode string.
S2, determining the score of the character string in the sliding window based on the character distribution condition of the character string of the mode character string to be processed in the sliding window.
In the embodiment of the application, for the character string of the to-be-processed mode character string in the sliding window, the corresponding score can be determined according to the character distribution condition of the character string. The character distribution of the character string may represent the distribution of each character in the character string, the distribution of each character in the filtered character string of each mode character string, and the like.
According to the method and the device for determining the filtering character strings, the filtering character strings are determined by combining the local distribution condition and the global distribution condition of each character of the character strings in the sliding window, so that continuous repeated characters in the determined filtering character strings can be greatly avoided, and the situation that the filtering character strings of a plurality of mode character strings are the same is avoided, so that the situation that the performance is greatly reduced when the multi-mode character strings are matched is avoided.
Specifically, the manner of determining the score of the character string based on the character distribution is described in the following embodiments.
S3, moving the sliding window to the tail/head end of the character string of the mode to be processed by a preset step length, and continuing to execute the step S2 until the interval length between the tail/head end of the sliding window and the tail/head end of the character string of the mode to be processed is smaller than the preset step length.
In the embodiment of the application, after the score of the character string in the sliding window is determined, the sliding window is moved by a preset step length so as to update the character string in the sliding window. Specifically, if the trailing end is aligned in S1, the sliding window is moved from the rear to the front, whereas if the leading end is aligned in S1, the sliding window is moved from the front to the rear.
In practical application, after moving the sliding window every time, the score of the character string in the sliding window is determined based on the character distribution of the character string until the sliding window cannot slide continuously. Therefore, scores corresponding to a plurality of character strings in the sliding window in the character strings of the mode to be processed can be determined.
S4, determining a filter character string of the mode character string to be processed based on the score of each character string of the mode character string to be processed in the sliding window, and continuing to execute S1 to process the next mode character string until each mode character string completes the determination of the filter character string.
In the embodiment of the application, after the scores corresponding to the character strings in the sliding window in the character strings of the mode to be processed are determined, which character string is used as the filtering character string is determined based on the scores. The filter strings for each modal string are determined in the manner described above for use in subsequent filtering-based multimodal string matching methods.
In the method for processing the modal character string provided in the embodiment of the present application, before matching the multimode character string, a filtering character string is dynamically determined for each modal character string based on a character distribution condition of the modal character string, so that the filtering character string is no longer a character at a fixed position. The multimode character string is matched based on the dynamically determined filtering character string, so that the condition that the performance is greatly reduced due to frequent entering of a confirmation stage can be avoided, and the matching efficiency is improved.
The following embodiment of the present application provides a specific processing method for a pattern string, and refers to fig. 2, which is a flowchart of another processing method for a pattern string provided by the present application. The method comprises the following steps:
s201: establishing a character occurrence frequency table in advance, and initializing elements in the character occurrence frequency table to 0; the character occurrence frequency table is used for recording the occurrence frequency of each character in the filtered character string.
S202: and determining any mode character string as a mode character string to be processed, and aligning the tail end of a sliding window with a preset length with the tail end of the mode character string to be processed.
S203: and determining a first sub-score based on the occurrence times of each character in the character string of which the mode character string to be processed is in the sliding window in the character occurrence frequency table.
In the embodiment of the application, after the character string in the sliding window is determined, the occurrence frequency table of each character in the character string is queried respectively. Since the more the number of occurrences of each character, the less suitable it is as a filter string, the first sub-score is inversely related to the number of occurrences.
In an alternative embodiment, an average of the number of occurrences of each character in the string is calculated, and the first sub-score of the string is determined based on the average.
S204: and determining a second sub-score based on the repeated number of continuous repeated characters in the character string of the mode character string to be processed in the sliding window.
In the embodiment of the application, after the character string in the sliding window is determined, the occurrence number of the character with the largest occurrence number in the character string is determined. Since the larger the number of character repetitions occurring in the character string, the less suitable it is as a filter character string, the second sub-score is inversely related to the number of repetitions.
S205: determining a score for the string based on the first sub-score and the second sub-score.
In the embodiment of the present application, after the first sub-score and the second sub-score are determined, the product of the first sub-score and the second sub-score may be used as the score of the character string, and is used as a basis for determining the filtering character string.
In an alternative embodiment, the first sub-score and the second sub-score are multiplied to obtain the score of the character string, and although the score of the character string can reflect the character distribution condition of the character string to some extent, the score of the character string is determined by means of multiplication due to different magnitudes of the first sub-score and the second sub-score and the existence of some 0 values, and in many cases, the optimal filtering character string cannot be obtained.
Because the same number of the filtered character strings of the mode character string reaches a certain threshold, the performance is greatly influenced when the multimode character strings are matched, the score of the character string can be determined based on the first sub-score determined by the average value when the average value of the occurrence times of all the characters in the character string is determined to be larger than the preset threshold through the character occurrence frequency table in the embodiment of the application; otherwise, determining the score of the character string based on the repeated number of the continuous repeated characters in the character string of the mode to be processed in the sliding window.
S206: moving the sliding window to the head end of the to-be-processed mode character string by a preset step length, and continuing to execute the step S203 until the interval length between the head end of the sliding window and the head end of the to-be-processed mode character string is smaller than the preset step length;
s207: determining a filtering character string of the mode character string to be processed based on the fraction of each character string of the mode character string to be processed in the sliding window, and recording the position information of the filtering character string in the mode character string to be processed.
In order to facilitate the matching of the subsequent multi-mode character string, the embodiment of the present application further needs to record the position information of the filtering character string in the to-be-processed mode character string, so as to perform the subsequent matching based on the position information.
S208: and counting the occurrence times of each character in the filtered character string, and updating a character occurrence frequency table by using each character and the corresponding occurrence times.
In the embodiment of the application, after the filtering character string of any mode character string is determined, the occurrence frequency of each character in the filtering character string is counted, and then the character occurrence frequency table is updated.
S209: s201 continues to be executed to process the next pattern string until each pattern string completes the determination of the filter string.
The following describes a method for processing a pattern string provided in the above embodiment with a specific example, where the following three pattern strings are assumed to be predetermined:
baidu.com
slna.com
alibaaaaaaa
wherein, the length k of the sliding window is set to 4, the sliding step t of the sliding window is set to 2, and the preset threshold of the first sub-score global _ score is set to 2. The elements in the pre-created character occurrence frequency TABLE FOC _ TABLE are initialized to 0.
Com for the pattern string basic and corresponding first sub-score global _ score and second sub-score local _ score are shown in table 1:
Figure BDA0002316614360000121
TABLE 1
In this case, a character string corresponding to any one of the sliding windows may be selected as a filter character string of the pattern character string basic. Suppose that this time is selected
Figure BDA0002316614360000122
Com is taken as a filter string of the pattern string basic.com, counts the number of occurrences of each character in com, and is used to update FOC _ TABLE. Wherein, the updated FOC _ TABLE is shown in TABLE 2:
character(s) foc(frequency of occurency)
. 1
c 1
o 1
m 1
Others 0
TABLE 2
Com, its sliding window and corresponding local _ score and global _ score are as shown in table 3:
Figure BDA0002316614360000123
Figure BDA0002316614360000131
TABLE 3
Due to the fact that
Figure BDA0002316614360000132
Com, and counting the occurrence times of each character in sina for updating the FOC _ TABLE. Wherein, the updated FOC _ TABLE is shown in TABLE 4:
character(s) foc(frequency of occurency)
. 1
c 1
o 1
m 1
s 1
i 1
n 1
a 1
TABLE 4
Finally, the processing mode string alibaaaaa, its sliding window and corresponding local _ score and global _ score are shown in table 5:
Figure BDA0002316614360000133
TABLE 5
As can be seen from Table 5, based on
Figure BDA0002316614360000134
If the score determined by the corresponding local _ score and global _ score is the minimum, the corresponding liba can be determined as the filtering character string of the pattern character string alibaaaaaa, and the occurrence number of each character in the liba is counted for updating the FOC _ TABLE. Wherein, the updated FOC _ TABLE is shown in TABLE 6:
Figure BDA0002316614360000135
Figure BDA0002316614360000141
TABLE 6
Then, a state table is created by using the filtering character strings ". com", "sina", and "liba" respectively determined for each mode character string, and the specific creating process of the state table is not described herein again, as follows:
a:1110
b:1101
c:1011
i:1011
l:0111
m:1110
n:1101
s:0111
*:1111
in practical application, the filtering character string determined based on the above method is used in scene one: there are continuously repeated characters in the filtering character string determined for any pattern character string, and there are also continuously repeated identical characters in the data to be matched, and scene two: the filtering character strings determined for the plurality of mode character strings are the same, and the data to be matched comprises the filtering character strings, so that the condition that the performance is greatly reduced due to frequent entering of a confirmation stage can be avoided, and the matching efficiency is improved.
Based on the foregoing method embodiment, the present application further provides a device for processing a pattern string, and referring to fig. 3, fig. 3 is a schematic structural diagram of the device for processing a pattern string provided in the present application, where the device includes:
a first determining module 301, configured to determine any mode character string as a to-be-processed mode character string, and align a head/tail end of a sliding window with a preset length with a head/tail end of the to-be-processed mode character string;
a second determining module 302, configured to determine a score of the character string based on a character distribution of the character string in which the to-be-processed mode character string is located in the sliding window;
a moving module 303, configured to move the sliding window to the tail/head end of the to-be-processed mode character string by a preset step length, and continue to trigger the second determining module until a length of an interval between the tail/head end of the sliding window and the tail/head end of the to-be-processed mode character string is smaller than the preset step length;
a third determining module 304, configured to determine a filtering character string of the to-be-processed mode character string based on a score of each character string of the to-be-processed mode character string in the sliding window, and continue to trigger the first determining module to process a next mode character string until each mode character string completes determination of the filtering character string.
In an alternative embodiment, the apparatus further comprises:
the counting module is used for counting the occurrence times of each character in the filtering character string;
the updating module is used for updating the character occurrence frequency table by utilizing each character and the corresponding occurrence frequency;
correspondingly, the second determining module is specifically configured to:
determining the score of the character string based on the occurrence frequency of each character in the character string of the mode to be processed in the sliding window in the character occurrence frequency table; wherein the score is inversely related to the number of occurrences.
In the processing apparatus for a pattern character string provided in the embodiment of the present application, before matching a multi-pattern character string, a filter character string is dynamically determined for each pattern character string based on a character distribution condition of the pattern character string, so that the filter character string is no longer a character at a fixed position. The multimode character string is matched based on the dynamically determined filtering character string, so that the condition that the performance is greatly reduced due to frequent entering of a confirmation stage can be avoided, and the matching efficiency is improved.
In addition, an embodiment of the present application further provides a processing device for a pattern character string, as shown in fig. 4, the processing device may include:
a processor 401, a memory 402, an input device 403, and an output device 404. The number of processors 401 in the processing device of the pattern string may be one or more, and one processor is taken as an example in fig. 4. In some embodiments of the present invention, the processor 401, the memory 402, the input device 403, and the output device 404 may be connected by a bus or other means, wherein the connection by the bus is illustrated in fig. 4.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing of the processing device of the pattern string by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. The input means 403 may be used to receive input numeric or character information and to generate signal inputs related to user settings and function control of the processing device for the pattern string.
Specifically, in this embodiment, the processor 401 loads an executable file corresponding to a process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application programs stored in the memory 402, thereby implementing various functions in the method for processing the mode character string.
In addition, the present application also provides a computer-readable storage medium, where instructions are stored, and when the instructions are executed on a terminal device, the terminal device is caused to execute the processing method of the mode character string.
It is understood that for the apparatus embodiments, since they correspond substantially to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The method, the apparatus, and the device for processing a pattern string provided in the embodiments of the present application are described in detail above, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A method for processing a pattern string, the method comprising:
s1, determining any mode character string as a mode character string to be processed, and aligning the head/tail end of a sliding window with a preset length with the head/tail end of the mode character string to be processed;
s2, determining the score of the character string in the sliding window based on the character distribution condition of the character string of the mode character string to be processed in the sliding window;
s3, moving the sliding window to the tail/head end of the to-be-processed mode character string by a preset step length, and continuing to execute the S2 until the interval length between the tail/head end of the sliding window and the tail/head end of the to-be-processed mode character string is smaller than the preset step length;
s4, determining a filter character string of the mode character string to be processed based on the score of each character string of the mode character string to be processed in the sliding window, and continuing to execute S1 to process the next mode character string until each mode character string completes the determination of the filter character string.
2. The method according to claim 1, wherein after determining the filter string of the pending mode string in S4, the method further comprises:
counting the occurrence times of each character in the filtering character string;
updating a character occurrence frequency table by using each character and the corresponding occurrence frequency;
correspondingly, the determining the score of the character string based on the character distribution of the character string of the mode to be processed in the sliding window includes:
determining the score of the character string based on the occurrence frequency of each character in the character string of the mode to be processed in the sliding window in the character occurrence frequency table; wherein the score is inversely related to the number of occurrences.
3. The method according to claim 1, wherein the determining the score of the character string based on the number of occurrences in the character occurrence frequency table of each character in the character string of which the character string of the mode to be processed is in the sliding window comprises:
determining the occurrence frequency of each character in the character string of the mode character string to be processed in the sliding window in the character occurrence frequency table, and calculating the average value of the occurrence frequency of each character;
determining a score of the character string based on the average value if the average value is greater than a preset threshold value; wherein the score is inversely related to the average.
4. The method of claim 3, further comprising:
if the average value is not larger than the preset threshold value, determining the fraction of the character string based on the repeated number of continuous repeated characters in the character string of the character string in the sliding window; wherein the score is inversely related to the number of repetitions.
5. The method according to claim 2, wherein the determining the score of the character string based on the character distribution of the character string of the pending mode in the sliding window comprises:
and determining the score of the character string based on the repeated number of the continuous repeated characters in the character string of the mode to be processed in the sliding window and the occurrence number of each character in the character string in the character occurrence frequency table.
6. The method according to claim 1, wherein after determining the filter string of the pending mode string in S4, the method further comprises:
and recording the position information of the filtering character string in the to-be-processed mode character string.
7. An apparatus for processing a pattern string, the apparatus comprising:
the first determining module is used for determining any mode character string as a mode character string to be processed and aligning the head/tail end of a sliding window with a preset length with the head/tail end of the mode character string to be processed;
the second determination module is used for determining the scores of the character strings in the sliding window based on the character distribution condition of the character strings of the mode to be processed in the sliding window;
the moving module is used for moving the sliding window to the tail/head end of the mode character string to be processed by a preset step length and continuously triggering the second determining module until the interval length between the tail/head end of the sliding window and the tail/head end of the mode character string to be processed is smaller than the preset step length;
and the third determining module is used for determining the filtering character string of the mode character string to be processed based on the fraction of each character string of the mode character string to be processed in the sliding window, and continuously triggering the first determining module to process the next mode character string until each mode character string finishes the determination of the filtering character string.
8. The apparatus of claim 7, further comprising:
the counting module is used for counting the occurrence times of each character in the filtering character string;
the updating module is used for updating the character occurrence frequency table by utilizing each character and the corresponding occurrence frequency;
correspondingly, the second determining module is specifically configured to:
determining the score of the character string based on the occurrence frequency of each character in the character string of the mode to be processed in the sliding window in the character occurrence frequency table; wherein the score is inversely related to the number of occurrences.
9. A computer-readable storage medium having stored therein instructions that, when executed on a terminal device, cause the terminal device to perform the method of any one of claims 1-6.
10. A device for processing a pattern string, comprising: memory, a processor, and a computer program stored on the memory and executable on the processor, when executing the computer program, implementing the method of any of claims 1-6.
CN201911280495.7A 2019-12-13 2019-12-13 Method, device and equipment for processing pattern character strings Active CN111159490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911280495.7A CN111159490B (en) 2019-12-13 2019-12-13 Method, device and equipment for processing pattern character strings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911280495.7A CN111159490B (en) 2019-12-13 2019-12-13 Method, device and equipment for processing pattern character strings

Publications (2)

Publication Number Publication Date
CN111159490A true CN111159490A (en) 2020-05-15
CN111159490B CN111159490B (en) 2023-05-26

Family

ID=70557050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911280495.7A Active CN111159490B (en) 2019-12-13 2019-12-13 Method, device and equipment for processing pattern character strings

Country Status (1)

Country Link
CN (1) CN111159490B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836367A (en) * 2021-09-26 2021-12-24 杭州迪普科技股份有限公司 Character reverse matching method and device
CN116388767A (en) * 2023-04-11 2023-07-04 河南大学 Security management method for software development data

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0677927A2 (en) * 1994-04-15 1995-10-18 International Business Machines Corporation Character string pattern matching for compression and the like using minimal cycles per character
CN101609455A (en) * 2009-07-07 2009-12-23 哈尔滨工程大学 A kind of method of high-speed accurate single-pattern character string coupling
CN101876986A (en) * 2009-11-27 2010-11-03 福建星网锐捷网络有限公司 Character string matching method based on finite state automation and content filtering equipment
CN102063482A (en) * 2010-12-27 2011-05-18 北京友录在线科技发展有限公司 High-efficiency contact searching method of handheld device
CN102750379A (en) * 2012-06-25 2012-10-24 华南理工大学 Fast character string matching method based on filtering type
CN103559018A (en) * 2013-10-23 2014-02-05 东软集团股份有限公司 String matching method and system based on graphics processing unit (GPU) calculation
CN106599097A (en) * 2016-11-24 2017-04-26 东软集团股份有限公司 Massive feature string sets matching method and apparatus
CN109977276A (en) * 2019-03-22 2019-07-05 华南理工大学 A kind of single pattern matching method based on Sunday algorithm improvement

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0677927A2 (en) * 1994-04-15 1995-10-18 International Business Machines Corporation Character string pattern matching for compression and the like using minimal cycles per character
CN101609455A (en) * 2009-07-07 2009-12-23 哈尔滨工程大学 A kind of method of high-speed accurate single-pattern character string coupling
CN101876986A (en) * 2009-11-27 2010-11-03 福建星网锐捷网络有限公司 Character string matching method based on finite state automation and content filtering equipment
CN102063482A (en) * 2010-12-27 2011-05-18 北京友录在线科技发展有限公司 High-efficiency contact searching method of handheld device
CN102750379A (en) * 2012-06-25 2012-10-24 华南理工大学 Fast character string matching method based on filtering type
CN103559018A (en) * 2013-10-23 2014-02-05 东软集团股份有限公司 String matching method and system based on graphics processing unit (GPU) calculation
CN106599097A (en) * 2016-11-24 2017-04-26 东软集团股份有限公司 Massive feature string sets matching method and apparatus
CN109977276A (en) * 2019-03-22 2019-07-05 华南理工大学 A kind of single pattern matching method based on Sunday algorithm improvement

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836367A (en) * 2021-09-26 2021-12-24 杭州迪普科技股份有限公司 Character reverse matching method and device
CN113836367B (en) * 2021-09-26 2023-04-28 杭州迪普科技股份有限公司 Method and device for character reverse matching
CN116388767A (en) * 2023-04-11 2023-07-04 河南大学 Security management method for software development data
CN116388767B (en) * 2023-04-11 2023-10-13 河北湛泸软件开发有限公司 Security management method for software development data

Also Published As

Publication number Publication date
CN111159490B (en) 2023-05-26

Similar Documents

Publication Publication Date Title
CN107609098B (en) Searching method and device
US20160255100A1 (en) Filter for network intrusion and virus detection
CN111971931B (en) Method for verifying transactions in a blockchain network and nodes constituting the network
US20090012957A1 (en) System and method for searching strings of records
CN110222238B (en) Query method and system for bidirectional mapping of character string and identifier
JPH0612303A (en) Method and apparatus for inspecting whether record is stored or not in computer system
CN111159490A (en) Method, device and equipment for processing mode character string
US20220005546A1 (en) Non-redundant gene set clustering method and system, and electronic device
US8606772B1 (en) Efficient multiple-keyword match technique with large dictionaries
US8051060B1 (en) Automatic detection of separators for compression
US11422997B2 (en) Supporting repetitive operations within an operating system
CN111175810B (en) Microseismic signal arrival time picking method, device, equipment and storage medium
US8407187B2 (en) Validating files using a sliding window to access and correlate records in an arbitrarily large dataset
CN116775695A (en) Dynamic combination query optimization method and device based on index and storage medium
CN108304467B (en) Method for matching between texts
CN112579839B (en) Multi-mode matching method and device for large-scale features and storage medium
CN112328630B (en) Data query method, device, equipment and storage medium
CN109740249B (en) MUX tree logic structure optimization method, module and storage medium
CN113010882B (en) Custom position sequence pattern matching method suitable for cache loss attack
US9864765B2 (en) Entry insertion apparatus, method, and program
CN113051566B (en) Virus detection method and device, electronic equipment and storage medium
KR20030032499A (en) A method for matching subsequence based on time-warping in sequence databases
CN113887223B (en) Character string matching method and related device
CN113569010B (en) Method, device, equipment and storage medium for filtering search result
CN110427391B (en) Method, apparatus and computer program product for determining duplicate data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant