CN110083746B - Quick matching identification method and device based on character strings - Google Patents

Quick matching identification method and device based on character strings Download PDF

Info

Publication number
CN110083746B
CN110083746B CN201910339599.4A CN201910339599A CN110083746B CN 110083746 B CN110083746 B CN 110083746B CN 201910339599 A CN201910339599 A CN 201910339599A CN 110083746 B CN110083746 B CN 110083746B
Authority
CN
China
Prior art keywords
character
character string
array
bits
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910339599.4A
Other languages
Chinese (zh)
Other versions
CN110083746A (en
Inventor
李小坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Greenet Information Service Co Ltd
Original Assignee
Wuhan Greenet Information Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Greenet Information Service Co Ltd filed Critical Wuhan Greenet Information Service Co Ltd
Priority to CN201910339599.4A priority Critical patent/CN110083746B/en
Publication of CN110083746A publication Critical patent/CN110083746A/en
Application granted granted Critical
Publication of CN110083746B publication Critical patent/CN110083746B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Abstract

The invention relates to the technical field of computers, and provides a method and a device for rapid matching identification based on character strings. The method comprises the steps of determining one or more character bits with dynamic changes in character strings and static character bits in corresponding character strings; and updating a character string mapping library according to the content information of the static character bits in the character string and the one or more dynamic character bits. The invention marks the dynamically changing character bit, and can add 257 th bit in the array of the conventional dictionary tree for storing the link information of the next level array corresponding to the dynamically changing character bit, thereby greatly simplifying the redundancy degree of the dictionary tree.

Description

Quick matching identification method and device based on character strings
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of computers, in particular to a method and a device for fast matching and identifying based on character strings.
[ background of the invention ]
The Deep Packet Inspection (DPI) technology is an application-layer-based traffic inspection and control technology, and when an IP Packet, a TCP or a UDP data stream passes through a DPI-based bandwidth management system, the system reassembles application-layer information in an OSI seven-layer protocol by deeply reading the content of the IP Packet payload to obtain the content of the entire application program, and then performs a shaping operation on the traffic according to a management policy defined by the system.
In the DPI technology, when application identification and malicious traffic analysis are performed on a network data packet, characteristics of some bytes of n bytes before the packet load are usually collected, for example: two bytes of the QQ are generated at the appointed position of the message in the network data message of the Tencent QQ; then generating a specific rule base, and finally matching the rule and the data packet through a matching engine. However, in the actual operation process, it is encountered that, since some byte is uncertain in n bytes and cannot be matched by establishing a state machine through an automaton (Aho-Corasickautomation, abbreviated as AC) algorithm, a rule is generally used to traverse whether the matching is hit. Traversal rules are feasible with a small number of rules, but after the order of magnitude of the rules, the matching performance is very low and the matching rate is very slow. This would cause a great waste of computing resources, but there is no simple and efficient solution for this situation in the prior art.
Patent application No. CN201210132834.9 discloses a multi-mode string matching method and apparatus. The method comprises the following steps: writing each character into a node according to the sequence of the respective character composition of the plurality of pattern strings and downwards along the root node of the tree structure to generate a decision tree structure; and matching the main string to be matched downwards along the decision tree. The technical scheme of the invention can realize the accurate matching of the multi-mode character strings, simultaneously searches the child node according to the hash value corresponding to the child node, does not influence the CPU time expenditure of the character string matching due to the change of the width of the decision tree, and only depends on the average depth of the decision tree and is irrelevant to the number of the mode strings. For character string matching with more pattern strings, the algorithm can greatly reduce the time overhead of a CPU and improve the response speed of application. However, the patent does not support the matching of undetermined characters contained in a character string.
Patent application No.: CN201310744154.7 patent document discloses a character string searching method based on non-deterministic finite automata, which includes constructing non-deterministic finite automata NFA and setting state variables for the non-deterministic finite automata; loading a matching expression in the non-deterministic finite automaton, and converting the matching expression in the non-deterministic finite automaton into a directed graph according to a directed graph operator conversion rule; according to the state position in the state variable, starting to match characters in the character string of the non-deterministic finite automaton; if the character matching is successful, updating the state variable according to the final position pointed by the position in the directed graph, and matching the next character from the position in the updated state variable until a character string conforming to the matching expression is obtained or the matching of the character is failed, so that the matching is completed; and when the matching is completed, setting the state variable as a starting position. The patent carries out character string matching through a logic operator similar to a (A. B | AC) D', the NFA algorithm in the patent supports abc. cd, and the number of uncertain characters between abc and cd is not limited, so compared with the application scene provided by the invention, the technical problems related to the invention can be solved by adopting the NFA algorithm as the common AC algorithm, but the AC algorithm is over-solidified and single-board, and the NFA algorithm has too large application flexibility, so that the effective utilization of resources and the improvement of computing performance in the application scene provided by the invention cannot be realized.
[ summary of the invention ]
The technical problem to be solved by the present invention is that when performing string matching identification with a fixed length of n, a corresponding string contains one or more indeterminate bytes, however, when the string still corresponds to the same identification result, the existing AC algorithm still sets a matching bit of 0-255 for the one or more indeterminate bytes, and the matching result of 0-255 in the corresponding one or more indeterminate bytes corresponds to the identification result, which may cause waste in computation; in this case, the existing NFA algorithm is based on a transfer function to implement switching between the current input and the transfer object, and finally reaches a termination state (accepting state), so that it is also necessary to set a corresponding transfer function for the above-mentioned one guard multiple uncertain bytes, and therefore, compared with AC, using the NFA algorithm does not bring any saving of computing resources and improvement of computing performance for the problem in the application scenario proposed by the present invention.
The invention adopts the following technical scheme:
in a first aspect, the present invention provides a fast matching identification method based on character strings, including:
determining one or more character bits with dynamic changes in the character string and static character bits in the corresponding character string;
updating a character string mapping library according to the content information of the static character bits in the character string and the one or more dynamic character bits;
and correspondingly calibrating the one or more dynamic character bits in the character string mapping library by using preset additional character bits.
Preferably, the character string mapping library includes one or more array arrays, and the array arrays are specifically formed by arranging characters corresponding to one or more arrays in a hierarchical manner; wherein, the series of the array corresponds to the number of corresponding characters in the character string; each array comprises array units the number of which is consistent with that of the complete characters, and the preset additional character bit is additionally arranged behind the last character bit of each array; the array unit is used for storing the address of the next-level array associated with the array unit.
Preferably, the array units of the number of the complete characters specifically include 256 array units in total corresponding to 0x00-0xFF, and the additional characters are correspondingly set as 257 th array units in the array, where each array unit is used to store address information of the next-level array or is used to store corresponding information for jumping out of the current array to obtain a matching result.
Preferably, the method is used for storing corresponding information obtained by jumping out of the current array to obtain the matching result, and specifically includes:
jump address links are stored in the last-level array of the array corresponding to each character string, and the jump address links are used for acquiring analysis results matched with the character strings; alternatively, the first and second electrodes may be,
and the last-level array of the array corresponding to each character string stores the analysis result matched with the character string.
Preferably, the character string mapping library already stores the first character string, and at this time, the introducing of the newly added second character string into the character string mapping library specifically includes:
multiplexing a first-level array of the first character string to a second character string for the first character string and the second character string with the same initial character;
for the ith character bit with difference between the first character string and the second character string, in the array of the first character string, the link where the corresponding ith series group is located is added with an array to correspond to the content of the ith character bit in the second character string; thus forming two lower links of the ith-level array with respect to the (i-1) th-level array.
Preferably, when a third character string is acquired and the information represented by the third character string needs to be analyzed through the character string mapping library, the method further includes:
according to the content of the first character of the third character string, one or more candidate array arrays with the record information consistent with the content of the first character of the third character string are matched in the array arrays of the character string mapping library;
and screening the one or more candidate array arrays in sequence according to the content of the subsequent character bit of the third character string to obtain an analysis result corresponding to the third character string.
Preferably, the sequentially screening the one or more candidate array arrays according to the content of the subsequent character bit of the third character string to obtain an analysis result corresponding to the third character string specifically includes:
setting the subsequent character bit as a static character bit for matching, selectively setting the subsequent character bit as a dynamic character bit if the subsequent character bit is not matched to obtain a unique result, and matching the adjusted subsequent character bit until the unique result is matched, or feeding back a message of successful unmatching to an operator after the matched cyclic condition is skipped.
Preferably, the selectively setting the subsequent character bits as the dynamic character bits specifically includes:
in the last round of matching process, the last mismatched character bit is adjusted to be a dynamic character bit, and when the last round of matching is performed, the array corresponding to the previous character bit newly adjusted to be the dynamic character bit is taken as the start, and the matching process of the current round is performed;
if the next character bit mismatch occurs, repeating the adjusting process and completing the matching process of the whole character string;
if the same character bit is not successfully matched after being adjusted to be the dynamic character bit, the arrival of the matched cycle condition is confirmed, and the information of the successful unmatching is fed back to an operator.
Preferably, the determining one or more dynamically changing character bits in the character string and the static character bits in the corresponding character string specifically include:
comparing a fourth character string and a fifth character string obtained from a data packet in a preset time period, and if the number ratio result between the number of similar character bits and the number of different character bits between the fourth character string and the fifth character string is greater than a preset threshold value, marking the fourth character string and the fifth character string;
according to the confirmation message that the fourth character string and the fifth character string fed back by the input end belong to the same analysis result; and determining character bits with difference between the fourth character string and the fifth character string as the dynamically changed character bits, and determining character bits with same content between the fourth character string and the fifth character string as the static character bits.
Preferably, the determining one or more dynamically changing character bits in the character string and the static character bits in the corresponding character string specifically include:
analyzing whether a plurality of final-stage arrays exist in the character mapping library and can be matched with the same target matching result or not according to a preset time period;
if the condition that a plurality of last-stage arrays can be matched with the same target matching result is confirmed, acquiring at least two character strings to be integrated corresponding to the respective array arrays according to the link relation between the last-stage arrays and the preceding-stage arrays;
comparing the at least two character strings to be integrated to obtain character positions and static character positions which dynamically change between the character strings to be integrated;
and adjusting the corresponding array in the character mapping library according to the dynamically changed character bit and the static character bit.
In a second aspect, the present invention further provides a device for fast matching and recognizing based on character strings, which is used for implementing the method for fast matching and recognizing based on character strings in the first aspect, and the device includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor and programmed to perform the string-based quick match identification method of the first aspect.
In a third aspect, the present invention further provides a non-transitory computer storage medium storing computer-executable instructions for execution by one or more processors to perform the method for identifying a quick match based on a character string according to the first aspect.
The invention marks the dynamically changing character bit, and can add 257 th bit in the array of the conventional dictionary tree for storing the link information of the next level array corresponding to the dynamically changing character bit, thereby greatly simplifying the redundancy degree of the dictionary tree.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a schematic flow chart of a method for identifying a quick match based on a character string according to an embodiment of the present invention;
fig. 2 is a schematic diagram of an array structure after an array unit is added in accordance with an embodiment of the present invention;
FIG. 3 is a block diagram of an exemplary array according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for determining dynamic byte bits and static byte bits according to an embodiment of the present invention;
FIG. 5 is a flowchart of an array method for adding new character strings to an existing array according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a process for adding an array of strings to an existing array according to an embodiment of the present invention;
FIG. 7 is a second schematic diagram of a process of adding an array of character strings to an existing array according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating the effect of a prior art array according to an embodiment of the present invention;
fig. 9 is a schematic flowchart of a process of searching for a target matching result of a third string by using the string mapping library according to the present invention;
FIG. 10 is a flowchart of an implementation method provided by an embodiment of the present invention and corresponding to step 402 in FIG. 9;
FIG. 11 is a flowchart of another implementation method corresponding to step 402 in FIG. 9 according to an embodiment of the present invention;
FIG. 12 is a flowchart of a method for updating a string map library according to an embodiment of the present invention;
fig. 13 is a schematic structural diagram of a device for fast matching and recognizing based on character strings according to an embodiment of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the description of the present invention, the terms "inner", "outer", "longitudinal", "lateral", "upper", "lower", "top", "bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are for convenience only to describe the present invention without requiring the present invention to be necessarily constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
In the embodiments of the present invention, the terms like "first" and "second" are only used for convenience to describe a scenario in which two subjects appear on the same name object in the technical solution at the same time, and are not limited to a specific term, and the actual representation may be any two subject subjects satisfying the corresponding functional description in a specific scenario.
In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Example 1:
the embodiment 1 of the present invention provides a fast matching and identifying method based on a character string, which is applicable to various occasions involving obtaining a target matching result by matching the character string, and particularly, the character string further includes one or more dynamically changing character bits, for example, a situation that a specific character bit in the character string dynamically changes when application identification and malicious traffic analysis are performed on a network data packet in a DPI technology. As shown in fig. 1, the method comprises the steps of:
in step 201, one or more character bits in a character string that dynamically change are determined to be present, as well as static character bits in the corresponding character string.
In a specific implementation manner, the determining that one or more dynamically changing character bits exist in the character string may be performed at multiple stages of the generation/update of the character string mapping library; according to different determined time and modes, the method can be divided into the following steps:
the first method can be that through the statistics or observation experience of an operator, when an initial array is generated in a character string mapping library for the character string, the dynamic character bit in the character string is determined; for example: when the operator creates the associated array of the character string "Hello 4 world" in the character string mapping library, the operator already confirms that the 6 th character bit is a dynamic byte bit (i.e. the parameter value contained therein is dynamically changeable, for example, the currently displayed parameter value is "4", and the parameter value of the next data packet may be "5", but both character strings correspond to the same matching result, for example, both correspond to the data packet issued by "HelloWorld application").
The second mode can adopt a more intelligent mode which is a comparison mode, namely, the corresponding process of identifying the character string containing the random bits is finished by a computer; the specific mode is as follows:
after a computer (which can be understood as a server in the embodiment of the present invention) creates a plurality of mapping array arrays corresponding to a plurality of character strings, each character string in the character string mapping library and the corresponding matching result are traversed, and if two different character strings are found and the corresponding matching results are the same, the character string mapping library is updated in a manner of fitting the array arrays of the two character strings. And wherein multiple mapping array arrays are created corresponding to multiple strings, which may be confirmed and established by an operator. Compared with the first mode, the second mode utilizes the later analysis function of the computer, and reduces the complexity of confirmation and establishment work of operators in the early stage.
The array fitting the two is specifically implemented as that for the array of the level corresponding to the dynamic character bit therein, the array unit storing the next level of associated array is adjusted to be the corresponding array unit specially set for the dynamic character bit (for example, the 256 th bit in the array, which may also be set as the previous bit of the 0 th bit in the array in the specific implementation process, or may be added at any other position in the array, of course, the preferred mode is to set at the head or the tail of the array), and the branches after the level group are reduced to one branch (at this time, only the case where only one character bit between two character strings is dynamically changed is considered).
In step 202, a string mapping library is updated according to the content information of the static character bits and the one or more dynamic character bits in the string.
And correspondingly calibrating the one or more dynamic character bits in the character string mapping library by using preset additional character bits. As described above in step 201, the diacritic bits are preferably set before the start or after the end of the array. As shown in FIG. 2, an array unit is added at the end of a standard 0-255 array as the index of the appended character bit, wherein the index of the appended character bit is used for storing the link address of the next array.
The embodiment of the invention firstly provides the establishment of a character string mapping library containing the character bits with the dynamic change characteristics, compared with the prior art that a dictionary tree needs to be established for each character bit in the character string containing the random bits according to the appeared parameter values, and 0-255 possible parameter values exist in one random character bit at the limit, so that the volume of the dictionary tree is greatly increased, and the corresponding dictionary tree needs to be loaded into a memory during computer processing, and at the moment, unnecessary resource waste of contents can be caused. In the embodiment of the invention, by combining computer statistics and/or operator identification, the character strings used for creating the dictionary tree in the prior art are preprocessed, namely dynamically changing character bits are calibrated, and for example, 257 th bit is added in the array of the conventional dictionary tree and is used for storing the link information of the next-level array corresponding to the dynamically changing character bits, so that the redundancy degree of the dictionary tree is greatly simplified.
In order to more clearly illustrate various processing procedures and operation contents involved in the embodiments of the present invention, the following description is made with respect to the relationship between the array and the character string adopted in the embodiments of the present invention. Specifically, as shown in fig. 3, the character string mapping library includes one or more array arrays, and the array arrays are formed by arranging characters corresponding to one or more arrays in a hierarchical manner; wherein, the series of the array corresponds to the number of corresponding characters in the character string; each array comprises array units the number of which is consistent with that of the complete characters, and the preset additional character bit is additionally arranged behind the last character bit of each array; the array unit is used for storing the address of the next-level array associated with the array unit.
Taking fig. 3 as an example, a typical array for mapping two character strings follows the array display manner shown in fig. 2, where the two character strings are respectively a character string a and a character string B, as can be seen from fig. 3, where parameter values of first character bits of the character string a and the character string B are the same and are both 0x04, so that the character string a and the character string B share a first-level array, and the character string a and the character string B diverge from a second character bit, and the thick arrow lines and the thin arrow lines in fig. 3 respectively represent the link relationships of the respective levels of the array arrays corresponding to the character string a and the character string B. Through the above analysis and the contents shown in fig. 3, it can be known that the content of the character string a is "0X 04, 0X05, 0xFB, 0X 02" and the content of the character string B is "0X 04, 0X04, X, 0 xFD", where X in the character string B indicates that the parameter value expressed by the character string B is a dynamically changing parameter value, and the third character bit of the character string B where the corresponding parameter value is located is a dynamic character bit.
Taking fig. 3 as an example, the array units of the number of complete characters specifically include 256 array units in total corresponding to 0x00-0xFF, and then the additional character is set to be the 257 th array unit in the array, where each array unit is used to store address information of its next-level array or is used to store corresponding information for jumping out of the current array to obtain a matching result. The arrow relationship shown in fig. 3 represents that each array unit is used for storing the address information of the next-level array; the corresponding information for storing the matching result obtained by jumping out of the current array is usually the array unit in the array corresponding to the corresponding stage of the last character bit in the character string, for example, the character string a in fig. 3, where the corresponding array unit labeled with the grid-like shadow in the fourth-stage array stores the corresponding information for jumping out of the current array to obtain the matching result (for example, directly storing the target matching result in the array unit labeled with the grid-like shadow, or storing the identification information for addressing the target matching result in the array unit labeled with the grid-like shadow, where the identification information of the target matching result may be represented as a string of numerical values for searching the target matching result in the designated table).
In the embodiment of the present invention, with respect to the determining that there are one or more dynamically changing character bits in the character string involved in step 201 and static character bits in the corresponding character string, a specific implementation manner (which may be regarded as a specific implementation manner of the one introduced in step 201) is further provided in the embodiment of the present invention, as shown in fig. 4, specifically including the following execution steps:
in step 2011, a fourth character string and a fifth character string obtained from the data packet within a preset time period are compared, and if a result of a number ratio between the number of similar character bits and the number of different character bits between the fourth character string and the fifth character string is greater than a preset threshold, the fourth character string and the fifth character string are marked.
The marking out the fourth character string and the fifth character string may be represented as feeding back a confirmation request message to an operator, where the confirmation request message carries contents of the fourth character string and the fifth character string, so that the operator confirms whether the fourth character string and the fifth character string are mapped to a same target matching result, thereby further confirming a dynamically changing character position in the character string.
In the preset time period, the time parameter set by the operator at the system side may be a time parameter, that is, an operation basis for the supply system (or called as a server) to determine whether a round of dynamic character bit recognition process needs to be performed. The preset threshold may be a fixed value or a result calculated by a relation, and the latter is preferably adopted in the embodiment of the present invention, for example: the preset threshold may be 60% -80% of the length of the character string.
In step 2012, a confirmation message that the fourth character string and the fifth character string fed back by the input end belong to the same parsing result is obtained; and determining character bits with difference between the fourth character string and the fifth character string as the dynamically changed character bits, and determining character bits with same content between the fourth character string and the fifth character string as the static character bits.
The input end specifically refers to an input end on the side of an operator, and can be represented as a keyboard, a touch pad, a handheld intelligent terminal and the like.
Example 2:
after embodiment 1 of the present invention has shown a more complete structural feature of the string mapping library, the embodiment of the present invention adds new strings and their array arrays in the existing string mapping library. Before the embodiment of the present invention is specifically described, it is assumed that the character string mapping library already stores the first character string, and at this time, a newly added second character string is imported into the character string mapping library, as shown in fig. 5, where the method specifically includes:
in step 301, for a first string and a second string having the same initial character, a first-level array of the first string is multiplexed to the second string.
The description has been made with the character string a and the character string B introduced in embodiment 1 as the first character string and the second character string, respectively, according to the embodiment of the present invention. As shown in fig. 6, in order to illustrate the effect of the character string B after the first-level array of the character string a is multiplexed after the step 301 is executed, at this time, when comparing with the initial state, the corresponding array unit marked with the solid black circle in the first-level array shown in fig. 6 updates the content of the corresponding array unit from the content of the link address information originally storing only the second-level array of the character string a to the content of the link address information storing both the link address information of the second-level array of the character string a and the link address information of the second-level array of the character string B. In a specific implementation, if the first-level array is multiplexed by N character strings, the link address information of N second-level arrays will also be stored in the corresponding array unit (for example, the array unit marked with a solid black circle in fig. 6) in the first-level array.
In step 302, for the ith character bit having a difference between the first character string and the second character string, in the array of the first character string, a new array is added to correspond to the content of the ith character bit in the second character string, where the link is located at the corresponding ith series group; thus forming two lower links of the ith-level array with respect to the (i-1) th-level array. Wherein i is a natural number of 2 or more.
Between the character string a and the character string B, the ith character bit is expressed as a second character bit, and as shown in fig. 7, the ith character bit is an effect schematic diagram of establishing a link relationship between the multiplexed first-level array and a second-level array corresponding to the second character bit of the newly generated character string B after the first-level array of the character string a is multiplexed. Further, the corresponding link relationship is established between the arrays of different levels corresponding to the subsequent character positions of the character string B, so as to obtain the array shown in fig. 3.
Through the above process of creating the array of the second character string in steps 301 and 302, it is not difficult to deduce that, in comparison with the existing dictionary tree technology, in embodiment 1 of the present invention, a new array unit is added for dynamically changing character bits for identification, which significantly improves memory space occupation. Still taking the character string a and the character string B shown in fig. 3 as an example, now taking fig. 8 as an example to show a case that the third character bit of the character string B in the prior art contains three dynamic parameter values, the three dynamic parameter values are 0x04, 0xFC, 0xFD can be known through the array marked with thin lines in the third level array in fig. 8, and as for how many arrays in the prior art corresponding to the dynamic parameter values in the third level array, the corresponding array corresponding to the same static character bit in the fourth level array can also be copied three times to establish respective link relationships with the third level array corresponding to the dynamic parameter values 0x04, 0xFC, 0xFD, respectively. Compared with the array obtained by the solution proposed by the embodiment of the present invention, the same pair of character strings a and B can be visually compared with the array (as shown in fig. 3) obtained by the solution proposed by the embodiment of the present invention and the array (as shown in fig. 8) obtained by the method of generating the character mapping library in the prior art, so that the embodiment of the present invention has a superior improvement effect on the storage volume. It should be emphasized that fig. 8 only shows that the third character bit of the character string B only has three dynamic parameter values, and in an extreme case, if the third character bit of the character string B has 256 dynamic parameter values, the corresponding array starts from the third-level array, and the arrays corresponding to the character bits in the subsequent character string B all need to be copied 256 times, which will bring about a great loss of memory resources in the matching analysis process.
Example 3:
the character mapping library architecture proposed by the present invention is introduced in embodiment 1, and the process of how to generate the array in the character mapping library for a newly added character string in the character mapping library architecture proposed by the embodiment of the present invention is shown in embodiment 2. Embodiment 3 of the present invention further elaborates how to obtain a third character string from the process of using the character mapping library by using the structure of the character mapping library provided in the embodiment of the present invention to obtain a target matching result. As shown in fig. 9, the process includes the following steps:
in step 401, according to the content of the first character of the third character string, one or more candidate array arrays in which the record information in the first-level array is consistent with the content of the first character of the third character string are matched in the array of the character string mapping library.
The one or more candidate array arrays refer to a candidate array when only one link reaches the final target matching result after the first-level array; and for the first-level array, a plurality of links reaching a plurality of target matching results are called a plurality of candidate array arrays.
In step 402, the one or more candidate array arrays are sequentially screened according to the content of the subsequent character bits of the third character string, so as to obtain an analysis result corresponding to the third character string.
In this embodiment of the present invention, the sequentially screening the one or more candidate array arrays according to the content of the subsequent character bits of the third character string in step 402 to obtain an analysis result corresponding to the third character string, as shown in fig. 10, specifically includes:
in step 4021, the subsequent character bits are set as static character bits for matching.
In step 4022, if the unique result is obtained by unmatching, the subsequent character bits are selectively set as dynamic character bits, and the adjusted subsequent character bits are matched until the unique result is matched, or after the matched cycle condition is reached, the unmatched successful message is fed back to the operator.
Wherein, the selectively setting the subsequent character bits as dynamic character bits specifically includes: in the last round of matching process, the last mismatched character bit is adjusted to be a dynamic character bit, and when the last round of matching is performed, the array corresponding to the previous character bit newly adjusted to be the dynamic character bit is taken as the start, and the matching process of the current round is performed; if the next character bit mismatch occurs, repeating the adjusting process and completing the matching process of the whole character string; if the same character bit is not successfully matched after being adjusted to be the dynamic character bit, the arrival of the matched cycle condition is confirmed, and the information of the successful unmatching is fed back to an operator.
In the embodiment of the present invention, in addition to the single-thread tentative matching method described in the above steps 4021 and 4022, a multi-thread parallel tentative method may also be used, as shown in fig. 11, which is specifically set forth as follows:
in step 4021', the character bit information in each subsequent array of each candidate array, which has a preset length, is read and matched with the parameter value in the corresponding character bit in the third character string.
In step 4022 ', a round of screening results are obtained through step 4021 ', a part of candidate array arrays are removed from the screening results, and at this time, the operation process similar to step 4021 ' is repeated for the remaining candidate array arrays until a unique result (i.e., a target matching result) is matched, or after a condition of jumping out of the matching loop is reached, a message of unsuccessful matching is fed back to an operator.
Compared with the single-thread and foldback matching process of the steps 4021-4022, the method proposed in the steps 4021 '-4022' has higher efficiency, and certainly, the multi-thread execution process or the direct extraction of the character bit information in each subsequent level of array with the preset length in the candidate array also brings more occupation of CPU resources, so that the two processing flows have respective advantages.
Example 4:
in embodiment 1, a specific implementation method for determining one or more character bits in a character string having dynamic changes and static character bits in the corresponding character string has been given, but the determination method described in embodiment 1 in correspondence to the specific description is operator intervention required and is applicable to a case where a number array has not been generated for any of a plurality of character strings analyzed in the early stage. As a possible related situation in the character mapping library architecture provided in the embodiment of the present invention, in addition to the identification of the dynamically changing character bits and the static character bits in the character strings in the initial stage described in embodiment 1, the embodiment of the present invention also provides a method for automatically identifying the dynamically changing character bits between the character strings by the server (system) without excessive access of the operator, after the array corresponding to each character string is generated, which is a method for establishing the array in an early stage in a rough manner and gradually gathering the character bits in a later stage (which may be regarded as a specific implementation of the second manner introduced in step 201 in embodiment 1). As shown in fig. 12, the method specifically includes the following steps:
in step 501, according to a preset time period, whether a plurality of last-stage arrays in the character mapping library can be matched to the same target matching result is analyzed.
The preset time period may be 1 day, 1 week or 1 month, and is specifically set according to an actual situation, which is not particularly limited herein.
The last-level array refers to a last-level array corresponding to a complete character string in the character mapping library, for example, as shown in fig. 3, a fourth-level array of the last-level array is a last-level array of the character string a and the character string B in an array corresponding to the character string a and the character string B.
In the first mode, if the target matching result is directly stored in the last level array, the target matching result stored in each last level array in the character mapping library may be directly called for comparison, and if the comparison results are the same, it is determined that the plurality of last level arrays can match the same target matching result. In the second mode, if the last-level array directly stores the identification information of the corresponding target matching result, the identification information of the corresponding target matching result stored in each last-level array in the character mapping library may also be directly called, and if the comparison results are the same, it is determined that the plurality of last-level arrays can be matched to the same target matching result.
In step 502, if it is determined that there is a situation where a plurality of last-stage arrays can be matched to the same target matching result, at least two strings to be integrated corresponding to the respective array arrays are obtained according to the linking relationship between the last-stage array and the previous-stage array.
The character string to be integrated is obtained by performing reverse-extrapolation according to the link relationship between the arrays in each array, taking the array shown in fig. 6 as an example, and determining the parameter value of the character bit corresponding to the fourth-level array according to the serial number of the array unit storing the content in the fourth-level array, where the parameter value of the character bit corresponding to the fourth-level array shown in fig. 6 is 0x 02; and the serial number of the array unit in which the content is stored in the third-level array is 0xFB according to the inter-array link relationship, so that the parameter value of the character bit corresponding to the third-level array is 0xFB, and the complete character string A to be integrated is obtained by analogy and is 0x04, 0x05, 0xFB and 0x 02.
In step 503, the at least two strings to be integrated are compared to obtain dynamically changing character bits and static character bits between the strings to be integrated.
Now, the array shown in fig. 8 is reused as an application scenario described in the embodiment of the present invention, where the array shown in fig. 8 includes a character string a, a character string B1, a character string B2, and a character string B3 (where, the character string B1, the character string B2, and the character string B3 respectively correspond to the corresponding arrays marked by the thin-line arrows in fig. 8 and are sequentially arranged from top to bottom in each level of the array), and therefore, it has been confirmed through step 501 that the target matching results of the character string B1, the character string B2, and the character string B3 are the same according to the content included in the fourth level of the array; and having deduced in reverse that the corresponding string B1, string B2, and string B3 are "0 x04, 0x04, 0x04, 0 xFD", "0 x04, 0x04, 0xFC, 0 xFD", and "0 x04, 0x04, 0 xFD", respectively, through step 502, it can be further confirmed that the static character bits therein are: a first word bit "0 x 04", a second word bit "0 x 04", a fourth word bit "0 xFD"; and the corresponding dynamically changing character bits are: the third character bit.
In step 504, the corresponding array in the character mapping library is adjusted according to the dynamically changing character bits and the static character bits.
The state diagram shown in FIG. 8 is still used as the array before the step 504 is executed, and after the step 504 is executed correspondingly, the corresponding array in the character mapping table is updated to the state diagram shown in FIG. 3.
Example 5:
after providing the above-mentioned character mapping library architecture and the method for using each link thereof proposed by the present invention, in the embodiment of the present invention, a fast matching and recognizing device based on character strings is further proposed for executing the methods proposed by the above-mentioned embodiments, as shown in fig. 13, the fast matching and recognizing device based on character strings of the present embodiment includes one or more processors 21 and a memory 22. In fig. 13, one processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or other means, and the bus connection is exemplified in fig. 13.
The memory 22, as a non-volatile computer-readable storage medium for a method and apparatus for string-based fast match recognition, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the method for string-based fast match recognition in embodiment 1 (e.g., the flowcharts shown in fig. 1, 4, 5, 9-12). The processor 21 executes various functional applications and data processing of the character string-based quick match recognition apparatus by executing nonvolatile software programs, instructions, and modules stored in the memory 22, that is, implements the character string-based quick match recognition method of embodiments 1 to 4.
The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
It should be noted that, because the contents of information interaction, execution process, and the like between the modules and units in the device are based on the same concept as the processing method embodiment of the present invention, specific contents may refer to the description in the method embodiment of the present invention, and are not described herein again.
Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (7)

1. A quick matching identification method based on character strings is characterized by comprising the following steps:
determining one or more character bits with dynamic changes in the character string and static character bits in the corresponding character string;
updating a character string mapping library according to the content information of the static character bits in the character string and the one or more dynamic character bits;
correspondingly calibrating the one or more dynamic character bits in the character string mapping library by using preset additional character bits;
the character string mapping library comprises one or more array arrays, and the array arrays are formed by arranging characters corresponding to one or more arrays in a hierarchical mode; wherein, the series of the array corresponds to the number of corresponding characters in the character string; each array comprises array units the number of which is consistent with that of the complete characters, and the preset additional character bit is additionally arranged behind the last character bit of each array; the array unit is used for storing the address of the next-level array associated with the array unit;
the method includes that the character string mapping library already stores a first character string, and at this time, a newly added second character string is imported into the character string mapping library, and specifically includes:
multiplexing a first-level array of the first character string to a second character string for the first character string and the second character string with the same initial character;
for the ith character bit with difference between the first character string and the second character string, in the array of the first character string, the link where the corresponding ith series group is located is added with an array to correspond to the content of the ith character bit in the second character string; thus, two lower links of the ith level array are formed relative to the (i-1) th level array;
when a third character string is acquired and the information represented by the third character string needs to be analyzed through the character string mapping library, the method further comprises the following steps:
according to the content of the first character of the third character string, one or more candidate array arrays with the record information consistent with the content of the first character of the third character string are matched in the array arrays of the character string mapping library;
sequentially screening the one or more candidate array arrays according to the content of the subsequent character bit of the third character string to obtain an analysis result corresponding to the third character string;
the array units of the number of the complete characters specifically include 256 array units in total corresponding to 0x00-0xFF, and the additional characters are correspondingly set as 257 th array units in the array, wherein each array unit is used for storing address information of the next-level array or storing corresponding information for jumping out of the current array to obtain a matching result.
2. The method for rapid match recognition based on character strings according to claim 1, wherein the method for storing the corresponding information of the match result obtained by skipping out of the current array specifically comprises:
jump address links are stored in the last-level array of the array corresponding to each character string, and the jump address links are used for acquiring analysis results matched with the character strings; alternatively, the first and second electrodes may be,
and the last-level array of the array corresponding to each character string stores the analysis result matched with the character string.
3. The method according to claim 1, wherein the screening the one or more candidate array arrays according to the content of the subsequent character bits of the third character string in sequence to obtain the analysis result corresponding to the third character string specifically includes:
setting the subsequent character bit as a static character bit for matching, selectively setting the subsequent character bit as a dynamic character bit if the subsequent character bit is not matched to obtain a unique result, and matching the adjusted subsequent character bit until the unique result is matched, or feeding back a message of successful unmatching to an operator after the matched cyclic condition is skipped.
4. The method according to claim 3, wherein the selectively setting the subsequent character bits as dynamic character bits specifically comprises:
in the previous round of matching process, the last mismatched character position is adjusted to be a dynamic character position, and when the previous round of mismatch is carried out, an array corresponding to the previous character position which is newly adjusted to be the dynamic character position is taken as the start, and the matching process of the current round is carried out;
if the next character bit mismatch occurs, repeating the adjusting process and completing the matching process of the whole character string;
if the same character bit is not successfully matched after being adjusted to be the dynamic character bit, the arrival of the matched cycle condition is confirmed, and the information of the successful unmatching is fed back to an operator.
5. The method according to any one of claims 1 to 4, wherein the determining that one or more dynamically changing character bits exist in the character string and static character bits in the corresponding character string specifically includes:
comparing a fourth character string and a fifth character string obtained from a data packet in a preset time period, and if the number ratio result between the number of similar character bits and the number of different character bits between the fourth character string and the fifth character string is greater than a preset threshold value, marking the fourth character string and the fifth character string;
according to the confirmation message that the fourth character string and the fifth character string fed back by the input end belong to the same analysis result; and determining character bits with difference between the fourth character string and the fifth character string as the dynamically changed character bits, and determining character bits with same content between the fourth character string and the fifth character string as the static character bits.
6. The method according to any one of claims 1 to 4, wherein the determining that one or more dynamically changing character bits exist in the character string and static character bits in the corresponding character string specifically includes:
analyzing whether a plurality of final-stage arrays exist in the character mapping library and can be matched with the same target matching result or not according to a preset time period;
if the condition that a plurality of last-stage arrays can be matched with the same target matching result is confirmed, acquiring at least two character strings to be integrated corresponding to the respective array arrays according to the link relation between the last-stage arrays and the preceding-stage arrays;
comparing the at least two character strings to be integrated to obtain character positions and static character positions which dynamically change between the character strings to be integrated;
and adjusting the corresponding array in the character mapping library according to the dynamically changed character bit and the static character bit.
7. An apparatus for string-based fast match recognition, comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor and programmed to perform the string-based quick match identification method of any of claims 1-6.
CN201910339599.4A 2018-04-20 2018-04-20 Quick matching identification method and device based on character strings Active CN110083746B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910339599.4A CN110083746B (en) 2018-04-20 2018-04-20 Quick matching identification method and device based on character strings

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910339599.4A CN110083746B (en) 2018-04-20 2018-04-20 Quick matching identification method and device based on character strings
CN201810362354.9A CN108628966B (en) 2018-04-20 2018-04-20 A kind of quick matching and recognition method and device based on character string

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201810362354.9A Division CN108628966B (en) 2018-04-20 2018-04-20 A kind of quick matching and recognition method and device based on character string

Publications (2)

Publication Number Publication Date
CN110083746A CN110083746A (en) 2019-08-02
CN110083746B true CN110083746B (en) 2021-01-22

Family

ID=63694204

Family Applications (4)

Application Number Title Priority Date Filing Date
CN201910339586.7A Active CN110096628B (en) 2018-04-20 2018-04-20 Quick matching identification method and device based on character strings
CN201910339599.4A Active CN110083746B (en) 2018-04-20 2018-04-20 Quick matching identification method and device based on character strings
CN201910339570.6A Active CN110008385B (en) 2018-04-20 2018-04-20 Quick matching identification method and device based on character strings
CN201810362354.9A Active CN108628966B (en) 2018-04-20 2018-04-20 A kind of quick matching and recognition method and device based on character string

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201910339586.7A Active CN110096628B (en) 2018-04-20 2018-04-20 Quick matching identification method and device based on character strings

Family Applications After (2)

Application Number Title Priority Date Filing Date
CN201910339570.6A Active CN110008385B (en) 2018-04-20 2018-04-20 Quick matching identification method and device based on character strings
CN201810362354.9A Active CN108628966B (en) 2018-04-20 2018-04-20 A kind of quick matching and recognition method and device based on character string

Country Status (1)

Country Link
CN (4) CN110096628B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659489B (en) * 2019-09-20 2023-03-24 安天科技集团股份有限公司 Threat detection method, device and storage medium for character string splicing behavior
CN111061972B (en) * 2019-12-25 2023-05-16 武汉绿色网络信息服务有限责任公司 AC searching optimization method and device for URL path matching
US11586615B2 (en) * 2020-07-29 2023-02-21 Bank Of America Corporation System for generation of resource identification numbers to avoid electronic misreads
CN113641672A (en) * 2021-07-30 2021-11-12 武汉思普崚技术有限公司 Multi-dimensional rapid matching method and device and storage medium
CN113609352B (en) * 2021-08-03 2023-08-04 北京恒安嘉新安全技术有限公司 Character string retrieval method, device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101807184A (en) * 2009-02-16 2010-08-18 阿尔卡特朗讯 Method for searching character string with wildcard character and system thereof
CN102646115A (en) * 2012-02-17 2012-08-22 北京星网锐捷网络技术有限公司 Method and device for constructing AC (aho-corasick) state machine
CN103414600A (en) * 2013-07-19 2013-11-27 华为技术有限公司 Approximate matching method, related device and communication system
CN104750725A (en) * 2013-12-30 2015-07-01 亿阳信通股份有限公司 Character string searching method and device based on non-determined finite automaton
CN105404635A (en) * 2014-09-16 2016-03-16 华为技术有限公司 Character string matching method and device and heterogeneous computing system

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6718325B1 (en) * 2000-06-14 2004-04-06 Sun Microsystems, Inc. Approximate string matcher for delimited strings
US6917936B2 (en) * 2002-12-18 2005-07-12 Xerox Corporation Method and apparatus for measuring similarity between documents
CN101441664A (en) * 2008-12-03 2009-05-27 北京启明星辰信息技术股份有限公司 Paralleling multiple-mode matching method and system of matching regulation including choosing character
CN102142009B (en) * 2010-12-09 2013-08-14 华为技术有限公司 Method and device for matching regular expressions
CN103186640B (en) * 2011-12-31 2016-05-25 百度在线网络技术(北京)有限公司 Adopt traffic filtering method and the device of the canonical coupling based on AC algorithm
US9639325B2 (en) * 2012-03-01 2017-05-02 International Business Machines Corporation Finding a best matching string among a set of strings
US8990232B2 (en) * 2012-05-15 2015-03-24 Telefonaktiebolaget L M Ericsson (Publ) Apparatus and method for parallel regular expression matching
US8972450B2 (en) * 2013-04-17 2015-03-03 National Taiwan University Multi-stage parallel multi-character string matching device
CN103685222A (en) * 2013-09-05 2014-03-26 北京科能腾达信息技术股份有限公司 A data matching detection method based on a determinacy finite state automation
CN107193843B (en) * 2016-03-15 2020-08-28 阿里巴巴集团控股有限公司 Character string screening method and device based on AC automaton and suffix expression
CN107545071B (en) * 2017-09-21 2020-02-07 北京神州泰岳智能数据技术有限公司 Method and device for matching character strings

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101807184A (en) * 2009-02-16 2010-08-18 阿尔卡特朗讯 Method for searching character string with wildcard character and system thereof
CN102646115A (en) * 2012-02-17 2012-08-22 北京星网锐捷网络技术有限公司 Method and device for constructing AC (aho-corasick) state machine
CN103414600A (en) * 2013-07-19 2013-11-27 华为技术有限公司 Approximate matching method, related device and communication system
CN104750725A (en) * 2013-12-30 2015-07-01 亿阳信通股份有限公司 Character string searching method and device based on non-determined finite automaton
CN105404635A (en) * 2014-09-16 2016-03-16 华为技术有限公司 Character string matching method and device and heterogeneous computing system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于众核硬件的模式匹配算法加速技术研究;刘旭东;《中国优秀硕士学位论文全文数据库信息科技辑》;20150815(第 08 期);参见正文第3.1.2.2、3.3.2、3.4.1节 *

Also Published As

Publication number Publication date
CN110008385A (en) 2019-07-12
CN110096628B (en) 2021-01-22
CN110096628A (en) 2019-08-06
CN110083746A (en) 2019-08-02
CN110008385B (en) 2020-12-22
CN108628966B (en) 2019-06-14
CN108628966A (en) 2018-10-09

Similar Documents

Publication Publication Date Title
CN110083746B (en) Quick matching identification method and device based on character strings
CN111008201B (en) Method and apparatus for parallel modification and reading of state trees
US8914320B2 (en) Graph generation method for graph-based search
CN104579941A (en) Message classification method in OpenFlow switch
CN105794172A (en) Packet parsing and key generation in a network device
US10164884B2 (en) Search apparatus, search configuration method, and search method
JP2003196295A (en) Method for improving lookup performance of tree-type knowledge base search
CN112468410B (en) Method and device for enhancing accuracy of network traffic characteristics
CN113946546B (en) Abnormality detection method, computer storage medium, and program product
CN111935081B (en) Data packet desensitization method and device
CN110071871A (en) A kind of large model pool ip address matching process
WO2015192742A1 (en) Lookup device, lookup method and configuration method
CN114422620B (en) Data packet classification method and related device based on knowledge distillation
CN108304467B (en) Method for matching between texts
CN112866229B (en) High-speed network traffic identification method and system based on state diagram
CN112887280B (en) Network protocol metadata extraction system and method based on automaton
KR100662254B1 (en) Apparatus and Method for Packet Classification in Router
CN114490861A (en) Telemetry data analysis method, device, equipment and medium
CN110581823B (en) Method for analyzing non-public database protocol request data packet
CN115801020B (en) Definite finite state automaton compression method, matching method, device and medium
CN113778893B (en) Method, device, equipment and storage medium for generating test case of dialogue robot
US11184282B1 (en) Packet forwarding in a network device
CN112769813B (en) Matching method of multi-prefix mask quintuple
CN108228660A (en) The method and its device of the corresponding parameter of generation SQL templates and acquisition based on automatic machine
CN117336242A (en) Exit flow control method based on XDP

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant