WO2008141519A1 - Method and chip structure for matching multi-character string - Google Patents

Method and chip structure for matching multi-character string Download PDF

Info

Publication number
WO2008141519A1
WO2008141519A1 PCT/CN2008/000293 CN2008000293W WO2008141519A1 WO 2008141519 A1 WO2008141519 A1 WO 2008141519A1 CN 2008000293 W CN2008000293 W CN 2008000293W WO 2008141519 A1 WO2008141519 A1 WO 2008141519A1
Authority
WO
WIPO (PCT)
Prior art keywords
state
post
current
input
input character
Prior art date
Application number
PCT/CN2008/000293
Other languages
French (fr)
Chinese (zh)
Inventor
Tian Song
Original Assignee
Beijing Zhean Technology Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhean Technology Corporation filed Critical Beijing Zhean Technology Corporation
Publication of WO2008141519A1 publication Critical patent/WO2008141519A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • the present invention relates to a method and a chip structure for information processing, and in particular to a multi-character string matching method and chip structure. Background technique
  • Multi-string matching technology also known as multi-keyword matching technology, has matured and is widely used in many fields such as text processing and content filtering.
  • the technology can find one or more of a predefined set of strings in one-dimensional content to be matched, and in the process of matching text, fully utilize the features in a set of strings, perform pre-processing, and according to the pre-
  • the processed intermediate data structure performs content matching to achieve parallel matching of a set of predefined strings. .
  • multi-string matching techniques such as network intrusion detection and prevention systems, spam filtering, virus scanning and filtering, malicious code scanning and filtering, and content filtering.
  • the typical use of this type of application for multi-string matching techniques is to capture packets from the network and restore them to specific network layer data, based on pre-defined rule sets (eg, intrusion rules, virus rules, garbage). Mail rules, etc.), matching in the data. In most cases, this match utilizes multi-character string matching techniques.
  • scheme A In the actual multi-string matching technology application, there is a kind of scheme (hereinafter referred to as scheme A) which is favored because of the following characteristics:
  • the matching performance is independent of the size of the rule base, the matching performance and the minimum length of the rule base. Irrelevant, matching performance and rule base and text to be matched The relationship is irrelevant.
  • scenario A preprocesses P and constructs a finite state automaton (DFA), as shown in Figure 1. (where the circle indicates the state and the line indicates the conversion rule)
  • DFA finite state automaton
  • one character can be read at a time, and in the above structure, according to the conversion relationship, each time advances to a position, when reaching S3 or S5 When the location is located, ⁇ ⁇ a valid match.
  • the scheme of the paper 1 adopts the scheme ⁇ , and proposes a priority conversion rule storage method, which can merge all the failure conversion rules and all the restart conversion rules in Fig. 1 into a maximum of 256 rules. In practical applications, the number of conversion rules can be greatly reduced.
  • Paper 1 does not completely solve the problem of increasing storage space with the increase of the number of rules. Matching large-scale feature sets requires a great space cost.
  • the state machine contains state and conversion rules. Implementing the state machine with a chip structure means that the conversion rules in the state machine are stored in a specific memory, and these conversion rules are accessed as needed.
  • the information contained in each conversion rule includes: pre-state, input characters, and post-state.
  • the pre-state refers to the current state of the state machine.
  • the conversion rule indicates the process of receiving a character to jump to a certain state after the previous state. For each (pre-state, input character) pair, the state machine has a unique conversion rule that corresponds to it.
  • TCAM bead memory device
  • the main object of the present invention is to provide a multi-string matching method and chip structure, and the technical problem to be solved is to enable high matching speed and matching to a large-scale rule set, which is very suitable for practical use.
  • the cache state machine includes: a status register: for registering a current state; a cache status register: for registering a cache state; a conversion rule module: for storing and accessing a state conversion rule base, and according to characters received by the interface module
  • the current state of the status register register and the cache status of the cache status register register look for the next state, output to the status register; and assign the cache status register according to a specific cache rule.
  • a multi-string matching method comprising the steps of: sequentially taking characters as input characters from a received input character stream; for each input character, performing the following steps: The current state and the cache state are searched for in the state transition rule base; the jump to the post state; the state cache is performed according to a specific cache rule; the post state is taken as the current state, and the cached state is used as the cache state, An input character is used as the current input character, and the steps performed for each input character are repeated until all the characters in the character stream are judged.
  • the step of the post-find state includes: first determining whether the current dog state receives the current input character in the basic conversion rule and the n-step cross-conversion rule, and if present, if present, Then, the post state is used as a search result; if not, it is determined whether the cache state receives the current input character in the basic conversion rule and the n-step cross-conversion rule, and if yes, the post state is used as the search result; If it does not exist, it is judged whether the initial state receives the current input character in the basic conversion rule and the n-step cross-conversion rule. If it exists, the post-state is used as the search result; otherwise, the initial state is used as the search result.
  • the step of performing state buffering according to a specific cache rule is: if the initial state receives the corresponding post state of the current input character in the basic conversion rule, the post state is cached; otherwise, the initial state is cached.
  • the step of the post-find state includes: determining a type of the current state, and if it is a converged state or a general state, searching in the state transition rule set according to the current input character and the current state.
  • Post state if it is a detached state, the post state is looked up in the detached state transition rule set according to the current input character, the current state, and the cache state.
  • the separated state transition rule set is set to receive three inputs: the current input character, when The pre-state and the cache state provide an output accordingly: post-state.
  • the step of buffering according to a specific cache rule is: if the current state is a convergence state, the current state is cached.
  • the present invention also provides a computer readable storage medium storing a plurality of instructions, when the instructions are executed by a processor, causing the processor to: receive an input character; for each input character, perform the next Steps: searching for a post state in the state transition rule base according to the current input character, current state, and cache state; jumping to the post state; performing state caching according to a specific caching rule; using the post state as a current state, The state of the cache is used as the cache state, and the next input character is used as the current input character, and the steps performed for each input character are repeated until all the characters in the character stream are judged.
  • the present invention also provides a system comprising: a processor; a bus coupled to the processor for transferring data between portions of the system; a communication interface coupled to the bus for receiving a stream of character data a main memory, coupled to the bus, in which is stored a number of instructions, when the instructions are executed by the processor, causing the processor to perform the following steps: sequentially extracting characters from the received character data stream as Enter characters; for each input character, perform the following steps: Find the post state in the state transition rule base according to the current input character, current state, and cache state; jump to the post state; perform state buffer according to a specific cache rule The post state is taken as the current state, the cached state is taken as the cache state, and the next input character is used as the current input character, and the steps performed for each input character are repeated until all the characters in the character stream are judged. .
  • the post-state search method includes: calculating a possible post-state according to the current state and the input character in conjunction with the input translation table; and searching the rule storage table according to the possible post-state to obtain a corresponding input character; Whether the actual input characters are consistent with the characters obtained by searching the rule storage table; if the results are consistent, the state is switched to the possible post state; if the results are inconsistent, the state is reset to zero.
  • the numbering rule of the state includes: if the current state has only one corresponding output conversion rule, the number of the state after the output conversion rule is the number of the current state plus one.
  • the step of calculating a possible post-state includes: according to a certain rule set, if the current state has only one corresponding output conversion rule, the number of the current state is added a number for obtaining a possible post state; if there are a plurality of corresponding output conversion rules for the current state, taking the color of the current state and the input character as inputs, searching the input translation table to obtain the The difference between the possible post state and the current state, and the number of the current state is added to the difference to obtain the number of possible post states.
  • the rule storage table is configured to: the input is a post state, and the corresponding output is a color of the post state and an input character corresponding to the post state.
  • the input translation table is configured to: the input is the color of the current state and the input character, and the corresponding output is the difference between the possible post state and the current state.
  • the foregoing post-state search method further includes performing entry merging on the input translation table, where each row of the input translation table corresponds to a current state, and each column corresponds to one input character, and the entry is Merging it includes the steps of judging whether there is a resource conflict and an overlay conflict, and judging each of the two rows to be merged, the judgment of the kth column is as follows: If one of the two columns is empty, judging the corresponding of the empty column Whether the character received by the non-empty column data after the merge is equal to k, if yes, it is the overlay conflict, the two columns cannot be merged, and exit; if not, the following judgment is made; if both columns are empty or both If it is not empty, determine whether the corresponding values of the two columns are the same.
  • the resource conflict refers to the value of the corresponding column in the ITT table entry. It is empty and different; the coverage conflict refers to the non-null value of a column in the ITT table entry that covers the null value, which is equivalent to the original state.
  • the external conversion rule, the additional conversion rule conflicts with the original conversion rule, that is, the overlay conflict; until it is determined that if all the columns in the two rows to be merged do not have the resource conflict and the overlay conflict, the corresponding row is performed. Merge, where non-null values cover null values.
  • the foregoing post-state search method further includes performing group associative optimization on the input translation table, and the method includes the following steps of determining whether there is a resource conflict: for the N-way group association, dividing the ITT table into a row 256/N groups, for a group, judge the number of valid values contained in two rows. If the number is greater than N, it indicates that there is a resource conflict in the group; otherwise, judge another group; until all 256/N groups are determined If there are no resource conflicts, the two rows are merged.
  • the present invention also provides a computer readable storage medium storing a plurality of instructions, when the instructions are executed by the processor, causing the processor to perform the following steps: calculating the input translation table according to the current state and the input characters a possible post state; searching the rule storage table according to the possible post state to obtain a corresponding input character; comparing whether the actual input character and the character obtained by searching the rule storage table are consistent; if the results are consistent, The state is converted to the possible post state described; if the results are inconsistent, the state is zeroed.
  • the numbering rule of the state includes: if the current state has only one corresponding output conversion rule, the number of the state after the output conversion rule is the number of the current state plus one; the calculation is possible
  • the step of the post state includes: a certain rule set, if the current state has only one corresponding output conversion rule, add a number of the current state to obtain a number of possible post states; if the current state exists a plurality of corresponding output conversion rules, taking the color of the current state and the input character as inputs, and searching the input translation table to obtain a difference between the number of the possible post state and the current state, And adding the difference by the number of the current state to obtain the number of possible post-states.
  • the rule storage table is configured to: the input is a post state, and the corresponding output is a color of the post state and an input character corresponding to the post state.
  • the input translation table is configured to: the input is the color of the current state and the input character, and the corresponding output is the difference between the possible post state and the current state.
  • each row of the input translation table corresponds to a current state
  • each column corresponds to one input character
  • the input translation table is merged by an entry
  • the combination of the entries is performed as follows:
  • Each column of the two rows is judged, and the judgment of the kth column is as follows: If one of the two columns is empty, it is judged whether the state corresponding to the empty column is equal to k when the character received by the non-null column data after the merge is equal. If yes, it is an override conflict, two columns cannot be merged, and exit.
  • both columns are empty or not empty, it is judged whether the corresponding values of the two columns are the same, if not, then For resource conflicts, the two columns cannot be merged, exit, and if so, the next column is judged; until all the columns in the two rows to be merged are determined to have no resource conflicts and overlay conflicts, the corresponding rows are merged, and the corresponding rows are not empty. The value overrides the null value.
  • the input translation table is optimized by group association, and the group association optimization includes the following steps of determining whether there is a resource conflict: For the N-way group association, the ITT table is divided into 256/N. Groups, for a group, determine the number of valid values contained in the two rows. If the number is greater than N, it indicates that there is a resource conflict in the group; otherwise, judge another group; until it is determined that all 256/N groups do not have resources Conflict, then merge the two lines.
  • the present invention also provides a system, comprising: a main processor, an organization input data stream; a coprocessor unit, connected to the main processor; the coprocessor unit performs the following operations: according to the current state and the input characters Entering a translation table to calculate a possible post state; searching the rule storage table according to the possible post state to obtain a corresponding input character; comparing whether the actual input character and the character obtained by searching the rule storage table are consistent; The results are consistent, then the state is transitioned to the possible post state; if the results are inconsistent, the state is zeroed.
  • the numbering rule of the state includes: if the current state has only one corresponding output conversion rule, the number of the state after the output conversion rule is the number of the current state plus one; the calculation is possible
  • the step of the post state includes: according to a certain rule set, if the current state has only one corresponding output conversion rule, add a number of the current state to obtain a number of possible post states; if the current state exists a plurality of corresponding output conversion rules, taking the color of the current state and the input character as inputs, and searching the input translation table to obtain a difference between the number of the possible post state and the current state, And adding the difference by the number of the current state to obtain the number of possible post-states.
  • the rule storage table is configured to: the input is a post state, and the corresponding output is the color of the post state and the input character corresponding to the post state.
  • the input translation table is configured to: the input is the color of the current state and the input character, and the corresponding output is the difference between the possible post state and the current state.
  • each row of the input translation table corresponds to a current state
  • each column corresponds to one input character
  • the input translation table is merged by an entry
  • the combination of the entries is performed as follows:
  • Each of the two rows is judged by the ⁇ , and the judgment of the kth column is as follows: If one of the two columns is empty, it is judged whether the character corresponding to the empty column is the character received by the non-null column data after the merge. Equivalent to k, if yes, it is an overlay conflict, the two columns cannot be merged, and exit.
  • both columns are empty or not empty, it is judged whether the corresponding values of the two columns are the same, if not, Then, for resource conflicts, the two columns cannot be merged and exited. If yes, the next column is judged; until all the columns in the two rows to be merged are determined to have no resource conflicts and overlay conflicts, the corresponding rows are merged, and the non- A null value covers a null value.
  • the input translation table is optimized by group association, and the group association optimization includes the following steps of determining whether there is a resource conflict: For the N-way group association, the ITT table is divided into 256/N. Groups, for a group, determine the number of valid values contained in the two rows. If the number is greater than N, it indicates that there is a resource conflict in the group; otherwise, judge another group; until all 256 N groups are determined to have no resource conflicts , then merge the two lines. -
  • the object of the present invention and solving the technical problems thereof are additionally achieved by the following technical solutions.
  • a post-state lookup structure includes: a main memory: storing a basic conversion rule and a cross-conversion rule, the input of which is a possible post-state calculated according to the current state and the input character in conjunction with the input translation table, Outputting the color of the possible post state and the input character corresponding to the possible post state according to the stored conversion rule; the secondary memory: storing the failure conversion rule and restarting the conversion rule, and the input is the actual input character Outputting a post state corresponding to the actual input character and its color according to the stored conversion rule; inputting a translation table: the input is the color of the current state and the actual input character, and the corresponding output is possible The difference between the number of the post state and the current state; the two-state gate: according to the comparison result between the character output by the main memory and the actual input character: if equal, the current The state transitions to the calculated possible post state, while the current state of the face The color is converted to the color of the possible post state output by the main memory; otherwise, the current state and
  • the post state lookup structure further includes a comparator for performing the main memory.
  • the post state lookup structure further includes: a status register: configured to store the current state; a color register: A color used to store the current state.
  • the post-state lookup structure further includes a gate: configured to selectively output the output value of the input translation table and the value 1 according to the value of the color register.
  • the post state lookup structure further includes an adder: configured to add the number of the current state to an output value of the gate to calculate a possible post state.
  • an adder configured to add the number of the current state to an output value of the gate to calculate a possible post state.
  • a multi-string matching structure comprising: a status register: for storing a current state; a color register: for storing a color of a current state; a status buffer: for storing a buffer state; a color buffer: The color used to store the cache state; the main memory: stores the basic conversion rule and the n- step cross conversion rule, and the first input is the first possible post state calculated according to the current state and the input character combined with the input translation table, corresponding to The first way output is the color of the first possible post state obtained according to the stored conversion rule and the input character corresponding to the first possible post state; the second input is > cache state and The input character is matched with the second possible post state calculated by the input translation table, and the corresponding second output is the color of the second
  • the road character is the same as the actual input character, the state register is overwritten with the first possible post state, and the color register is overwritten with the color of the first possible post state; if the first path character and The actual input characters are different, but the second path character is the same as the actual input character, the state register is overwritten by the second possible post state, and the color is covered by the second possible post state The color register; otherwise, the status register and the color register are respectively covered by the post state output and the color thereof.
  • the multi-string matching structure further includes: a first comparator, configured to perform a comparison between a first path character output by the main memory and an actual input character; and a second comparator, A comparison between the second pass character output by the main memory and the actual input character is performed.
  • a first comparator configured to perform a comparison between a first path character output by the main memory and an actual input character
  • a second comparator A comparison between the second pass character output by the main memory and the actual input character is performed.
  • the multi-string matching structure further includes: a first strobe: configured to select and output an output value of the input translation table and a value 1 according to a value of the color register; and the second strobe: The value of the color buffer is selected for the output value of the input translation table and the value 1.
  • the multi-string matching structure further includes: a first adder: configured to add a number of the current state to an output value of the first gate to calculate a first possible post state a second adder: configured to compare the number of the buffer state with an output value of the second gate Add to calculate the second possible post state.
  • a first adder configured to add a number of the current state to an output value of the first gate to calculate a first possible post state
  • a second adder configured to compare the number of the buffer state with an output value of the second gate Add to calculate the second possible post state.
  • a multi-regular expression matching method comprising the steps of: sequentially taking characters as input characters from a received input character stream; for each input character, performing the following steps: according to the current input
  • the character, current state, and cache state are looked up in the state transition rule base; jump to the post state; state cache according to a specific cache rule; the post state as the current state, and the cached state as the cache state
  • the next input character is used as the current input character, and the step performed for each input character is repeated until all the characters in the character stream are judged.
  • the step of the post-find state includes: first determining whether the current state receives the current input character in the basic conversion rule and the n-step cross-conversion rule, and if present, if present, Then, the post state is used as a search result; if not, it is determined whether the cache state receives the current input character in the basic conversion rule and the n-step cross-conversion rule, and if yes, the post state is used as the search result; If it does not exist, it is judged whether the initial state receives the current input character in the basic conversion rule and the n-step cross-conversion rule; if it exists, the post state is used as the search result; otherwise, the initial state is used as the search result;
  • the step of performing state buffering according to a specific cache rule is: if the initial state receives the corresponding post state of the current input character in the basic conversion rule, the post state is cached; otherwise, the initial state is cached.
  • the step of the post-find state includes: determining a type of the current state, and if it is a converged state or a general state, according to the current input character and the current state in the state transition rule set After the lookup state; if it is a detached state, the post state is searched in the detached state transition rule set according to the current input character, the current state, and the cache state; the detached state transition rule set is set to receive three inputs: the current input character, the current The status and the cache status are respectively provided with an output: a post state; the step of caching according to a specific cache rule is: If the current state is a converged state, the current state is cached.
  • the present invention has significant advantages and advantageous effects over the prior art.
  • the multi-string matching method based on the cache state machine and the chip structure based on the "post-state lookup" have at least the following advantages and beneficial effects:
  • the performance of the matching is independent of the size of the rule base.
  • the performance of the matching is independent of the minimum length of the rule base.
  • the performance of the matching is independent of the relationship between the rule base and the text to be matched. It can support large-scale rule sets, with the number of rules. Increase the sub-linearity of storage space, effectively reduce space requirements, and be effective Store and access conversion rules in the state machine.
  • Figure 1 The finite state automaton constructed in the existing multi-string matching scheme A.
  • Figure 2 A finite state automaton constructed according to the scheme of the prior art 1, in which different priorities are set for the conversion rules.
  • Figure 3 State machine model.
  • Figure 4 Cache state machine model.
  • Figure 5 A finite state automaton constructed according to scenario A, where the restart conversion rules and the failed conversion rules have been removed.
  • Figure 6 Cache state machine constructed to implement dynamic cross-conversion loading.
  • Figure 7 Flow chart of the dynamic cross-conversion loading method.
  • Figure 8 A finite state automaton constructed according to scenario A, with a homogeneous path.
  • Figure 9 The ideal framework for feature set ⁇ betters, pattern ⁇ optimization.
  • Figure 10 Conformation path merge based on cache state machine.
  • Figure 11 Three states in the homogeneous path merge method.
  • Figure 12 Conversion function for three states in the isomorphic path merge method.
  • Figure 13 Two observations based on the post-state lookup structure.
  • Figure 14 Post-state lookup framework.
  • Figure 15 Detailed structure of the post-state lookup structure.
  • FIG. 16 Input translation table (ITT) structure.
  • Figure 17 Schematic diagram of the consolidation of ITT entries.
  • Figure 19 One of the ITT table optimizations: Table item consolidation method.
  • ITT Table Optimization 2 2-way set associative ITT table structure.
  • ITT table optimization 2 N-way group association ITT table optimization method.
  • Figure 22 Chip structure ACC-NSA structure for implementing multi-string matching technology based on cache state machine.
  • Figure 24 Applying a dynamic cross-conversion loading method to eliminate the effect of a cross-conversion rule (Snort rule).
  • Figure 25 Effect diagram of applying the merged isomorphic path method to reduce the basic conversion rules.
  • DFA deterministic finite state automaton
  • - Deterministic Finite Automata A representation of DFA is shown in Figure 3.
  • Each DFA has a current state (in the status register) that accepts the conversion rules for that character based on the input character and the current state, and proceeds to the next state. When the next character comes, the "next state” becomes the "current state”.
  • DFA can perform state transitions based on the internal data structure shown in Figure 1 driven by input characters. The main features of DFA are: Its next state is determined only by the current state and the currently entered characters.
  • DFA and NFA are simplified form of the Turing machine model, regardless of the deterministic finite state automaton (DFA) or the uncertainty finite state automaton (NFA), the next state is only the current state and The current input decision is shown in Figure 3. NFA can be converted to DFA equivalently.
  • a finite state set denoted K, is a collection of all states
  • a collection of alphabets denoted as ⁇ , that is, a collection of characters received by the state machine
  • Receive state set denoted as F
  • receive state set is a subset of the finite state set
  • the state transition function is a binary function, which determines the next state according to the current state of the state machine and the received characters.
  • the CDFA - Cached Deterministic Finite Automata is proposed by the present invention, and one of its manifestations is shown in FIG. Referring to DFA, CDFA includes a cache state (in the state buffer) in addition to a current state. In the cache state machine, its next state is determined by the current state, the currently input character and the cache state. The next cache state is determined by the internal mechanism of the cache state machine. No external input is required, and the cache state machine can be used. The specific needs of the flexible customization.
  • the Cache State Machine breaks the traditional state machine's "the next state is determined only by the current state and the current input". By recording the history information, the richness of the operation of the state machine in the post-determination state is increased.
  • the cache state machine achieves the above design goal by adding a state cache function to the state machine, as shown in FIG. From the perspective of the external interface, the cache state machine, like the traditional state machine, receives only input characters and outputs the state machine judgment result. The difference is that a state buffer ( Cache ) is added internally to enable a certain policy to cache the state.
  • the Cache State Machine can be defined as a seven-tuple, ⁇ /, ⁇ , , ⁇ ⁇ , ⁇ , including: • A finite state set, denoted ⁇ , that is, a set of all states in the state set;
  • a collection of alphabets denoted as ⁇ , that is, a collection of characters received by the state machine
  • N The number of caches contained in the state machine
  • the cache policy function determines the state to be cached according to the current state and the current input; the state transition function ⁇ determines the next state according to the current state, the cached state, and the input characters.
  • the new state machine model is named as the cache state machine. model.
  • the cache policy function can remember both historical information that the state machine has experienced, and can also "remember” other state information in a certain way.
  • the structure of the cache state machine is as follows, which includes:
  • Status register used to register the current status
  • Cache Status Register Used to register the cache status.
  • the number of states that can be registered is ⁇ , ⁇ > 1 ;
  • Conversion Rule Module Used to store the state conversion rule base, and according to the characters received by the interface module, the current state of the status register registration and the cache. The status register registered cache status looks for the next state.
  • Interface module used to receive input characters
  • Control module Used to control the characters that the interface module normally receives input, control the status register to update the current state, control the cache status register to update the buffer status, and control the conversion rule module to find the next status.
  • the prioritized approach used in paper 1 will be able to restart the conversion rules and failure conversion rules. Then the number is controlled within 256. In the present invention, both types of conversion rules can be solved in this way.
  • the invention utilizes the principle of a buffer state machine, mainly to eliminate nearly all cross-conversion rules, thereby completely solving the space explosion problem.
  • the present invention utilizes the principle of a buffer state machine, and can also reduce the number of basic conversion rules, thereby achieving a sub-linear increase in storage space with the number of rules.
  • the implementation is as follows.
  • the principle of the cache state machine uses Method 1: "Dynamic Cross-Conversion Loading" to eliminate more than 95% or even all of the cross-conversion rules. This method is named ACC.
  • the cross-conversion rule has been eliminated, and replaced by a cache space.
  • the principle of any one of the cross-conversion rules is: in the current state ⁇ S 3 , the received character is s, while switching to state S 4 , another path is opened from S G (ie, where S 6 is located) path).
  • the basic conversion rule of the current path is not met (if the current state S 4 , the input character is S, then the state is converted to state S 5 ), but the basic conversion rule of the other path is met (if the current state S 6 , if the input character is 1, then the condition of transition to state S 7 ) is generated, then the cross-conversion rule is generated, that is, if the next input character is 1, the state jumps from state S 4 to S 7 .
  • the operations performed by the cache state machine are as follows. If the current state S at position 3, the current character is received 3, according to the principle of cross conversion rule generation, S 6 is cached state. At the same time, the state machine enters the next state S 4 . In S 4 state, the received character is 1, the next state, the input characters (1) and the state of the buffer (S. 6) determined by the current state (S 4), because the S in the basic conversion rule path 4 does not accept characters 1, and S 6 accepts the character 1, so S 7 is determined to be the next state.
  • the dynamic cross-conversion loading dynamically generates the cross-conversion rules originally described by DF A using the CDFA principle, thereby greatly reducing the number of stored conversion rules.
  • the state transition function ⁇ can be divided into the following two categories:
  • S nerass represents the state transition function of the n-step cross-conversion rule, S llcross K ⁇ K , and the definition of ⁇ n cross in the ACC method is the same as that of scheme A.
  • the state transition function ⁇ is defined as
  • priority is the priority identifier
  • A is the highest priority
  • D is the highest priority. If the high priority result is valid (not empty), the result is taken first; if the high priority result is invalid, the 4 priority result is adopted.
  • the invalid result means that a certain state Si is in S basie and S ncr . There is no rule in the ss conversion function that accepts the character c.
  • the meaning of the state transition function ⁇ is that, for the state transition of the CDFA in the ACC, it is first determined whether the current state Si has a conversion rule for receiving the current character c in the basic conversion rule and the n-step cross-conversion rule. If yes, apply the rule to jump to the next state; if there is no corresponding conversion rule, the cached state S k is taken out, and the Sk state is the current state in the basic conversion rule and the n-step cross conversion
  • the rule searches for a conversion rule that accepts the current character c. If it exists, it jumps to the corresponding next state; if there is no corresponding conversion rule, it determines whether the initial state So receives the character c; if it receives, jumps to the corresponding state , otherwise jump to the initial state So.
  • the cache policy function is defined as
  • the meaning of the cache policy function ⁇ is that for the buffer space of the CDFA in ACC (only one), each cycle is cached, and the cached content is the initial state So accepts the next state corresponding to the current input character c; The corresponding conversion rule is not included in the rule, and the initial state S G is cached. As you can see, the cache policy function has nothing to do with the current state Si.
  • the ACC method is based on the above cache state machine.
  • the method mainly consists of two steps: preprocessing and matching.
  • the work in the preprocessing stage is to read in the feature set and construct the cache state machine;
  • the job of the matching phase is to read in the text to be matched, perform state machine conversion, and report the match in a specific state.
  • the idea of the ACS method is to combine the homogeneous paths in the state machine to reduce the number of states and basic conversion rules in the state machine.
  • the ACS method uses a cache state machine model.
  • the cache state machine can effectively remember the characteristics of the state transition history information, and perform isomorphic path merging to ensure the correctness of the matching. Taking the feature set ⁇ pattern, betters ⁇ as an example, the isomorphic path using the idea of the cache state machine is combined as shown in Fig. 10.
  • the idea of merging the isomorphic path based on the cache state machine is to dynamically store the path source state (S 8 or S in Figure 10 is stored in the cache of the cache state machine) when the path is merged. If the received characters cause the state transition to arrive at the same isolated path configuration of the position (s 6 state), a state will be cached taken to determine the configuration according to jump to the state where the source of the same path. for this reason, if the text input at this time is "patters", the state in which the same Si configuration at the beginning of the path is cached, the state S 6 when taken out, because the path is not derived from S 8, even when the input character is "s", not to jump to state S 9.
  • Each state in the CDFA corresponds to one color, and the CDFA contains three colors. The color is used to distinguish three different states in the merge process of the isomorphic path, as shown in Figure 11.
  • Converging states Yellow, mesh, defined as the last state before entering the isomorphic path, which represents the history information of the state machine before the isomorphic path. This state triggers its own state cache. This set of states is denoted as K c . v .
  • the state transition function ⁇ of the cache state machine CDFA can be divided into the following two categories: For the convergence of the four dog states and the general state, ⁇ is a binary function, ⁇ : ⁇ ⁇ - ⁇ ⁇ , the definition of ⁇ in the ACS method is the same as that of the scheme A.
  • is a ternary function, : ⁇ ⁇
  • the conversion function ⁇ of the separation state in the ACS method is defined as the current state, the cache state, and the current character.
  • the state transition function ⁇ is defined as
  • the transition rule for the state transition function in ⁇ is different from the traditional transition rule. It contains three inputs and one output. The three inputs contain the aggregation status of the source before the isomorphic path merge, as shown in Figure 12.
  • the conversion rule set two inputs
  • the conversion rule set is found according to the current input and the current state to obtain the next state.
  • the separated state in addition to the current input and current state, it is also necessary to find a separate state transition rule set (different from the conversion rule set, three inputs) according to the state being cached to obtain the next state.
  • the cache policy function ⁇ is defined as
  • the cache policy function ⁇ means that for the CDFA cache space in ACS (only one), when the current state is the aggregation state, the state is cached to the cache space. In other cases, nothing is done with the cache space.
  • the type of the current state is first determined, and then the corresponding action is performed according to the judgment result. If it is the aggregation state, the next state is obtained by searching the conversion rule set according to the current input and the current state, and the current state is cached to the cache space; if it is the general state, the conversion rule set is obtained according to the current input and the current state to obtain the next state. ; If it is a detached state, the next state is obtained by looking up the separation state transition rule set according to the current input, current state, and cache state.
  • the merged CDFA removes 5 states and 4 basic conversion rules, and space can be further saved.
  • the overhead required is the storage of a state storage space as a cache.
  • a regular expression is a string consisting of a series of special characters.
  • regular expressions refer to related materials.
  • the traditional AC algorithm can solve the problem of multi-regular expression matching by converting regular expressions into DFA and using DFA.
  • CDFA and use CDFA to receive input characters for matching.
  • the specific matching method includes eliminating 1 step Cross-conversion rules and homogeneous path merges, etc.
  • the technical difficulty of hardware implementation is: How to effectively store the conversion rule base in the memory and how to effectively locate the conversion rule Tr.
  • Si in the conversion rule Tr is referred to as "input state” and c is referred to as "input character”, which is referred to as "output state”.
  • Linear Trie structures there are a large number of Linear Trie structures in the state machine, especially the cache state machine generated by scenario A.
  • the so-called "linear tree” means that each state in the state machine contains only one transformation rule pointing to the next state, and forms a linear one-dimensional structure. Due to the existence of a large number of linear trees, the status numbers can be arranged incrementally. Therefore, the number of the next state can be calculated from the current state, that is, the predicted state.
  • the characters it accepts are deterministic, regardless of the type of conversion rules entered. If the state S 7 receives the basic conversion rule and the cross conversion rule, the character received by the state is "i" regardless of the conversion rule. Therefore, if the post state, that is, the output state, is obtained, the characters accepted by it can be uniquely determined, and by comparing with the actually input characters, it can be verified whether the calculated post state is a real post state. .
  • the structure of the post-state lookup uses a "predictive" and verification approach, as shown in Figure 14.
  • a possible post-state is calculated through an Input Translation Table (ITT) or a possible post-state is directly calculated, and the post-state is used as an address to index the rule storage table to obtain the state.
  • ITT Input Translation Table
  • the rule storage table can be stored by using an inexpensive memory such as SRAM or DDR, and the internal conversion rules of the memory are compactly distributed, and there is no "gap".
  • the post-state lookup is effective and comes from the use and optimization of ITT tables. According to observation 1, it can be known that since the state machine contains a large number of linear trees, the post-state of each state in the linear tree can be obtained by simple incrementing without looking up the ITT table. Only a small number of states with multiple conversion rule outputs need to enter the ITT table to get the difference between states. In addition, optimization of ITT tables can further reduce the use of storage space. ⁇
  • the detailed design of the NSA structure is divided into two parts.
  • One is the conversion rule in the input translation table ITT.
  • the rules store the storage in the table; the second is the access path design of the conversion rules.
  • the overall structure of the NSA is shown in Figure 15. This includes the main space "TRM-1, (Transition Rule Memory - 1) stored in the conversion rule and the storage space "TRM-0" (Transition Rule Memory -0 ) that resolves the failure conversion rule and restarts the conversion rule.
  • TRM-1 Transition Rule Memory - 1
  • TRM-0 Transition Rule Memory -0
  • a strobe MUX is provided for selecting and outputting the output value (ie, the difference between the states) obtained by accessing the ITT table according to the value of the color register and the value 1. If the color register value is 0, it is considered that there is no color in the current state, MUX selects output 1, and the current status number is incremented by 1 to obtain the post status number, and the corresponding state is used to access TRM-1 to obtain the corresponding value.
  • the corresponding value includes the color of the next state and a character.
  • the color register value is not 0, it is considered that the previous state has color, that is, the current state and the currently input character are input into the table together to obtain the output value, and the MUX selects the output value obtained by accessing the ITT table, that is, the current state number is added.
  • the post-state number is obtained after the difference between the states, and the corresponding value is obtained by accessing the TRM-1 with the post-state.
  • the input character is input to TRM-0 to obtain an output value, which includes the next state and a color value.
  • the character value output from the TRM-1 is compared with the current input character at a comparator CMP, and the following operation is performed by a two-state gate according to the comparison result: If equal, the color of the next state output by the TRM-1 is used.
  • the color register is overwritten, and the status register is overwritten with the calculated address of the access TRM-1 (ie, the post state), thereby realizing state transition in the case of verification. Otherwise, the state register is overwritten with the state of the TRM-0 output, and the color is overwritten with the color register, thereby realizing zeroing in the case of verification failure.
  • Failed conversion rules and restart conversion rules can be combined into a maximum of 256 with priority policies.
  • the input character is used as the address for indexing. That is, the initial state So or the post state of the initial state is output according to the input character.
  • TRM-0 uses character addressing to store the two types of conversion rules. According to the output state that the input character can jump, if there is a conversion rule for the corresponding character, the post state of the initial state is stored in the corresponding position. If there is no conversion rule for the corresponding character, The initial state is stored in the corresponding location. Since the input characters are up to 256, TRM0 contains 256 entries.
  • the character sequence accepted by each state is stored in the main conversion rule memory TRM-1 according to the state number. This part of the space is compact.
  • each state can be made into any color.
  • the input translation table uses color as an index for access.
  • Si the current state Si in the state machine, it is set to the input state of the k conversion rules, ie for this state, there are k characters that cause it to jump to the new state. (The failure conversion rules and restart conversion rules are not considered here).
  • both the state Si and the state S k contain two output conversion rules. To be able to predict the next state, the two states are respectively associated with a new row of the ITT table, and different colors are used to index the ITT table.
  • each color corresponds to 256 values, and each value is a state number difference value in which the state Si receives the corresponding column character and jumps to the new state.
  • state Si receives the character 0x01 and jumps to state S k , which corresponds to the ITT table.
  • the 0x01 column of color 1 stores the difference between state S k and state Si: k - i. Where 0 represents a null value.
  • the possible post state is S i+1 ; if the color is not white, access the ITT with color and current input Table, obtain the state difference, and then calculate the post state S i+ i .
  • the post state is calculated, although the current state information is used and the current input character information may be used, this use is not sufficient to actually determine the post state. To this end, it is necessary to compare the accessed character c, and the current character. c. If the two characters are the same, the calculated post state is the real post state and jumps to the state. If the two characters are different, jump to the state obtained by the TRM-0 access, that is, apply the failure conversion rule or restart the conversion rule.
  • each state containing multiple output conversion rules is assigned a new color, i.e., a portion of the ITT table is allocated as a basis for the post-calculation state. It should be noted that for most colors, there are only a few post-states, so the ITT table has a large number of nulls (0) per line. In order to effectively use the ITT table space, an optimization method for the ITT table is given here: Table item merge.
  • merging the entries of the ITT table is to combine multiple entries of the ITT table into one, so as to effectively utilize the space resources. Another implication of merging is to make the color of the state in the state machine.
  • Figure 17 shows the merge of the entries in the ITT table.
  • the left state machine contains 4 colors, and after merging, the right state machine contains only 2 colors.
  • Resource conflict means that the value of the corresponding column in the ITT table entry is not empty and different; as shown in Figure 18, color 2 and color 4.
  • Coverage conflict means that after a non-null value of a column in the ITT table entry covers a null value, an additional (virtual) conversion rule is added for the original state. It is to be ensured that the added extra conversion rule does not conflict with the original conversion rule, that is, the post state obtained according to the (virtual) conversion rule does not receive the character corresponding to the existing conversion rule.
  • the ITT table can be used to merge the entries.
  • the related method is shown in Figure 19.
  • Figure 19 shows the judgment of whether two ITT table entries can be merged.
  • the judging method judges each of the two rows to be merged.
  • the judgment of the kth column is as follows: If one of the two columns is empty, it is judged whether the state corresponding to the empty column is equal to k if the character received by the non-empty column data after the combination is merged, and if so, the overlay conflict, two columns Cannot merge, exit, if not, proceed to the next judgment; if both columns are empty or not empty, judge whether the corresponding values of the two columns are the same, if not, the resource conflicts, the two columns cannot be merged, and exit, if Yes, then judge the next column. Until all the columns in the two rows to be merged are determined to have no resource conflicts and overlay conflicts, the corresponding rows are merged, with non-null values covering the null values.
  • the method of two-two judgment is taken. As shown in Fig. 18, the color 2 and the color 1 are first combined and judged, then the color 3 and the color 1 are combined and judged, and so on. Until all possible merged colors are merged.
  • the group association optimization strategy is similar to the group association strategy cached in the computer storage system.
  • the idea is to break the boundaries of the ITT table column by group association, and the same column data can be stored in different columns.
  • the 2-way set associative ITT table structure is shown in Figure 20. With this structure, the color 4 in Fig. 18 can be combined with the color 2.
  • the ITT table is divided into 256/N groups.
  • the method for judging whether they can use the group association strategy for merging is shown in FIG. 21.
  • Two ITT table entries can be optimized by group association if and only if there is no conflict in the state in the same group. The conflict here is just a resource conflict, that is, any group contains non-empty elements that exceed N.
  • the method in Figure 21 is to determine if two ITT table entries conflict. For a group p, determine the number of valid values contained in the two lines. If it is greater than N, it means that there is a resource conflict in the group (the total number of valid values exceeds N). Therefore the two lines cannot be merged. Otherwise, another group is judged, and until it is determined that there is no resource conflict in all 256/N groups, the two lines can be merged.
  • the group association policy of the ITT table needs to add a tag bit (Tag) to distinguish each content.
  • the tag here requires two fields, one is the input tag field, and the other is the color tag field. Use these two fields to distinguish between different rows and different columns before the merge.
  • the NSA is an efficient hardware state machine implementation. This effectiveness stems from accurate access to the memory, the absence of conflicting items to determine, and the use of inexpensive SRAM, DDR, etc. memories. Although there are certain storage gaps in the ITT table of the NSA, this gap can be effectively controlled by the combination of the table entry and the optimization of the group association strategy.
  • Corresponding chip structure In order to implement a multi-string matching technology based on a cache state machine at a high speed, the present invention designs a corresponding chip structure, and the overall structure is as shown in FIG.
  • the structure is a feature matching structure including an ACC method and an NSA structure for string matching.
  • the ACC-NSA structure of Figure 22 includes a conversion rules module, a status register, and a cache status register module.
  • the NSA structure can implement a state machine efficiently by hardware, and the ACC method is based on the principle of the cache state machine.
  • the main problem solved by combining the ACC-NSA structure is to provide a cache using the NSA structure to implement the ACC method. state machine.
  • Figure 22 shows the ACC-NSA structure framework. It can be seen that the structure is based on the post-state lookup structure shown in Figure 15, adding the "state buffer” and "color buffer” related paths. These two sets of paths share a set of memory ITT tables and TRM-1 memory.
  • the TRM-1 design and the TRM-1 design are implemented in a dual-port memory that supports parallel access to the registers and cache. (If you do not consider the parallel access feature, single-port memory can also be used.)
  • TriMUX three-state gate
  • TriMUX three-state gate
  • ( state, color, "1” ) represents the post state value and its color calculated using the state in the register
  • ( state, color, "2” ) Represents the post state value and its color calculated using the state in the cache
  • middle_pi'iority means: If the input state calculated by using the state in the cache is consistent with the actual input character after accessing the TRM-1, the TriMU does not satisfy the preference.
  • the output uses the post state value calculated by the state in the register and its color, the output is selected from the post state value calculated by the state in the buffer and its color, and the input has a medium priority. If the conditions of the situation are not met, then the third input is selected, that is, the post-state value and its color of the TRM-0 output obtained by the application failure and restarting the conversion rule.
  • the register (status register and color register) and the cache (status buffer and color buffer) access the ITT table together, calculate the possible post-state value, and access TRM-1 to extract the character corresponding to the conversion rule.
  • TRM-0 the failure conversion rule and the state value corresponding to the restart conversion rule are obtained.
  • the three-way result is entered into the TriMUX module.
  • the TriMUX module it is judged whether the real occurrence occurs by comparing the input characters, and the TriMUX module is controlled to select the correct result to be overwritten into the register.
  • the result of TRM-0 is updated. A state transition is formed.
  • One of the one-step cross-conversion rules can be eliminated by the method of the present invention. It can be seen that the multi-string matching method based on the cache state machine "dynamic cross-conversion loading" can reduce the space to the original 4.1% (ClamAV Rule) and 20.8% (Snort rule).
  • the dotted line is the data of the traditional method using DFA
  • the solid line is the data of the method using the buffer state machine CDFA. It can be seen that the CDFA-based method of the present invention can reduce the number of basic conversion rules by up to 21.4%. (Snort rules).
  • the chip structure ACC-NSA structure can achieve a maximum matching speed of 11.7 Gbps (under 0.18 micron process). It has a faster speed than other methods.
  • the multi-string matching method based on the cache state machine and the chip structure based on the "post-state lookup" have at least the following advantages and beneficial effects:
  • the performance of the matching is independent of the size of the rule base.
  • the performance of the matching is independent of the minimum length of the rule base.
  • the performance of the matching is independent of the relationship between the rule base and the text to be matched. It can support large-scale rule sets, with the number of rules. Increasing the storage space sub-linear increase, can effectively reduce the space requirements, can effectively store and access the conversion rules in the state machine.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A method for matching multi-character strings based on a cache state machine as well as a chip structure for matching multi-character strings are disclosed, wherein the chip structure is realized by a method and structure for searching a next state. The method for matching multi-character strings searches the next state in a state transformation rule database based on input characters, a current state and cached states then jumps, and caches the state by specific cache rules. In the chip structure for matching multi-character strings, a main memory which includes basic transformation rules and n step cross transformation rules and an input translation table are shared by two paths of a state register, a color register and a state cache, a color cache, so as to calculate the possible next state and acquire corresponding input characters. An auxiliary memory, which stores fail and restart transformation rules, is used to acquire the next state corresponding to the actual input and updates the state cache and the color cache. Tri-state selector implements multi-way selection of the next state based on the actually input character and the character corresponding to the possible next state, so as to update the state registers and color registers.

Description

多字符串匹配方法和芯片结构 技术领域  Multi-string matching method and chip structure
本发明涉及一种信息处理的方法和芯片结构, 特别是涉及一种多字符 串匹配方法和芯片结构。 背景技术  The present invention relates to a method and a chip structure for information processing, and in particular to a multi-character string matching method and chip structure. Background technique
多字符串匹配技术, 也叫多关键词匹配技术, 已经比较成熟, 并且广 泛的应用于文本处理、 内容过滤等很多领域。 该技术能够在一维的待匹配 内容中发现预先定义的一组字符串中的一个或多个, 在匹配文本的过程中, 充分利用一组字符串中的特点, 进行预处理, 并且根据预处理后的中间数 据结构进行内容匹配, 从而实现对一组预定义字符串的并行匹配。 .  Multi-string matching technology, also known as multi-keyword matching technology, has matured and is widely used in many fields such as text processing and content filtering. The technology can find one or more of a predefined set of strings in one-dimensional content to be matched, and in the process of matching text, fully utilize the features in a set of strings, perform pre-processing, and according to the pre- The processed intermediate data structure performs content matching to achieve parallel matching of a set of predefined strings. .
在网絡安全领域中, 有一类基于内容的安全应用需要利用多字符串匹 配技术, 典型应用如网络入侵检测和防御系统、 垃圾邮件过滤、 病毒扫描 和过滤、 恶意代码扫描和过滤、 内容过滤等。 这类应用对多字符串匹配技 术的典型使用方式是通过程序抓取网络中的数据包, 并将其还原成特定网 络层数据, 并根据预先定义的规则集(如入侵规则、 病毒规则、 垃圾邮件 规则等), 在数据中进行匹配。 绝大多数情况下, 这种匹配利用的是多字符 串匹配技术。  In the field of network security, there is a class of content-based security applications that require the use of multi-string matching technologies such as network intrusion detection and prevention systems, spam filtering, virus scanning and filtering, malicious code scanning and filtering, and content filtering. The typical use of this type of application for multi-string matching techniques is to capture packets from the network and restore them to specific network layer data, based on pre-defined rule sets (eg, intrusion rules, virus rules, garbage). Mail rules, etc.), matching in the data. In most cases, this match utilizes multi-character string matching techniques.
由于网络带宽的发展十分迅速, 为了能够满足千兆甚至更高网络带宽 下内容安全应用的需求, 对高性能的多字符串匹配技术需求十分迫切。 为 了不断提高多字符串匹配技术的匹配性能, 出现了 艮多改进的软件算法, 尽管改进的算法匹配性能有一定的提高, 但提高幅度仍然十分有限, 通常 能够较传统算法提高性能 20%-40%。 仅通过软件实现上述已有算法已经无 法满足实际系统对该技术的性能需求。  Due to the rapid development of network bandwidth, in order to meet the needs of content security applications under gigabit or higher network bandwidth, the demand for high performance multi-string matching technology is urgent. In order to continuously improve the matching performance of multi-string matching technology, there are many improved software algorithms. Although the improved algorithm matching performance is improved, the improvement is still very limited, and the performance can be improved by 20%-40. %. Implementing the above existing algorithms only by software has not been able to meet the performance requirements of the actual system for the technology.
另一方面, 随着网络安全应用中恶意代码的增多, 预先定义的规则集 中包含的规则数量也增加迅速。 例如, 对于入侵检测规则库, 现有的规则 数量超过 5千; 对于病毒规则, 现有的规则数量超过 20万。 为此, 在追求 提高匹配性能的同时, 还需要匹配技术能够处理大规模规则库(超过 5 万 条以上规则的规则库可以称为大规模规则库) 的匹配问题。 传统的算法尽 管可以支持对较大规模的规则库进行多字符串匹配, 但大规则库对匹配性 能的影响十分明显, 一^:不具有实用性。  On the other hand, as the number of malicious code in network security applications increases, the number of rules contained in a predefined rule set also increases rapidly. For example, for an intrusion detection rule base, the number of existing rules exceeds 5,000; for virus rules, the number of existing rules exceeds 200,000. To this end, while pursuing improved matching performance, matching technology is also required to be able to handle the matching problem of large-scale rule bases (a rule base of more than 50,000 rules can be called a large-scale rule base). Although the traditional algorithm can support multi-string matching for a large-scale rule base, the impact of the large rule base on the matching performance is very obvious. One: It is not practical.
在实际的多字符串匹配技术应用中, 有一类方案 (下面称为方案 A ) 因为具有如下一些特点而倍受青睐: 匹配的性能与规则库的大小无关、 匹 配的性能与规则库的最小长度无关、 匹配的性能与规则库和待匹配文本之 间的关系无关。 In the actual multi-string matching technology application, there is a kind of scheme (hereinafter referred to as scheme A) which is favored because of the following characteristics: The matching performance is independent of the size of the rule base, the matching performance and the minimum length of the rule base. Irrelevant, matching performance and rule base and text to be matched The relationship is irrelevant.
例如,对字符串集合 P={SEC, SSH}进行匹配,方案 A将 P进行预处理, 对其构造一个有限状态自动机(DFA ), 如图 1 所示。 (其中, 圆圈表示状 态, 线条表示转换规则)  For example, to match the string set P={SEC, SSH}, scenario A preprocesses P and constructs a finite state automaton (DFA), as shown in Figure 1. (where the circle indicates the state and the line indicates the conversion rule)
有了有限状态自动机这个中间结构, 对于待匹配的一维本文(比如 SSSIG ), 可以每次读入一个字符, 并且在上述结构中根据转换关系, 每次 前进一个位置, 当到达 S3或者 S5位置时, ^艮告出一个有效匹配。  With the intermediate structure of the finite state automaton, for the one-dimensional text to be matched (such as SSSIG), one character can be read at a time, and in the above structure, according to the conversion relationship, each time advances to a position, when reaching S3 or S5 When the location is located, ^ 艮 a valid match.
正是利用了 DFA的特点, 方案 A达到了前述的优点。 与此同时, 应该 看到, 尽管其具有上述优点, 但也有比较明显的缺陷。对于 P={SEC, SSH} 这样简单的规则集,方案 A的中.间结构一共需要 6个状态和 16个转换规则。 随着规则集中规则数量的增加, 方案 A中间结构的规模将迅速递增。 正是 由于这种空间爆炸, 方案 A在实际应用中局限较大。  It is the use of the characteristics of DFA that Scheme A achieves the aforementioned advantages. At the same time, it should be noted that although it has the above advantages, it has obvious drawbacks. For a simple rule set such as P={SEC, SSH}, the intermediate structure of scheme A requires a total of 6 states and 16 conversion rules. As the number of rules in the rule set increases, the size of the intermediate structure of the scheme A will increase rapidly. It is because of this space explosion that Option A is limited in practical applications.
图 1中的转换规则 (带箭头的线条)一共有四类, 描述如下:  The conversion rules (lines with arrows) in Figure 1 are divided into four categories, as described below:
基本转换规则: 编号 1、 2、 3、 4、 5, 功能上讲是正确接收规则集的 路径; '  Basic conversion rules: No. 1, 2, 3, 4, 5, functionally the path to correctly receive the rule set; '
交叉转换规则: 编号 6, 在多个规则路径之间转换的路径;  Cross-conversion rules: No. 6, a path that is converted between multiple rule paths;
重启转换规则: 编号 7, 8, 9, 10, 回到初始状态后一个状态的路径; 失败转换规则: 编号 11, 12, 13 , 14, 15, 16, 回到初始状态的路 径。  Restart the conversion rule: No. 7, 8, 9, 10, the path to a state after returning to the initial state; Failure conversion rule: No. 11, 12, 13, 14, 15, 16, Return to the path of the initial state.
在 2006年 4月的第 25期 Conference of IEEE INFOCOM中, Jan van Lunteren的论文 "High-Performance Pattern-Matching for Intrusion Detection" (下面以 "论文 Γ 来引用)提出了一种实现方案。  In the 25th issue of IEEE INFOCOM in April 2006, Jan van Lunteren's paper "High-Performance Pattern-Matching for Intrusion Detection" (herein referred to as "thesis") proposes an implementation.
论文 1 的方案采用了方案 Α, 并提出了一种带优先级的转换规则存储 方法,可以将图 1中所有失败转换规则和所有重启转换规则合并成最多 256 条规则。 在实际应用中, 可以极大的减少转换规则的数量。  The scheme of the paper 1 adopts the scheme Α, and proposes a priority conversion rule storage method, which can merge all the failure conversion rules and all the restart conversion rules in Fig. 1 into a maximum of 256 rules. In practical applications, the number of conversion rules can be greatly reduced.
论文 1 中方案的技术基础是图 1 中失败转换规则的共性(将状态带回 初始状态)和重启转换规则的共性(将状态带回初始状态的下状态)。 为此, 可以将失败转换规则定为最低优先级 , 重启转换规则定为次低优先级。 举 例如图 2和表 1 : 转换规则 当前状态 输入字符 下一状态 优先级  The technical basis of the scheme in Paper 1 is the commonality of the failed transition rules in Figure 1 (taking the state back to the initial state) and the commonality of restarting the conversion rules ( bringing the state back to the lower state of the initial state). To do this, you can set the failure conversion rule to the lowest priority and the restart conversion rule to the next lowest priority. For example, Figure 2 and Table 1: Conversion Rules Current Status Input Character Next State Priority
1 S2 1 S3 2  1 S2 1 S3 2
2 * 1 S1 1  2 * 1 S1 1
3 S1 2 S2 2  3 S1 2 S2 2
4 S4 Β > S5 2 5 * A ► S4 1 4 S4 Β > S5 2 5 * A ► S4 1
6 * * ► SO 0 尽管图 2 中有很多转换规则, 但通过优先级描述, 最后实际 在的规 则只有 6条, 如表 1所示。  6 * * ► SO 0 Although there are many conversion rules in Figure 2, there are only six rules in the final description by priority, as shown in Table 1.
论文 1 没有完全解决随着规则数量增加存储空间快速增加的问题, 针 对大规模特征集进行匹配需要极大的空间代价。  Paper 1 does not completely solve the problem of increasing storage space with the increase of the number of rules. Matching large-scale feature sets requires a great space cost.
状态机中包含状态和转换规则, 用芯片结构实现状态机是指将状态机 中的转换规则存储在特定存储器中, 并根据需要对这些转换规则进行访问。 每条转换规则包含的信息包括: 前状态、 输入字符和后状态。 前状态是指 状态机的当前状态, 转换规则表示在前状态下, 接收某一个字符跳到某一 后^^态的过程。 对于每一个 (前状态, 输入字符)对, 状态机存在唯一的转换 规则与之对应。  The state machine contains state and conversion rules. Implementing the state machine with a chip structure means that the conversion rules in the state machine are stored in a specific memory, and these conversion rules are accessed as needed. The information contained in each conversion rule includes: pre-state, input characters, and post-state. The pre-state refers to the current state of the state machine. The conversion rule indicates the process of receiving a character to jump to a certain state after the previous state. For each (pre-state, input character) pair, the state machine has a unique conversion rule that corresponds to it.
对于一个状态机, 可能存在很多条转换规则, 如何在这些转换规则中 定位所需要查找的转换规则是一个技术问题。 也是状态机在芯片实现中必 然面对的问题。 这个问题可以抽象为: 利用已知的前状态和输入字符, 找 到对应的后状态。 - 针对这个问题, Sensory公司的专利 US 7,082,044 B2中提出了一种方 法, 这种方法将所有的转换规则按照 [前状态、 输入字符、 后状态]的格式存 储于 TCAM中 (TCAM是三态内容寻址存储器), 由于在规则查找时, 前 状态和输入字符已知, 所以,将他们输入到 TCAM中, 利用 TCAM的并行 查找功能可以将对应的转换规则找到。  For a state machine, there may be many conversion rules. How to locate the conversion rules that need to be found in these conversion rules is a technical problem. It is also a problem that the state machine must face in the implementation of the chip. This problem can be abstracted as: Using the known pre-state and input characters, find the corresponding post-state. - In response to this problem, Sensory's patent US 7,082,044 B2 proposes a method of storing all conversion rules in TCAM in the format of [pre-state, input character, post-state] (TCAM is tri-state content) Addressing memory), since the pre-state and input characters are known during rule lookup, they are entered into the TCAM and the corresponding conversion rules can be found using the TCAM's parallel lookup function.
该方案十分直观, 但需要采用特珠存储器件(TCAM ), 该存储器件具 有面积大、 成本高、 功耗大的特点, 且存储容量有限。 因此, 采用该结构 实现的硬件状态机不能包含很多的转换规则, 状态机规模需要很小, 能够 匹配的特征集规模十分有限。  This solution is very intuitive, but requires the use of a special bead memory device (TCAM), which has the characteristics of large area, high cost, high power consumption, and limited storage capacity. Therefore, the hardware state machine implemented by this structure cannot contain many conversion rules, the state machine scale needs to be small, and the size of the feature set that can be matched is very limited.
由此可见, 上述现有的多字符串匹配方法和芯片结构在使用上, 显然 仍不足够实用, 存在缺陷, 亟待加以进一步改进。 为了解决上述存在的问 题, 相关厂商莫不费尽心思来谋求解决之道, 但长久以来一直未见适用的 设计被发展完成, 此显然是相关业者急欲解决的问题。 因此如何能创设一 种新的多字符串匹配方法和芯片结构, 实属当前重要研发课题之一, 亦成 为当前业界极需改进的目标。  It can be seen that the above existing multi-string matching method and chip structure are obviously not practical enough and have defects, and need to be further improved. In order to solve the above problems, the relevant manufacturers have not tried their best to find a solution, but the design that has not been applied for a long time has been developed, which is obviously an issue that the relevant industry is anxious to solve. Therefore, how to create a new multi-string matching method and chip structure is one of the current important research and development topics, and it has become a goal that the industry needs to improve.
有鉴于上述现有的多字符串匹配方法和芯片结构存在的缺陷, 本发明 人基于从事此类产品设计制造多年丰富的实务经验及专业知识, 并配合学 理的运用, 积极加以研究创新, 以期创设一种新的多字符串匹配方法和芯 片结构, 能够改进一般现有的多字符串匹配方法和芯片结构, 使其更具有 实用性。 经过不断的研究、 设计,并经反复试作及改进后, 终于创设出确具 实用价值的本发明。 发明内容 In view of the above existing multi-string matching methods and chip defects, the inventors have been engaged in the design and manufacture of such products for many years of practical experience and professional knowledge, and with the use of academics, actively research and innovation, with a view to creating A new multi-string matching method and chip structure can improve the existing multi-string matching method and chip structure, making it more practical. After continuous research and design, and after repeated trials and improvements, it finally created The invention of practical value. Summary of the invention
本发明的主要目的在于提供一种多字符串匹配方法和芯片结构, 所要 解决的技术问题是使其能够实现高匹配速度和对大规模规则集的匹配, 非 常适于实用。  The main object of the present invention is to provide a multi-string matching method and chip structure, and the technical problem to be solved is to enable high matching speed and matching to a large-scale rule set, which is very suitable for practical use.
本发明的目的及解决其技术问题是采用以下技术方案来实现的。 依据 本发明提出的緩存状态机包括: 状态寄存器: 用于寄存当前状态; 緩存状 态寄存器: 用于寄存緩存状态; 转换规则模块: 用于存储和访问状态转换 规则库, 并根据接口模块接收的字符、 状态寄存器寄存的当前状态和緩存 状态寄存器寄存的緩存状态查找下一状态, 输出到状态寄存器; 以及根据 特定的緩存规则对緩存状态寄存器进行赋值。  The object of the present invention and solving the technical problems thereof are achieved by the following technical solutions. The cache state machine according to the present invention includes: a status register: for registering a current state; a cache status register: for registering a cache state; a conversion rule module: for storing and accessing a state conversion rule base, and according to characters received by the interface module The current state of the status register register and the cache status of the cache status register register look for the next state, output to the status register; and assign the cache status register according to a specific cache rule.
本发明的目的及解决其技术问题还采用以下技术方案来实现。 依据本 发明提出的一种多字符串匹配方法, 其包括下述步骤: 从接收的输入字符 流中按顺序取出字符作为输入字符; 对于每个输入字符, 进行下述步骤: 根据当前输入字符、 当前状态和緩存状态在状态转换规则库中查找后状态; 跳转到所述后状态; 根据特定的緩存规则进行状态緩存; 将所述后状态作 为当前状态, 所緩存的状态作为緩存状态, 下一个输入字符作为当前输入 字符, 重复对于每个输入字符所进行的步骤, 直至所述字符流中的字符全 部判断完毕。  The object of the present invention and solving the technical problems thereof are also achieved by the following technical solutions. A multi-string matching method according to the present invention, comprising the steps of: sequentially taking characters as input characters from a received input character stream; for each input character, performing the following steps: The current state and the cache state are searched for in the state transition rule base; the jump to the post state; the state cache is performed according to a specific cache rule; the post state is taken as the current state, and the cached state is used as the cache state, An input character is used as the current input character, and the steps performed for each input character are repeated until all the characters in the character stream are judged.
优选地, 前述的多字符串匹配方法中, 所述的查找后状态的步骤包括: 首先判断当前犬态接收当前输入字符在基本转换规则和 n步交叉转换规则 中是否存在后状态, 如果存在, 则将该后状态作为查找结果; 如果不存在, 则判断緩存状态接收当前输入字符在基本转换规则和 n步交叉转换规则中 是否存在后状态, 如果存在, 则将该后状态作为查找结果; 如果不存在, 则判断初始状态接收当前输入字符在基本转换规则和 n步交叉转换规则中 是否存在后状态; 如果存在, 则将该后状态作为查找结果; 否则将初始状 态作为查找结果。  Preferably, in the foregoing multi-string matching method, the step of the post-find state includes: first determining whether the current dog state receives the current input character in the basic conversion rule and the n-step cross-conversion rule, and if present, if present, Then, the post state is used as a search result; if not, it is determined whether the cache state receives the current input character in the basic conversion rule and the n-step cross-conversion rule, and if yes, the post state is used as the search result; If it does not exist, it is judged whether the initial state receives the current input character in the basic conversion rule and the n-step cross-conversion rule. If it exists, the post-state is used as the search result; otherwise, the initial state is used as the search result.
所述的根据特定的緩存规则进行状态緩存的步骤为: 如果初始状态接 收当前输入字符在基本转换规则中存在对应的后状态, 则緩存该后状态; 否则, 緩存初始状态。  The step of performing state buffering according to a specific cache rule is: if the initial state receives the corresponding post state of the current input character in the basic conversion rule, the post state is cached; otherwise, the initial state is cached.
优选地, 前述的多字符串匹配方法中, 所述的查找后状态的步骤包括: 判断当前状态的类型, 如果是汇聚状态或一般状态, 则根据当前输入字符 和当前状态在状态转换规则集中查找后状态; 如果是分离状态, 则根据当 前输入字符、 当前状态和緩存状态在分离状态转换规则集中查找后状态。  Preferably, in the foregoing multi-string matching method, the step of the post-find state includes: determining a type of the current state, and if it is a converged state or a general state, searching in the state transition rule set according to the current input character and the current state. Post state; if it is a detached state, the post state is looked up in the detached state transition rule set according to the current input character, the current state, and the cache state.
所述的分离状态转换规则集设置为接收三个输入: 当前输入字符、 当 前状态和緩存状态, 相应提供一个输出: 后状态。 The separated state transition rule set is set to receive three inputs: the current input character, when The pre-state and the cache state provide an output accordingly: post-state.
所述的根据特定的緩存规则进行緩存的步骤为: 如果当前状态是汇聚 状态, 则将当前状态进行緩存。  The step of buffering according to a specific cache rule is: if the current state is a convergence state, the current state is cached.
本发明还提供了一种存储有若干指令的计算机可读存储介质, 当所述 指令被处理器执行时, 使得所述处理器实现下述步骤: 接收输入字符; 对 于每个输入字符, 进行下述步骤: 根据当前输入字符、 当前状态和緩存状 态在状态转换规则库中查找后状态; 跳转到所述后状态; 根据特定的緩存 规则进行状态缓存; 将所述后状态作为当前状态, 所緩存的状态作为緩存 状态, 下一个输入字符作为当前输入字符, 重复对于每个输入字符所进行 的步骤, 直至所述字符流中的字符全部判断完毕。  The present invention also provides a computer readable storage medium storing a plurality of instructions, when the instructions are executed by a processor, causing the processor to: receive an input character; for each input character, perform the next Steps: searching for a post state in the state transition rule base according to the current input character, current state, and cache state; jumping to the post state; performing state caching according to a specific caching rule; using the post state as a current state, The state of the cache is used as the cache state, and the next input character is used as the current input character, and the steps performed for each input character are repeated until all the characters in the character stream are judged.
本发明还提供了一种系统, 包括: 处理器; 与处理器连接的总线, 用 来在所述系统各部分之间传送数据; 通信接口, 与所述述总线连接, 用来 接收字符数据流; 主存储器, 与所述总线连接, 其中存储有若干指令, 当 所述指令被所述处理器执行时, 使得所述处理器实现下述步骤: 从接收的 字符数据流中按顺序取出字符作为输入字符; 对于每个输入字符, 进行下 述步骤: 根据当前输入字符、 当前状态和緩存状态在状态转换规则库中查 找后状态; 跳转到所述后状态; 根据特定的缓存规则进行状态緩存; 将所 述后状态作为当前状态, 所緩存的状态作为緩存状态, 下一个输入字符作 为当前输入字符, 重复对于每个输入字符所进行的步骤, 直至所述字符流 中的字符全部判断完毕。 .  The present invention also provides a system comprising: a processor; a bus coupled to the processor for transferring data between portions of the system; a communication interface coupled to the bus for receiving a stream of character data a main memory, coupled to the bus, in which is stored a number of instructions, when the instructions are executed by the processor, causing the processor to perform the following steps: sequentially extracting characters from the received character data stream as Enter characters; for each input character, perform the following steps: Find the post state in the state transition rule base according to the current input character, current state, and cache state; jump to the post state; perform state buffer according to a specific cache rule The post state is taken as the current state, the cached state is taken as the cache state, and the next input character is used as the current input character, and the steps performed for each input character are repeated until all the characters in the character stream are judged. .
本发明的目的及解决其技术问题另外还采用以下技术方案来实现。 依 据本发明提出的后状态查找方法, 其包括: 根据当前状态和输入字符配合 输入翻译表计算出可能的后状态; 根据所述可能的后状态查找规则存储表 以获得对应的输入字符; 比较所述的实际输入字符和查找所述规则存储表 所获得的字符是否一致; 如果结杲一致, 则将状态转换到所述的可能的后 状态; 如果结果不一致, 则状态归零。  The object of the present invention and solving the technical problems thereof are additionally achieved by the following technical solutions. The post-state search method according to the present invention includes: calculating a possible post-state according to the current state and the input character in conjunction with the input translation table; and searching the rule storage table according to the possible post-state to obtain a corresponding input character; Whether the actual input characters are consistent with the characters obtained by searching the rule storage table; if the results are consistent, the state is switched to the possible post state; if the results are inconsistent, the state is reset to zero.
所述的状态的编号规则包括: 如果所述当前状态只有一条对应的输出 转换规则, 则该条输出转换规则所指向的后状态的编号为所述当前状态的 编号加一。  The numbering rule of the state includes: if the current state has only one corresponding output conversion rule, the number of the state after the output conversion rule is the number of the current state plus one.
优选地, 前述的后状态查找方法中, 所述计算可能的后状态的步驟包 括: 根据一定的规则集, 如果所述当前状态只有一条对应的输出转换规则, 则用所述当前状态的编号加一以获得可能的后状态的编号; 如果所述当前 状态存在多条对应的输出转换规则, 则将所述当前状态的颜色和所述输入 字符作为输入 , 查找所述输入翻译表以获得所述可能的后状态与所述当前 状态之间编号的差值, 并用所述当前状态的编号加上所述的差值以获得可 能的后状态的编号。 所述规则存储表的构成为: 其输入为一后状态, 所对应的输出为所述 后状态的颜色和所述后状态所对应的输入字符。 Preferably, in the foregoing post-state search method, the step of calculating a possible post-state includes: according to a certain rule set, if the current state has only one corresponding output conversion rule, the number of the current state is added a number for obtaining a possible post state; if there are a plurality of corresponding output conversion rules for the current state, taking the color of the current state and the input character as inputs, searching the input translation table to obtain the The difference between the possible post state and the current state, and the number of the current state is added to the difference to obtain the number of possible post states. The rule storage table is configured to: the input is a post state, and the corresponding output is a color of the post state and an input character corresponding to the post state.
所述输入翻译表的构成为: 其输入为当前状态的颜色和所述输入字符, 所对应的输出为可能的后状态与所述当前状态之间编号的差值。  The input translation table is configured to: the input is the color of the current state and the input character, and the corresponding output is the difference between the possible post state and the current state.
优选地, 前述的后状态查找方法中, 还包括对所述输入翻译表进行表 项合并, 所述输入翻译表的每一行对应一种当前状态, 每一列对应一个输 入字符 , 所述的表项合并其包括判断是否存在资源冲突和覆盖冲突的步骤, 对于要合并的两行中的每一列均进行判断, 第 k列的判断如下: 如果两列 中有一列为空, 则判断空列所对应的状态在合并后使用非空列数据所接收 的字符是否等于 k, 如果是, 则为覆盖冲突, 两列不能合并, 退出; 如果不 是, 则进行下述判断; 如果两列都为空或者都不为空, 判断两列对应值是 否相同, 如果不是, 则为资源冲突, 两列不能合并, 退出, 如果是, 则判 断下一列; 其中资源冲突是指 ITT表表项中对应列的值不为空且不相同; 覆盖冲突是指 ITT表表项中一列的非空值覆盖空值后, 则对于原状态相当 于增加了额外的转换规则, 该额外的转换规则与原有转换规则冲突即为覆 盖冲突; 直到确定如果要合并的两行中的所有列均不存在所述的资源冲突 和覆盖冲突, 则将对应行进行合并, 其中的非空值覆盖空值。  Preferably, the foregoing post-state search method further includes performing entry merging on the input translation table, where each row of the input translation table corresponds to a current state, and each column corresponds to one input character, and the entry is Merging it includes the steps of judging whether there is a resource conflict and an overlay conflict, and judging each of the two rows to be merged, the judgment of the kth column is as follows: If one of the two columns is empty, judging the corresponding of the empty column Whether the character received by the non-empty column data after the merge is equal to k, if yes, it is the overlay conflict, the two columns cannot be merged, and exit; if not, the following judgment is made; if both columns are empty or both If it is not empty, determine whether the corresponding values of the two columns are the same. If not, the resource conflicts. The two columns cannot be merged and exit. If yes, the next column is judged. The resource conflict refers to the value of the corresponding column in the ITT table entry. It is empty and different; the coverage conflict refers to the non-null value of a column in the ITT table entry that covers the null value, which is equivalent to the original state. The external conversion rule, the additional conversion rule conflicts with the original conversion rule, that is, the overlay conflict; until it is determined that if all the columns in the two rows to be merged do not have the resource conflict and the overlay conflict, the corresponding row is performed. Merge, where non-null values cover null values.
优选地, 前述的后状态查找方法中, 还包括对所述输入翻译表进行组 相联优化, 其包括如下的判断是否存在资源冲突的步骤: 对于 N路组相联, 将 ITT表一行分为 256/N个组, 对于一个组, 判断两行中包含的有效数值 数量, 如果该数量大于 N, 则表示该组出现了资源冲突; 否则, 判断另外 一组; 直到确定全部 256/N組都不存在资源沖突, 则将这两行合并。  Preferably, the foregoing post-state search method further includes performing group associative optimization on the input translation table, and the method includes the following steps of determining whether there is a resource conflict: for the N-way group association, dividing the ITT table into a row 256/N groups, for a group, judge the number of valid values contained in two rows. If the number is greater than N, it indicates that there is a resource conflict in the group; otherwise, judge another group; until all 256/N groups are determined If there are no resource conflicts, the two rows are merged.
本发明还提供了一种存储有若干指令的计算机可读存储介质 , 当所述 指令被处理器执行时, 使得所述处理器实现下述步骤: 根据当前状态和输 入字符配合输入翻译表计算出可能的后状态; ,根据所述可能的后状态查找 规则存储表以获得对应的输入字符; 比较所述的实际输入字符和查找所述 规则存储表所获得的字符是否一致; 如果结果一致, 则将状态转换到所述 的可能的后状态; 如果结果不一致, 则状态归零。  The present invention also provides a computer readable storage medium storing a plurality of instructions, when the instructions are executed by the processor, causing the processor to perform the following steps: calculating the input translation table according to the current state and the input characters a possible post state; searching the rule storage table according to the possible post state to obtain a corresponding input character; comparing whether the actual input character and the character obtained by searching the rule storage table are consistent; if the results are consistent, The state is converted to the possible post state described; if the results are inconsistent, the state is zeroed.
所述的状态的编号规则包括: 如果所述当前状态只有一条对应的输出 转换规则, 则该条输出转换规则所指向的后状态的编号为所述当前状态的 编号加一; 所述计算可能的后状态的步骤包括: 居一定的规则集, 如果 所述当前状态只有一条对应的输出转换规则, 则用所述当前状态的编号加 一以获得可能的后状态的编号; 如果所述当前状态存在多条对应的输出转 换规则, 则将所述当前状态的颜色和所述输入字符作为输入, 查找所述输 入翻译表以获得所述可能的后状态与所述当前状态之间编号的差值, 并用 所述当前状态的编号加上所述的差值以获得可能的后状态的编号。 所述规则存储表的构成为: 其输入为一后状态, 所对应的输出为所述 后状态的颜色和所述后状态所对应的输入字符。 The numbering rule of the state includes: if the current state has only one corresponding output conversion rule, the number of the state after the output conversion rule is the number of the current state plus one; the calculation is possible The step of the post state includes: a certain rule set, if the current state has only one corresponding output conversion rule, add a number of the current state to obtain a number of possible post states; if the current state exists a plurality of corresponding output conversion rules, taking the color of the current state and the input character as inputs, and searching the input translation table to obtain a difference between the number of the possible post state and the current state, And adding the difference by the number of the current state to obtain the number of possible post-states. The rule storage table is configured to: the input is a post state, and the corresponding output is a color of the post state and an input character corresponding to the post state.
所述输入翻译表的构成为: 其输入为当前状态的颜色和所述输入字符, 所对应的输出为可能的后状态与所述当前状态之间编号的差值。  The input translation table is configured to: the input is the color of the current state and the input character, and the corresponding output is the difference between the possible post state and the current state.
优选地, 所述输入翻译表的每一行对应一种当前状态, 每一列对应一 个输入字符, 所述输入翻译表是经过表项合并的, 所述的表项合并是如下 进行的: 对于要合并的两行中的每一列均进行判断, 第 k列的判断如下: 如果两列中有一列为空, 则判断空列所对应的状态在合并后使用非空列数 据所接收的字符是否等于 k,如果是,则为覆盖冲突, 两列不能合并, 退出, 如果不是, 则进行下述判断; 如果两列都为空或者都不为空, 判断两列对 应值是否相同, 如杲不是, 则为资源冲突, 两列不能合并, 退出, 如果是, 则判断下一列; 直到确定要合并的两行中的所有列均不存在资源冲突和覆 盖冲突, 则将对应行进行合并, 其中的非空值覆盖空值。  Preferably, each row of the input translation table corresponds to a current state, and each column corresponds to one input character, and the input translation table is merged by an entry, and the combination of the entries is performed as follows: Each column of the two rows is judged, and the judgment of the kth column is as follows: If one of the two columns is empty, it is judged whether the state corresponding to the empty column is equal to k when the character received by the non-null column data after the merge is equal. If yes, it is an override conflict, two columns cannot be merged, and exit. If not, the following judgment is made; if both columns are empty or not empty, it is judged whether the corresponding values of the two columns are the same, if not, then For resource conflicts, the two columns cannot be merged, exit, and if so, the next column is judged; until all the columns in the two rows to be merged are determined to have no resource conflicts and overlay conflicts, the corresponding rows are merged, and the corresponding rows are not empty. The value overrides the null value.
优选地, 所述输入翻译表是经过组相联优化的, 所述的组相联优化包 括如下的判断是否存在资源冲突的步骤: 对于 N路组相联, 将 ITT表一行 分为 256/N个组, 对于一个组, 判断两行中包含的有效数值数量, 如果该 数量大于 N, 则表示该组出现了资源冲突; 否则, 判断另外一组; 直到确 定全部 256/N组都不存在资源冲突, 则将这两行合并。  Preferably, the input translation table is optimized by group association, and the group association optimization includes the following steps of determining whether there is a resource conflict: For the N-way group association, the ITT table is divided into 256/N. Groups, for a group, determine the number of valid values contained in the two rows. If the number is greater than N, it indicates that there is a resource conflict in the group; otherwise, judge another group; until it is determined that all 256/N groups do not have resources Conflict, then merge the two lines.
本发明还提供了一种系统, 包括: 主处理器, 组织输入数据流; 协处 理器单元, 与主处理器连接; 所述的协处理器单元内进行如下操作: 根据 当前状态和输入字符配合输入翻译表计算出可能的后状态; 根据所述可能 的后状态查找规则存储表以获得对应的输入字符; 比较所述的实际输入字 符和查找所述规则存储表所获得的字符是否一致; 如果结果一致, 则将状 态转换到所述的可能的后状态; 如果结果不一致, 则状态归零。  The present invention also provides a system, comprising: a main processor, an organization input data stream; a coprocessor unit, connected to the main processor; the coprocessor unit performs the following operations: according to the current state and the input characters Entering a translation table to calculate a possible post state; searching the rule storage table according to the possible post state to obtain a corresponding input character; comparing whether the actual input character and the character obtained by searching the rule storage table are consistent; The results are consistent, then the state is transitioned to the possible post state; if the results are inconsistent, the state is zeroed.
所述的状态的编号规则包括: 如果所述当前状态只有一条对应的输出 转换规则, 则该条输出转换规则所指向的后状态的编号为所述当前状态的 编号加一; 所述计算可能的后状态的步骤包括: 根据一定的规则集, 如果 所述当前状态只有一条对应的输出转换规则, 则用所述当前状态的编号加 一以获得可能的后状态的编号; 如果所述当前状态存在多条对应的输出转 换规则, 则将所述当前状态的颜色和所述输入字符作为输入, 查找所述输 入翻译表以获得所述可能的后状态与所述当前状态之间编号的差值, 并用 所述当前状态的编号加上所述的差值以获得可能的后状态的编号。  The numbering rule of the state includes: if the current state has only one corresponding output conversion rule, the number of the state after the output conversion rule is the number of the current state plus one; the calculation is possible The step of the post state includes: according to a certain rule set, if the current state has only one corresponding output conversion rule, add a number of the current state to obtain a number of possible post states; if the current state exists a plurality of corresponding output conversion rules, taking the color of the current state and the input character as inputs, and searching the input translation table to obtain a difference between the number of the possible post state and the current state, And adding the difference by the number of the current state to obtain the number of possible post-states.
所述规则存储表的构成为: 其输入为一后状态, 所对应的输出为所述 后状态的颜色和所述后状态所对应的输入字符。  The rule storage table is configured to: the input is a post state, and the corresponding output is the color of the post state and the input character corresponding to the post state.
所述输入翻译表的构成为: 其输入为当前状态的颜色和所述输入字符, 所对应的输出为可能的后状态与所述当前状态之间编号的差值。 优选地, 所述输入翻译表的每一行对应一种当前状态, 每一列对应一 个输入字符, 所述输入翻译表是经过表项合并的, 所述的表项合并是如下 进行的: 对于要合并的两行中的每一列均进 ^^判断, 第 k列的判断如下: 如果两列中有一列为空, 则判断空列所对应的状态在合并后使用非空列数 据所接收的字符是否等于 k,如果是, 则为覆盖冲突, 两列不能合并, 退出, 如果不是, 则进行下述判断; 如果两列都为空或者都不为空, 判断两列对 应值是否相同, 如果不是, 则为资源冲突, 两列不能合并, 退出, 如果是, 则判断下一列; 直到确定要合并的两行中的所有列均不存在资源冲突和覆 盖沖突, 则将对应行进行合并, 其中的非空值覆盖空值。 The input translation table is configured to: the input is the color of the current state and the input character, and the corresponding output is the difference between the possible post state and the current state. Preferably, each row of the input translation table corresponds to a current state, and each column corresponds to one input character, and the input translation table is merged by an entry, and the combination of the entries is performed as follows: Each of the two rows is judged by the ^^, and the judgment of the kth column is as follows: If one of the two columns is empty, it is judged whether the character corresponding to the empty column is the character received by the non-null column data after the merge. Equivalent to k, if yes, it is an overlay conflict, the two columns cannot be merged, and exit. If not, the following judgment is made; if both columns are empty or not empty, it is judged whether the corresponding values of the two columns are the same, if not, Then, for resource conflicts, the two columns cannot be merged and exited. If yes, the next column is judged; until all the columns in the two rows to be merged are determined to have no resource conflicts and overlay conflicts, the corresponding rows are merged, and the non- A null value covers a null value.
优选地, 所述输入翻译表是经过组相联优化的, 所述的组相联优化包 括如下的判断是否存在资源冲突的步骤: 对于 N路组相联, 将 ITT表一行 分为 256/N个组, 对于一个组, 判断两行中包含的有效数值数量, 如果该 数量大于 N, 则表示该組出现了资源冲突; 否则, 判断另外一组; 直到确 定全部 256 N組都不存在资源冲突, 则将这两行合并。 - 本发明的目的及解决其技术问题另外再采用以下技术方案来实现。 依 据本发明提出的一种后状态查找结构, 其包括: 主存储器: 存储有基本转 换规则和交叉转换规则, 其输入为根据当前状态和输入字符配合输入翻译 表所计算出的可能的后状态, 才艮据所存储的转换规则输出所述可能的后状 态的颜色和与所述可能的后状态相对应的输入字符; 次存储器: 存储有失 败转换规则和重启转换规则, 其输入为实际输入字符, 根据所存储的转换 规则输出与所述实际输入字符相对应的后状态及其颜色; 输入翻译表: 其 输入为所述当前状态的颜色和所述实际输入字符, 所对应的输出为可能的 后状态与所述当前状态之间编号的差值; 双态选通器: 根据所述主存储器 所输出的字符与实际输入字符两者之间的比较结果执行如下操作: 如果相 等, 则将当前状态转换到所述计算出来的可能的后状态, 同时将当前状态 的颜色转换到所述主存储器所输出的该可能的后状态的颜色; 否则, 将当 前状态及其颜色转换到次存储器的输出。 .  Preferably, the input translation table is optimized by group association, and the group association optimization includes the following steps of determining whether there is a resource conflict: For the N-way group association, the ITT table is divided into 256/N. Groups, for a group, determine the number of valid values contained in the two rows. If the number is greater than N, it indicates that there is a resource conflict in the group; otherwise, judge another group; until all 256 N groups are determined to have no resource conflicts , then merge the two lines. - The object of the present invention and solving the technical problems thereof are additionally achieved by the following technical solutions. A post-state lookup structure according to the present invention includes: a main memory: storing a basic conversion rule and a cross-conversion rule, the input of which is a possible post-state calculated according to the current state and the input character in conjunction with the input translation table, Outputting the color of the possible post state and the input character corresponding to the possible post state according to the stored conversion rule; the secondary memory: storing the failure conversion rule and restarting the conversion rule, and the input is the actual input character Outputting a post state corresponding to the actual input character and its color according to the stored conversion rule; inputting a translation table: the input is the color of the current state and the actual input character, and the corresponding output is possible The difference between the number of the post state and the current state; the two-state gate: according to the comparison result between the character output by the main memory and the actual input character: if equal, the current The state transitions to the calculated possible post state, while the current state of the face The color is converted to the color of the possible post state output by the main memory; otherwise, the current state and its color are converted to the output of the secondary memory. .
优选地, 所述的后状态查找结构还包括一比较器, 用于进行所述主存 优选地, 所述的后状态查找结构还包括: 状态寄存器: 用于存储所述 当前状态; 颜色寄存器: 用于存储所述当前状态的颜色。  Preferably, the post state lookup structure further includes a comparator for performing the main memory. Preferably, the post state lookup structure further includes: a status register: configured to store the current state; a color register: A color used to store the current state.
优选地, 所述的后状态查找结构还包括选通器: 用于才艮据颜色寄存器 的值对输入翻译表的输出值与数值 1进行选择输出。  Preferably, the post-state lookup structure further includes a gate: configured to selectively output the output value of the input translation table and the value 1 according to the value of the color register.
优选地, 所述的后状态查找结构还包括加法器: 用于将所述当前状态 的编号与所述选通器的输出值相加, 以计算得出可能的后状态。 本发明的目的及解决其技术问题此外还采用以下技术方案来实现。 依 据本发明提出的一种多字符串匹配结构, 其包括: 状态寄存器: 用于存储 当前状态; 颜色寄存器: 用于存储当前状态的颜色; 状态緩存器: 用于存 储緩存状态; 颜色緩存器: 用于存储緩存状态的颜色; 主存储器: 存储有 基本转换规则和 n步交叉转换规则, 其第一路输入为根据当前状态和输入 字符配合输入翻译表计算出的第一可能后状态, 所对应的第一路输出为根 据所存储的转换规则所获得的所述第一可能后状态的颜色和所述第一可能 后状态所对应的输入字符; 其第二路输入为 >据緩存状态和所述输入字符 配合输入翻译表计算出的第二可能后状态, 所对应的第二路输出为根据所 存储的转换规则所获得的所述第二可能后状态的颜色和所述第二可能后状 态所对应的输入字符; 次存储器: 存储有失败转换规则和重启转换规则, 其输入为所述的实际输入的字符, 输出为^^据所存储的转换规则所获得的 所述实际输入字符所对应的后状态及其颜色; 在每个当前状态的转换周期, 器进行二次覆盖; 输入翻译 、其第二 输入为所述当前状态的 色和所 述实际输入字符, 所对应的第一路输出为所述第一可能后状态与所述当前 状态之间编号的差值; 其第二路输入为所述緩存状态的颜色和所述实际输 入字符, 所对应的第二路输出为所述第二可能后状态与所述緩存状态之间 二
Figure imgf000011_0001
Preferably, the post state lookup structure further includes an adder: configured to add the number of the current state to an output value of the gate to calculate a possible post state. The object of the present invention and solving the technical problems thereof are also achieved by the following technical solutions. A multi-string matching structure according to the present invention, comprising: a status register: for storing a current state; a color register: for storing a color of a current state; a status buffer: for storing a buffer state; a color buffer: The color used to store the cache state; the main memory: stores the basic conversion rule and the n- step cross conversion rule, and the first input is the first possible post state calculated according to the current state and the input character combined with the input translation table, corresponding to The first way output is the color of the first possible post state obtained according to the stored conversion rule and the input character corresponding to the first possible post state; the second input is > cache state and The input character is matched with the second possible post state calculated by the input translation table, and the corresponding second output is the color of the second possible post state obtained according to the stored conversion rule and the second possible post state Corresponding input characters; secondary memory: stored with a failure conversion rule and a restart conversion rule, the input of which is described The input character is output as the post-state corresponding to the actual input character obtained by the stored conversion rule and its color; in each conversion period of the current state, the device performs secondary coverage; The second input is the color of the current state and the actual input character, and the corresponding first way output is the difference between the number of the first possible post state and the current state; the second input thereof For the color of the cache state and the actual input character, the corresponding second way output is between the second possible post state and the cache state.
Figure imgf000011_0001
路字符与所述实际输入字符相同, 则用所述第一可能后状态覆盖所述状态 寄存器, 同时用所述第一可能后状态的颜色覆盖所述颜色寄存器; 如果所 述第一路字符与所述实际输入字符不相同, 但所述第二路字符与所述实际 输入字符相同, 则用所述第二可能后状态覆盖所述状态寄存器, 同时用所 述第二可能后状态的颜色覆盖所述颜色寄存器; 否则, 用所述次存储器所 输出的后状态及其颜色分别覆盖所述状态寄存器和所述颜色寄存器。 The road character is the same as the actual input character, the state register is overwritten with the first possible post state, and the color register is overwritten with the color of the first possible post state; if the first path character and The actual input characters are different, but the second path character is the same as the actual input character, the state register is overwritten by the second possible post state, and the color is covered by the second possible post state The color register; otherwise, the status register and the color register are respectively covered by the post state output and the color thereof.
优选地, 所述的多字符串匹配结构还包括: 第一比较器, 用于执行所 述主存储器所输出的第一路字符与实际输入字符两者之间的比较; 第二比 较器, 用于执行所述主存储器所输出的第二路字符与实际输入字符两者之 间的比较。  Preferably, the multi-string matching structure further includes: a first comparator, configured to perform a comparison between a first path character output by the main memory and an actual input character; and a second comparator, A comparison between the second pass character output by the main memory and the actual input character is performed.
优选地, 所述的多字符串匹配结构还包括: 第一选通器: 用于根据颜 色寄存器的值对输入翻译表的输出值与数值 1进行选择输出; 第二选通器: 用于根据颜色緩存器的值对输入翻译表的输出值与数值 1进行选择输出。  Preferably, the multi-string matching structure further includes: a first strobe: configured to select and output an output value of the input translation table and a value 1 according to a value of the color register; and the second strobe: The value of the color buffer is selected for the output value of the input translation table and the value 1.
优选地, 所述的多字符串匹配结构还包括: 第一加法器: 用于将所述 当前状态的编号与所述第一选通器的输出值相加, 以计算出第一可能后状 态; 第二加法器: 用于将所述緩存状态的编号与所述第二选通器的输出值 相加, 以计算出第二可能后状态。 Preferably, the multi-string matching structure further includes: a first adder: configured to add a number of the current state to an output value of the first gate to calculate a first possible post state a second adder: configured to compare the number of the buffer state with an output value of the second gate Add to calculate the second possible post state.
本发明的目的及解决其技术问题此外又采用以下技术方案来实现。 依 据本发明提出的一种多正则表达式匹配方法, 其包括下述步骤: 从接收的 输入字符流中按顺序取出字符作为输入字符; 对于每个输入字符,.进行下 述步骤: 根据当前输入字符、 当前状态和緩存状态在状态转换规则库中查 找后状态; 跳转到所述后状态; 根据特定的緩存规则进行状态緩存; 将所 述后状态作为当前状态, 所緩存的状态作为緩存状态, 下一个输入字符作 为当前输入字符, 重复对于每个输入字符所进行的步 ¾ , 直至所述字符流 中的字符全部判断完毕。  The object of the present invention and solving the technical problems thereof are further achieved by the following technical solutions. A multi-regular expression matching method according to the present invention, comprising the steps of: sequentially taking characters as input characters from a received input character stream; for each input character, performing the following steps: according to the current input The character, current state, and cache state are looked up in the state transition rule base; jump to the post state; state cache according to a specific cache rule; the post state as the current state, and the cached state as the cache state The next input character is used as the current input character, and the step performed for each input character is repeated until all the characters in the character stream are judged.
优选地, 前述的多正则表达式匹配方法中, 所述的查找后状态的步驟 包括: 首先判断当前状态接收当前输入字符在基本转换规则和 n步交叉转 换规则中是否存在后状态, 如果存在, 则将该后状态作为查找结果; 如果 不存在, 则判断緩存状态接收当前输入字符在基本转换规则和 n步交叉转 换规则中是否存在后状态, 如果存在, 则将该后状态作为查找结果; 如果 不存在, 则判断初始状态接收当前输入字符在基本转换规则和 n步交叉转 换规则中是否存在后状态; 如果存在, 则将该后状态作为查找结果; 否则 将初始状态作为查找结果; 所述的根据特定的緩存规则进行状态缓存的步 驟为: 如果初始状态接收当前输入字符在基本转换规则中存在对应的后状 态, 则缓存该后状态; 否则, 緩存初始状态。  Preferably, in the foregoing multi-regular expression matching method, the step of the post-find state includes: first determining whether the current state receives the current input character in the basic conversion rule and the n-step cross-conversion rule, and if present, if present, Then, the post state is used as a search result; if not, it is determined whether the cache state receives the current input character in the basic conversion rule and the n-step cross-conversion rule, and if yes, the post state is used as the search result; If it does not exist, it is judged whether the initial state receives the current input character in the basic conversion rule and the n-step cross-conversion rule; if it exists, the post state is used as the search result; otherwise, the initial state is used as the search result; The step of performing state buffering according to a specific cache rule is: if the initial state receives the corresponding post state of the current input character in the basic conversion rule, the post state is cached; otherwise, the initial state is cached.
优选地, 前述的多正则表达式匹配方法中, 所述的查找后状态的步骤 包括: 判断当前状态的类型, 如果是汇聚状态或一般状态, 则根据当前输 入字符和当前状态在状态转换规则集中查找后状态; 如果是分离状态, 则 根据当前输入字符、 当前状态和緩存状态在分离状态转换规则集中查找后 状态; 所述的分离状态转换规则集设置为接收三个输入: 当前输入字符、 当前状态和緩存状态, 相应提供一个输出: 后状态; 所述的根据特定的缓 存规则进行缓存的步骤为: 如果当前状态是汇聚状态, 则将当前状态进行 緩存。  Preferably, in the foregoing multi-regular expression matching method, the step of the post-find state includes: determining a type of the current state, and if it is a converged state or a general state, according to the current input character and the current state in the state transition rule set After the lookup state; if it is a detached state, the post state is searched in the detached state transition rule set according to the current input character, the current state, and the cache state; the detached state transition rule set is set to receive three inputs: the current input character, the current The status and the cache status are respectively provided with an output: a post state; the step of caching according to a specific cache rule is: If the current state is a converged state, the current state is cached.
本发明与现有技术相比具有明显的优点和有益效果。 借由上述技术方 案, 本发明基于緩存状态机的多字符串匹配方法和基于 "后状态查找" 的 芯片结构至少具有下列优点及有益效果:  The present invention has significant advantages and advantageous effects over the prior art. With the above technical solution, the multi-string matching method based on the cache state machine and the chip structure based on the "post-state lookup" have at least the following advantages and beneficial effects:
其可以消除 95%以上甚至全部的交叉转换规则; 可以减少基冬转换规 则的数量, 从而减少所需状态数等; 可以实现较其他方法更高的匹配速度。 总之, 其能够满足对高速大规模多字符串匹配技术的需求。 其匹配的性能 与规则库的大小无关、 匹配的性能与规则库的最小长度无关、 匹配的性能 与规则库和待匹配文本之间的关系无关、 能够支持大规模规则集、 随着规 则数量的增加存储空间亚线性增加、 能够有效降低空间需求、 可以有效的 存储和访问状态机中的转换规则。 It can eliminate more than 95% of all cross-conversion rules; it can reduce the number of base winter conversion rules, thereby reducing the number of required states, etc.; can achieve higher matching speed than other methods. In short, it can meet the demand for high-speed large-scale multi-string matching technology. The performance of the matching is independent of the size of the rule base. The performance of the matching is independent of the minimum length of the rule base. The performance of the matching is independent of the relationship between the rule base and the text to be matched. It can support large-scale rule sets, with the number of rules. Increase the sub-linearity of storage space, effectively reduce space requirements, and be effective Store and access conversion rules in the state machine.
上述说明仅是本发明技术方案的概述, 为了能够更清楚了解本发明的 技术手段, 而可依照说明书的内容予以实施, 并且为了让本发明的上述和 其他目的、 特征和优点能够更明显易懂, 以下特举较佳实施例, 并配合附 图, 详细说明如下。 附图的简要说明  The above description is only an overview of the technical solutions of the present invention, and the technical means of the present invention can be more clearly understood, and can be implemented in accordance with the contents of the specification, and the above and other objects, features and advantages of the present invention can be more clearly understood. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, preferred embodiments will be described in detail with reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS
图 1 : 现有的多字符串匹配方案 A中所构造的有限状态自动机。  Figure 1: The finite state automaton constructed in the existing multi-string matching scheme A.
图 2: 根据现有论文 1的方案所构造的有限状态自动机, 其中为转换规 则设定了不同的优先级。  Figure 2: A finite state automaton constructed according to the scheme of the prior art 1, in which different priorities are set for the conversion rules.
图 3: 状态机模型。  Figure 3: State machine model.
图 4: 緩存状态机模型。  Figure 4: Cache state machine model.
图 5: 根据方案 A所构造的有限状态自动机, 其中重启转换规则和失 败转换规则已经去掉。  Figure 5: A finite state automaton constructed according to scenario A, where the restart conversion rules and the failed conversion rules have been removed.
图 6: 用于实现动态交叉转换载入所构造的缓存状态机。  Figure 6: Cache state machine constructed to implement dynamic cross-conversion loading.
图 7: 动态交叉转换载入方法流程图。  Figure 7: Flow chart of the dynamic cross-conversion loading method.
图 8: 根据方案 A所构造的有限状态自动机, 其中具有同构路径。 图 9: 特征集 {betters, pattern}优化后的理想框架。  Figure 8: A finite state automaton constructed according to scenario A, with a homogeneous path. Figure 9: The ideal framework for feature set {betters, pattern} optimization.
图 10: 基于緩存状态机的同构路径合并。  Figure 10: Conformation path merge based on cache state machine.
图 11: 同构路径合并方法中的三种状态。  Figure 11: Three states in the homogeneous path merge method.
图 12: 同构路径合并方法中三种状态的转换函数。  Figure 12: Conversion function for three states in the isomorphic path merge method.
图 13: 后状态查找结构基于的两个观察。  Figure 13: Two observations based on the post-state lookup structure.
图 14: 后状态查找框架。  Figure 14: Post-state lookup framework.
图 15: 后状态查找结构的详细结构图。  Figure 15: Detailed structure of the post-state lookup structure.
图 16: 输入翻译表 ( ITT ) 结构。  Figure 16: Input translation table (ITT) structure.
图 17: ITT表项合并示意图。  Figure 17: Schematic diagram of the consolidation of ITT entries.
图 18: ITT表优化之一: 表项合并思想。  Figure 18: One of the ITT table optimizations: The idea of consolidation of entries.
图 19: ITT表优化之一: 表项合并方法。  Figure 19: One of the ITT table optimizations: Table item consolidation method.
图 20: ITT表优化之二: 2路組相联 ITT表结构。  Figure 20: ITT Table Optimization 2: 2-way set associative ITT table structure.
图 21: ITT表优化之二: N路组相联 ITT表优化方法。  Figure 21: ITT table optimization 2: N-way group association ITT table optimization method.
图 22:实现基于緩存状态机的多字符串匹配技术的芯片结构 ACC-NSA 结构。  Figure 22: Chip structure ACC-NSA structure for implementing multi-string matching technology based on cache state machine.
图 23 : 应用动态交叉转换载入方法消除交叉转换规则的效果图 ( ClamAV规则 )。 ·  Figure 23: Applying the dynamic cross-conversion loading method to eliminate the effect of cross-conversion rules (ClamAV rule). ·
图 24: 应用动态交叉转换载入方法消除交叉转换规则的效果图 (Snort 规则)。 图 25: 应用合并同构路径方法减少基本转换规则的效果图。 实现发明的最佳方式 以下结合附图及较佳实施例, 对依据本发明提出的多字符串匹配方法和芯 片结构其具体实施方式、 步驟、 结构、 特征及其功效, 详细说明如后。 Figure 24: Applying a dynamic cross-conversion loading method to eliminate the effect of a cross-conversion rule (Snort rule). Figure 25: Effect diagram of applying the merged isomorphic path method to reduce the basic conversion rules. BEST MODE FOR CARRYING OUT THE INVENTION The specific embodiments, steps, structures, features and functions of the multi-string matching method and chip structure according to the present invention will be described in detail below with reference to the accompanying drawings and preferred embodiments.
为了满足对高速大规模多字符串匹配技术的需求, 本发明的方案中采 用了一种称为 "緩存状态机" 的模型, 其设计思想来源于前面提及的确定 性有限状态自动机 ( DFA- Deterministic Finite Automata )。 DFA的一种表现 形式如图 3所示, 每个 DFA有一个当前状态(在状态寄存器之中), 根据 输入字符以及当前状态接受该字符对应的转换规则, 进入下一个状态。 在 下—个字符到来之时, "下一个状态" 变成 "当前状态"。 DFA可以根据如 图 1所示的内部的数据结构, 在输入字符的驱动下进行状态转换。 DFA的 主要特点是: 它的下一个状态仅由当前状态和当前输入的字符决定。  In order to meet the demand for high-speed large-scale multi-string matching technology, a scheme called "cache state machine" is adopted in the scheme of the present invention, and its design idea is derived from the aforementioned deterministic finite state automaton (DFA). - Deterministic Finite Automata ). A representation of DFA is shown in Figure 3. Each DFA has a current state (in the status register) that accepts the conversion rules for that character based on the input character and the current state, and proceeds to the next state. When the next character comes, the "next state" becomes the "current state". DFA can perform state transitions based on the internal data structure shown in Figure 1 driven by input characters. The main features of DFA are: Its next state is determined only by the current state and the currently entered characters.
传统的状态机模型 (DFA和 NFA )是图灵机模型的一种简化形式, 无 论确定性有限状态自动机(DFA )还是不确定性有限状态自动机(NFA ), 下一个状态仅由当前状态和当前输入决定, 如图 3所示。 其中 NFA可以等 价转化为 DFA。  The traditional state machine model (DFA and NFA) is a simplified form of the Turing machine model, regardless of the deterministic finite state automaton (DFA) or the uncertainty finite state automaton (NFA), the next state is only the current state and The current input decision is shown in Figure 3. NFA can be converted to DFA equivalently.
确定性有限状态自动机(DFA )定义为一个五元组, M = {JiT,∑,s。,F, }, 包括:  The deterministic finite state automaton (DFA) is defined as a five-tuple, M = {JiT, ∑, s. ,F, }, including:
• 有限状态集合, 记作 K, 即所有状态的集合;  • A finite state set, denoted K, is a collection of all states;
• 字母表集合, 记作∑, 即状态机接收字符的集合;  • A collection of alphabets, denoted as ∑, that is, a collection of characters received by the state machine;
• 状态机的开始状态, 记作 s。; ' • The start state of the state machine, denoted as s. ; '
• 接收状态集合, 记作 F, 接收状态集合是有限状态集合的子集, • Receive state set, denoted as F, receive state set is a subset of the finite state set,
• 状态转换函数, δ Κ χ Σ→Κ \ • State transition function, δ Κ χ Σ→Κ \
其中, 状态转换函数是一个二元函数, 它才艮据状态机所处的当前状态 和接收字符决定下一个状态。 緩存状态机 ( CDFA - Cached Deterministic Finite Automata ) , 由本发明提出,它的一种表现形式如图 4所示。参照 DFA, CDFA除了包含一个当前状态之外, 还包含一个緩存状态 (在状态緩存器 中) 。 在緩存状态机中, 它的下一个状态由当前状态、 当前输入的字符和 緩存状态三个参数决定 , 下一个緩存状态由緩存状态机的内部机制决定, 不需要外部输入, 可以根据缓存状态机的特定需要灵活定制。  Among them, the state transition function is a binary function, which determines the next state according to the current state of the state machine and the received characters. The CDFA - Cached Deterministic Finite Automata is proposed by the present invention, and one of its manifestations is shown in FIG. Referring to DFA, CDFA includes a cache state (in the state buffer) in addition to a current state. In the cache state machine, its next state is determined by the current state, the currently input character and the cache state. The next cache state is determined by the internal mechanism of the cache state machine. No external input is required, and the cache state machine can be used. The specific needs of the flexible customization.
緩存状态机(CDFA )打破了传统状态机 "下一个状态仅由当前状态和 当前输入决定" 的特点, 通过对历史信息的记录, 增加状态机在决定后状 态时操作的丰富性。 , 緩存状态机通过在状态机中增加状态緩存功能达到上述的设计目的, 如图 4所示。 从对外接口来看, 緩存状态机和传统状态机一样, 仅接收输 入字符, 并输出状态机判断结果。 所不同之处在于内部增加了状态緩存器 ( Cache ) , 能够对状态进行一定策略的緩存。 The Cache State Machine (CDFA) breaks the traditional state machine's "the next state is determined only by the current state and the current input". By recording the history information, the richness of the operation of the state machine in the post-determination state is increased. The cache state machine achieves the above design goal by adding a state cache function to the state machine, as shown in FIG. From the perspective of the external interface, the cache state machine, like the traditional state machine, receives only input characters and outputs the state machine judgment result. The difference is that a state buffer ( Cache ) is added internally to enable a certain policy to cache the state.
緩存状态机(CDFA )可以定义为一个七元组, {/,∑, ,Λ ^, }, 包括: • 有限状态集合, 记作 Κ, 即状态集中所有状态的集合;  The Cache State Machine (CDFA) can be defined as a seven-tuple, {/, ∑, , Λ ^, }, including: • A finite state set, denoted Κ, that is, a set of all states in the state set;
• 字母表集合, 记作∑, 即状态机接收字符的集合;  • A collection of alphabets, denoted as ∑, that is, a collection of characters received by the state machine;
• 状态机的开始状态, 记作 s0; • The start state of the state machine, denoted as s 0 ;
• 接收状态集合, 记作 F, F ^ K ;  • Receive state set, denoted as F, F ^ K ;
• 状态机中包含的緩存数量, 记作 N; .  • The number of caches contained in the state machine, denoted as N;
• 状态转换函数, δ : Κ χ ΚΝ χ Σ→Κ ; • State transition function, δ : Κ χ Κ Ν χ Σ → Κ ;
• 缓存策略函数, θ '· Κ χ Σ→ΚΝ • Cache policy function, θ '· Κ χ Σ →Κ Ν
其中, 緩存策略函数 Θ根据当前状态和当前输入决定要緩存的状态; 状态转换函数 δ根据当前状态、 被緩存状态和输入字符, 决定下一个状态。  The cache policy function determines the state to be cached according to the current state and the current input; the state transition function δ determines the next state according to the current state, the cached state, and the input characters.
由于新增加的状态緩存器由状态机根据緩存策略函数控制, 对外不可 见, 不可操作, 这与计算机存储系统中高速緩存的设计原理相似, 因此, 命名这种新的状态机模型为緩存状态机模型。  Since the newly added state buffer is controlled by the state machine according to the cache policy function, it is invisible and inoperable, which is similar to the design principle of the cache in the computer storage system. Therefore, the new state machine model is named as the cache state machine. model.
缓存策略函数既可以记住状态机曾经历过的历史信息, 也可以按照一 定方式 "记住" 其他状态信息。  The cache policy function can remember both historical information that the state machine has experienced, and can also "remember" other state information in a certain way.
根据以上描述, 緩存状态机的结构如下, 其包括:  According to the above description, the structure of the cache state machine is as follows, which includes:
状态寄存器: 用于寄存当前状态;  Status register: used to register the current status;
緩存状态寄存器: 用于寄存緩存状态, 可寄存的状态数目为 Ν, Ν > 1 ; 转换规则模块: 用于存储状态转换规则库, 并根据接口模块接收的字 符、 状态寄存器寄存的当前状态和緩存状态寄存器寄存的緩存状态查找下 一状态。  Cache Status Register: Used to register the cache status. The number of states that can be registered is Ν, Ν > 1 ; Conversion Rule Module: Used to store the state conversion rule base, and according to the characters received by the interface module, the current state of the status register registration and the cache. The status register registered cache status looks for the next state.
此外, 所述緩存状态机在实现时, 还应配合下述结构:  In addition, when the cache state machine is implemented, the following structure should also be matched:
接口模块: 用于接收输入字符;  Interface module: used to receive input characters;
控制模块: 用于控制接口模块正常接收输入的字符, 控制状态寄存器 更新当前状态, 控制緩存状态寄存器更新緩存状态, ·以及控制转换规则模 块查找下一状态。  Control module: Used to control the characters that the interface module normally receives input, control the status register to update the current state, control the cache status register to update the buffer status, and control the conversion rule module to find the next status.
如前所述,基于 DFA实现的方案 Α在进行多字符串匹配时会造成随着规 则增加存储空间爆炸的问题, 空间爆炸来源于转换规则数量呈指数形式增 加。 经过研究, 空间爆炸来源于三类转换规则 (交叉转换规则、 重启转换 规则和失败转换规则) 。 为了解决空间爆炸问题, 必须有效的控制这三类 转换规则数量的增加。  As mentioned earlier, DFA-based solutions 造成 when multi-string matching occurs, which increases the explosion of storage space with rules. The space explosion comes from the exponential increase in the number of conversion rules. After research, the space explosion comes from three types of conversion rules (cross-conversion rules, restart conversion rules, and failure conversion rules). In order to solve the space explosion problem, it is necessary to effectively control the increase in the number of these three types of conversion rules.
在论文 1中采用的带优先级的方法能够将重启转换规则和失败转换规 则的数量控制在 256之内。 在本发明中, 可以沿用这种方法解决这两类转换 规则。 The prioritized approach used in paper 1 will be able to restart the conversion rules and failure conversion rules. Then the number is controlled within 256. In the present invention, both types of conversion rules can be solved in this way.
本发明利用緩存状态机原理, 主要是可以将接近全部的交叉转换规则 消除, 从而彻底解决空间爆炸问题。 另外, 本发明利用缓存状态机原理, 还可以减少基本转换规则数量, 从而实现存储空间随着规则数量增加亚线 性增长。 其实现方式具体如下。  The invention utilizes the principle of a buffer state machine, mainly to eliminate nearly all cross-conversion rules, thereby completely solving the space explosion problem. In addition, the present invention utilizes the principle of a buffer state machine, and can also reduce the number of basic conversion rules, thereby achieving a sub-linear increase in storage space with the number of rules. The implementation is as follows.
緩存状态机原理使用方法一: "动态交叉转换载入" , 可消除 95%以上 甚至全部的交叉转换规则。 这个方法命名为 ACC。 The principle of the cache state machine uses Method 1: "Dynamic Cross-Conversion Loading" to eliminate more than 95% or even all of the cross-conversion rules. This method is named ACC.
P={slice, cross}为例, 方案 Α构造的 DF Α如图 5所示(其中重启转换规 则和失败转换规则已经去掉) 。 其中存在三条交叉转换规则。  For example, P={slice, cross}, the DF structure of the scheme Α is shown in Figure 5 (where the restart conversion rules and the failure conversion rules have been removed). There are three cross-conversion rules.
如果待匹配的文本是 croslice, 构造緩存状态机如图 6所示:  If the text to be matched is croslice, the constructor cache state machine is shown in Figure 6:
其中, 交叉转换规则已经消除, 取而代之的是一个緩存空间。 任何一 所述交叉转换规则产生的原理为: 在当前状态^ S3, 接收的字符是 s的 情况下,在转换到状态 S4的同时 ,从 SG开辟另外一条路径(即 S6所在的路径)。 对于下一输入字符, 如果不符合当前路径的基本转换规则 (若当前状态 S4, 输入字符为 S, 则转换到状态 S5 ) , 但符合所述另外一条路径的基本转换规 则 (若当前状态 S6, 输入字符为 1, 则转换到状态 S7 ) 的条件, 则交叉转换 规则产生, 即若下一输入字符是 1, 从状态 S4跳转到 S7Among them, the cross-conversion rule has been eliminated, and replaced by a cache space. The principle of any one of the cross-conversion rules is: in the current state ^ S 3 , the received character is s, while switching to state S 4 , another path is opened from S G (ie, where S 6 is located) path). For the next input character, if the basic conversion rule of the current path is not met (if the current state S 4 , the input character is S, then the state is converted to state S 5 ), but the basic conversion rule of the other path is met (if the current state S 6 , if the input character is 1, then the condition of transition to state S 7 ) is generated, then the cross-conversion rule is generated, that is, if the next input character is 1, the state jumps from state S 4 to S 7 .
所述緩存状态机所进行的操作如下。如果当前状态在 S3位置, 当前接收 的字符是3, 根据交叉转换规则产生的原理, S6状态被緩存。 同时, 状态机 进入下一个状态 S4。在 S4状态,接收字符是 1, 下一个状态由当前状态(S4 )、 输入字符( 1 )和被緩存的状态 ( S6 )共同决定, 因为在基本转换规则路径 中 S4不接受字符 1, 而 S6接受字符 1, 因此, S7被决定为下一个状态。 The operations performed by the cache state machine are as follows. If the current state S at position 3, the current character is received 3, according to the principle of cross conversion rule generation, S 6 is cached state. At the same time, the state machine enters the next state S 4 . In S 4 state, the received character is 1, the next state, the input characters (1) and the state of the buffer (S. 6) determined by the current state (S 4), because the S in the basic conversion rule path 4 does not accept characters 1, and S 6 accepts the character 1, so S 7 is determined to be the next state.
动态交叉转换载入,利用 CDFA原理动态的产生原来由 DF A静态描述的 交叉转换规则, 从而大幅度缩小存储的转换规则数量。  The dynamic cross-conversion loading dynamically generates the cross-conversion rules originally described by DF A using the CDFA principle, thereby greatly reducing the number of stored conversion rules.
ACC方法中采用的緩存状态机 CDFA是一个七元组 {K, ∑, sQ,F, 1, δ , θ }, 需要的緩存状态数 N=l (即需要一个寄存器进行状态緩存) 。 The buffer state machine CDFA used in the ACC method is a seven-tuple {K, ∑, s Q , F, 1, δ, θ }, and the required number of buffer states N = l (that is, a register is required for state buffering).
状态转换函数 δ可以分为以下两类:  The state transition function δ can be divided into the following two categories:
• 6basic表示基本转换规则的状态转换函数, Sbask ι Κ χ Σ -^ Κ , ACC方法 中 Sbasic的定义与方案 A—样。 • 6 basic indicates the state transition function of the basic conversion rule, S bask ι Κ χ Σ -^ Κ , the definition of S basic in the ACC method is the same as the scheme A.
• Snerass表示 n步交叉转换规则的状态转换函数, Sllcross K ∑→K , ACC 方法中 δ n cross的定义与方案 A—样。 • S nerass represents the state transition function of the n-step cross-conversion rule, S llcross K ∑→K , and the definition of δ n cross in the ACC method is the same as that of scheme A.
状态转换函数 δ的定义为
Figure imgf000017_0001
The state transition function δ is defined as
Figure imgf000017_0001
其中, priority为优先级标识, A为最高优先级, D为最氏优先级。 如 果高优先级的结果有效(非空), 则该结果被先采纳; 如果优先级高的结果 无效, 则 4氏优先级结果被采纳。 结果无效是指某一状态 Si在 Sbasie和 Sncrss 转换函数中没有接受字符 c的规则。 Where priority is the priority identifier, A is the highest priority, and D is the highest priority. If the high priority result is valid (not empty), the result is taken first; if the high priority result is invalid, the 4 priority result is adopted. The invalid result means that a certain state Si is in S basie and S ncr . There is no rule in the ss conversion function that accepts the character c.
如图 7的流程图所示, 状态转换函数 δ的含义是, 对于 ACC中 CDFA 的状态转换, 首先判断当前状态 Si在基本转换规则和 n步交叉转换规则中 是否存在接收当前字符 c 的转换规则, 如果存在, 则应用该规则, 跳到下 一个状态 ; 如果不存在对应的转换规则, 则将被緩存的状态 Sk取出, 并 以 Sk状态为当前状态在基本转换规则和 n步交叉转换规则中寻找接受当前 字符 c的转换规则, 如果存在, 则跳转到对应的下一个状态 ; 如果不存 在对应的转换规则, 则判断初始状态 So是否接收字符 c; 如果接收, 则跳 到相应状态, 否则跳转到初始状态 So。 As shown in the flowchart of FIG. 7, the meaning of the state transition function δ is that, for the state transition of the CDFA in the ACC, it is first determined whether the current state Si has a conversion rule for receiving the current character c in the basic conversion rule and the n-step cross-conversion rule. If yes, apply the rule to jump to the next state; if there is no corresponding conversion rule, the cached state S k is taken out, and the Sk state is the current state in the basic conversion rule and the n-step cross conversion The rule searches for a conversion rule that accepts the current character c. If it exists, it jumps to the corresponding next state; if there is no corresponding conversion rule, it determines whether the initial state So receives the character c; if it receives, jumps to the corresponding state , otherwise jump to the initial state So.
上述四个优先级操作之间无关, 可以采用并行操作, 不影响最终性能。 緩存策略函数 Θ定义为
Figure imgf000017_0002
Regardless of the above four priority operations, parallel operations can be used without affecting the final performance. The cache policy function is defined as
Figure imgf000017_0002
其中, 含义为 "空", 表示不存在对应的转换规则。  Where the meaning is "empty", indicating that there is no corresponding conversion rule.
緩存策略函数 Θ的含义是,对于 ACC中 CDFA的緩存空间(只有一个), 每周期进行緩存, 被緩存的内容是初始状态 So接受当前输入字符 c对应的 下一个状态; 如果在基.本转换规则中不包含该对应转换规则, 緩存初始状 态 SG。 可以看到, 緩存策略函数与当前状态 Si无关。 The meaning of the cache policy function 是 is that for the buffer space of the CDFA in ACC (only one), each cycle is cached, and the cached content is the initial state So accepts the next state corresponding to the current input character c; The corresponding conversion rule is not included in the rule, and the initial state S G is cached. As you can see, the cache policy function has nothing to do with the current state Si.
ACC方法基于上迷緩存状态机, 该方法主要包含两个步骤: 预处理和 匹配。 预处理阶段的工作是读入特征集, 构造缓存状态机; 匹配阶段的工 作是读入待匹配文本, 进行状态机转换, 并在特定状态报告匹配。 - 以上描述是针对 N=l , 即 CDFA中只有一个存储空间, 可以緩存一个 状态。 该方法可以应用于 N>1的情况。  The ACC method is based on the above cache state machine. The method mainly consists of two steps: preprocessing and matching. The work in the preprocessing stage is to read in the feature set and construct the cache state machine; the job of the matching phase is to read in the text to be matched, perform state machine conversion, and report the match in a specific state. - The above description is for N=l, that is, there is only one storage space in the CDFA, and one state can be cached. This method can be applied to the case of N>1.
緩存状态机原理使用方法二: "同构路径合并", 可减少基本转换规则 和状态。 这个方法命名为 ACS。  The principle of the cache state machine uses Method 2: "Homomorphic path merge" to reduce basic conversion rules and states. This method is named ACS.
ACS方法的思想是对状态机中的同构路径进行合并, 减少状态机中状 态和基本转换规则的数量。  The idea of the ACS method is to combine the homogeneous paths in the state machine to reduce the number of states and basic conversion rules in the state machine.
以 P={betters,pattern}为例, 方案 A构造的 DFA如图 8所示, 其中一共 需要 14个基本转换规则和 15个状态。 经过分析, 可以发现, S2-S5与 S9_S12 具有相同的性质, 都是接收字符串 "tter" , 称之为同构路径。 Taking P={betters, pattern} as an example, the DFA constructed by scheme A is shown in Fig. 8, among which Requires 14 basic conversion rules and 15 states. After analysis, it can be found that S 2 -S 5 has the same properties as S 9 _S 12 , and all receive the string "tter", which is called an isomorphic path.
才艮据上述分析, 特征集 {betters, pattern}优化后的理想框架如图 9所示。 这个框架表示了将 "tter" 子串生成的状态和转换规则进行合并。  According to the above analysis, the ideal frame after optimization of the feature set {betters, pattern} is shown in Fig. 9. This framework represents the merging of the state generated by the "tter" substring and the conversion rules.
发明人认为, 不可能使用传统的状态机理论(DFA或者 NFA )对同构 路径进行合并。 例如, 图 9中所示的理想框架存在严重的错误, 如在 S6状 态, 在输入字符 "s" 时, 会跳转到状态 S9, 因此, 字符串 "patters" 也可 以被成功匹配。 但是, 在优化前的状态机中, 只有在特征 "betters" 到来之 时才能才跳转到状态 S9。 可见, 使用 DFA理论进行同构路径合并实质上是 去掉了状态机不同状态所代表的历史信息, 因此会导致匹配结果发生错误。 The inventors believe that it is not possible to combine isomorphic paths using traditional state machine theory (DFA or NFA). For example, over the frame as shown in FIG. 9 serious errors, as in state S 6, when the input character "s", will jump to the state S 9, therefore, the string "patters" may also be successfully matched. However, in the state machine before optimization, the state S 9 is only jumped when the feature "betters" comes. It can be seen that the use of the DFA theory for the homomorphic path merging essentially removes the historical information represented by the different states of the state machine, thus causing an error in the matching result.
ACS方法采用了緩存状态机模型。 利用緩存状态机能够有效记忆状态 转换历史信息的特点, 进行同构路径合并, 从而保证匹配的正确性。 以特 征集 {pattern, betters}为例, 采用緩存状态机思想的同构路径合并如图 10所 示。  The ACS method uses a cache state machine model. The cache state machine can effectively remember the characteristics of the state transition history information, and perform isomorphic path merging to ensure the correctness of the matching. Taking the feature set {pattern, betters} as an example, the isomorphic path using the idea of the cache state machine is combined as shown in Fig. 10.
基于緩存状态机的同构路径合并思想是, 在路径合并的时候, 动态的 将路径来源状态 (图 10中 S8或者 S 存储到缓存状态机的缓存中。 如果 接收的字符使得状态转换到达同构路径分离的位置(s6状态), 将被緩存的 状态取出, 根据同构路径来源来确定跳转到何状态。 为此, 如果此时输入 的文本是" patters", 则 Si状态在同构路径开始处被緩存, 在 S6状态时将其 取出, 由于该路径来源于 而不是 S8, 即使输入字符是" s", 也不 跳转到 S9状态。 The idea of merging the isomorphic path based on the cache state machine is to dynamically store the path source state (S 8 or S in Figure 10 is stored in the cache of the cache state machine) when the path is merged. If the received characters cause the state transition to arrive at the same isolated path configuration of the position (s 6 state), a state will be cached taken to determine the configuration according to jump to the state where the source of the same path. for this reason, if the text input at this time is "patters", the state in which the same Si configuration at the beginning of the path is cached, the state S 6 when taken out, because the path is not derived from S 8, even when the input character is "s", not to jump to state S 9.
ACS方法中采用的緩存状态机 CDFA是一个七元组 {K,∑, so, F, 1, δ, θ}, 需要的緩存状态数 Ν=1 (即需要一个寄存器进行历史状态緩存)。 CDFA中 的每一个状态对应于一种颜色, CDFA中共包含三种颜色。颜色用以区分同 构路径合并过程中的三类不同状态, 如图 11所示。  The cache state machine CDFA used in the ACS method is a seven-tuple {K, ∑, so, F, 1, δ, θ}, and the required number of buffer states Ν = 1 (that is, a register is required for history state caching). Each state in the CDFA corresponds to one color, and the CDFA contains three colors. The color is used to distinguish three different states in the merge process of the isomorphic path, as shown in Figure 11.
这三种状态说明及其与颜色的对应关系如下:  The three state descriptions and their correspondence with colors are as follows:
眷汇聚状态(Converging states ): 黄色, 网状, 定义为进入同构路径之 前的最后一个状态, 该状态表示了同构路径之前状态机的历史信息。 该状 态触发一次自身的状态緩存。 这类状态集合记为 KcvConverging states: Yellow, mesh, defined as the last state before entering the isomorphic path, which represents the history information of the state machine before the isomorphic path. This state triggers its own state cache. This set of states is denoted as K c . v .
翁分离状态 (Diverging states ): 粉色, 条状, 定义为同构路径的最后 一个状态, 该状态才艮据历史信息 (即被緩存的状态)判断接收字符后需要 跳转的状态。 该状态触发一次緩存读取。 状态集合记为 KDivDiverging states: Pink, strip, defined as the last state of a homogeneous path, which determines the state that needs to be jumped after receiving a character based on historical information (ie, the state being cached). This state triggers a cache read. The state set is recorded as K Div .
•一般状态( Common states ): 白色, 空白, 定义为非汇聚状态和分离 状态的所有其他状态。 在这类状态中, CDFA内的緩存不被操作。 这类状态 集合记为 Kcm• Common states: White, blank, defined as all other states of the non-convergence state and the split state. In this type of state, the cache within the CDFA is not manipulated. This set of states is denoted as K c . m .
緩存状态机 CDFA的状态转换函数 δ可以分为以下两类: 攀对于汇聚 4犬态和一般状态, δ为二元函数, δ : Κχ Σ -^ Κ , ACS方法 中 δ的定义与方案 A—样。 The state transition function δ of the cache state machine CDFA can be divided into the following two categories: For the convergence of the four dog states and the general state, δ is a binary function, δ : Κχ Σ -^ Κ , the definition of δ in the ACS method is the same as that of the scheme A.
*对于分离^ I犬态, δ为三元函数, : χΛΓχΣ→ , ACS方法中分离状 态的转换函数 δ定义是当前状态、 緩存状态、 当前字符的函数。 * For the separation ^ I dog state, δ is a ternary function, : χΛΓχΣ → , the conversion function δ of the separation state in the ACS method is defined as the current state, the cache state, and the current character.
状态转换函数 δ的定义为  The state transition function δ is defined as
S. e Kt Cov S. e K t Cov
S(S c) Com  S(S c) Com
Div  Div
其中, 是当前状态, c是当前输入, Sk是当前被緩存的状态。 Where is the current state, c is the current input, and S k is the currently cached state.
状态转换函数 δ 中分离状态的转换规则与传统转换规则不同,. 它包含 三输入, 一输出。 三输入中包含了同构路径合并前来源的汇聚状态, 如图 12所示。 在汇聚状态和一般状态, 根据当前输入和当前状态查找转换规则 集 (二输入)获得下一状态。 而在分离状态, 除需,要根据当前输入和当前 状态之外, 还需要根据被緩存的状态, 来查找分离状态转换规则集 (不同 于转换规则集, 为三输入) 以获得下一状态。  The transition rule for the state transition function in δ is different from the traditional transition rule. It contains three inputs and one output. The three inputs contain the aggregation status of the source before the isomorphic path merge, as shown in Figure 12. In the aggregation state and the general state, the conversion rule set (two inputs) is found according to the current input and the current state to obtain the next state. In the separated state, in addition to the current input and current state, it is also necessary to find a separate state transition rule set (different from the conversion rule set, three inputs) according to the state being cached to obtain the next state.
緩存策略函数 Θ的定义为  The cache policy function Θ is defined as
Si e K S i e K
[ Φ , ^ Ε {Καιηοίν} 其中, 含义为 "空操作", 表示不对緩存状态进行任何操作。 [ Φ , ^ Ε {Κ αιη , Κ οίν } where meaning "null operation" means no action is taken on the cache state.
緩存策略函数 Θ的含义是,对于 ACS中 CDFA的緩存空间(只有一个), 在当前状态为汇聚状态时, 将该状态緩存至緩存空间。 其他情况不对缓存 空间做任何操作。  The cache policy function Θ means that for the CDFA cache space in ACS (only one), when the current state is the aggregation state, the state is cached to the cache space. In other cases, nothing is done with the cache space.
因而在本方法中, 首先要判断当前状态的类型, 而后根据判断结果进 行相应动作。 如果是汇聚状态, 则根据当前输入和当前状态查找转换规则 集获得下一状态, 并将当前状态緩存至緩存空间; 如果是一般状态, 则根 据当前输入和当前状态查找转换规则集获得下一状态; 如果是分离状态, 则根据当前输入、 当前状态和緩存状态查找分离状态转换规则集获得下一 状态。  Therefore, in the method, the type of the current state is first determined, and then the corresponding action is performed according to the judgment result. If it is the aggregation state, the next state is obtained by searching the conversion rule set according to the current input and the current state, and the current state is cached to the cache space; if it is the general state, the conversion rule set is obtained according to the current input and the current state to obtain the next state. ; If it is a detached state, the next state is obtained by looking up the separation state transition rule set according to the current input, current state, and cache state.
合并后的 CDFA去掉了 5个状态和 4个基本转换规则, 空间可以进一 步被节省。 需要的额外开销是一个状态存储空间的存储作为緩存。  The merged CDFA removes 5 states and 4 basic conversion rules, and space can be further saved. The overhead required is the storage of a state storage space as a cache.
本领域扶术人员可以考虑到, 正则表达式是由一系列特殊字符组成的 字符串,有关正则表达式的介绍可以参考相关资料。传统的 AC算法可以解 决多正则表达式匹配问题, 方法是将正则表达式转化成 DFA, 并利用 DFA  The practitioners in this field can consider that a regular expression is a string consisting of a series of special characters. For the introduction of regular expressions, refer to related materials. The traditional AC algorithm can solve the problem of multi-regular expression matching by converting regular expressions into DFA and using DFA.
CDFA, 并利用 CDFA接收输入字符进行匹配。 具体匹配方法包括消除 1步 交叉转换规则和同构路径合并等。 CDFA, and use CDFA to receive input characters for matching. The specific matching method includes eliminating 1 step Cross-conversion rules and homogeneous path merges, etc.
状态机转换的实质是如何根据已知的当前状态 Si和当前字符输入 c在 转换规则库中找到对应的转换规则 Tr, 其中 Tr(Si5 c) =Sj, 并跳到状态 。 硬件实现的技术难点在于: 如何将转换规则库有效的存储于存储器和如何 有效定位到转换规则 Tr。为了便于后续描述,将转换规则 Tr中的 Si称作 "输 入状态" 和 c称作 "输入字符", 称作 "输出状态"。 The essence of state machine conversion is how to find the corresponding conversion rule Tr in the conversion rule base according to the known current state Si and the current character input c, where Tr(S i5 c) = Sj, and jump to the state. The technical difficulty of hardware implementation is: How to effectively store the conversion rule base in the memory and how to effectively locate the conversion rule Tr. For convenience of subsequent description, Si in the conversion rule Tr is referred to as "input state" and c is referred to as "input character", which is referred to as "output state".
、本案提出 This case is proposed
Figure imgf000020_0001
Figure imgf000020_0001
访问。 access.
后状态查找的设计源于对状态机的两个观察, 如图 13所示。  The design of the post-state lookup stems from two observations of the state machine, as shown in Figure 13.
第一, 在状态机中, 尤其是方案 A生成的缓存状态机中, 存在大量的 线性树(Linear Trie )结构。 所谓 "线性树" 指状态机中每个状态仅包含一 个转换规则指向下一个状态, 并形成一个线性的一维结构。 由于大量线性 树的存在, 可以对状态编号进行递增排列, 因此, 可以由当前状态计算出 下一个状态的编号, 即预测后状态。  First, there are a large number of Linear Trie structures in the state machine, especially the cache state machine generated by scenario A. The so-called "linear tree" means that each state in the state machine contains only one transformation rule pointing to the next state, and forms a linear one-dimensional structure. Due to the existence of a large number of linear trees, the status numbers can be arranged incrementally. Therefore, the number of the next state can be calculated from the current state, that is, the predicted state.
第二, 对于状态机中的每一个状态, 它是根据接收的特定字符确定的, 即其所接受的字符是确定的, 而不在乎输入的转换规则是何种类型。 如状 态 S7接收基本转换规则和交叉转换规则, 无论哪种转换规则, 该状态接收 的字符都是 "i"。 因此, 如果得到了后状态, 即输出状态, 则可唯一确定其 所接受的字符, 通过与实际输入的字符进行比对, 即可检证所计算的后状 态是否为真实后状态。 . Second, for each state in the state machine, it is determined based on the particular character received, ie, the characters it accepts are deterministic, regardless of the type of conversion rules entered. If the state S 7 receives the basic conversion rule and the cross conversion rule, the character received by the state is "i" regardless of the conversion rule. Therefore, if the post state, that is, the output state, is obtained, the characters accepted by it can be uniquely determined, and by comparing with the actually input characters, it can be verified whether the calculated post state is a real post state. .
基于上述两个观察, 提出后状态查找的结构。 该结构采用了 "预测" 和验证的方法, 如图 14所示。根据当前状态 Si和当前输入字符 c经过输入 翻译表 ( Input Translation Table, ITT )计算出可能的后状态或者直接计算出 可能的后状态, 并将后状态作为地址来索引规则存储表, 获得该状态转换 规则 Tr的输入字符, 并比较当前输入字符 c和该字符是否一致(如图 14 中两条虚线所示)。 如果结果一致, 则进行状态转换。 如果结果不一致, 说 明没有转换规则与当前输入对应 , 状态归零。  Based on the above two observations, the structure of the post-state lookup is proposed. The structure uses a "predictive" and verification approach, as shown in Figure 14. According to the current state Si and the current input character c, a possible post-state is calculated through an Input Translation Table (ITT) or a possible post-state is directly calculated, and the post-state is used as an address to index the rule storage table to obtain the state. Converts the input character of rule Tr and compares whether the current input character c is consistent with the character (as shown by the two dashed lines in Figure 14). If the results are consistent, a state transition is performed. If the results are inconsistent, it means that no conversion rules correspond to the current input and the status is zero.
后状态查找结构中, 规则存储表可采用 SRAM或者 DDR等廉价存储 器实现存储, 存储器内部转换规则分布紧凑, 不存在 "空隙",。  In the post-state lookup structure, the rule storage table can be stored by using an inexpensive memory such as SRAM or DDR, and the internal conversion rules of the memory are compactly distributed, and there is no "gap".
后状态查找之所以有效, 来源于 ITT表的使用和优化。 根据观察 1可 以知道, 由于状态机中包含大量的线性树, 因此, 对于线性树中每个状态 的后状态, 可以通过简单的递增获得, 不需要查找 ITT表。 只有少量存在 多个转换规则输出的状态需要进入 ITT表获得状态间的差值。 此外, ITT 表的优化可以进一步降低存储空间的使用。 ·  The post-state lookup is effective and comes from the use and optimization of ITT tables. According to observation 1, it can be known that since the state machine contains a large number of linear trees, the post-state of each state in the linear tree can be obtained by simple incrementing without looking up the ITT table. Only a small number of states with multiple conversion rule outputs need to enter the ITT table to get the difference between states. In addition, optimization of ITT tables can further reduce the use of storage space. ·
NSA结构的详细设计分为两个部分, 一是转换规则在输入翻译表 ITT 和规则存储表中的存储; 二是转换规则的访问路径设计。 The detailed design of the NSA structure is divided into two parts. One is the conversion rule in the input translation table ITT. And the rules store the storage in the table; the second is the access path design of the conversion rules.
NSA 整体结构如图 15 所示。 其中包括转换规则存储的主要空间 "TRM- 1,, ( Transition Rule Memory - 1 )和解决失败转换规则和重启转换规 则的存储空间 "TRM-0" ( Transition Rule Memory -0 )。  The overall structure of the NSA is shown in Figure 15. This includes the main space "TRM-1, (Transition Rule Memory - 1) stored in the conversion rule and the storage space "TRM-0" (Transition Rule Memory -0 ) that resolves the failure conversion rule and restarts the conversion rule.
对于图 15, 字符输入后, ■据当前状态寄存器和颜色寄存器判断如何 操作。 其中设置了一选通器 MUX, 用于根据颜色寄存器的值对访问 ITT表 所获得的输出值(即状态间的差值)与数值 1 进行选择输出。 如果颜色寄 存器值为 0, 认为当前状态无颜色, MUX选择输出 1, 即将当前状态编号 加 1后获得后状态编号, 并用该后状态访问 TRM-1获得对应值。 对应值包 括下一个状态的颜色和一个字符。如果颜色寄存器值不为 0, 认为 前状态 有颜色, 即将当前状态和当前输入的字符一起输入到 Π 表中, 获得输出 值, MUX选择输出访问 ITT表所获得的输出值, 即将当前状态编号加上该 状态间的差值后获得后状态编号, 并用该后状态访问 TRM-1获得对应值。 无论颜色寄存器为何值, 将输入字符输入 TRM-0获得输出值, 该输出值包 括下一个状态和一个颜色值。  For Figure 15, after the character is entered, ■ according to the current status register and color register to determine how to operate. A strobe MUX is provided for selecting and outputting the output value (ie, the difference between the states) obtained by accessing the ITT table according to the value of the color register and the value 1. If the color register value is 0, it is considered that there is no color in the current state, MUX selects output 1, and the current status number is incremented by 1 to obtain the post status number, and the corresponding state is used to access TRM-1 to obtain the corresponding value. The corresponding value includes the color of the next state and a character. If the color register value is not 0, it is considered that the previous state has color, that is, the current state and the currently input character are input into the table together to obtain the output value, and the MUX selects the output value obtained by accessing the ITT table, that is, the current state number is added. The post-state number is obtained after the difference between the states, and the corresponding value is obtained by accessing the TRM-1 with the post-state. Regardless of the value of the color register, the input character is input to TRM-0 to obtain an output value, which includes the next state and a color value.
将从 TRM-1输出的字符值与当前输入字符在一比较器 CMP进行比较, 并根据比较结果通过一双态选通器执行如下操作: 如果相等, 则用 TRM-1 输出的下一个状态的颜色覆盖颜色寄存器, 同时用计算出来的访问 TRM-1 的地址(即后状态)覆盖状态寄存器, 从而实现通过验证的情况下的状态 转换。否则,用 TRM- 0输出的状态覆盖状态寄存器,颜色覆盖颜色寄存器, 从而实现验证失败的情况下的状态归零。  The character value output from the TRM-1 is compared with the current input character at a comparator CMP, and the following operation is performed by a two-state gate according to the comparison result: If equal, the color of the next state output by the TRM-1 is used. The color register is overwritten, and the status register is overwritten with the calculated address of the access TRM-1 (ie, the post state), thereby realizing state transition in the case of verification. Otherwise, the state register is overwritten with the state of the TRM-0 output, and the color is overwritten with the color register, thereby realizing zeroing in the case of verification failure.
失败转换规则和重启转换规则可以采用优先级策略被合并为最多 256 条。 对于这些转换规则, 由于它们的输出状态是初始状态 S。或者初始状态 的后状态, 因此采用输入字符作为地址进行索引。 即, 根据输入字符对应 输出初始状态 So或者初始状态的后状态。 为此, 构建一个解决这两类转换 规则的失败和重启转换规则存储器 TRM-0。 它采用字符寻址, 存储这两类 转换规则, 根据输入字符能够跳转的输出状态, 如果对应字符存在转换规 则, 则将初始状态的后状态存储在对应位置, 如果对应字符不存在转换规 则, 则将初始状态存储在对应位置。 由于输入字符最多 256个, TRM0 包 含 256个表项。  Failed conversion rules and restart conversion rules can be combined into a maximum of 256 with priority policies. For these conversion rules, since their output state is the initial state S. Or the post-state of the initial state, so the input character is used as the address for indexing. That is, the initial state So or the post state of the initial state is output according to the input character. To do this, build a failure to resolve both types of conversion rules and restart the transformation rule memory TRM-0. It uses character addressing to store the two types of conversion rules. According to the output state that the input character can jump, if there is a conversion rule for the corresponding character, the post state of the initial state is stored in the corresponding position. If there is no conversion rule for the corresponding character, The initial state is stored in the corresponding location. Since the input characters are up to 256, TRM0 contains 256 entries.
其它类型的转换规则决定了状态机的匹配, 将这些转换规则的含义分 解, 使用两部分来存储。  Other types of conversion rules determine the matching of state machines, and the meaning of these conversion rules is decomposed and stored in two parts.
首先根据状态编号, 将每个状态接受的字符顺序存储在主转换规则存 储器 TRM-1中。 这部分空间分布紧凑。  First, the character sequence accepted by each state is stored in the main conversion rule memory TRM-1 according to the state number. This part of the space is compact.
另外, 定义一个新的概念, 状态的颜色(color ), 每个状态可以被着成 任意颜色。 同时, 输入翻译表使用颜色作为索引进行访问。 对于状态机中的当前状态 Si, 设它是 k个转换规则的输入状态, 即对 于该状态, 存在 k个字符使其跳转到新的状态。 (这里不考虑失败转换规则 和重启转换规则)。 ' In addition, define a new concept, the color of the state, each state can be made into any color. At the same time, the input translation table uses color as an index for access. For the current state Si in the state machine, it is set to the input state of the k conversion rules, ie for this state, there are k characters that cause it to jump to the new state. (The failure conversion rules and restart conversion rules are not considered here). '
如果 k为 1 , 即该状态处于线形树(Linear Trie ) 中, 只有一条对应的 转换规则 Tr(S c)=Sj。 将该状态着成白色 (color=0 ), 并且使状态机中状态 的编号符合如下条件: j=i+l , 即处于线性树中的状态编号依次递增。  If k is 1, that the state is in the Linear Trie, there is only one corresponding conversion rule Tr(S c)=Sj. The state is colored white (color=0), and the number of states in the state machine is made to meet the following conditions: j = i + l , that is, the state numbers in the linear tree are sequentially incremented.
如果 k不为 1, 即状态包含多个输出转换, 就为其着一种新的颜色。 如 图 16所示, 状态 Si和状态 Sk都包含两个输出转换规则, 为了能够预测下 一个状态, 将这两个状态分別关联 ITT表新的一行, 并着不同颜色用于索 引 ITT表。 If k is not 1, that is, the state contains multiple output transitions, it is a new color. As shown in FIG. 16, both the state Si and the state S k contain two output conversion rules. To be able to predict the next state, the two states are respectively associated with a new row of the ITT table, and different colors are used to index the ITT table.
ITT表内部构造如图 16所示, 其中每个颜色对应 256个数值, 每个数 值是该状态 Si接收对应列字符后跳转到新状态 的状态编号差值。 图 16 中状态 Si接收了字符 0x01跳转到状态 Sk, 对应于 ITT表, 颜色 1的 0x01 列存储状态 Sk和状态 Si的差值: k - i。 其中 0表示空值。 The internal structure of the ITT table is as shown in Fig. 16, wherein each color corresponds to 256 values, and each value is a state number difference value in which the state Si receives the corresponding column character and jumps to the new state. In Figure 16, state Si receives the character 0x01 and jumps to state S k , which corresponds to the ITT table. The 0x01 column of color 1 stores the difference between state S k and state Si: k - i. Where 0 represents a null value.
结合 ITT表设计和颜色的概念, 对于状态机的当前状态 S 如果其是 白色 (color=0 ), 则可能的后状态是 Si+1; 如果颜色不是白色, 则用颜色和 当前输入访问 ITT表, 获得状态差值, 然后计算出后状态 Si+ i。 利用所述 后状态访问主转换规则存储器 TRM-1,获得对应的该后状态接受的字符 c,。 由于后状态是通过计算得出, 尽管其中使用了当前状态信息, 并且可能使 用当前输入字符信息, 但是这种使用不足以真实确定后状态, 为此, 需要 比较访问出来的字符 c,和当前字符 c, 如果两字符相同, 则计算的后状态为 真实后状态, 并跳转到该状态。 如果两字符不同, 则跳转到 TRM- 0访问得 出的状态, 即应用失败转换规则或重启转换规则。 In combination with the ITT table design and color concept, for the current state of the state machine S if it is white (color=0), the possible post state is S i+1 ; if the color is not white, access the ITT with color and current input Table, obtain the state difference, and then calculate the post state S i+ i . Using the post state to access the main conversion rule memory TRM-1, the corresponding character c accepted in the subsequent state is obtained. Since the post state is calculated, although the current state information is used and the current input character information may be used, this use is not sufficient to actually determine the post state. To this end, it is necessary to compare the accessed character c, and the current character. c. If the two characters are the same, the calculated post state is the real post state and jumps to the state. If the two characters are different, jump to the state obtained by the TRM-0 access, that is, apply the failure conversion rule or restart the conversion rule.
在 NSA结构的上述设计中, 每个包含多个输出转换规则的状态被赋予 一种新的颜色, 即分配 ITT表的一 4于作为计算后状态的依据。 应该看到, 对于大部分颜色来说, 只有少数的后状态, 因此 ITT表每行有大量的空值 ( 0 )。 为了有效利用 ITT表空间, 这里给出了一个 ITT表的优化方法: 表 项合并。  In the above design of the NSA structure, each state containing multiple output conversion rules is assigned a new color, i.e., a portion of the ITT table is allocated as a basis for the post-calculation state. It should be noted that for most colors, there are only a few post-states, so the ITT table has a large number of nulls (0) per line. In order to effectively use the ITT table space, an optimization method for the ITT table is given here: Table item merge.
ITT表的表项合并思想是将 ITT表多个表项合并成一个,从而有效利用 其中的空间资源。 合并的另外含义是将状态机中状态的颜色进行 并。  The idea of merging the entries of the ITT table is to combine multiple entries of the ITT table into one, so as to effectively utilize the space resources. Another implication of merging is to make the color of the state in the state machine.
图 17给出了 ITT表的表项合并示意图。 左侧状态机中包含 4个颜色, 经过合并后右侧状态机仅包含 2个颜色。  Figure 17 shows the merge of the entries in the ITT table. The left state machine contains 4 colors, and after merging, the right state machine contains only 2 colors.
ITT表中两个表项能够合并当且仅当他们不存在冲突。 冲突分为两类: 资源冲突和覆盖冲突。  Two entries in the ITT table can be merged if and only if they do not conflict. Conflicts fall into two categories: resource conflicts and coverage conflicts.
( 1 )资源冲突是指 ITT表表项中对应列的值不为空且不相同; 如图 18 中颜色 2和颜色 4。 ( 2 )覆盖冲突是指 ITT表表项中一列的非空值覆盖空值后, 则对于原 状态相当于增加了额外的 (虚拟)转换规则。 要保证增加的额外转换规则 不会与原有转换规则冲突, 即才艮据该 (虛拟)转换规则所获得的后状态不 接收已存在转换规则对应的字符。 (1) Resource conflict means that the value of the corresponding column in the ITT table entry is not empty and different; as shown in Figure 18, color 2 and color 4. (2) Coverage conflict means that after a non-null value of a column in the ITT table entry covers a null value, an additional (virtual) conversion rule is added for the original state. It is to be ensured that the added extra conversion rule does not conflict with the original conversion rule, that is, the post state obtained according to the (virtual) conversion rule does not receive the character corresponding to the existing conversion rule.
覆盖沖突的举例如下: 如图 18 , 如果颜色 3和颜色 1合并, 则颜色 3 的第 0x63列将覆盖颜色 1的该列。 即颜色 1的 0x63列本来是空值, 合并 后变为非空值 3。 这可能导致错误。 如果同样是颜色 1, 在输入 0x63的时 候, 为了计算后状态, 则用颜色 1对应的状态加 3 , 变成一个新状态。 但无 法知道该新状态对应的接收字符是否是 0x63。 如果是 0x63 , 则根据图 15 的后状态查找结构将进行状态跳转。 即颜色 1接收 0x63的时候也跳转到某 一个状态了, 这是错误的。 这种情况就是合并时当非空值覆盖空值后产生 的冲突。  An example of a coverage conflict is as follows: As shown in Figure 18, if color 3 and color 1 are merged, the 0x63 column of color 3 will overwrite the column of color 1. That is, the 0x63 column of color 1 is originally a null value, and merges to become a non-null value of 3. This can lead to errors. If it is also the color 1, when 0x63 is input, in order to calculate the post state, the state corresponding to the color 1 is incremented by 3 to become a new state. However, it is impossible to know whether the received character corresponding to the new state is 0x63. If it is 0x63, the state jump will be performed according to the post-state lookup structure of Figure 15. That is, when color 1 receives 0x63, it also jumps to a certain state, which is wrong. This is the conflict that occurs when a non-null value overwrites a null value when merging.
在避免这两类冲突之后, ITT表即可进行表项合并, 相关方法如图 19 所示。  After avoiding these two types of conflicts, the ITT table can be used to merge the entries. The related method is shown in Figure 19.
图 19给出了两个 ITT表表项是否可以进行合并的判断。 所述的判断方 法对于要合并的两行中的每一列均进行判断。 第 k列的判断如下: 如果两 列中有一列为空, 则判断空列所对应的状态在合并后使用非空列数据所接 收的字符是否等于 k, 如果是, 则为覆盖冲突, 两列不能合并, 退出, 如果 不是, 则进行下一判断; 如果两列都为空或者都不为空, 判断两列对应值 是否相同, 如果不是, 则为资源冲突, 两列不能合并, 退出, 如果是, 则 判断下一列。 直到确定要合并的两行中的所有列均不存在资源冲突和覆盖 冲突, 则将对应行进行合并, 其中的非空值覆盖空值。  Figure 19 shows the judgment of whether two ITT table entries can be merged. The judging method judges each of the two rows to be merged. The judgment of the kth column is as follows: If one of the two columns is empty, it is judged whether the state corresponding to the empty column is equal to k if the character received by the non-empty column data after the combination is merged, and if so, the overlay conflict, two columns Cannot merge, exit, if not, proceed to the next judgment; if both columns are empty or not empty, judge whether the corresponding values of the two columns are the same, if not, the resource conflicts, the two columns cannot be merged, and exit, if Yes, then judge the next column. Until all the columns in the two rows to be merged are determined to have no resource conflicts and overlay conflicts, the corresponding rows are merged, with non-null values covering the null values.
在 ITT表合并时, 釆取两两判断的方法。 如图 18所示, 首先将颜色 2 与颜色 1进行合并判断, 然后将颜色 3与颜色 1进行合并判断, 依此类推。 直到可能合并的颜色全部合并为止。  When the ITT table is merged, the method of two-two judgment is taken. As shown in Fig. 18, the color 2 and the color 1 are first combined and judged, then the color 3 and the color 1 are combined and judged, and so on. Until all possible merged colors are merged.
图 18中, 尽管颜色 4仅使用了 ITT表表项中的一个位置, 但由于资源 冲突, 它却不能与颜色 2进行合并。 为了进一步优化 ITT表空间的使用, 提出 ITT表的组相联优化策略, 用于解决上述问题。  In Figure 18, although color 4 uses only one location in the ITT table entry, it cannot be merged with color 2 due to resource conflicts. In order to further optimize the use of ITT table space, a group association optimization strategy for ITT tables is proposed to solve the above problems.
组相联优化策略类似计算机存储系统中緩存的組相联策略, 思想是通 过分组关联的方式将 ITT表列的界限打破, 可以将同一列数据存储在不同 列中。 2路组相联 ITT表结构如图 20所示。 采用这个结构之后, 图 18中颜 色 4可以与颜色 2进行合并。  The group association optimization strategy is similar to the group association strategy cached in the computer storage system. The idea is to break the boundaries of the ITT table column by group association, and the same column data can be stored in different columns. The 2-way set associative ITT table structure is shown in Figure 20. With this structure, the color 4 in Fig. 18 can be combined with the color 2.
对于 N路组相联, 将 ITT表一行分为 256/N个组, 对于两个颜色, 判 断他们是否可以采用组相联策略进行合并的方法如图 21所示。 两个 ITT表 表项可以通过组相联方式进行优化当且仅当同一组中状态不存在冲突。 这 里的冲突只是资源沖突, 即任何組中包含的非空元素超过 N。 图 21中方法为判断两个 ITT表表项是否冲突。 对于一个组 p, 判断两 行中包含的有效数值数量。如果其大于 N, 则表示该组出现了资源冲突(总 有效数值个数超过了 N )。 因此两行不能合并。 否则, 判断另外一组, 直到 确定全部 256/N组都不存在资源冲突, 可将两行合并。 For the N-way group association, the ITT table is divided into 256/N groups. For two colors, the method for judging whether they can use the group association strategy for merging is shown in FIG. 21. Two ITT table entries can be optimized by group association if and only if there is no conflict in the state in the same group. The conflict here is just a resource conflict, that is, any group contains non-empty elements that exceed N. The method in Figure 21 is to determine if two ITT table entries conflict. For a group p, determine the number of valid values contained in the two lines. If it is greater than N, it means that there is a resource conflict in the group (the total number of valid values exceeds N). Therefore the two lines cannot be merged. Otherwise, another group is judged, and until it is determined that there is no resource conflict in all 256/N groups, the two lines can be merged.
类似于緩存的组相联策略, ITT表的组相联策略需要增加标记位(Tag ) 区分每一个内容。 这里的 Tag需要两个域, 一是输入 Tag域, 另一个是颜 色 Tag域。 使用这两个域可以区分合并前的不同行和不同列。  Similar to the cached group association strategy, the group association policy of the ITT table needs to add a tag bit (Tag) to distinguish each content. The tag here requires two fields, one is the input tag field, and the other is the color tag field. Use these two fields to distinguish between different rows and different columns before the merge.
后状态查找 NSA是一个有效的硬件状态机实现方法。 这种有效性来源 于对存储器的精确访问, 不存在冲突项需要判断, 以及可以使用廉价的 SRAM, DDR等存储器。 尽管 NSA的 ITT表中会有一定的存储空隙, 但经 过表项合并与组相联策略的优化, 这种空隙可以得到有效的控制。  Post-state lookup The NSA is an efficient hardware state machine implementation. This effectiveness stems from accurate access to the memory, the absence of conflicting items to determine, and the use of inexpensive SRAM, DDR, etc. memories. Although there are certain storage gaps in the ITT table of the NSA, this gap can be effectively controlled by the combination of the table entry and the optimization of the group association strategy.
相应的芯片结构: 为了高速地实现基于緩存状态机的多字符串匹配技 术, 本发明设计了相应的芯片结构, 结构整体如图 22所示。 该结构是包括 ACC方法和 NSA结构的一个特征匹配结构, 用于字符串匹配。 对应于图 4 的缓存状态机结构, 图 22的 ACC-NSA结构包含了转换规则模块、 状态寄 存器和緩存状态寄存器模块。  Corresponding chip structure: In order to implement a multi-string matching technology based on a cache state machine at a high speed, the present invention designs a corresponding chip structure, and the overall structure is as shown in FIG. The structure is a feature matching structure including an ACC method and an NSA structure for string matching. Corresponding to the cache state machine structure of Figure 4, the ACC-NSA structure of Figure 22 includes a conversion rules module, a status register, and a cache status register module.
NSA结构可以用硬件有效的实现一个状态机,而 ACC方法基于緩存状 态机原理, 为此, ACC- NSA结构将两者结合后所解决的主要问题是提供了 采用 NSA结构来实现 ACC方法的緩存状态机。  The NSA structure can implement a state machine efficiently by hardware, and the ACC method is based on the principle of the cache state machine. To this end, the main problem solved by combining the ACC-NSA structure is to provide a cache using the NSA structure to implement the ACC method. state machine.
图 22给出了 ACC- NSA结构框架,可以看到,该结构基于图 15所示的 后状态查找结构, 增加了 "状态緩存" 和 "颜色緩存" 相关路径。 这两套 路径共用一套存储器 ITT表和 TRM- 1存储器。  Figure 22 shows the ACC-NSA structure framework. It can be seen that the structure is based on the post-state lookup structure shown in Figure 15, adding the "state buffer" and "color buffer" related paths. These two sets of paths share a set of memory ITT tables and TRM-1 memory.
在该结构中 Π 表和 TRM-1设计采用双端口存储器实现, 可以支持寄 存器和緩存的并行访问。 (如果不考虑并行访问特性, 单端口存储器也可以 使用) ACC-NSA结构中 TriMUX (三态选通器)采用优先策略实现。  In this architecture, the TRM-1 design and the TRM-1 design are implemented in a dual-port memory that supports parallel access to the registers and cache. (If you do not consider the parallel access feature, single-port memory can also be used.) In the ACC-NSA architecture, TriMUX (three-state gate) is implemented with a priority strategy.
它包含 3个输入, 1个输出和 2位控制信号。 控制信号是 "比较"模块 CMP的输出, 编号为 " 和 "2"。 TriMU 的功能为 if 、、V、 = equal high一 priority  It contains 3 inputs, 1 output and 2 bit control signals. The control signal is the output of the "Compare" module CMP, numbered " and "2". TriMU functions as if , , V, = equal high a priority
TriMux(state, color) = (state 9 color ^2 ^) y,2 ', = equal middle _ priority ( 5-1 ) TriMux(state, color) = (state 9 color ^2 ^) y,2 ', = equal middle _ priority ( 5-1 )
(state, color ^ 3 n) others low priority 该式中, ( state, color, "1" )代表利用寄存器中的状态所计算出的后状态 值及其颜色; ( state, color, "2" )代表利用緩存中的状态所计算出的后状态值 及其颜色; (state, color, "3" )代表 TRM-0 输出的后状态值及其颜色。 If "l"=equal, high_priority的含义为: 如果利用寄存器中的状态所计算出的后 状态经过访问 TRM-1所获得的输入字符与实际输入字符一致, 则 TriMUX 优先选择输出利用寄存器中的状态所计算出的后状态值及其颜色, 这一路 输入具有高优先级。 If "2,,=equal, middle_pi'iority的含义为: 如果利用緩存 中的状态所计算出的后状态经过访问 TRM-1所获得的输入字符与实际输入 字符一致, 则 TriMU 在不满足优先选择输出利用寄存器中的状态所计算 出的后状态值及其颜色的情况下, 选择输出利用緩存中的状态所计算出的 后状态值及其颜色, 这一路输入具有中等优先級。 在上述两种情况的条件 均不能满足的情况下, 则选择输出第 3路输入, 即应用失败和重启转换规 则所获得的 TRM- 0输出的后状态值及其颜色。 (state, color ^ 3 n ) others low priority In this formula, ( state, color, "1" ) represents the post state value and its color calculated using the state in the register; ( state, color, "2" ) Represents the post state value and its color calculated using the state in the cache; (state, color, "3" ) represents the post state value of the TRM-0 output and its color. If "l"=equal, the meaning of high_priority is: If the input state calculated by using the state in the register is consistent with the actual input character after accessing the TRM-1, TriMUX The preferred output outputs the post-state value and its color calculated using the state in the register, which has a high priority. If "2,,=equal, middle_pi'iority means: If the input state calculated by using the state in the cache is consistent with the actual input character after accessing the TRM-1, the TriMU does not satisfy the preference. When the output uses the post state value calculated by the state in the register and its color, the output is selected from the post state value calculated by the state in the buffer and its color, and the input has a medium priority. If the conditions of the situation are not met, then the third input is selected, that is, the post-state value and its color of the TRM-0 output obtained by the application failure and restarting the conversion rule.
当输入字符到来时, 寄存器 (状态寄存器和颜色寄存器) 与緩存(状 态缓存与颜色緩存) 共同访问 ITT表, 计算出可能的后状态值 ,· 并访问 TRM- 1将转换规则对应的字符取出。 同时, 根据输入字符访问 TRM-0, 获 得失败转换规则和重启转换规则对应的状态值。 三路结果输入到 TriMUX 模块。同时,根据 TRM-1的两个输出通过比较输入字符判断是否真实发生, 并控制 TriMUX模块选择正确结果覆盖到寄存器中。 同时 TRM-0的结果更 新緩存。 形成了一次状态转换。  When the input character arrives, the register (status register and color register) and the cache (status buffer and color buffer) access the ITT table together, calculate the possible post-state value, and access TRM-1 to extract the character corresponding to the conversion rule. At the same time, according to the input character access TRM-0, the failure conversion rule and the state value corresponding to the restart conversion rule are obtained. The three-way result is entered into the TriMUX module. At the same time, according to the two outputs of TRM-1, it is judged whether the real occurrence occurs by comparing the input characters, and the TriMUX module is controlled to select the correct result to be overwritten into the register. At the same time, the result of TRM-0 is updated. A state transition is formed.
以上所述, 仅是本发明的较佳实施例而已, 并非对本发明作任何形式 上的限制, 虽然本发明已以较佳实施例揭露如上, 然而并非用以限定本发 明,任何熟悉本专业的技术人员, 在不脱离本发明技术方案范围内,当可利用 上述揭示的技术内容作出些许更动或修饰为等同变化的等效实施例,但凡是 未脱离本发明技术方案内容, 依据本发明的技术实质对以上实施例所作的 任何简单修改、 等同变化与修饰,均仍属于本发明技术方案的范围内。 以 ClamAV 5万奈病毒规则的匹配和 Snort 1785条入侵检测规则为例, 说明本案的工业应用性。  The above is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Although the present invention has been disclosed in the above preferred embodiments, it is not intended to limit the present invention. The skilled person can make some modifications or modifications to the equivalent embodiments by using the above-disclosed technical contents without departing from the technical scope of the present invention. It is still within the scope of the technical solution of the present invention to make any simple modifications, equivalent changes and modifications to the above embodiments. Taking the matching of ClamAV 5 Wanna virus rule and Snort 1785 intrusion detection rule as an example, the industrial applicability of this case is illustrated.
对于缓存状态机原理的使用方法一: "动态交叉转换载入", 消除交叉 转换规则的效果如图 23、 图 24所示。  For the use of the principle of the cache state machine: "dynamic cross-conversion loading", the effect of eliminating cross-conversion rules is shown in Figure 23 and Figure 24.
其中的一步交叉转换规则是通过本发明的方法能够消除的, 可以看到, 使用基于缓存状态机的多字符串匹配方法 "动态交叉转换载入', 可以将空 间减少为原来的 4.1% ( ClamAV规则)和 20.8% ( Snort规则)。  One of the one-step cross-conversion rules can be eliminated by the method of the present invention. It can be seen that the multi-string matching method based on the cache state machine "dynamic cross-conversion loading" can reduce the space to the original 4.1% (ClamAV Rule) and 20.8% (Snort rule).
对于缓存状态机原理的使用方法二: 合并同构路径, 减少基本转换规 则的效果如图 25所示。  How to use the principle of the cache state machine 2: Combine the isomorphic paths, and reduce the effect of the basic conversion rules as shown in Figure 25.
其中虛线是采用 DFA的传统方法的数据,实线是使用緩存状态机 CDFA 的方法的数据, 可以看到, 使用本发明的基于 CDFA的方法可以将基本转 换规则数量最多减少为原来的 21.4% ( Snort规则)。  The dotted line is the data of the traditional method using DFA, and the solid line is the data of the method using the buffer state machine CDFA. It can be seen that the CDFA-based method of the present invention can reduce the number of basic conversion rules by up to 21.4%. (Snort rules).
相应芯片结构性能如表 2所示:  The corresponding chip structure performance is shown in Table 2:
表 2
Figure imgf000026_0001
Table 2
Figure imgf000026_0001
基于緩存状态机(CDFA )方法设计的芯片结构 ACC- NSA结构, 可以 达到的最高匹配速度是 11.7Gbps (在 0.18微米工艺下)。较其他方法具有更 快的速度。 工业应用性  Based on the Cache State Machine (CDFA) method, the chip structure ACC-NSA structure can achieve a maximum matching speed of 11.7 Gbps (under 0.18 micron process). It has a faster speed than other methods. Industrial applicability
本发明基于緩存状态机的多字符串匹配方法和基于 "后状态查找" 的 芯片结构至少具有下列优点及有益效果:  The multi-string matching method based on the cache state machine and the chip structure based on the "post-state lookup" have at least the following advantages and beneficial effects:
其可以消除 95°/。以上甚至全部的交叉转换规则;可以减少基本转换规则 的数量, 从而减少所需状态数等; 可以实现较其他方法更高的匹配速度。 总之, 其能够满足对高速大规模多字符串匹配技术的需求。 其匹配的性能 与规则库的大小无关、 匹配的性能与规则库的最小长度无关、 匹配的性能 与规则库和待匹配文本之间的关系无关、 能够支持大规模规则集、 随着规 则数量的增加存储空间亚线性增加、 能够有效降低空间需求、 可以有效的 存储和访问状态机中的转换规则。  It can eliminate 95°/. All of the above cross-conversion rules; can reduce the number of basic conversion rules, thereby reducing the number of states required, etc.; can achieve higher matching speed than other methods. In short, it can meet the demand for high-speed large-scale multi-string matching technology. The performance of the matching is independent of the size of the rule base. The performance of the matching is independent of the minimum length of the rule base. The performance of the matching is independent of the relationship between the rule base and the text to be matched. It can support large-scale rule sets, with the number of rules. Increasing the storage space sub-linear increase, can effectively reduce the space requirements, can effectively store and access the conversion rules in the state machine.

Claims

权 利 要 求 Rights request
1、 一种緩存状态机, 其特征在于, 其包括: A cache state machine, characterized in that it comprises:
状态寄存器: 用于寄存当前状态;  Status register: used to register the current status;
缓存状态寄存器: 用于寄存缓存状态;  Cache Status Register: Used to register the cache status;
转换规则模块: 用于存储和访问状态转换规则库, 并根据接口模块接 收的字符、 状态寄存器寄存的当前状态和緩存状态寄存器寄存的緩存状态 查找下一状态, 输出到状态寄存器; 以及根据特定的緩存规则对緩存状态 寄存器进行赋值。 - Conversion rule module: for storing and accessing the state conversion rule base, and searching for the next state according to the character received by the interface module, the current state of the state register registration, and the cache state of the cache state register registration, and outputting to the status register; The cache rule assigns a value to the cache status register. -
2、 一种多字符串匹配方法, 其特征在于, 其包括下述步骤: 2. A multi-string matching method, characterized in that it comprises the following steps:
从接收的输入字符流中按顺序取出字符作为输入字符; 对于每个输入 字符, 进行下述步骤:  The characters are sequentially taken out from the received input character stream as input characters; for each input character, the following steps are performed:
根据当前输入字符、 当前状态和緩存状态在状态转换规则库中查找后 状态;  Find the post state in the state transition rule base according to the current input character, current state, and cache state;
跳转到所述后状态;  Jump to the post state;
根据特定的緩存规则进行状态緩存;  Stateful caching according to specific caching rules;
将所述后状态作为当前状态, 所緩存的状态作为緩存状态, 下一个输 入字符作为当前输入字符, 重复对于每个输入字符所进行的步骤, 直至所 述字符流中的字符全部判断完毕。  The post state is taken as the current state, the cached state is taken as the cache state, and the next input character is used as the current input character, and the steps performed for each input character are repeated until all the characters in the character stream are judged.
3、 根据权利要求 2所述的多字符串匹配方法, 其特征在于, 所述的查 找后状态的步骤包括: 首先判断当前状态接收当前输入字符在基本转换规 则和 n步交叉转换规则中是否存在后状态, 如果存在, 则将该后^态作为 查找结果; 如果不存在, 则判断緩存状态接收当前输入字符在基本转换规 则和 n步交叉转换规则中是否存在后状态, 如果存在, 则将该后状态作为 查找结果; 如果不存在, 则判断初始状态接收当前输入字符在基本转换规 则和 n步交叉转换规则中是否存在后状态; 如果存在, 则将该后状态作为 查找结果; 否则将初始状态作为查找结果。  The multi-string matching method according to claim 2, wherein the step of the post-find state comprises: first determining whether the current state receives the current input character in the basic conversion rule and the n-step cross-conversion rule. Post-state, if present, the post-state is used as the search result; if not, it is determined whether the cache state receives the current input character in the basic conversion rule and the n-step cross-conversion rule, and if so, the The post state is used as the search result; if it does not exist, it is judged whether the initial state receives the current input character in the basic conversion rule and the n-step cross-conversion rule; if it exists, the post state is used as the search result; otherwise, the initial state is As a result of the search.
4、 根据权利要求 3所述的多字符串匹配方法, 其特征在于, 所述的根 据特定的緩存规则进行状态緩存的步骤为: 如果初始状态接收当前输入字 符在基本转换规则中存在对应的后状态, 则緩存该后状态; 否则, 緩存初 始状态。  The multi-string matching method according to claim 3, wherein the step of performing state buffering according to a specific cache rule is: if the initial state receives the current input character after the corresponding corresponding in the basic conversion rule State, then cache the post state; otherwise, cache the initial state.
5、 根据权利要求 2所述的多字符串匹配方法, 其特征在于, 所述的查 找后状态的步骤包括: 判断当前状态的类型, 如果是汇聚状态或一般状态, 则根据当前输入字符和当前状态在状态转换规则集中查找后状态; 如果是 分离状态, 则根据当前输入字符、 当前状态和緩存状态在分离状态转换规 则集中查找后状态。 The multi-string matching method according to claim 2, wherein the step of the post-find state comprises: determining a type of the current state, and if it is a converged state or a general state, according to the current input character and the current The state is in the state transition rule set to find the post state; if it is the detached state, the detached state transition rule is based on the current input character, the current state, and the cache state. Then focus on the post-state.
6、 根据权利要求 5所述的多字符串匹配方法, 其.特征在于, 所述的分 离状态转换规则集设置为接收三个输入: 当前输入字符、 当前状态和緩存 状态, 相应提供一个输出: 后状态。  6. The multi-string matching method according to claim 5, wherein the separated state transition rule set is configured to receive three inputs: a current input character, a current state, and a cache state, and respectively provide an output: After the state.
7、 根据权利要求 5所述的多字符串匹配方法, 其特征在于, 所述的根 据特定的緩存规则进行緩存的步骤为: 如果当前状态是汇聚状态, 则将当 前状态进行緩存。  The multi-string matching method according to claim 5, wherein the step of buffering according to a specific cache rule is: if the current state is a converged state, the current state is cached.
8、 一种存储有若干指令的计算机可读存储介质, 其特征在于, 当所述 指令被处理器执行时, 使得所述处理器实现下述步聚:  8. A computer readable storage medium storing a plurality of instructions, wherein when said instructions are executed by a processor, said processor causes said steps to be:
接收输入字符; 对于每个输入字符, 进行下述步骤:  Receive input characters; For each input character, perform the following steps:
根据当前输入字符、 当前状态和緩存状态在状态转换规则库中查找后 状态; - 跳转到所述后状态;  Find the post state in the state transition rule base according to the current input character, current state, and cache state; - jump to the post state;
根据特定的緩存规则进行状态緩存;  Stateful caching according to specific caching rules;
将所述后状态作为当前状态, 所緩存的状态作为緩存状态, 下一个输 入字符作为当前输入字符, 重复对于每个输入字符所进行的步骤, 直至所 述字符流中的字符全部判断完毕。  The post state is taken as the current state, the cached state is taken as the cache state, and the next input character is used as the current input character, and the steps performed for each input character are repeated until all the characters in the character stream are judged.
9、 一种系统, 其特征在于, 包括:  9. A system, comprising:
处理器;  Processor
与处理器连接的总线, 用来在所述系统各部分之间传送数据; 通信接口, 与所述述总线连接, 用来接收字符数据流;  a bus coupled to the processor for transferring data between portions of the system; a communication interface coupled to the bus for receiving a stream of character data;
主存储器, 与所述总线连接, 其中存储有若干指令, 当所述指令被所 述处理器执行时, 使得所述处理器实现下述步驟:  A main memory, coupled to the bus, in which is stored a number of instructions that, when executed by the processor, cause the processor to:
从接收的字符数据流中按顺序取出字符作为输入字符; 对于每个输入 字符, 进行下述步骤: .  The characters are sequentially taken out as input characters from the received character stream; for each input character, the following steps are performed:
根据当前输入字符、 当前状态和緩存状态在状态转换规则库中查找后 状态;  Find the post state in the state transition rule base according to the current input character, current state, and cache state;
跳转到所述后状态;  Jump to the post state;
根据特定的緩存规则进行状态緩存;  Stateful caching according to specific caching rules;
将所述后状态作为当前状态, 所緩存的状态作为緩存状态, 下一个输 入字符作为当前输入字符, 重复对于每个输入字符所进行的步骤, 直至所 述字符流中的字符全部判断完毕。  The post state is taken as the current state, the cached state is taken as the cache state, and the next input character is used as the current input character, and the steps performed for each input character are repeated until all the characters in the character stream are judged.
10、 一种后状态查找方法, 其特征在于, 其包括: 根据当前状态和输 入字符配合输入翻译表计算出可能的后状态; 根据所述可能的后状态查找 规则存储表以获得对应的输入字符; 比较所述的实际输入字符和查找所述 规则存储表所获得的字符是否一致; 如果结果一致, 则将状态转换到所述 的可能的后状态; 如果结果不一致, 则状态归零。 A post-state search method, comprising: calculating a possible post-state according to a current state and an input character in conjunction with an input translation table; and searching a rule storage table according to the possible post-state to obtain a corresponding input character Comparing the actual input characters with whether the characters obtained by searching the rule storage table are consistent; if the results are consistent, converting the state to the Possible post state; if the results are inconsistent, the state is zeroed.
11、 根据权利要求 10的后状态查找方法, 其特征在于, 所述的状态的 编号规则包括: 如果所述当前状态只有一条对应的输出转换规则, 则该条 输出转换规则所指向的后状态的编号为所述当前状态的编号加一。  The post-state search method according to claim 10, wherein the numbering rule of the state comprises: if the current state has only one corresponding output conversion rule, the strip outputs a post-state to which the conversion rule points The number is the number of the current state plus one.
12、 根据权利要求 11的后状态查找方法, 其特征在于, 所述计算可能 的后状态的步骤包括: 根据一定的规则集, 如果所述当前状态只有一条对 应的输出转换规则, 则用所述当前状态的编号加一以获得可能的后状态的 编号; 如果所述当前状态存在多奈对应的输出转换规则, 则将所述当前状 态的颜色和所述输入字符作为输入, 查找所述输入翻译表以获得所述可能 的后状态与所述当前状态之间编号的差值, 并用所述当前状态的编号加上 所述的差值以获得可能的后状态的编号。  The post-state finding method according to claim 11, wherein the calculating the possible post-state comprises: according to a certain rule set, if the current state has only one corresponding output conversion rule, The number of the current state is incremented by one to obtain a number of possible post-states; if there is a corresponding output-converting rule for the current state, the color of the current state and the input character are taken as inputs, and the input translation is searched for The table obtains a difference between the number of the possible post state and the current state, and adds the difference by the number of the current state to obtain a number of possible post states.
13、 根据权利要求 10的后状态查找方法, 其特征在于, 所述规则存储 表的构成为:. 其输入为一后状态, 所对应的输出为所述后状态的颜色和所 述后状态所对应的输入字符。  The post-state finding method according to claim 10, wherein the rule storage table is configured to: have an input of a post state, and the corresponding output is a color of the post state and the post state Corresponding input characters.
14、 根据权利要求 10或 12的后状态查找方法, 其特征在于: 所述输 入翻译表的构成为: 其输入为当前状态的颜色和所述输入字符, 所对应的 输出为可能的后状态与所述当前状态之间编号的差值。  The post-state finding method according to claim 10 or 12, wherein: the input translation table is configured to: the input is a color of the current state and the input character, and the corresponding output is a possible post state The difference in number between the current states.
15、 根据权利要求 14的后状态查找方法, 其特征在于, 还包括对所述 输入翻译表进行表项合并, 所述输入翻译表的每一行对应一种当前状态, 每一列对应一个输入字符, 所述的表项合并对于要合并的两行中的每一列 均进行判断, 第 k列的判断如下: 如果两列中有一列为空, 则判断空列所 对应的状态在合并后使用非空列数据所接收的字符是否等于 k,如果是, 则 为覆盖沖突, 两列不能合并, 退出, 如果不是, 则进 4亍下述判断; 如果两 列都为空或者都不为空, 判断两列对应值是否相同, 如果不是, 则为资源 冲突, 两列不能合并, 退出, 如果是, 则判断下一列; 直到确定要合并的 两行中的所有列均不存在资源冲突和覆盖冲突, 则将对应行进行合并, 其 中的非空值覆盖空值。  The post-state search method according to claim 14, further comprising: performing entry merging on the input translation table, each row of the input translation table corresponding to a current state, and each column corresponding to one input character, The combination of the items is determined for each of the two rows to be merged, and the judgment of the kth column is as follows: If one of the two columns is empty, it is determined that the state corresponding to the empty column is non-empty after the merge. Whether the character received by the column data is equal to k, if it is, it is the overlay conflict, the two columns cannot be merged, and the exit, if not, the following judgment is made; if both columns are empty or not empty, judge two Whether the column corresponds to the same value, if not, it is a resource conflict, the two columns cannot be merged, and the exit, if yes, the next column is judged; until all the columns in the two rows to be merged are determined to have no resource conflict and overwrite conflict, then The corresponding rows are merged, and the non-null values cover the null values.
16、 才艮据权利要求 14或 15的后状态查找方法, 其特征在于, · 还包括 对所述输入翻译表进行组相联优化, 其包括如下的判断是否存在资源冲突 的步骤: 对于 N路组相联, 将 ITT表一行分为 256/N个组, 对于一个组, 判断两行中包含的有效数值数量, 如果该数量大于 N, 则表示该组出现了 资源冲突; 否则, 判断另外一组; 直到确定全部 256/N组都不存在资源冲 突, 则将这两行合并。  16. The post-state finding method according to claim 14 or 15, wherein: - further comprising performing group associative optimization on said input translation table, comprising the following steps of determining whether there is a resource conflict: The group is associated, and the ITT table is divided into 256/N groups. For a group, the number of valid values contained in the two rows is judged. If the number is greater than N, it indicates that there is a resource conflict in the group; otherwise, another one is judged. Group; The two lines are merged until it is determined that there are no resource conflicts in all 256/N groups.
17、 一种存储有若干指令的计算机可读存储介质, 其特征在于, 当所 述指令被处理器执行时, 使得所述处理器实现下述步骤: 根据当前状态和 输入字符配合输入翻译表计算出可能的后状态; 根据所述可能的后状态查 找规则存储表以获得对应的输入字符; 比较所述的实际输入字符和查找所 述规则存储表所获得的字符是否一致; 如杲结果一致, 则将状态转换到所 述的可能的后状态; 如果结果不一致, 则状态归零。 . 17. A computer readable storage medium storing a plurality of instructions, wherein when the instructions are executed by a processor, causing the processor to perform the following steps: calculating a translation table according to a current state and input characters. a possible post state; check according to the possible post state Finding a rule storage table to obtain a corresponding input character; comparing whether the actual input character and the character obtained by searching the rule storage table are consistent; if the result is consistent, converting the state to the possible post state; If the results are inconsistent, the status is zeroed. .
18、 根据权利要求 17所述的计算机可读存储介质, 其特征在于, 所述 的状态的编号规则包括: 如果所述当前状态只有一条对应的输出转换规则, 则该条输出转换规则所指向的后状态的编号为所述当前状态的编号加一; 所述计算可能的后状态的步骤包括: 根据一定的规则集, 如果所述当前状 态只有一条对应的输出转换规则, 则用所述当前状态的编号加一以获得可 能的后状态的编号; 如果所述当前状态存在多条对应的输出转换规则, 则 将所述当前状态的颜色和所述输入字符作为输入, 查找所述输入翻译表以 获得所述可能的后状态与所述当前状态之间编号的差值, 并用所述当前状 态的编号加上所述的差值以获得可能的后状态的编号。  The computer readable storage medium according to claim 17, wherein the numbering rule of the state comprises: if the current state has only one corresponding output conversion rule, the output switching rule points to The number of the subsequent state is the number of the current state plus one; the step of calculating the possible post state includes: according to a certain rule set, if the current state has only one corresponding output conversion rule, the current state is used The number is incremented by one to obtain a number of possible post-states; if there are multiple corresponding output conversion rules in the current state, the color of the current state and the input character are taken as inputs, and the input translation table is searched for A difference between the number of the possible post state and the current state is obtained, and the difference is added by the number of the current state to obtain a number of possible post states.
19、 根据权利要求 17所述的计算机可读存储介质, 其特征在于, 所述 规则存储表的枸成为: 其输入为一后状态, 所对应的输出为所述后状态的 颜色和所述后状态所对应的输入字符。  The computer readable storage medium according to claim 17, wherein the rule storage table is: the input is a post state, and the corresponding output is the color of the post state and the rear The input character corresponding to the status.
20、 根据权利要求 17所述的计算机可读存储介质, 其特征在于, 所述 输入翻译表的构成为: 其输入为当前状态的颜色和所述输入字符, 所对应 的输出为可能的后状态与所述当前状态之间编号的差值。  The computer readable storage medium according to claim 17, wherein the input translation table is configured to: input the color of the current state and the input character, and the corresponding output is a possible post state The difference between the numbers and the current state.
21、 根据权利要求 20所述的计算机可读存储介廣, 其特征在于, 所述 输入翻译表的每一行对应一种当前状态, 每一列对应一个输入字符, 所述 输入翻译表是经过表项合并的, 所述的表项合并是如下进行的: 对于要合 并的两行中的每一列均进行判断, 第 k列的判断如下: 如果两列中有一列 为空, 则判断空列所对应的状态在合并后使用非空列数据所接收的字符是 否等于 k, 如果是, 则为覆盖沖突, 两列不能合并, 退出, 如果不是, 则进 行下述判断; 如果两列都为空或者都不为空, 判断两列对应值是否相同, 如果不是, 则为资源冲突, 两列不能合并, 退出, 如果是, 则判断下一列; 直到确定要合并的两行中的所有列均不存在资源沖突和覆盖冲突, 则将对 应行进行合并, 其中的非空值覆盖空值。  The computer readable storage medium according to claim 20, wherein each row of the input translation table corresponds to a current state, each column corresponds to one input character, and the input translation table is a past entry. In combination, the combination of the items is performed as follows: For each of the two rows to be merged, the judgment of the kth column is as follows: If one of the two columns is empty, the corresponding empty column is determined. Whether the character received by the non-null column data after the merge is equal to k, if yes, it is the overlay conflict, the two columns cannot be merged, and the exit, if not, the following judgment is made; if both columns are empty or both If it is not empty, judge whether the corresponding values of the two columns are the same. If not, it is a resource conflict. The two columns cannot be merged and exit. If yes, the next column is judged. Until all the columns in the two rows to be merged are determined to have no resources. Conflicts and overlay conflicts, the corresponding rows are merged, and the non-null values cover the null values.
22、 根据权利要求 20或 21所述的计算机可读存储介质, 其 征在于, 所述输入翻译表是经过组相联优化的, 所述的组相联优化包括如下的判断 是否存在资源冲突的步驟: 对于 N路組相联, 将 ITT表一行分为 256/N个 组, 对于一个组, 判断两行中包含的有效数值数量, 如果该数量大于 N, 则表示该组出现了资源冲突; 否则, 判断另外一组; 直到确定全部 256/N 组都不存在资源冲突, 则将这两行合并。  The computer readable storage medium according to claim 20 or 21, wherein the input translation table is optimized by group association, and the group association optimization includes determining whether there is a resource conflict as follows. Step: For the N-way group association, divide the ITT table into 256/N groups. For one group, judge the number of valid values contained in the two lines. If the number is greater than N, it indicates that the group has a resource conflict. Otherwise, judge another group; until it is determined that there are no resource conflicts in all 256/N groups, then the two lines are merged.
23、 一种系统, 其特征在于, 包括:  23. A system, comprising:
主处理器, 组织输入数据流; 协处理器单元, 与主处理器连接; The main processor, organizes the input data stream; a coprocessor unit, connected to the main processor;
所述的协处理器单元内进行如下操作: 根据当前状态和输入字符配合 输入翻译表计算出可能的后状态; 根据所述可能的后状态查找规则存储表 以获得对应的输入字符; 比较所述的实际输入字符和查找所述规则存储表 所获得的字符是否一致; 如果结果一致, 则将状态转换到所述的 能的后 状态; 如果结果不一致, 则状态归零。  Performing the following operations in the coprocessor unit: calculating a possible post state according to the current state and the input character in conjunction with the input translation table; searching the rule storage table according to the possible post state to obtain a corresponding input character; The actual input characters are consistent with the characters obtained by looking up the rule storage table; if the results are consistent, the state is converted to the post-energy state; if the results are inconsistent, the state is zeroed.
24、 根据权利要求 23所述的系统, 其特征在于, 所述的状态的编号规 则包括: 如果所述当前状态只有一条对应的输出转换规则, 则该条输出转 换规则所指向的后状态的编号为所述当前状态的编号加一; 所述计算可能 的后状态的步骤包括: 居一定的规则集, 如果所述当前状态只有一条对 应的输出转换规则, 则用所述当前状态的编号加一以获得可能的后状态的 编号; 如果所述当前状态存在多条对应的输出转换规则, 则将所述当前状 态的颜色和所述输入字符作为输入, 查找所述输入翻译表以获得所述可能 的后状态与所述当前状态之间编号的差值, 并用所述当前状态的编号加上 所述的差值以获得可能的后状态的编号。  The system according to claim 23, wherein the numbering rule of the state comprises: if the current state has only one corresponding output conversion rule, the number of the post state indicated by the output conversion rule Adding a number to the current state; the step of calculating a possible post state includes: a certain rule set, if the current state has only one corresponding output conversion rule, adding one by the current state number Obtaining a number of possible post-states; if there are multiple corresponding output conversion rules in the current state, taking the color of the current state and the input character as inputs, searching the input translation table to obtain the possible The difference between the number of the post state and the current state, and the number of the current state is added to the difference to obtain the number of possible post states.
25、 根据权利要求 23所述的系统, 其特征在于, 所述规则存储表的构 成为: 其输入为一后状态, 所对应的输出为所述后状态的颜色和所述后状 态所对应的输入字符。 .  The system according to claim 23, wherein the rule storage table is configured to: the input is a post state, and the corresponding output is a color of the post state and a corresponding state of the post state Enter the characters. .
26、 居权利要求 23所述的系统, 其特征在于, 所述输入翻译表的构 成为: 其输入为当前状态的颜色和所述输入字符, 所对应的输出为可能的 后状态与所述当前状态之间编号的差值。  The system of claim 23, wherein the input translation table is configured to: input the color of the current state and the input character, and the corresponding output is a possible post state and the current The difference between the numbers in the status.
27、 根据权利要求 26所述的系统, 其特征在于, 所述输入翻译表的每 一行对应一种当前状态, 每一列对应一个输入字符, 所述输入翻译表是经 过表项合并的, 所述的表项合并是如下进行的: 对于要合并的两行中的每 一列均进^ ί亍判断, 第 k列的判断如下: 如果两列中有一列为空, 则判断空 列所对应的状态在合并后使用非空列数据所接收的字符是否等于 k, 如果 是, 则为覆盖冲突, 两列不能合并, 退出, 如果不是, 则进行下述判断; 如果两列都为空或者都不为空, 判断两列对应值是否相同, 如果不是, 则 为资源冲突, 两列不能合并, 退出, 如果是, 则判断下一列; 直到确定要 合并的两行中的所有列均不存在资源冲突和覆盖冲突, 则将对应行进行合 并, 其中的非空值覆盖空值。  The system according to claim 26, wherein each row of the input translation table corresponds to a current state, each column corresponds to an input character, and the input translation table is merged by an entry, The merging of the table entries is as follows: For each of the two rows to be merged, the judgment of the kth column is as follows: If one of the two columns is empty, the state corresponding to the empty column is determined. Whether the characters received by the non-null column data after the merge are equal to k, if yes, the overlay conflict, the two columns cannot be merged, and the exit, if not, the following judgment is made; if both columns are empty or not Empty, determine whether the corresponding values of the two columns are the same. If not, the resource conflicts. The two columns cannot be merged and exit. If yes, the next column is judged. Until all the columns in the two rows to be merged are determined to have no resource conflicts. If the conflict is overwritten, the corresponding rows are merged, and the non-null values cover the null values.
28、 根据权利要求 26或 27所述的系统, 其特征在于, 所述输入翻译 表是经过组相联优化的, 所述的组相联优化包括如下的判断是否存在资源 冲突的步骤: 对于 N路组相联, 将 ITT表一行分为 256/N个组, 对于一个 组, 判断两行中包含的有效数值数量, 如果该数量大于 N, 则表示该组出 现了资源冲突; 否则, 判断另外一组; 直到确定全部 256 N组都不存在资 源冲突, 则将这两行合并。 The system according to claim 26 or 27, wherein the input translation table is optimized by group association, and the group association optimization includes the following steps of determining whether there is a resource conflict: The road group is connected, and the ITT table is divided into 256/N groups. For one group, the number of valid values included in the two rows is judged. If the number is greater than N, it indicates that the group has a resource conflict; otherwise, it is judged that a group; until all 256 N groups are determined to be incapable If the source conflicts, the two lines are merged.
29、 一种后状态查找结构, 其特征在于, 其包括:  29. A post state lookup structure, characterized in that it comprises:
主存储器: 存储有基本转换规则和交叉转换规则, 其输入为才艮据当前 状态和输入字符配合输入翻译表所计算出的可能的后状态, 根据所存储的 转换规则输出所述可能的后状态的颜色和与所述可能的后状态相对应的输 入字符;  Main memory: The basic conversion rule and the cross conversion rule are stored, and the input is a possible post state calculated according to the current state and the input character in conjunction with the input translation table, and the possible post state is output according to the stored conversion rule. a color and an input character corresponding to the possible post state;
次存储器: 存储有失败转换规则和重启转换规则, 其输入为实际输入 字符, 才艮据所存储的转换规则输出与所述实际输入字符相对应的后状态及 其颜色;  Secondary memory: stores a failure conversion rule and a restart conversion rule, and the input is an actual input character, and outputs a post state corresponding to the actual input character and its color according to the stored conversion rule;
输入翻译表: 其输入为所述当前状态的颜色和所述实际输入字符, 所 对应的输出为可能的后状态与所述当前状态之间编号的差值;  Input translation table: the input is the color of the current state and the actual input character, and the corresponding output is the difference between the possible post state and the current state;
双态选通器: 根据所述主存储器所输出的字符与实际输入字符两者之 间的比较结果执行如下操作: 如果相等, 则将当前状态转换到所述计算出 来的可能的后状态, 同时将当前状态的颜色转换到所述主存储器所输出的 该可能的后状态的颜色; 否则, 将当前状态及其颜色转换到次存储器的输 出。  A two-state gate: performing the following operations according to a comparison result between the character outputted by the main memory and the actual input character: if equal, converting the current state to the calculated possible post state, Converting the color of the current state to the color of the possible post state output by the main memory; otherwise, converting the current state and its color to the output of the secondary memory.
30、 根据权利要求 29的后状态查找结构, 其特征在于, 还包括一比较 器, 用于进行所述主存储器所输出的字符与实际输入字符两者之间的比较。  30. The post state lookup structure of claim 29, further comprising a comparator for performing a comparison between the character output by said main memory and the actual input character.
31、 根揚权利要求 29的后状态查找结构, 其特征在于, 还包括: 状态寄存器: 用于存储所述当前状态;  31. The post state lookup structure of claim 29, further comprising: a status register: configured to store the current state;
颜色寄存器: 用于存储所述当前状态的颜色。 ·  Color register: The color used to store the current state. ·
32、 根据权利要求 29的后状态查找结构, 其特征在于, 还包括: 选通器: 用于根据颜色寄存器的值对输入翻译表的输出值与数值 1 进 行选择输出。  32. The post state lookup structure according to claim 29, further comprising: a gate: configured to selectively output the output value of the input translation table and the value 1 according to the value of the color register.
33、 根据权利要求 32的后状态查找结构, 其特征在于, 还包括: 加法器: 用于将所述当前状态的编号与所述选通器的输出值相加, 以 计算得出可能的后状态。  33. The post state lookup structure according to claim 32, further comprising: an adder: configured to add the number of the current state to an output value of the gate to calculate a possible post status.
34、 一种多字符串匹配结构, 其特征在于, 其包括:  34. A multi-string matching structure, characterized in that it comprises:
^!大态寄存器: 用于存储当前状态;  ^! Large state register: used to store the current state;
颜色寄存器: 用于存储当前状态的颜色;  Color register: The color used to store the current state;
状态緩存器: 用于存储緩存状态;  Status buffer: used to store the cache status;
颜色緩存器: 用于存储緩存状态的颜色;  Color buffer: The color used to store the cache state;
主存储器: 存储有基本转换规则和 n步交叉转换规则, 其第一路输入 为根据当前状态和输入字符配合输入翻译表计算出的第一可能后状态, 所 对应的第一路输出为根据所存储的转换规则所获得的所述第一可能后状态 的颜色和所述第一可能后状态所对应的输入字符; 其第二路输入为 居缓 存状态和所述输入字符配合输入翻译表计算出的第二可能后状态, 所对应 的第二路输出为根据所存储的转换规则所获得的所述第二可能后状态的颜 色和所述第二可能后状态所对应的输入字符; Main memory: The basic conversion rule and the n-step cross conversion rule are stored, and the first input is the first possible post state calculated according to the current state and the input character and the input translation table, and the corresponding first output is the basis. The color of the first possible post state obtained by the stored conversion rule and the input character corresponding to the first possible post state; the second input is slow The stored state and the input character cooperate with the second possible post state calculated by the input translation table, and the corresponding second output is the color of the second possible post state obtained according to the stored conversion rule and the first The input characters corresponding to the two possible states;
次存储器: 存储有失败转换规则和重启转换规则, 其输入为所述的实 际输入的字符, 输出为根据所存储的转换规则所获得的所述实际输入字符 所对应的后状态及其颜色; 在每个当前状态的转换周期, 均用所述次存储 器所输出的后状态及其颜色分别对状态緩存器和颜色緩存器进行一次覆 盖;  Secondary memory: storing a failure conversion rule and a restart conversion rule, the input of which is the actual input character, and the output is the post state corresponding to the actual input character obtained according to the stored conversion rule and its color; The switching period of each current state is respectively covered by the state buffer and the color buffer by the post state and the color output by the secondary memory;
输入翻译表: 其第一路输入为所述当前 ^1犬态的颜色和所述实际输入字 符, 所对应的第一路输出为所述第一可能后^ 态与所述当前状态之间编号 的差值; 其第二路输入为所述緩存状态的颜色和所述实际输入字符, 所对 应的第二路输出为所述第二可能后状态与所述緩存状态之间编号的差值; 三态选通器: 根据所述主存储器所输出的第一路字符和第二路字符与 所述实际输入字符之间的比较结果执行如下操作: 如果所述第一路字符与 所述实际输入字符相同, 则用所述第一可能后状态覆盖所述状态寄存器, 同时用所述第一可能后状态的颜色覆盖所述颜色寄存器; 如果所述第一路 字符与所述实际输入字符不相同, 但所述第二路字符与所述实际输入字符 相同, 则用所述第二可能后状态覆盖所述状态寄存器, 同时用所述第二可 能后状态的颜色覆盖所述颜色寄存器; 否则, 用所述次存储器所输出的后 状态及其颜色分别覆盖所述状态寄存器和所述颜色寄存器。  Input translation table: the first input is the color of the current ^1 dog state and the actual input character, and the corresponding first way output is the number between the first possible state and the current state The difference between the second input is the color of the buffer state and the actual input character, and the corresponding second output is the difference between the number of the second possible post state and the cache state; a three-state gate: performing the following operations according to a comparison result between the first path character and the second path character outputted by the main memory and the actual input character: if the first path character and the actual input If the characters are the same, the status register is overwritten with the first possible post state, while the color register is overwritten with the color of the first possible post state; if the first path character is different from the actual input character And the second way character is the same as the actual input character, the state register is overwritten by the second possible post state, and the color is covered by the second possible post state Said color register; otherwise, the state and the color memory with the secondary output of said status register and respectively covering the color register.
35、 根据权利要求 34的多字符串匹配结构, 其特征在于, 还包括: 第一比较器, 用于执行所述主存储器所输出的第一路字符与实际输入 字符两者之间的比较; ·  The multi-string matching structure according to claim 34, further comprising: a first comparator, configured to perform a comparison between the first path character output by the main memory and the actual input character; ·
第二比较器, 用于执行所述主存储器所输出的第二路字符与实际输入 字符两者之间的比较。  And a second comparator, configured to perform a comparison between the second path character output by the main memory and the actual input character.
36、 根据权利要求 34的多字符串匹配结构, 其特征在于, 还包括: 第一选通器:用于根据颜色寄存器的值对输入翻译表的输出值与数值 1 进行选择输出;  36. The multi-string matching structure according to claim 34, further comprising: a first strobe: configured to select and output an output value of the input translation table and a value 1 according to a value of the color register;
第二选通器:用于根据颜色緩存器的值对输入翻译表的输出值与数值 1 进行选择输出。  Second strobe: used to select and output the output value of the input translation table and the value 1 according to the value of the color buffer.
37、 根据权利要求 36的多字符串匹配结构, 其特征在于, 还包括: 第一加法器: 用于将所述当前状态的编号与所述第一选通器的输出值 相加, 以计算出第一可能后状态;  37. The multi-string matching structure according to claim 36, further comprising: a first adder: configured to add the number of the current state to an output value of the first gate to calculate The first possible post state;
第二加法器: 用于将所述緩存状态的编号与所述第二选通器的输出值 相加, 以计算出第二可能后状态。 The second adder is configured to add the number of the buffer state to the output value of the second gate to calculate a second possible post state.
38、 一种多正则表达式匹配方法, 其特征在于, 其包括下述步驟: 从接收的输入字符流中按顺序取出字符作为输入字符; 对于每个输入 字符, 进行下述步骤: 38. A method of matching multiple regular expressions, comprising the steps of: sequentially taking characters as input characters from a received input character stream; and performing, for each input character, the following steps:
根据当前输入字符、 当前状态和緩存状态在状态转换规则库中查找后 状态;  Find the post state in the state transition rule base according to the current input character, current state, and cache state;
跳转到所述后状态;  Jump to the post state;
根据特定的缓存规则进行状态緩存;  Stateful caching according to specific caching rules;
将所述后状态作为当前状态, 所緩存的状态作为緩存状态, 下一个输 入字符作为当前输入字符, 重复对于每个输入字符所进行的步骤, 直至所 述字符流中 ^字符全部判断完毕。  The post state is taken as the current state, the cached state is taken as the cache state, and the next input character is used as the current input character, and the steps performed for each input character are repeated until all the characters in the character stream are judged.
39、 根据权利要求 38所述的多正则表达式匹配方法, 其特征在于, 所 述的查找后状态的步驟包括: 首先判断当前状态接收当前输入字符在基本 转换规则和 n步交叉转换规则中是否存在后状态, 如果存在, 则将该后状 态作为查找结果; 如果不存在, 则判断緩存状态接收当前输入字符在基本 转换规则和 n步交叉转换规则中是否存在后状态, 如果存在, 则将该后状 态作为查找结果; 如果不存在, 则判断初始状态接收当前输入字符在基本 转换规则和 n步交叉转换规则中是否存在后状态; 如果存在, 则将该后状 态作为查找结果; 否则将初始状态作为查找结果;  The multi-regular expression matching method according to claim 38, wherein the step of the post-find state comprises: first determining whether the current state receives the current input character in the basic conversion rule and the n-step cross-conversion rule a post-existence state, if present, the post-state as a search result; if not, determining whether the cache state receives the current input character in the basic conversion rule and the n-step cross-conversion rule, and if so, The post state is used as the search result; if it does not exist, it is judged whether the initial state receives the current input character in the basic conversion rule and the n-step cross-conversion rule; if it exists, the post state is used as the search result; otherwise, the initial state is As a result of the search;
所述的根据特定的緩存规则进行状态緩存的步驟为: 如果初始状态接 收当前输入字符在基本转换规则中存在对应的后状态, 则緩存该后状态; 否则, 緩存初始状态。  The step of performing state buffering according to a specific cache rule is: if the initial state receives the corresponding post state of the current input character in the basic conversion rule, the post state is cached; otherwise, the initial state is cached.
40、 根据权利要求 38所述的多正则表达式匹配方法, 其特征在于, 所 述的查找后状态的步驟包括: 判断当前状态的类型, 如果是汇聚状态或一 般状态, 则根据当前输入字符和当前状态在状态转换规则集中查找后状态; 如果是分离状态, 则根据当前输入字符、 当前状态和緩存状态在分离状态 转换规则集中查找后状态;  The multi-regular expression matching method according to claim 38, wherein the step of searching for the state includes: determining a type of the current state, and if it is a convergence state or a general state, according to the current input character and The current state is in the state transition rule set to find the post state; if it is the split state, the post state is searched in the separated state transition rule set according to the current input character, the current state, and the cache state;
所述的分离状态转换规则集设置为接收三个输入: 当前输入字符、 当 前状态和緩存状态, 相应提供一个输出: 后状态;  The separated state transition rule set is set to receive three inputs: a current input character, a current state, and a cache state, and an output is provided correspondingly: a post state;
所述的根据特定的緩存规则进行緩存的步骤为: 如果当前状态是汇聚 状态, 则将当前状态进行緩存。  The step of buffering according to a specific cache rule is: if the current state is a convergence state, the current state is cached.
PCT/CN2008/000293 2007-05-18 2008-02-03 Method and chip structure for matching multi-character string WO2008141519A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710099389.X 2007-05-18
CNB200710099389XA CN100495407C (en) 2007-05-18 2007-05-18 Multiple character string matching method and chip

Publications (1)

Publication Number Publication Date
WO2008141519A1 true WO2008141519A1 (en) 2008-11-27

Family

ID=38782733

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/000293 WO2008141519A1 (en) 2007-05-18 2008-02-03 Method and chip structure for matching multi-character string

Country Status (2)

Country Link
CN (1) CN100495407C (en)
WO (1) WO2008141519A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445891A (en) * 2016-08-09 2017-02-22 中国科学院计算技术研究所 Method and device for accelerating string matching algorithm
CN111078963A (en) * 2019-12-31 2020-04-28 奇安信科技集团股份有限公司 NFA to DFA conversion method and device

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100495407C (en) * 2007-05-18 2009-06-03 北京哲安科技有限公司 Multiple character string matching method and chip
CN101901257B (en) * 2010-07-21 2012-07-04 北京理工大学 Multi-string matching method in a search engine
CN104714951A (en) * 2013-12-13 2015-06-17 世纪禾光科技发展(北京)有限公司 Parallel multi-pattern matching method and system
CN104361097A (en) * 2014-11-21 2015-02-18 国家电网公司 Real-time detection method for electric power sensitive mail based on multimode matching
CN107967219B (en) * 2017-11-27 2021-08-06 北京理工大学 TCAM-based large-scale character string high-speed searching method
CN108133052A (en) * 2018-01-18 2018-06-08 广州汇智通信技术有限公司 A kind of searching method of multiple key, system, medium and equipment
CN110222143B (en) * 2019-05-31 2022-11-04 北京小米移动软件有限公司 Character string matching method, device, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4241402A (en) * 1978-10-12 1980-12-23 Operating Systems, Inc. Finite state automaton with multiple state types
JP2002297681A (en) * 2001-03-29 2002-10-11 Kddi Corp Finite state automaton generating device
US6961693B2 (en) * 2000-04-03 2005-11-01 Xerox Corporation Method and apparatus for factoring ambiguous finite state transducers
CN1801152A (en) * 2006-01-13 2006-07-12 清华大学 Multi-keyword matching method for text or network content analysis
CN101051321A (en) * 2007-05-18 2007-10-10 北京哲安科技有限公司 Multiple character string matching method and chip

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4241402A (en) * 1978-10-12 1980-12-23 Operating Systems, Inc. Finite state automaton with multiple state types
US6961693B2 (en) * 2000-04-03 2005-11-01 Xerox Corporation Method and apparatus for factoring ambiguous finite state transducers
JP2002297681A (en) * 2001-03-29 2002-10-11 Kddi Corp Finite state automaton generating device
CN1801152A (en) * 2006-01-13 2006-07-12 清华大学 Multi-keyword matching method for text or network content analysis
CN101051321A (en) * 2007-05-18 2007-10-10 北京哲安科技有限公司 Multiple character string matching method and chip

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
AHO A.V. AND CORASICK M.J.: "Efficient String Matching: An Aid to Bibliographic Search", COMMUNICATIONS OF THE ACM, vol. 18, no. 6, June 1975 (1975-06-01), pages 333 - 340, XP001152117 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445891A (en) * 2016-08-09 2017-02-22 中国科学院计算技术研究所 Method and device for accelerating string matching algorithm
CN111078963A (en) * 2019-12-31 2020-04-28 奇安信科技集团股份有限公司 NFA to DFA conversion method and device
CN111078963B (en) * 2019-12-31 2023-08-15 奇安信科技集团股份有限公司 Method and device for converting NFA (network File Access) into DFA (distributed File Access)

Also Published As

Publication number Publication date
CN101051321A (en) 2007-10-10
CN100495407C (en) 2009-06-03

Similar Documents

Publication Publication Date Title
WO2008141519A1 (en) Method and chip structure for matching multi-character string
US7539032B2 (en) Regular expression searching of packet contents using dedicated search circuits
US7624105B2 (en) Search engine having multiple co-processors for performing inexact pattern search operations
US7644080B2 (en) Method and apparatus for managing multiple data flows in a content search system
US7529746B2 (en) Search circuit having individually selectable search engines
JP3935880B2 (en) Hybrid search memory for network processors and computer systems
CN105224692B (en) Support the system and method for the SDN multilevel flow table parallel searchs of multi-core processor
JP4091604B2 (en) Bit string matching method and apparatus
US20080071781A1 (en) Inexact pattern searching using bitmap contained in a bitcheck command
US20080192754A1 (en) Routing system and method for managing rule entries of ternary content addressable memory in the same
US10110492B2 (en) Exact match lookup with variable key sizes
EP2215563B1 (en) Method and apparatus for traversing a deterministic finite automata (dfa) graph compression
US8560475B2 (en) Content search mechanism that uses a deterministic finite automata (DFA) graph, a DFA state machine, and a walker process
US9871727B2 (en) Routing lookup method and device and method for constructing B-tree structure
Che et al. DRES: Dynamic range encoding scheme for TCAM coprocessors
US6957215B2 (en) Multi-dimensional associative search engine
CN101309216B (en) IP packet classification method and apparatus
US20070171911A1 (en) Routing system and method for managing rule entry thereof
JP2005513895A5 (en)
IL182820A (en) Double-hash lookup mechanism for searching addresses in a network device
US20120005234A1 (en) Storage medium, trie tree generation method, and trie tree generation device
EP1678619B1 (en) Associative memory with entry groups and skip operations
US6629195B2 (en) Implementing semaphores in a content addressable memory
US20140114995A1 (en) Scalable high speed relational processor for databases and networks
US8935270B1 (en) Content search system including multiple deterministic finite automaton engines having shared memory resources

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08714815

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 26/03/2010)

122 Ep: pct application non-entry in european phase

Ref document number: 08714815

Country of ref document: EP

Kind code of ref document: A1