CN113505585B - High-speed character string feature matching method, device and equipment based on primitive state machine - Google Patents

High-speed character string feature matching method, device and equipment based on primitive state machine Download PDF

Info

Publication number
CN113505585B
CN113505585B CN202110801808.XA CN202110801808A CN113505585B CN 113505585 B CN113505585 B CN 113505585B CN 202110801808 A CN202110801808 A CN 202110801808A CN 113505585 B CN113505585 B CN 113505585B
Authority
CN
China
Prior art keywords
primitive
state machine
node
character
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110801808.XA
Other languages
Chinese (zh)
Other versions
CN113505585A (en
Inventor
刘铮铮
周蓉蓉
姜武忠
莫晨宇
王瑞旋
陈妍红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiangya Hospital of Central South University
Original Assignee
Xiangya Hospital of Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiangya Hospital of Central South University filed Critical Xiangya Hospital of Central South University
Priority to CN202110801808.XA priority Critical patent/CN113505585B/en
Publication of CN113505585A publication Critical patent/CN113505585A/en
Application granted granted Critical
Publication of CN113505585B publication Critical patent/CN113505585B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention relates to the technical field of gene sequencing and deep content matching of network messages, in particular to a high-speed character string feature matching method, device and equipment based on a primitive state machine. The method comprises the following steps: inputting a keyword feature set by a user, and storing each keyword feature in a feature serial number and character string sequence mode; compiling the key word feature set according to a preset compiling method based on a predefined basic operation primitive to obtain a primitive state machine corresponding to the key word feature set; and acquiring a target character string to be matched, executing a character string matching process based on a primitive state machine, acquiring a hit keyword, and outputting a characteristic serial number of the hit keyword and the end position of the current matched character. The matching method greatly reduces the number of state nodes and the number of state node migration times required to be involved in matching, and is beneficial to fully increasing the hit rate of the multistage Cache of the modern high-performance CPU, so that the performance and the matching flexibility of the matching algorithm are improved.

Description

High-speed character string feature matching method, device and equipment based on primitive state machine
Technical Field
The invention relates to the technical field of gene sequencing and deep content matching of network messages, in particular to a high-speed character string feature matching method, device and equipment based on a primitive state machine.
Background
How to quickly match whether a specified keyword feature set appears in an input character sequence and the position of the specified keyword feature set appears is a long-term difficult problem in the field of computer science, and has wide application in the fields of Internet high-speed message classification, internet application protocol identification, genome comparison positioning and the like.
Typical existing matching algorithms include AC algorithms, DFA and NFA algorithms, etc. The AC algorithm is a character string searching algorithm invented by Alfred V.Aho and Margaret J.Corasick, is used for matching substrings in limited character string characteristics in an input string of character strings, and has the problem of low efficiency. While DFA and NFA algorithms are typically compiled from regular expressions, DFA algorithms do not provide matching traceback functionality, and NFA is slower than DFA algorithms but provides matching traceback functionality. When the number of regular expressions is large, the DFA and NFA algorithms have a serious problem of state combination explosion, and the matching performance is lower than that of the AC algorithm. Meanwhile, the basic matching process of the algorithm is as follows: reading a character in the input character sequence, and advancing the state machine to the next position according to the position of the current algorithm state machine and the input character. There are inefficiencies associated with each advance of state machine position involving at least one or more memory access operations. Meanwhile, the performance of the algorithm is affected by the main frequency and the time delay of memory access, and the performance is difficult to improve. Therefore, a new character feature matching algorithm is designed, the powerful calculation performance of the CPU is fully utilized, the performance constraint of the memory is avoided, and the method is an important way for improving the matching performance of the algorithm.
Disclosure of Invention
Based on the above, the technical problem to be solved by the present invention is to break through the performance bottleneck of the existing character feature matching algorithm, and to provide a high-speed character feature matching method and device based on a primitive state machine.
In the embodiment of the present invention, the present invention provides a high-speed character string feature matching method based on a primitive state machine, which specifically includes:
inputting a keyword feature set by a user, and storing the keyword features for each keyword feature in a feature serial number and character string sequence mode for storing the keywords;
compiling the keyword feature set based on a predefined basic operation primitive according to a preset keyword feature set compiling method to obtain a primitive state machine corresponding to the keyword feature set;
and acquiring a target character string to be matched, executing a character string matching process based on the primitive state machine, acquiring a hit keyword, and outputting a characteristic sequence number of the hit keyword and the end position of the current matched character.
Further, the predefined basic operation primitive includes:
skip character primitives: skipping N characters forward or backward at the current input character reading position;
matching multiple string primitives at the current location: starting to match a plurality of character string features at a current input character position;
searching for matching multiple string primitives: searching and matching a plurality of character string characteristics from the current input character position;
jump to the specified match location primitive: pointing the current reading position to the designated position of the input character string;
successful hit primitive: after the matching process is finished, returning the feature sequence number of the matched keywords and the finishing position of the matching;
a failure primitive: and after the matching process is finished, returning failure information.
Further, the preset keyword feature set compiling method specifically includes:
creating a successful hit primitive node and a failure primitive node as basic primitive nodes, initializing relevant variables, pointing a failure pointer to the failure primitive node, and pointing a current node pointer to a null node;
acquiring a keyword feature set to be compiled, reading a current character and moving a reading pointer backwards, performing syntax analysis according to the type of the current character and/or a next character, and compiling different characters or character combinations according to a preset compiling method to construct a primitive state machine; obtaining a primitive state machine corresponding to each keyword;
and combining primitive state machines with the same depth according to the depth of the root node of the primitive state machine corresponding to the keyword, and then carrying out aggregation according to the principle from shallow to deep to obtain the primitive state machine corresponding to the keyword feature set.
Further, the step of performing syntax analysis according to the type of the current character and/or the next character, and compiling different characters or character combinations according to a preset compiling method specifically includes:
reading the next character when the current character is ' ″, setting a floating mark and reading the pointer to move backwards when the next character is ' # ', or else, establishing a skipped character primitive node as the current node or adding 1 to the skipped number of the current skipped character primitive;
the current character is 'x', the compiling is stopped and an error is reported;
for other characters, the processing process of the primitive node of the character string is carried out, which specifically comprises the following steps:
the current character is \ ", the next character is read, and the reading pointer is moved backwards;
the current node is not a primitive node of a 'character string' type, a primitive node of 'searching and matching a plurality of character strings' or a primitive node of 'matching a plurality of character strings at the current position' is established according to whether the floating mark is True, the successful skip state of the current node points to a new node, the failed skip state points to a failure pointer, and finally the current node is updated to be the newly established primitive node; adding the current character into the tail of the search character string of the current primitive;
if the floating mark is True and the failure pointer points to the failure primitive node, creating a jump appointed position primitive, pointing the success and failure jump states of the jump appointed position primitive to the current node, designating the character reading position as the initial reading position of the previous state, and pointing the failure pointer to the newly created jump appointed position primitive;
the float flag is set to False.
Further, the step of combining primitive state machines with the same depth according to the depth of the primitive state machine root node corresponding to the keyword, and then performing aggregation according to the principle from shallow to deep to obtain the primitive state machine corresponding to the keyword feature set specifically includes:
creating an empty linked list ordered according to depth, and determining the depth value of each primitive state machine according to the depth rule of the primitive state machines;
reading a first primitive state machine corresponding to the keyword and a depth value thereof, and storing the first primitive state machine and the corresponding depth value into an empty linked list to obtain a depth linked list;
continuing to read the primitive state machine corresponding to the keyword and the corresponding depth value, and searching the primitive state machine with the same depth in the depth linked list to be used as a target state machine;
when the target state machine exists, the state machine merging processing is carried out according to the current state machine type and the primitive node type of the target state machine; when the target state machine does not exist, inserting the currently read state machine into the depth linked list according to the depth sequence until all the primitive state machines are read to obtain a final depth linked list;
a first primitive state machine is taken from the final depth linked list, a global root points to a first primitive state machine node, and a primitive state machine except the first primitive state machine is created and jumped to a specified matching position primitive node to replace a failure primitive node of a previous state machine, and a success and failure jump pointer points to a root node of a next primitive state machine;
and deleting the depth linked list, and returning a global root pointer to obtain a primitive state machine corresponding to the key character set.
Further, the step of performing state machine combination processing according to the current state machine type of the standard state machine and the primitive node type specifically includes:
replacing the 'failure' primitive node of the current state machine with the 'failure' node of the target state machine;
when the root nodes of the two state machines point to the primitive node of the character skipping, the pointers of the source node and the target node point to the successfully skipped nodes of the primitive node of the character skipping, and the node of the current state machine of the character skipping is deleted;
when pointers of a source node and a target node point to a character string type primitive, adding all character string characteristics in the source node into the target node, ensuring that a successfully-skipped pointing node of the character string characteristics is unchanged, and adding all subsequent nodes starting from the successfully-skipped pointing node into a target state machine;
the source node is a current node of a current state machine; the target node is a target node of a target state machine.
Further, the specific step of executing the string matching process based on the primitive state machine includes:
reading a first primitive node into a current state node according to a global root pointer of a primitive state machine, and setting a next state as a null pointer; pointing the reading position of the character string to be matched to the first character of the character string;
the following processes are executed in a loop until the current state is a hit primitive state or a fail primitive state:
executing a corresponding primitive node matching function according to the primitive type of the current state, and advancing the reading position of the character string to be matched according to the primitive node function;
determining an address pointed by success or failure in selection as an address for reading a next primitive node based on a result of execution of the current primitive;
and reading the next primitive node into the current state node.
Based on the same inventive concept, an embodiment of the present invention further provides a high-speed character string feature matching device based on a primitive state machine, where the compiling device specifically includes:
a keyword feature set library, which is used for acquiring keyword feature sets and storing the keyword features for each keyword feature in a feature sequence and character string sequence mode for storing keywords;
the keyword feature compiler is used for compiling the keyword feature set based on a predefined basic operation primitive according to a preset keyword feature set compiling method to obtain a primitive state machine corresponding to the keyword feature set;
the primitive state machine information base is used for storing a primitive state machine corresponding to the keyword feature set compiled by the keyword feature compiler;
and the state machine execution engine module is used for executing a character string matching process based on the primitive state machine, obtaining a hit keyword and outputting a feature sequence number of the hit keyword and the end position of the current matched character.
Based on the same inventive concept, the embodiment of the present invention further provides an extraction device for high-speed character string feature matching based on a primitive state machine, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the high-speed character string feature matching method based on the primitive state machine when executing the computer program.
Based on the same inventive concept, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, where the computer program, when executed by a processor, implements the steps of the high-speed string feature matching method based on the primitive state machine.
Has the beneficial effects that:
the invention provides a high-speed character feature matching method based on a primitive state machine, which is characterized in that simple state nodes of the state machine of the traditional matching algorithm are replaced by complex operation primitives with a fixed processing function, more complex character matching operation calculation is carried out based on a CPU (central processing unit), the number of the state nodes and the number of state node migration times which need to be involved in matching are greatly reduced, the multistage Cache hit rate of the modern high-performance CPU is increased sufficiently, and the performance and the matching flexibility of the matching algorithm are improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a high-speed character feature matching method based on a primitive state machine according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating a first stage keyword compiling method according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an aggregation flow of a second-stage feature set library multi-primitive state machine according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a process for performing character matching according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an implementation of a high-speed character feature matching apparatus based on a primitive state machine according to an embodiment of the present invention;
fig. 6 is a diagram of a primitive state machine constructed based on two features, namely hello 8230.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, a flow diagram of a high-speed character feature matching method based on a primitive state machine according to an embodiment of the present invention is provided.
Step 101, a user inputs a keyword feature set, and stores the keyword features for each keyword feature in a feature serial number and a character string sequence mode for storing keywords.
In the embodiment of the invention, a group of keywords input by a user is obtained, each keyword consists of standard ASCII (American standard code for information interchange) characters, and the length is not limited. The '\\ character representation in the ASCII code matches any one character, the' \\ character needs to be used together with the '· character representation to represent that zero or any number of characters are matched in a floating mode, the' \\ character representation escape character is used for escaping special meaning characters into common characters, and the rest ASCII code characters serve as common character string characteristic characters.
And step S102, compiling the keyword feature set based on a predefined basic operation primitive according to a preset keyword feature set compiling method to obtain a primitive state machine corresponding to the keyword feature set.
In this embodiment of the present invention, the basic operation primitive specifically includes:
skip character primitives: at the current input character reading position, N characters are skipped forward or backward. And when N is a negative number, the forward jump is carried out, when N is a positive number, the backward jump is carried out, and when N is 0, no jump is carried out. And jumping to the state of successful pointing between the initial position and the end position of the input character string in the jumping range, otherwise, jumping to the state of failed pointing.
Matching multiple string primitives at the current location: and starting to match a plurality of character string characteristics at the current input character position, and jumping to the next state pointed by the character string characteristics according to the hit character string characteristics when the matching is successful. If the matching fails, jumping to the state pointed by the failure
Searching for matching multiple string primitives: and searching and matching a plurality of character string characteristics from the current input character position, and jumping to the next state pointed by the character string characteristics according to the hit character string characteristics when the matching is successful. And if the input reaches the end position of the character string, jumping to the state of the failed pointing direction.
Jump to the specified match location primitive: the current reading position is pointed to the specified position of the input character string. And jumping to the state of successful pointing from the new position between the initial position and the end position of the input character string, or jumping to the state of failed pointing.
Successful hit primitive: and (5) after the matching process is finished, returning the key character sequence number hit by matching and the hit end position.
A failure primitive: and (5) finishing the matching process and returning failure information.
As shown in fig. 2 and 3, the preset keyword feature set compiling method specifically includes:
fig. 2 shows a flowchart of the first stage keyword compiling method of the embodiment, which includes:
creating a successful hit primitive node; creating a failure primitive node; setting the reading position as the initial position of the current keyword; pointing the current node pointer and the root node pointer to a null node; setting the float flag to False; pointing the failure skip pointer to the failure primitive node; pointing the failure jump state of the current node to a failure jump pointer; looping until the read position reaches the end of the current key:
reading the current character, and moving the reading position backwards by one character; if the current character is a "." character: if the next character is present and is an "+" character: setting a floating flag to True; continuously moving the reading position backwards by one character; otherwise, setting the floating mark to False; if the type of the current node is the primitive of 'skipping characters', adding 1 to the number of the skipping characters of the current node; otherwise, a new primitive node is created, the type is 'skip character primitive', and the number of skip characters is set to be 1; setting the failure jump state of the new primitive node as a pointing node of a failure jump pointer; if the current node is not empty, the successful skip state of the current node points to a new primitive node; pointing the current node to a new primitive node; if the root node pointer is null, pointing the root node pointer to a new primitive node;
if the current character is a "+" character: reporting an error and stopping the compiling process;
if the current character is a "\" character: if the next character exists: assigning the character of the reading position to the current character, and moving the reading position backwards by one character; otherwise: reporting an error and stopping the compiling process;
if the current node type is a 'character string' primitive and the floating mark is False, adding the current character into the tail of the search character string of the current primitive;
otherwise, if the floating flag is True, then: and creating a new primitive node, wherein the type is 'search matching multiple string primitives', the failure skip state points to a failure skip pointer, and the current character is added to the tail of the search string of the new primitive. And if the current node is not empty, pointing the successful skip state of the current node to the new node, and pointing the current node to the new node. If the type of the primitive node pointed by the failed jump pointer is 'failure primitive', then: creating a second new primitive node with the type of 'jump to specified matching location primitive' and setting the location as 'previous primitive start location'; pointing the successful skip state and the failed skip state of the second new primitive node to the first new primitive node; pointing the failed jump pointer to a second new node, and setting a floating mark to False;
otherwise: and creating a new primitive node, wherein the type is that the current position matches a plurality of character string primitives, the failure skip state points to a failure skip pointer, and the current character is added to the tail part of the search character string of the new primitive. And if the current node is not empty, pointing the successful jumping state of the current node to the new node. The current node is pointed to the new node.
After the loop is finished, if the native language state machine does not contain the character string primitive, then: error is reported and the compilation process is stopped. And pointing the successful skip state of the current node to the successful hit primitive node, and completing the compiling of the current keyword.
According to the detailed steps, firstly, the establishment of success primitive nodes and failure primitive nodes is carried out, and related variables are initialized; then, a keyword character string analysis stage is carried out, each character of the keyword is read in a circulating way, and the following processing is respectively carried out according to the characters:
when the character is ".": and judging whether the subsequent primitive has search matching or not according to whether the next character is 'x' or not, and setting a floating mark. For the case of only containing ". Multidot." characters, if the current primitive is a "skip character" primitive, the number of skip characters is increased by one, indicating that "\8230;" form of skipping multiple characters occurs; for other cases, indicating that the process of skipping characters is started, a new "skip character" primitive needs to be created and added to the state machine.
When it is a "\\" character: representing escape characters. The three special meaning characters, i.e.' about \ are transferred to ordinary characters for processing by reading the next character and moving the reading pointer backwards.
Other characters: added to the feature string of the "string" primitive. And respectively processing according to the current node type and the floating mark. When the type of the current node is a 'character string' primitive and the floating mark is False, the current character is attached to the tail of the characteristic character string of the node of the primitive. For other cases, according to the floating mark, a primitive node with the type of 'search matching multiple character strings' or 'current position matching multiple character strings' is created, and the current character is used as the first character of the characteristic character string of the primitive node. For the primitive of searching and matching a plurality of character strings, after the downstream primitive fails, the primitive of searching and matching a plurality of character strings at the top layer needs to be returned to for subsequent searching and matching; therefore, when the primitive searching and matching multiple character strings is created for the first time, a subsequent failed jump pointer points to a newly-built primitive jumping to a specified position, and the matching state is promoted to the top primitive searching and matching multiple character strings through the primitive jumping to the specified position.
And finally, updating the successful state of the current node of the compiled state machine, pointing the successful hit primitive to the established successful hit primitive, simultaneously carrying out validity check on the state machine, stopping compiling the state machine which does not contain the primitive node of the character string class and reporting errors.
Fig. 3 shows an aggregation flow diagram of the second-stage feature set library multi-primitive state machine in this embodiment, which is used to aggregate the primitive state machines corresponding to each keyword into the feature library primitive state machines corresponding to the keyword feature set library.
The processing steps of the second stage are as follows: creating a state machine linked list according to the depth sequence; and circularly executing the following steps on the primitive state machine corresponding to each keyword generated by the stage one: setting the depth value to 0; if the type of the primitive node pointed by the current primitive state machine root pointer is 'skip character primitive', then: taking the number of skipped characters of the primitive as a depth value; if the type of the primitive node pointed by the root pointer of the current primitive state machine is 'search matching multiple character string primitive', then: setting the depth value as a maximum value;
looking up a state machine linked list based on the depth value, if a primitive state machine (target state machine) with the same depth exists, then: the target node pointer points to a primitive node corresponding to the target state machine root pointer; pointing the source node pointer to a primitive node corresponding to a root pointer of a current primitive state machine; checking and ensuring the consistency of primitive types of a target node and a source node, then deleting a 'failure primitive' node in a current primitive state machine, and completely replacing the 'failure primitive' node in the target state machine with the 'failure primitive' node; if the primitive type of the target node is "skip character primitive", then: assigning the successful jumping state of the source node to the temporary node; deleting the source node from the current primitive state machine, and assigning the temporary node to the source node; assigning the successful jumping state of the target node to the target node; if the primitive type of the target node is not "match multiple string primitives at current location" or "search match multiple string primitives", then: reporting errors and terminating the compiling process; adding all character string characteristics in a source node into a target node, ensuring that a successfully skipped pointing node of the character string characteristics is unchanged, and adding all subsequent nodes started by the successfully skipped pointing node into a target state machine; deleting the source node and deleting all primitive nodes which are not referred by the target state machine in the current primitive state machine;
when primitive state machines with the same depth do not exist, the depth value is used as a sorting basis, and the current state machine is inserted into a state machine linked list from small to large according to the depth;
when all the state machines are inserted into the linked list, assigning a first state machine in the linked list of the state machines to a current state machine, and deleting the first state machine from the linked list of the state machines; setting a global root pointer as a current state machine root pointer; for the remaining state machines in the state machine linked list, the following loops are performed: assigning the first state machine to the next state machine and deleting the first state machine from the state machine linked list; creating a primitive node with the type of 'jump to the specified matching position primitive', and setting the jump position as the input 'initial position'; pointing the successful skip state and the failed skip state of the new primitive node to the root node of the next state machine; replacing the 'failure primitive' node of the current state machine with a new primitive node; deleting the 'failure primitive' of the current state machine, and assigning the next state machine to the current state machine;
and after the circulation is finished, deleting the linked list and returning the feature library primitive state machine pointed by the global root pointer.
According to the steps, the aggregation process of the state machines firstly creates an empty linked list according to the principle that primitive state machines with the same depth are merged firstly and then shallow-deep, and carries out depth determination according to the primitive state machine depth rule: the root node is a state machine with the largest depth, wherein the state machine is a primitive of searching and matching a plurality of character strings; the root node is a state machine of a primitive of 'skipping characters', and depth values of the state machine from shallow to deep are taken as the root node according to the number of the skipping characters from 1 to N; for the state machine with root node as primitive that "current position matches multiple character strings", the depth value is 0. It can be seen from the stage one compiling method that the root node is not possible to be other primitive types, and when other primitive types occur, an error is reported and the compiling is terminated.
Then reading a first primitive state machine corresponding to the keyword and a depth value thereof, and storing the first primitive state machine and the corresponding depth value into an empty linked list to obtain a depth linked list; continuing to read the primitive state machine corresponding to the keyword and the corresponding depth value, and searching the primitive state machine with the same depth in the depth linked list to be used as a target state machine; if no such state machine exists, the current state machine is inserted into the linked list. After state machines with the same depth are found, replacing a failure primitive of the current state machine with a failure primitive of a target state machine; then, judging whether the root nodes of the two state machines are primitive of 'skip character', if the root nodes are the primitive, respectively advancing the current comparison nodes to the successfully skipped nodes and deleting the primitive nodes of 'skip character' of the current state machine; thirdly, if the current comparison node is not the primitive of the type of the character string, reporting an error and terminating the compiling; otherwise, adding the character string characteristics of the source node into the destination node, and adding the subsequent primitive nodes pointed by the character string characteristics of the source node into the target state machine; finally, all nodes in the current state machine and the state machine which are not referenced by the target state machine are deleted.
To merge phases of different depth state machines in a linked list. The merging method is that the failure node of the previous state is replaced by a primitive node of 'jumping to the appointed position', and the appointed message reading position is moved to the initial part of the message; and then, the success and failure jump pointers of the primitive node of jumping to the specified position point to the root node of the next state machine, and all the state machines in the linked list are connected into the primitive state machine of the feature library corresponding to the whole keyword feature set library.
And 103, acquiring a target character string to be matched, executing a character string matching process based on the primitive state machine, acquiring a hit keyword, and outputting a feature sequence number of the hit keyword and the end position of the currently matched character.
In the embodiment of the present invention, the target character string input by the user is obtained, and from the first character position, the character matching process is executed according to the corresponding primitive state machine constructed in step S102. As shown in fig. 4, a flow chart for performing a character matching process is provided. Firstly, reading a first primitive node into a current state node according to a global root pointer of a primitive state machine, and setting the next state as a null pointer; and pointing the reading position of the character string to be matched to the first character of the character string. And then circularly judging whether the current state node is a hit primitive or a failure primitive node, and if so, ending the circulation. Executing corresponding primitive node matching functions for other nodes according to the primitive types of the current state, and advancing the reading position of the character string to be matched according to the primitive node functions; and determining an address pointed by the selection success or failure as an address for reading the next primitive node based on the execution result of the current primitive, and reading the next primitive node into the current state node.
As shown in fig. 5, an implementation structure diagram of a high-speed character feature matching device based on a primitive state machine is provided in an embodiment of the present invention, where the matching device specifically includes:
and the keyword feature set library is used for acquiring a keyword feature set input by a user and storing the keyword features for each keyword feature in a feature serial number and character string sequence mode for storing the keywords.
In the embodiment of the present invention, each keyword feature comprises a feature serial number ID and a character string for storing the keyword character sequence.
And the keyword feature compiler acquires a keyword feature set based on a predefined basic operation primitive, and compiles the keyword feature set according to a preset keyword feature set compiling method to acquire a primitive state machine corresponding to the keyword feature set. In the embodiment of the invention, based on the keyword feature collection library, each keyword feature in the collection library is converted into a matching primitive state machine based on the previously defined operation primitive according to the compiling method by adopting a layered structure mode; and connecting the matching primitive state machines corresponding to each keyword feature, and combining the matching primitive state machines into the matching primitive state machine corresponding to the whole collection library.
And the primitive state machine information base is used for storing the keyword feature set primitive state machine compiled by the keyword feature compiler. In the embodiment of the invention, a graph mode is adopted to store the primitive state machine of the whole feature set library based on the matching primitives and the connection relation between the matching primitives; the state machine execution engine may perform state matching on the input string based on the native language state machine information base.
And the state machine execution engine module is used for acquiring a target character string to be matched, executing a character string matching process based on the primitive state machine, acquiring a hit keyword, and outputting a characteristic serial number of the hit keyword and the end position of the current matched character. In the embodiment of the invention, the matching process shown in FIG. 4 is adopted for execution, and the state machine matching process is executed from the zero state of the primitive state machine information base; and outputting a hit result of the keyword characteristics after the matching process is finished.
Fig. 6 shows an exemplary structure of a primitive state machine constructed based on two key features of. Wherein, the state 0 indicates that the matching process is started and does not correspond to the operation primitive of the state machine; state 1 is an operation primitive for searching and matching a plurality of character strings, continuously reading the next character from the currently input target character string and matching the next character with hello or test, and jumping to a failure primitive (state No. 8) if the input is finished; if there is a match with hello or test, then state 2 or state 4 is skipped, respectively. The state 2 is a character skipping operation primitive, 5 characters are skipped backwards from the current input character, and the state is changed to the state No. 3 after the character skipping operation primitive is successfully executed; and the state 3 is that a plurality of character string operation primitives are matched at the current position, the moon character string is accurately matched from the current position, and after the matching is successful, the state jumps to the state No. 5 to indicate that the matching is successful and the matching process is ended. State 4 is searching and matching a plurality of character string operation primitives, continuously reading the next character from the currently input target character string and matching the next character with the good characteristic string, and jumping to the state No. 7 if the input is finished; and if the matching result is matched with the good, jumping to the state No. 6, indicating that the feature two is successfully matched and ending the matching process. State 7 moves the position of the currently input target string to the position before state 4 starts matching for jumping to the specified matching position operation primitive.
For the target character string "aaatetbbbgoodcc", the matching process according to the finite state machine is as follows: firstly, jumping from a state 0 to a state 1, starting to search keyword substrings hello and test, and hitting a test keyword substring at the 4 th character of a target character string; then jumping to a state 4, continuously searching a good keyword sub-string, and hitting the good keyword sub-string at the 12 th character of the target character string; finally, a jump is made to state 6 indicating a hit on the second key feature.
In an embodiment of the present invention, there is also provided an extraction device for high-speed character string feature matching based on a primitive state machine, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the above-mentioned high-speed character string feature matching based on a primitive state machine when executing the computer program.
In the embodiment of the present invention, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, where the computer program, when executed by a processor, implements the steps of the high-speed string feature matching method based on the primitive state machine.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent should be subject to the appended claims.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

Claims (8)

1. A high-speed character string feature matching method based on a primitive state machine is characterized by specifically comprising the following steps:
inputting a keyword feature set by a user, and storing the keyword features for each keyword feature in a feature serial number and character string sequence mode for storing the keywords;
compiling the keyword feature set according to a preset keyword feature set compiling method based on a predefined basic operation primitive to obtain a primitive state machine corresponding to the keyword feature set;
the basic operation primitive includes:
skip character primitive: skipping N characters forward or backward at the current input character reading position;
matching multiple string primitives at the current location: starting to match a plurality of character string features at a current input character position;
searching for matching multiple string primitives: searching and matching a plurality of character string characteristics from the current input character position;
jump to the specified match location primitive: pointing the current reading position to the designated position of the input character string;
successful hit primitive: after the matching process is finished, returning the matched hit keyword feature serial number and the hit end position;
a failure primitive: after the matching process is finished, returning failure information;
the preset keyword feature set compiling method specifically comprises the following steps:
creating a successful hit primitive node and a failure primitive node as basic primitive nodes, initializing relevant variables, pointing a failure pointer to the failure primitive node, and pointing a current node pointer to a null node;
acquiring a keyword feature set to be compiled, reading a current character and moving a reading pointer backwards, performing syntax analysis according to the type of the current character and/or a next character, and compiling different characters or character combinations according to a preset compiling method to construct a primitive state machine; obtaining a primitive state machine corresponding to each keyword;
according to the depth of the root node of the primitive state machine corresponding to the keyword, merging the primitive state machines with the same depth, and then performing aggregation according to the principle from shallow to deep to obtain the primitive state machine corresponding to the keyword feature set;
and acquiring a target character string to be matched, executing a character string matching process based on the primitive state machine, acquiring a hit keyword, and outputting a feature sequence number of the hit keyword and the end position of the current matched character.
2. The high-speed character string feature matching method based on the primitive state machine as claimed in claim 1, wherein the step of performing syntax analysis according to the type of the current character and/or the next character and compiling different characters or character combinations according to a preset compiling method specifically comprises:
reading the next character when the current character is ' ″, setting a floating mark and reading the pointer to move backwards when the next character is ' # ', or else, establishing a skipped character primitive node as the current node or adding 1 to the skipped number of the current skipped character primitive;
stopping compiling and reporting errors when the current character is 'x';
for other characters, the processing process of the primitive node of the character string is carried out, which specifically comprises the following steps:
the current character is \ ", the next character is read, and the reading pointer is moved backwards;
the current node is not a primitive node of a character string type, a primitive node of searching and matching a plurality of character strings or a primitive node of matching a plurality of character strings at the current position is established according to whether the floating mark is True, the successful skip state of the current node points to a new node, the failed skip state points to a failure pointer, and finally the current node is updated to be the newly established primitive node; adding the current character into the tail of the search character string of the current primitive;
if the floating mark is True and the failure pointer points to the failure primitive node, creating a jump designated position primitive, pointing the success and failure jump states of the jump designated position primitive to the current node, designating the character reading position as the initial reading position of the previous state, and pointing the failure pointer to the newly created jump designated position primitive;
the float flag is set to False.
3. The method for matching characteristics of a high-speed character string based on a primitive state machine according to claim 2, wherein the step of obtaining the primitive state machine corresponding to the keyword feature set by combining primitive state machines with the same depth according to the depth of the root node of the primitive state machine corresponding to the keyword and then performing aggregation according to the principle from shallow to deep specifically comprises the steps of:
creating an empty linked list ordered according to depth, and determining the depth value of each primitive state machine according to the depth rule of the primitive state machines;
reading a first primitive state machine corresponding to the keyword and a depth value thereof, and storing the first primitive state machine and the corresponding depth value into an empty linked list to obtain a depth linked list;
continuing to read the primitive state machine corresponding to the keyword and the corresponding depth value, and searching the primitive state machine with the same depth in the depth linked list to be used as a target state machine;
when the target state machine exists, the state machine merging processing is carried out according to the current state machine type and the primitive node type of the target state machine; when the target state machine does not exist, inserting the currently read state machine into the depth linked list according to the depth sequence until all the primitive state machines are read to obtain a final depth linked list;
a first primitive state machine is taken from the final depth linked list, a global root points to a first primitive state machine node, and a primitive state machine except the first primitive state machine is created and jumped to a specified matching position primitive node to replace a failure primitive node of a previous state machine, and a success and failure jump pointer points to a root node of a next primitive state machine;
and deleting the depth linked list, and returning a global root pointer to obtain a primitive state machine corresponding to the key character set.
4. The primitive state machine based high-speed character string feature matching method according to claim 3, wherein the step of performing state machine merging processing according to the current state machine type of the target state machine and the primitive node type specifically comprises:
replacing the 'failure' primitive node of the current state machine with the 'failure' node of the target state machine;
when the root nodes of the two state machines point to the primitive node of 'character skipping', respectively pointing pointers of a source node and a target node to the successfully skipped nodes of the primitive node of the character skipping, and deleting the 'character skipping nodes' of the current state machine;
when pointers of a source node and a target node point to a character string type primitive, adding all character string characteristics in the source node into the target node, ensuring that a successfully-skipped pointing node of the character string characteristics is unchanged, and adding all subsequent nodes starting from the successfully-skipped pointing node into a target state machine;
the source node is a current node of a current state machine; the target node is a target node of a target state machine.
5. The method for high-speed string feature matching based on primitive state machine as claimed in claim 1, wherein the specific steps of performing the string matching process based on primitive state machine comprises:
reading a first primitive node into a current state node according to a global root pointer of a primitive state machine, and setting a next state as a null pointer; pointing the reading position of the character string to be matched to the first character of the character string;
the following processes are executed in a loop until the current state is a hit primitive state or a failure primitive state:
executing a corresponding primitive node matching function according to the primitive type of the current state, and advancing the reading position of the character string to be matched according to the primitive node function;
determining an address pointed by success or failure in selection as an address for reading a next primitive node based on a result of execution of the current primitive;
and reading the next primitive node into the current state node.
6. A high-speed character string feature matching device based on a primitive state machine is characterized in that the matching device specifically comprises:
the keyword feature set library is used for acquiring a keyword feature set input by a user and storing the keyword features for each keyword feature in a feature serial number and character string sequence mode for storing keywords;
the keyword feature compiler is used for compiling the keyword feature set based on a predefined basic operation primitive according to a preset keyword feature set compiling method to obtain a primitive state machine corresponding to the keyword feature set;
the basic operation primitive includes:
skip character primitives: skipping N characters forward or backward at the current input character reading position;
matching multiple string primitives at the current location: starting to match a plurality of character string features at a current input character position;
searching for matching multiple string primitives: searching and matching a plurality of character string characteristics from the current input character position;
jump to the specified match location primitive: pointing the current reading position to the designated position of the input character string;
successful hit primitive: after the matching process is finished, returning the feature sequence number of the matched keywords and the finishing position of the matching;
a failure primitive: after the matching process is finished, returning failure information;
the preset keyword feature set compiling method specifically comprises the following steps:
creating a successful hit primitive node and a failure primitive node as basic primitive nodes, initializing relevant variables, pointing a failure pointer to the failure primitive node, and pointing a current node pointer to a null node;
acquiring a keyword feature set to be compiled, reading a current character and moving a reading pointer backwards, performing syntactic analysis according to the type of the current character and/or the type of a next character, and compiling different characters or character combinations according to a preset compiling method to construct a primitive state machine; obtaining a primitive state machine corresponding to each keyword;
according to the depth of the root node of the primitive state machine corresponding to the keyword, merging the primitive state machines with the same depth, and then performing aggregation according to the principle from shallow to deep to obtain the primitive state machine corresponding to the keyword feature set;
the primitive state machine information base is used for storing the corresponding keyword feature set primitive state machine compiled by the keyword feature compiler;
and the state machine execution engine module is used for acquiring a target character string to be matched, executing a character string matching process based on the primitive state machine, acquiring a hit keyword, and outputting a characteristic serial number of the hit keyword and the end position of the current matched character.
7. An extraction device for high-speed string feature matching based on a primitive state machine, comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the high-speed string feature matching based on the primitive state machine according to any one of claims 1 to 5.
8. A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the steps of the method for high-speed string feature matching based on a primitive state machine according to any one of claims 1 to 5.
CN202110801808.XA 2021-07-15 2021-07-15 High-speed character string feature matching method, device and equipment based on primitive state machine Active CN113505585B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110801808.XA CN113505585B (en) 2021-07-15 2021-07-15 High-speed character string feature matching method, device and equipment based on primitive state machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110801808.XA CN113505585B (en) 2021-07-15 2021-07-15 High-speed character string feature matching method, device and equipment based on primitive state machine

Publications (2)

Publication Number Publication Date
CN113505585A CN113505585A (en) 2021-10-15
CN113505585B true CN113505585B (en) 2023-03-21

Family

ID=78012981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110801808.XA Active CN113505585B (en) 2021-07-15 2021-07-15 High-speed character string feature matching method, device and equipment based on primitive state machine

Country Status (1)

Country Link
CN (1) CN113505585B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101154228A (en) * 2006-09-27 2008-04-02 西门子公司 Partitioned pattern matching method and device thereof
CN102857493A (en) * 2012-06-30 2013-01-02 华为技术有限公司 Content filtering method and device
CN108170812A (en) * 2017-12-29 2018-06-15 迈普通信技术股份有限公司 A kind of data filtering method and equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8775457B2 (en) * 2010-05-31 2014-07-08 Red Hat, Inc. Efficient string matching state machine

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101154228A (en) * 2006-09-27 2008-04-02 西门子公司 Partitioned pattern matching method and device thereof
CN102857493A (en) * 2012-06-30 2013-01-02 华为技术有限公司 Content filtering method and device
CN108170812A (en) * 2017-12-29 2018-06-15 迈普通信技术股份有限公司 A kind of data filtering method and equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Finite-State-Machine based string matching system for Intrusion Detection on High-Speed Networks;Gerald Tripp 等;《The 14th EICAR annual conference》;20050503;全文 *
入侵检测系统中高速字符串匹配协处理的实现方法;张克农 等;《微电子学与计算机》;20061231;全文 *

Also Published As

Publication number Publication date
CN113505585A (en) 2021-10-15

Similar Documents

Publication Publication Date Title
US10242125B2 (en) Regular expression matching
US10664655B2 (en) Method and system for linear generalized LL recognition and context-aware parsing
JP2008299867A (en) Computer representation of data structure and encoding/decoding methods associated with the same
JP2011150546A (en) Recognition device
US11262988B2 (en) Method and system for using subroutine graphs for formal language processing
CN112052413B (en) URL fuzzy matching method, device and system
CN113312175A (en) Operator determining and operating method and device
WO2003034279A1 (en) Information searching method, information searching program, and computer-readable recording medium on which information searching program is recorded
CN113505585B (en) High-speed character string feature matching method, device and equipment based on primitive state machine
US7065753B2 (en) Method, system and computer program for syntax validation
CN113254025B (en) Keyword feature set compiling method, device and equipment based on primitive state machine
US9600565B2 (en) Data structure, index creation device, data search device, index creation method, data search method, and computer-readable recording medium
CN113032450B (en) Data storage and retrieval method, system, storage medium and processing terminal
CN106663094B (en) Method and system for linear generalized LL recognition and context-aware parsing
Sgarbas et al. Optimal insertion in deterministic DAWGs
CN110032366B (en) Code positioning method and device
CN110209829B (en) Information processing method and device
CN112445468A (en) Typescript type file generation method, device, equipment and computer readable storage medium
CN115801020B (en) Definite finite state automaton compression method, matching method, device and medium
van der Merwe et al. Ordered Context-Free Grammars
CN115563353A (en) Character string processing method, device, equipment and medium
CN114003234A (en) Local compiling method, device and equipment for small program and computer readable storage medium
CN115470311A (en) Method and device for searching code block
CN112650680A (en) Detection method and system for redundant variables and redundant method based on abstract syntax tree
CN117806647A (en) C program synthesis method and device based on software flow chart, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant