CN112579839A - Multi-mode matching method and device for large-scale features and storage medium - Google Patents

Multi-mode matching method and device for large-scale features and storage medium Download PDF

Info

Publication number
CN112579839A
CN112579839A CN201910945379.6A CN201910945379A CN112579839A CN 112579839 A CN112579839 A CN 112579839A CN 201910945379 A CN201910945379 A CN 201910945379A CN 112579839 A CN112579839 A CN 112579839A
Authority
CN
China
Prior art keywords
feature
features
text
prefix
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910945379.6A
Other languages
Chinese (zh)
Other versions
CN112579839B (en
Inventor
李博
吕群
杨龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qianxin Technology Group Co Ltd
Qianxin Safety Technology Zhuhai Co Ltd
Original Assignee
Qianxin Technology Group Co Ltd
Qianxin Safety Technology Zhuhai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qianxin Technology Group Co Ltd, Qianxin Safety Technology Zhuhai Co Ltd filed Critical Qianxin Technology Group Co Ltd
Priority to CN201910945379.6A priority Critical patent/CN112579839B/en
Publication of CN112579839A publication Critical patent/CN112579839A/en
Application granted granted Critical
Publication of CN112579839B publication Critical patent/CN112579839B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Abstract

The invention discloses a multimode matching method, a multimode matching device, a storage medium and computer equipment for large-scale features, wherein the method comprises the following steps: acquiring a feature set, and dividing features in the feature set into state machine features and moving characteristic features; constructing a finite state automaton according to the characteristics of the state machine; constructing a feature list, a prefix skip list and a prefix hash table according to the features of the mobile table; acquiring a text to be processed, and scanning the text to be processed by using a finite state automaton to obtain a first feature matching result; scanning the text to be processed by using the prefix jump table, the prefix hash table and the feature list, and obtaining a second feature matching result; and obtaining a feature matching result of the text to be processed according to the first feature matching result and the second feature matching result. The method not only makes full use of the high stability of the finite state automaton, but also effectively utilizes the feature list to save a large amount of memory space, so that the multimode matching algorithm can simultaneously have high performance and low memory occupation.

Description

Multi-mode matching method and device for large-scale features and storage medium
Technical Field
The invention relates to the technical field of data processing, in particular to a multimode matching method and device for large-scale features, a storage medium and computer equipment.
Background
The pattern matching is a basic operation of character strings in a data structure and is divided into single pattern matching and multi-pattern matching according to the number of matching patterns, wherein the multi-pattern matching refers to finding out character substrings of all patterns in a character string to be searched. The multi-mode matching algorithm is widely applied to the fields of keyword filtering, intrusion detection, virus detection, word segmentation and the like.
In the prior art, the advantages and disadvantages of most multimode matching methods are obvious. For example, the multimode matching method with higher stability has higher memory occupation, which results in the performance degradation of matching, and for example, the multimode matching method with small occupied space has unstable performance and higher error rate, and the disadvantages of these algorithms become increasingly inapplicable today with larger and larger data processing amount, so that there is a great need for a multimode matching method that can take into account both matching performance and memory occupation.
Disclosure of Invention
In view of this, the present application provides a multimode matching method, an apparatus, a storage medium and a computer device for large-scale features, and mainly aims to solve the technical problem that the multimode matching method cannot give consideration to both performance and memory in a large-scale feature scene.
According to a first aspect of the present invention, there is provided a multimodal matching method of large scale features, the method comprising:
acquiring a feature set, and dividing features in the feature set into state machine features and moving characteristic features;
constructing a finite state automaton according to the characteristics of the state machine; constructing a feature list, a prefix skip list and a prefix hash table according to the features of the mobile table;
acquiring a text to be processed, and scanning the text to be processed by using a finite state automaton to obtain a first feature matching result; scanning the text to be processed by using the prefix jump table, the prefix hash table and the feature list, and obtaining a second feature matching result;
and obtaining a feature matching result of the text to be processed according to the first feature matching result and the second feature matching result.
In one embodiment, dividing the features in the feature set into state machine features and moving feature features comprises: traversing the features in the feature set, and judging whether the character string length of the features is smaller than the preset character string length; if the character string length of the features is smaller than the preset character string length, dividing the features into state machine features; if not, judging whether the features are fuzzy matching features, jump matching features, regular matching features or case insensitive features; if the features are fuzzy matching features, jump matching features, regular matching features or case insensitive features, dividing the features into state machine features; if not, judging whether the hit frequency of the features is within a preset range; if the hit frequency of the features is within a preset range, dividing the features into state machine features; if not, the features are divided into moving table features.
In one embodiment, a finite state automaton is constructed from state machine features, comprising: storing the state machine characteristics in a file in the form of nodes; and performing serialized mode compilation on each node stored in the file to obtain the finite state automaton.
In one embodiment, constructing a feature list, a prefix hop list and a prefix hash table according to the mobile table features comprises: storing the characteristics of the mobile table in a file in a binary value form to generate a characteristic list; respectively constructing a prefix jump table and a prefix hash table according to the prefix of the feature and the index information of the prefix in the feature list; and performing compression processing and encryption processing on the feature list.
In one embodiment, scanning the text to be processed by using the prefix skip table, the prefix hash table, and the feature list to obtain a second feature matching result, includes: carrying out infix matching on the character strings of the text to be processed by using the infix jump table; when the character string of the text to be processed hits the infix of the infix jump table, prefix matching is carried out on the character string of the text to be processed by utilizing a prefix hash table; when the character string of the text to be processed hits the prefix of the prefix hash table, searching the infixes corresponding to the prefix by a binary search method according to the prefix hit by the character string of the text to be processed; traversing the feature list, and searching the feature matched with the character string of the text to be processed in the feature list according to the prefix and the infix hit by the character string of the text to be processed; and generating a second feature matching result aiming at the feature matched with the character string of the text to be processed.
In one embodiment, traversing the feature list, and searching for a feature matching the character string of the text to be processed in the feature list according to the prefix and the infix hit by the character string of the text to be processed includes: determining a traversal area of the feature list according to the hit prefix and the infix of the character string of the text to be processed; mapping the memory in a traversal area of the feature list, and establishing a sliding window for the traversal area of the feature list in the memory; and sequentially searching character string matching features of the text to be processed in the sliding window.
In one embodiment, obtaining a feature matching result of a text to be processed according to the first feature matching result and the second feature matching result includes: merging the first feature matching result and the second feature matching result to generate a feature matching result of the text to be processed; the feature matching result of the text to be processed comprises the ID value set of the successfully matched features and the position information of each successfully matched feature.
According to a second aspect of the present invention, there is provided a multimode matching device of large scale features, the device comprising:
the characteristic dividing module is used for acquiring a characteristic set and dividing the characteristics in the characteristic set into state machine characteristics and moving characteristic characteristics;
the model construction module is used for constructing a finite state automaton according to the characteristics of the state machine; constructing a feature list, a prefix skip list and a prefix hash table according to the features of the mobile table;
the text scanning module is used for acquiring a text to be processed and scanning the text to be processed by using a finite state automaton to obtain a first feature matching result; scanning a text to be processed by using the prefix jump table, the prefix hash table and the feature list, and obtaining a second feature matching result;
and the result generation module is used for obtaining the feature matching result of the text to be processed according to the first feature matching result and the second feature matching result.
In one embodiment, the feature segmentation module comprises:
a character string length judging unit for judging whether the character string length of the feature is smaller than a preset character string length;
the characteristic type judging unit is used for judging whether the characteristic is a fuzzy matching characteristic, a jump matching characteristic, a regular matching characteristic or a case insensitive characteristic;
a hit frequency judging unit for judging whether the hit frequency of the feature is within a preset range;
the dividing result generating unit is used for dividing the characteristics of which the character string length is less than the preset character string length, the fuzzy matching characteristics, the jump matching characteristics, the regular matching characteristics, the case and case insensitive characteristics and the characteristics of which the hit frequency is in the preset range into the state machine characteristics; and dividing the features except the state machine features in the feature set into moving table features.
In one embodiment, the model building module comprises a state machine building unit comprising:
the state machine characteristic storage subunit is used for storing the state machine characteristics in a file in a node form;
and the mode compiling subunit is used for carrying out serialized mode compiling on each node stored in the file to obtain the finite state automaton.
In one embodiment, the model building module includes a mobile table building unit comprising:
the mobile table characteristic storage subunit is used for storing the mobile table characteristics in a file in a binary numerical value form to generate a characteristic list;
the index table constructing subunit is used for respectively constructing a prefix jump table and a prefix hash table according to the prefix of the feature in the feature list and the index information of the prefix;
and the characteristic list encryption subunit is used for performing compression processing and encryption processing on the characteristic list.
In one embodiment, the text scanning module includes a state machine scanning unit and a mobile table scanning unit, wherein the mobile table scanning unit includes:
the affix matching subunit is used for carrying out affix matching on the character strings of the text to be processed by utilizing the affix jump table;
the prefix matching subunit is used for performing prefix matching on the character string of the text to be processed by utilizing the prefix hash table when the character string of the text to be processed hits the affix of the affix jump table;
the binary search subunit is used for searching the infixes corresponding to the prefixes by a binary search method according to the prefixes hit by the character strings of the texts to be processed when the character strings of the texts to be processed hit the prefixes of the prefix hash table;
the characteristic searching subunit is used for traversing the characteristic list and searching the characteristic matched with the character string of the text to be processed in the characteristic list according to the prefix and the infix hit by the character string of the text to be processed;
and the result generating subunit is used for generating a second feature matching result aiming at the feature matched with the character string of the text to be processed.
In one embodiment, the feature lookup subunit includes:
the traversal region determining subunit is used for determining a traversal region of the feature list according to the prefixes and the infixes hit by the character strings of the text to be processed;
the sliding window establishing subunit is used for mapping the traversal area of the feature list in the memory and establishing a sliding window for the traversal area of the feature list in the memory;
and the characteristic sequence searching subunit is used for sequentially searching the characteristics matched with the character strings of the text to be processed in the sliding window.
In one embodiment, the result generating module is further configured to merge the first feature matching result and the second feature matching result to generate a feature matching result of the text to be processed; the feature matching result of the text to be processed comprises the ID value set of the successfully matched features and the position information of each successfully matched feature.
According to a third aspect of the present invention, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described multimode matching method for large-scale features.
According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-described method of multi-mode matching of large-scale features when executing the program.
The invention provides a multimode matching method, a multimode matching device, a storage medium and computer equipment for large-scale features. The invention not only makes full use of the high stability of the finite state automaton, but also effectively utilizes the feature list to save a large amount of memory space, so that the multi-mode matching algorithm can simultaneously have high performance and low memory occupation, thereby improving the feature matching efficiency and saving the memory loss.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of a multi-mode matching method for large-scale features according to an embodiment of the present invention;
FIG. 2 is a flow chart of another multi-mode matching method for large-scale features according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a multi-mode matching device with large scale features according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of another multimode matching device with large scale features according to an embodiment of the invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
In one embodiment, as shown in fig. 1, a method for multi-mode matching of large-scale features is provided, which is described by taking the method as an example for a server, and includes the following steps:
101. and acquiring a feature set, and dividing the features in the feature set into state machine features and moving characteristic features.
The feature set refers to a set of features to be matched extracted from a data source. For example, in the intrusion detection scenario, the feature set refers to a set of feature patterns extracted from one or more network packets, and since these feature patterns are usually abstractly summarized from a variety of known attack behaviors, the attack that may occur can be detected by matching the text to be processed with these feature patterns. Further, the features can be obtained through various channels, such as manual extraction, acquisition by a sensor, or acquisition by learning.
Specifically, the server may divide the features into state machine features and moving table features according to the string length, feature type, and hit frequency of each feature in the feature set. In this embodiment, the server may divide the features with high hit frequency, short string length, and complex types into state machine features, and divide the features with low hit frequency, long string length, and common types into moving table features, such a division manner may fully utilize the high stability of the state machine to implement accurate matching of the features, and may also utilize the low memory characteristic of the feature list to reduce performance loss.
102. Constructing a finite state automaton according to the characteristics of the state machine; and constructing a feature list, a prefix jump table and a prefix hash table according to the characteristics of the mobile table.
Specifically, the server stores the state machine characteristics in a file in a node form, and then carries out serialized mode compilation on each node stored in the file to obtain the finite state automaton; then, the server stores the characteristics of the mobile table in a file in a binary value form to generate a characteristic list, then constructs a prefix jump table and a prefix hash table respectively according to the prefix and the index information of the prefix of the characteristics in the characteristic list, and finally performs compression processing and encryption processing on the characteristic list.
103. Acquiring a text to be processed, and scanning the text to be processed by using a finite state automaton to obtain a first feature matching result; and scanning the text to be processed by utilizing the prefix jump table, the prefix hash table and the feature list, and obtaining a second feature matching result.
Specifically, when a text to be processed is scanned, the text to be processed can be used as the input of a finite state machine, so that the state of the state machine is converted, pattern matching occurs when certain specific states are reached, and in the whole process of feature matching, the state machine can find the positions where all features matched with the text to be processed appear by only scanning once, and obtain a first feature matching result.
Further, after the first matching result is obtained, prefix matching and prefix matching can be further performed on the text to be processed by using the prefix-to-prefix jump table and the prefix-to-prefix hash table, if the prefixes and the prefixes are matched, a traversal area is defined for the feature list, and then the feature matched with the text to be processed is searched in the traversal area existing in the mapping, so that a second feature matching result is obtained.
In the embodiment, the matching result with high accuracy and good stability can be obtained by scanning the features with high hit frequency, short string length and complex types by using the state machine, and then the features with low hit frequency, long string length and common types are scanned and scanned by using the prefix skip list, the prefix hash list and the feature list, so that the memory loss of the feature library can be effectively reduced, and the feature matching efficiency is accelerated.
104. And obtaining a feature matching result of the text to be processed according to the first feature matching result and the second feature matching result.
Specifically, the server may obtain the feature matching result of the text to be processed by merging the first feature matching result and the second feature matching result. In this embodiment, the feature matching result of the text to be processed includes the ID value set of the successfully matched features and the position information of each successfully matched feature. For example, the result of feature matching may be stored in a byte array manner, and after the matching result is obtained, only by reading the value of the corresponding position, the position information of the successfully matched feature and the corresponding ID value may be obtained.
In this embodiment, first, the obtained feature set is divided into a state machine feature and a moving table feature, then a finite state automata, a feature list, a prefix skip list and a prefix hash table are respectively constructed according to the state machine feature and the moving table feature, then the finite state automata, the prefix skip list, the prefix hash table and the feature list are used to scan the text to be processed in sequence, two matching results are obtained, and finally, the final feature matching result of the text to be processed is obtained according to the two matching results. The embodiment not only makes full use of the high stability of the finite state automaton, but also effectively utilizes the feature list to save a large amount of memory space, so that the multi-mode matching algorithm can simultaneously have high performance and low memory occupation, thereby improving the feature matching efficiency and saving the memory loss.
Further, as a refinement and an extension of the specific implementation of the above embodiment, in order to fully illustrate the implementation process of the present embodiment, a method for multi-mode matching of large-scale features is provided, as shown in fig. 2, the method includes the following steps:
201. and acquiring a feature set, and dividing the features in the feature set into state machine features and moving table features according to the character string length, the feature type and the hit frequency of the features.
Specifically, the server traverses the features in the feature set after acquiring the feature set, and first determines whether the string length of the features is smaller than a preset string length, if the string length of the features is smaller than the preset string length, the features are divided into state machine features, if not, the features are continuously determined whether the features are fuzzy matching features, jump matching features, regular matching features or case insensitive features, if the features are fuzzy matching features, jump matching features, regular matching features or case insensitive features, the features are divided into state machine features, if not, the hit frequency of the features is determined whether the hit frequency of the features is within a preset range, if the hit frequency of the features is within the preset range, the features are divided into state machine features, and if not, the features are divided into mobile table features.
The preset character string length may be a certain number between 6 and 10 or another number, and the hit frequency may be a certain number between the top 10000 and the top 100000 of the hit rate in the feature set.
In this embodiment, the server may divide the features with high hit frequency, short string length, and complex types into state machine features, and divide the features with low hit frequency, long string length, and common types into moving table features, such a division manner may fully utilize the high stability of the state machine to implement accurate matching of the features, and may also utilize the low memory characteristic of the feature list to reduce performance loss.
202. And constructing the finite state automata according to the characteristics of the state machine.
Specifically, the server may store the state machine characteristics in a file in the form of nodes, and then perform serialized mode compilation on each node stored in the file to obtain the finite state automaton.
203. And constructing a feature list, a prefix jump table and a prefix hash table according to the characteristics of the mobile table.
Specifically, the server stores the characteristics of the mobile table in a file in a binary numerical form to generate a characteristic list, then constructs a prefix jump table and a prefix hash table respectively according to the prefix and the index information of the prefix of the characteristics in the characteristic list, and finally performs compression processing and encryption processing on the characteristic list.
204. And acquiring a text to be processed, and scanning the text to be processed by using a finite state automaton to obtain a first feature matching result.
Specifically, when a text to be processed is scanned, the text to be processed can be used as the input of a finite state machine, so that the state of the state machine is converted, pattern matching occurs when certain specific states are reached, and in the whole process of feature matching, the state machine can find the positions where all features matched with the text to be processed appear by only scanning once, and obtain a first feature matching result.
205. And scanning the text to be processed by utilizing the prefix jump table, the prefix hash table and the feature list to obtain a second feature matching result.
In this embodiment, the server may perform prefix matching and prefix matching on the text to be processed by using the prefix-to-prefix skip list and the prefix-to-prefix hash table, if both the prefix and the prefix are matched, a traversal region is defined for the feature list, and then a feature matching the text to be processed is found in the traversal region existing in the mapping, so as to obtain a second feature matching result.
Specifically, the character strings of the text to be processed may be subjected to prefix matching by using the prefix hash table when the character strings of the text to be processed hit the prefixes of the prefix hash table, and when the character strings of the text to be processed hit the prefixes of the prefix hash table, the character strings of the text to be processed may be subjected to prefix matching by using the prefix hash table.
The process of traversing the feature list is as follows: determining a traversal region of the feature list by using the hit prefix and the hit infix, then mapping the memory in the traversal region of the feature list, establishing a sliding window for the traversal region of the feature list in the memory, and finally sequentially searching character string matching features of the text to be processed in the sliding window.
It should be noted that, in the matching process, in this embodiment, the infix matching and the prefix matching are performed on the to-be-processed text by using the infix jumbo and the prefix hash table, so that the memory usage in the matching process is greatly reduced, the feature list only needs to be traversed when the to-be-processed text hits the infix and the prefix at the same time, and in the process of traversing the feature list, only the corresponding feature needs to be searched in the traversal region with a reduced range, so that the memory usage is still very small.
In addition, the medium affix jump table, the prefix hash table and the feature list scan the text to be processed, the purpose is to match the characteristics of the mobile table in the text to be processed, and the characteristics of the mobile table comprise a large number of unusual characteristics with low hit frequency, long character string length and common type characteristics, so that the accuracy and efficiency of feature matching are greatly improved, and the occupied space of a memory is saved.
206. And obtaining a feature matching result of the text to be processed according to the first feature matching result and the second feature matching result.
Specifically, the server may obtain the feature matching result of the text to be processed by merging the first feature matching result and the second feature matching result. In this embodiment, the feature matching result of the text to be processed includes the ID value set of the successfully matched features and the position information of each successfully matched feature. For example, the result of feature matching may be stored in a byte array manner, and after the matching result is obtained, only by reading the value of the corresponding position, the position information of the successfully matched feature and the corresponding ID value may be obtained.
According to the method, a first feature matching result with high matching precision can be obtained by scanning a state machine with high hit rate, short character string length and complex type feature utilization performance, then, a second feature matching result with high matching efficiency can be further obtained by scanning a large number of features with low hit frequency, long character string length and common type with a feature list, a prefix skip list and a prefix hash table which occupy small space, and finally, an accurate feature matching result can be obtained by merging two feature matching results.
Further, as a specific implementation of the method shown in fig. 1 and fig. 2, the embodiment provides a multimode matching device with large scale features, as shown in fig. 3, the device includes: a feature classification module 31, a model construction module 32, a text scanning module 33, and a result generation module 34, wherein,
the feature dividing module 31 is configured to acquire a feature set, and divide features in the feature set into state machine features and moving feature features;
a model building module 32, operable to build a finite state automaton based on the state machine characteristics; constructing a feature list, a prefix skip list and a prefix hash table according to the features of the mobile table;
the text scanning module 33 is configured to acquire a text to be processed, and scan the text to be processed by using a finite state automaton to obtain a first feature matching result; scanning a text to be processed by using the prefix jump table, the prefix hash table and the feature list, and obtaining a second feature matching result:
and the result generating module 34 is configured to obtain a feature matching result of the text to be processed according to the first feature matching result and the second feature matching result.
In a specific application scenario, the feature dividing module 31 includes:
a character string length determining unit 311, configured to determine whether a character string length of the feature is smaller than a preset character string length;
a feature type determining unit 312, configured to determine whether a feature is a fuzzy matching feature, a jump matching feature, a regular matching feature, or a case insensitive feature;
a hit frequency determining unit 313 for determining whether the hit frequency of the feature is within a predetermined range;
the division result generating unit 314 is configured to divide the feature that the length of the character string is smaller than the preset length of the character string, the fuzzy matching feature, the jump matching feature, the regular matching feature, the case insensitive feature, and the feature that the hit frequency is within the preset range into state machine features; and dividing the features except the state machine features in the feature set into moving table features.
In a specific application scenario, the model building module 32 includes a state machine building unit 321, and the state machine building unit 321 includes:
a state machine feature storage subunit 3211 configured to store the state machine features in a file in the form of nodes;
the schema compiling subunit 3212 may be configured to perform serialized schema compiling on each node stored in the file, so as to obtain a finite state automaton.
In a specific application scenario, the model building module 32 includes a mobile table building unit 322, where the mobile table building unit 322 includes:
a mobile table feature storage subunit 3221, configured to store the mobile table features in a file in a binary value form, and generate a feature list;
an index table constructing subunit 3222, configured to respectively construct an affix skip table and a prefix hash table according to the prefixes and the index information of the affixes of the features in the feature list;
the feature list encryption sub-unit 3223 may be configured to perform compression processing and encryption processing on the feature list.
In a specific application scenario, the text scanning module 33 includes a state machine scanning unit 331 and a mobile table scanning unit 332, where the mobile table scanning unit 332 includes:
the affix matching subunit 3321 is configured to perform affix matching on the character string of the text to be processed by using the affix jump table;
a prefix matching subunit 3322, configured to perform prefix matching on a character string of the to-be-processed text by using a prefix hash table when the character string of the to-be-processed text hits an affix of the affix skip table;
the dichotomy lookup subunit 3323 is configured to, when a character string of the text to be processed hits a prefix of the prefix hash table, lookup a prefix corresponding to the prefix by a dichotomy lookup method according to the prefix hit by the character string of the text to be processed;
the feature searching subunit 3324 is configured to traverse the feature list, and search, according to the prefix and the affix hit by the character string of the to-be-processed text, a feature matching the character string of the to-be-processed text in the feature list;
a result generating subunit 3325 operable to generate a second feature matching result for the features matching the character string of the text to be processed.
In a specific application scenario, the feature searching subunit includes 3324:
a traversal region determining subunit 3326 configured to determine a traversal region of the feature list according to prefixes and prefixes hit by character strings of the text to be processed;
a sliding window setting subunit 3327, configured to map the traversal region of the feature list in the memory, and set a sliding window for the traversal region of the feature list in the memory;
a feature sequence lookup subunit 3328, which may be configured to sequentially lookup the feature matching the character string of the text to be processed in the sliding window.
In a specific application scenario, the result generating module 34 may be further configured to merge the first feature matching result and the second feature matching result to generate a feature matching result of the text to be processed; the feature matching result of the text to be processed comprises the ID value set of the successfully matched features and the position information of each successfully matched feature.
It should be noted that other corresponding descriptions of the functional units related to the multimode matching device with large scale features provided in this embodiment may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not repeated herein.
Based on the methods shown in fig. 1 and fig. 2, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the multi-mode matching method for large-scale features shown in fig. 1 and fig. 2.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, and the software product to be identified may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, or the like), and include several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the method according to the implementation scenarios of the present application.
Based on the foregoing methods shown in fig. 1 and fig. 2 and the embodiments of the multimode matching device for large-scale features shown in fig. 3 and fig. 4, in order to achieve the foregoing object, the present embodiment further provides an entity device for multimode matching of large-scale features, which may specifically be a personal computer, a server, a smart phone, a tablet computer, a smart watch, or other network devices, and the entity device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing the computer program to implement the above-mentioned methods as shown in fig. 1 and fig. 2.
Optionally, the entity device may further include a user interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.
Those skilled in the art will appreciate that the present embodiment provides a multi-mode matching physical device structure with large scale features that does not constitute a limitation of the physical device, and may include more or fewer components, or some components in combination, or a different arrangement of components.
The storage medium may further include an operating system and a network communication module. The operating system is a program for managing the hardware of the above-mentioned entity device and the software resources to be identified, and supports the operation of the information processing program and other software and/or programs to be identified. The network communication module is used for realizing communication among components in the storage medium and communication with other hardware and software in the information processing entity device.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. By applying the technical scheme of the application, the high stability of the finite state automaton is fully utilized, meanwhile, the feature list is effectively utilized, and a large amount of memory space is saved.
Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims (10)

1. A method for multi-modal matching of large-scale features, the method comprising:
acquiring a feature set, and dividing features in the feature set into state machine features and moving characteristic features;
constructing a finite state automaton according to the state machine characteristics; constructing a feature list, a prefix skip list and a prefix hash table according to the characteristics of the mobile table;
acquiring a text to be processed, and scanning the text to be processed by using the finite state automaton to obtain a first feature matching result; scanning a text to be processed by using the prefix jump table, the prefix hash table and the feature list, and obtaining a second feature matching result;
and obtaining a feature matching result of the text to be processed according to the first feature matching result and the second feature matching result.
2. The method of claim 1, wherein said dividing the features in the feature set into state machine features and moving signature features comprises:
traversing the features in the feature set, and judging whether the character string length of the features is smaller than the preset character string length;
if the character string length of the characteristic is smaller than the preset character string length, dividing the characteristic into a state machine characteristic; if not, judging whether the features are fuzzy matching features, jump matching features, regular matching features or case insensitive features;
if the features are fuzzy matching features, jump matching features, regular matching features or case insensitive features, dividing the features into state machine features; if not, judging whether the hit frequency of the features is within a preset range;
if the hit frequency of the features is within a preset range, dividing the features into state machine features; if not, the features are divided into moving table features.
3. The method of claim 1, wherein constructing a finite state automaton from the state machine features comprises:
storing the state machine features in a file in the form of nodes;
and performing serialized mode compilation on each node stored in the file to obtain the finite state automaton.
4. The method of claim 1, wherein constructing a feature list, a prefix hop list, and a prefix hash table according to the mobile table features comprises:
storing the mobile table features in a file in a binary value form to generate the feature list;
respectively constructing a prefix jump table and a prefix hash table according to the prefix of the feature in the feature list and the index information of the prefix;
and performing compression processing and encryption processing on the feature list.
5. The method of claim 1, wherein scanning the text to be processed using the infix skip table, the prefix hash table, and the feature list to obtain a second feature matching result comprises:
carrying out infix matching on the character strings of the text to be processed by using the infix jump table;
when the character string of the text to be processed hits the infix of the infix jump table, prefix matching is carried out on the character string of the text to be processed by utilizing a prefix hash table;
when a character string of a text to be processed hits a prefix of a prefix hash table, searching a prefix corresponding to the prefix by a binary search method according to the prefix hit by the character string of the text to be processed;
traversing the feature list, and searching the feature matched with the character string of the text to be processed in the feature list according to the prefix and the infix hit by the character string of the text to be processed;
and generating a second feature matching result aiming at the feature matched with the character string of the text to be processed.
6. The method of claim 5, wherein traversing the feature list and finding a feature of the text to be processed with a matched character string in the feature list according to the prefix and the infix hit by the character string of the text to be processed comprises:
determining a traversal area of the feature list according to the hit prefix and the infix of the character string of the text to be processed;
mapping the memory in a traversal area of the feature list, and establishing a sliding window for the traversal area of the feature list in the memory;
and sequentially searching the character string matched features of the text to be processed in the sliding window.
7. The method according to claim 1, wherein obtaining a feature matching result of the text to be processed according to the first feature matching result and the second feature matching result comprises:
merging the first feature matching result and the second feature matching result to generate a feature matching result of the text to be processed;
and the feature matching result of the text to be processed comprises the ID value set of the successfully matched features and the position information of each successfully matched feature.
8. A multi-mode matching device for large scale features, the device comprising:
the characteristic dividing module is used for acquiring a characteristic set and dividing the characteristics in the characteristic set into state machine characteristics and moving characteristic characteristics;
the model construction module is used for constructing a finite state automaton according to the characteristics of the state machine; constructing a feature list, a prefix skip list and a prefix hash table according to the characteristics of the mobile table;
the text scanning module is used for acquiring a text to be processed and scanning the text to be processed by using the finite state automaton to obtain a first feature matching result; scanning a text to be processed by using the prefix jump table, the prefix hash table and the feature list, and obtaining a second feature matching result;
and the result generating module is used for obtaining a feature matching result of the text to be processed according to the first feature matching result and the second feature matching result.
9. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, realizing the steps of the method of any one of claims 1 to 7.
10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by the processor.
CN201910945379.6A 2019-09-30 2019-09-30 Multi-mode matching method and device for large-scale features and storage medium Active CN112579839B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910945379.6A CN112579839B (en) 2019-09-30 2019-09-30 Multi-mode matching method and device for large-scale features and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910945379.6A CN112579839B (en) 2019-09-30 2019-09-30 Multi-mode matching method and device for large-scale features and storage medium

Publications (2)

Publication Number Publication Date
CN112579839A true CN112579839A (en) 2021-03-30
CN112579839B CN112579839B (en) 2022-07-01

Family

ID=75117054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910945379.6A Active CN112579839B (en) 2019-09-30 2019-09-30 Multi-mode matching method and device for large-scale features and storage medium

Country Status (1)

Country Link
CN (1) CN112579839B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115935961A (en) * 2022-10-27 2023-04-07 安芯网盾(北京)科技有限公司 Multi-mode matching high-performance algorithm and device for realizing multi-stage matching

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104881439A (en) * 2015-05-11 2015-09-02 中国科学院信息工程研究所 Method and system for space-efficient multi-pattern matching
CN105468588A (en) * 2014-05-30 2016-04-06 华为技术有限公司 Character string matching method and apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105468588A (en) * 2014-05-30 2016-04-06 华为技术有限公司 Character string matching method and apparatus
CN104881439A (en) * 2015-05-11 2015-09-02 中国科学院信息工程研究所 Method and system for space-efficient multi-pattern matching

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杨波: "基于有限状态自动机的中文多模式匹配算法研究", 《万方数据知识服务平台》 *
范宇健: "大流量网络下串匹配算法的优化研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115935961A (en) * 2022-10-27 2023-04-07 安芯网盾(北京)科技有限公司 Multi-mode matching high-performance algorithm and device for realizing multi-stage matching

Also Published As

Publication number Publication date
CN112579839B (en) 2022-07-01

Similar Documents

Publication Publication Date Title
WO2018149292A1 (en) Object clustering method and apparatus
Chikhi et al. On the representation of de Bruijn graphs
EP3358474B1 (en) Route search method, device and apparatus, and non-volatile computer storage medium
US8706711B2 (en) Descriptor storage and searches of k-dimensional trees
CN106599097B (en) Matching method and device for mass feature string set
US11468096B2 (en) Database access using a z-curve
CN111312333B (en) Method, apparatus, device and medium for improving BWT table look-up performance
CN109815238A (en) The dynamic adding method and device of database are realized with strict balanced binary tree
CN112579839B (en) Multi-mode matching method and device for large-scale features and storage medium
CN108628907B (en) Method for matching Trie tree with multiple keywords based on Aho-Corasick
CN114817657A (en) To-be-retrieved data processing method, data retrieval method, electronic device and medium
CN106844553B (en) Data detection and expansion method and device based on sample data
CN111402958B (en) Method, system, equipment and medium for establishing gene comparison table
CN105488105A (en) Establishment method for information extraction template and knowledge data processing method and apparatus
US10459959B2 (en) Top-k query processing with conditional skips
US11323873B2 (en) Method for wireless fidelity connection and related products
CN110598057B (en) Data searching method and device for telemetering data
CN111159490A (en) Method, device and equipment for processing mode character string
CN108304467B (en) Method for matching between texts
CN109740762A (en) Feature selection approach, device, storage medium and electronic equipment
US9235639B2 (en) Filter regular expression
CN112579618B (en) Feature library upgrading method and device, storage medium and computer equipment
CN111737398B (en) Method and device for retrieving sensitive words in text, electronic equipment and storage medium
CN112100132A (en) Deleted file type identification method and device, electronic equipment and storage medium
US20190163810A1 (en) Search User Interface

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant