US20200304414A9 - Systems and methods for keyword spotting using adaptive management of multiple pattern matching algorithms - Google Patents

Systems and methods for keyword spotting using adaptive management of multiple pattern matching algorithms Download PDF

Info

Publication number
US20200304414A9
US20200304414A9 US15/411,369 US201715411369A US2020304414A9 US 20200304414 A9 US20200304414 A9 US 20200304414A9 US 201715411369 A US201715411369 A US 201715411369A US 2020304414 A9 US2020304414 A9 US 2020304414A9
Authority
US
United States
Prior art keywords
pattern matching
patterns
communication traffic
input communication
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US15/411,369
Other versions
US20170195234A1 (en
Inventor
Yitshak Yishay
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cognyte Technologies Israel Ltd
Original Assignee
Verint Systems Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Verint Systems Ltd filed Critical Verint Systems Ltd
Priority to US15/411,369 priority Critical patent/US20200304414A9/en
Assigned to VERINT SYSTEMS LTD. reassignment VERINT SYSTEMS LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: YISHAY, YITSHAK
Publication of US20170195234A1 publication Critical patent/US20170195234A1/en
Publication of US20200304414A9 publication Critical patent/US20200304414A9/en
Assigned to Cognyte Technologies Israel Ltd reassignment Cognyte Technologies Israel Ltd CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: VERINT SYSTEMS LTD.
Assigned to Cognyte Technologies Israel Ltd reassignment Cognyte Technologies Israel Ltd CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: VERINT SYSTEMS LTD.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/20Traffic policing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • G06F17/30864
    • G06F17/30985
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems
    • G06N5/047Pattern matching networks; Rete networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/18Protocol analysers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2483Traffic characterised by specific attributes, e.g. priority or QoS involving identification of individual flows

Definitions

  • the present disclosure relates generally to data processing, and particularly to methods and systems for detecting strings in data.
  • Keyword searching techniques are used in a wide variety of applications. For example, in some applications, communication traffic is analyzed in an attempt to detect keywords that indicate traffic of interest. Some data security systems attempt to detect information that leaks from an organization network by detecting keywords in outgoing traffic. Intrusion detection systems sometimes identify illegitimate intrusion attempts by detecting keywords in traffic.
  • Various keyword searching techniques are known in the art. For example, Aho and Corasick describe an algorithm for locating occurrences of a finite number of keywords in a string of text, in “Efficient String Matching: An Aid to Bibliographic Search,” Communications of the ACM, volume 18, no. 6, June, 1975, pages 333-340, which is incorporated herein by reference. This technique is commonly known as the Aho-Corasick algorithm.
  • Yu et al. describe a multiple-pattern matching scheme, which uses Ternary Content-Addressable Memory (TCAM), in “Gigabit Rate Packet Pattern-Matching using TCAM,” Proceedings of the 12 th IEEE International Conference on Network Protocols (ICNP), Berlin, Germany, Oct. 5-8, 2004, pages 174-183, which is incorporated herein by reference.
  • TCAM Ternary Content-Addressable Memory
  • An embodiment that is described herein provides a method, including receiving input data to be searched for occurrences of a set of patterns, assigning the input data and the patterns to multiple different pattern matching algorithms, searching the input data using the pattern matching algorithms, evaluating a predefined metric, and reassigning the input data and the patterns to the pattern matching algorithms based on the evaluated metric.
  • evaluating the predefined metric includes assessing a performance measure of the pattern matching algorithms. In other embodiments, evaluating the predefined metric includes assessing a characteristic of the input data. In yet other embodiments assigning the input data and the patterns includes applying each of the pattern matching algorithms to search a respective subset of the input data for the occurrences of all the patterns.
  • reassigning the input data and the patterns includes reassigning a portion of the input data from a first pattern matching algorithm to a second pattern matching algorithm.
  • assigning the input data and the patterns includes defining one of the pattern matching algorithms as a primary algorithm and assigning a majority of the input data to the primary algorithm, and reassigning the input data and the patterns includes redefining another of the pattern matching algorithms to serve as the primary algorithm and shifting the majority of the input data to the redefined primary algorithm.
  • assigning the input data and the patterns includes applying each of the pattern matching algorithms to search all the input data for the occurrences of a respective subset of the patterns.
  • evaluating the metric includes evaluating at least one metric type selected from a group of types consisting of: a volume of the input data processed by a given pattern matching algorithm per unit time; a memory size occupied by the assigned patterns; and the memory size used for maintaining state machines of respective flows of the input data.
  • an apparatus including an input circuit and a processor.
  • the input circuit is configured to receive input data to be searched for occurrences of a set of patterns.
  • the processor is configured to assign the input data and the patterns to multiple different pattern matching algorithms, to search the input data for the occurrences using the multiple algorithms, to evaluate a predefined metric, and to reassign the input data and the patterns to the pattern matching algorithms based on the evaluated metric.
  • FIG. 1 is a block diagram that schematically illustrates a system for keyword searching, in accordance with an embodiment of the present disclosure
  • FIGS. 2 and 3 are block diagrams that schematically illustrate configurations of a processor in a system keyword searching, in accordance with embodiments of the present disclosure.
  • FIG. 4 is a flow chart that schematically illustrates a method for efficient keyword searching, in accordance with an embodiment of the present disclosure.
  • Embodiments that are described herein provide improved methods and systems for keyword spotting, i.e., for identifying textual phrases of interest in input data.
  • the input data comprises communication packets exchanged in a communication network.
  • the disclosed keyword spotting techniques can be used, for example, in applications such as Data Leakage Prevention (DLP), Intrusion Detection Systems (IDS) or Intrusion Prevention Systems (IPS), and spam e-mail detection.
  • DLP Data Leakage Prevention
  • IDS Intrusion Detection Systems
  • IPS Intrusion Prevention Systems
  • a keyword spotting system holds a dictionary (or dictionaries) of textual phrases for searching input data.
  • the dictionary defines textual phrases to be located in communication packets—such as e-mail addresses or Uniform Resource Locators (URLs).
  • the dictionary comprises a large number of textual phrases, e.g., on the order of thousands or more, which may differ in size from one another.
  • Each textual phrase in the dictionary typically comprises a string of characters, and in some embodiments may comprise various wildcard characters.
  • the dictionary may change over time, e.g., textual phrases may be added, deleted or modified.
  • the textual phrases are also referred to as keywords or patterns.
  • pattern matching algorithms may be affected by many factors.
  • Example factors include the dictionary size, the alphabet size (i.e., the number of different characters in the data), the sizes (or the minimal size) of the searched patterns, and the characteristics of the input data.
  • an algorithm may suffer an attack (sometimes referred to as a “pattern matching algorithmic complexity attack” or “payload attack”) that may considerably reduce its efficiency.
  • the keyword spotting system assigns the input data and the patterns to multiple different pattern matching algorithms.
  • the system splits the input data traffic between two or more matching algorithms.
  • a dominant share of the traffic is handled by one algorithm and smaller traffic shares by the others.
  • the system monitors the algorithms performance (by evaluating a respective metric) as they process the data to search for a match.
  • the ratio of traffic splitting among the algorithms is dynamically reassigned or adjusted to maximize the overall performance.
  • two or more pattern matching algorithms each assigned to a distinct dictionary, process the input data in parallel.
  • the patterns are split among the matching algorithms.
  • the input traffic is not split but is rather directed in full to each of the matching algorithms.
  • the dictionaries together include all the patterns to be searched. Again, the algorithms performance is monitored and a respective metric is evaluated as they process the data. With response to data characteristics change over time, patterns may be dynamically reassigned among the different dictionaries to adjust the corresponding algorithms to maximal overall performance.
  • the disclosed techniques enable the system to exploit the advantages and avoid the disadvantages of each pattern matching algorithm.
  • the presented embodiments enable to handle high-bandwidth traffic with time-varying characteristics, and to search for a large number of patterns that otherwise would not be feasible with limited computing resources.
  • the methods and systems described herein are insensitive to pattern matching algorithmic complexity attacks.
  • FIG. 1 is a block diagram that schematically illustrates a system 20 for keyword spotting, in accordance with an embodiment that is described herein.
  • System 20 receives communication traffic from a communication network 24 , and attempts to detect in the traffic predefined textual phrases, also referred herein to as keywords or patterns. When one or more keywords are detected, the system reports the detection to a user 28 using an operator terminal 32 .
  • System 20 can be used, for example, in an application that detects data leakage from a communication network. In applications of this sort, the presence of one or more keywords in a data item indicates that this data item should not be allowed to exit the network.
  • system 20 can be used in any other suitable application in which input data is searched for occurrences of keywords, such as in intrusion detection and prevention systems, detection of spam in electronic mail (e-mail) systems, or detection of inappropriate content using a dictionary of inappropriate words or phrases.
  • system 20 can be used for locating data of interest on storage devices, such as in forensic disk scanning applications.
  • Certain additional aspects of keyword spotting are addressed, for example, in U.S. patent application Ser. No. 12/792,796, entitled “Systems and methods for efficient keyword spotting in communication traffic,” which is assigned to the assignee of the present patent applications and whose disclosure is incorporated herein by reference.
  • Other applications may comprise, for example, pattern matching in gene sequences in biology.
  • Network 24 may comprise any suitable public or private, wireless or wire-line communication network, e.g., a Wide-Area network (WAN) such as the Internet, a Local-Area Network (LAN), a Metropolitan-Area Network (MAN), or a combination of network types.
  • WAN Wide-Area network
  • LAN Local-Area Network
  • MAN Metropolitan-Area Network
  • the communication traffic, to be used as input data by system 20 may be provided to the system using any suitable means.
  • the traffic may be forwarded to the system from a network element (e.g., router) in network 24 , such as by port tapping or port mirroring.
  • system 20 may be placed in-line in the traffic path.
  • network 24 comprises an Internet Protocol (IP) network
  • IP Internet Protocol
  • IP/IP Transmission Control Protocol Internet Protocol
  • UDP User Datagram Protocol
  • the packets searched by system 20 are referred to herein generally as input data.
  • system 20 comprises a Network Interface Card (NIC) 36 , which receives TCP packets from network 24 .
  • NIC 36 thus serves as an input circuit that receives the input data to be searched.
  • NIC 36 stores the incoming TCP packets in a memory 40 , typically comprising a Random Access Memory (RAM).
  • a processor 44 searches the TCP packets stored in memory 40 and attempts to identify occurrences of predefined keywords in the packets.
  • dictionary 48 may be stored on any suitable storage device.
  • dictionary 48 or part of it, may be stored in a cache memory (not shown) of processor 44 to increase the access speed by the processor.
  • dictionary may comprise multiple physical or logical distinct dictionaries.
  • processor 44 When processor 44 detects a given keyword in a given packet, it reports the detection to user 28 using an output device of terminal 32 , such as a display 56 . For example, the processor may issue an alert to the user and/or present the data item (e.g., packet or session) in which the keyword was detected. In some embodiments, processor 44 may take various kinds of actions in response to detecting a keyword. For example, in a data leakage or intrusion prevention application, processor 44 may block some or all of the traffic upon detecting a keyword. User 28 may interact with system 20 using an input device of terminal 32 , e.g., a keyboard 60 .
  • processor 44 comprises a general-purpose computer, which is programmed in software to carry out the functions described herein.
  • the software may be downloaded to the computer in optical or electronic form, over a network, for example, or it may, additionally or alternatively, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
  • the run time as a function of the input length may be linear in the worst case, such as in Aho-Corasick (AC, which also performs better with a small dictionary of short patterns, and its run-time is not sensitive to the pattern length), or sub-linear on average, such as in the Wu-Manber (WM) and the Set Backward Oracle Matching (SBOM) algorithms that have a sub-linear on average run-time, and support a large keywords set as well as a large alphabet.
  • the AC, and the SBOM algorithms are additionally relatively simple to implement. Some algorithms such as the WM perform better when searching for a set of only long patterns, as short patterns degrade their performance significantly.
  • the disclosed techniques incorporate more than just a single algorithm in a system for keyword searching.
  • the system may dynamically divert the traffic to the most suitable algorithm so as to maximize the overall performance.
  • FIG. 2 is a block diagram that schematically illustrates an example configuration of processor 44 , in accordance with an embodiment of the present disclosure.
  • Input traffic data enters a data splitter 100 .
  • the splitter has one input port and two output ports.
  • Processor 44 configures the splitter to extract part or a share of the traffic to each output port.
  • the input traffic comprises multiple flows of packets, and the splitter directs certain segments of the flows (or flows parts) to one port and other or partially common segments to the other port.
  • the splitter configuration is also referred to herein as a splitting policy.
  • Processor 44 may use any suitable method to guarantee smooth transition of traffic with no loss of patterns detection. For example, when changing splitting policy, the processor may direct to each algorithm a sufficient lag of past characters. As another example, the processor may split the traffic on a flow basis, i.e., aggregate and direct all the data of a flow to one algorithm. As yet another example, a respective data segment around the flow cut point may be handled by a third (not shown) algorithm.
  • processor 44 configures the splitter to an initial splitting policy.
  • the processor may select any suitable initial policy. For example, if the initial data characteristics are not available to system 20 , the processor may configure to initially split the data evenly.
  • one of the algorithms may be a-priori assumed to be the most efficient for the expected input data.
  • the URL of the data source (if available) may indicate the data characteristics.
  • the processor may initially configure the splitter to direct a dominant share or even all the traffic to the most efficient algorithm (referred to as the primary algorithm). Additionally or alternatively, the processor may get an initial splitting policy from user 28 via terminal 32 .
  • ALGORITHM 1 and ALGORITHM 2 are configured to search the data accepted from the splitter for occurrences of patterns stored in a pattern dictionary 112 .
  • processor 44 reports the matching event as described in FIG. 1 above.
  • the performance or efficiency of a matching algorithm may change over time. For example, modifying/adding/deleting patterns in the dictionary (e.g., by user 28 ) may reduce the processing complexity of one algorithm and increase the complexity of another algorithm at the same time. As another example, as the characteristics of the input data change over time, the complexity burden on two different algorithms may change in opposite directions.
  • a performance analyzer 116 monitors the performance, e.g., the efficiency, of each matching algorithm.
  • the efficiency of a matching algorithm can be estimated, for example, by evaluating a respective metric, such as the amount of input data that the algorithm can process per unit time, e.g., the number of processed input bytes per second.
  • Other example performance metrics include the dictionaries memory size, and the amount of memory needed for flow state machine, i.e., for storing the internal state of the algorithm for each flow that is being analyzed.
  • each algorithm estimates its own performance and sends it to analyzer 116 for monitoring.
  • the analyzer calculates the performance metric internally.
  • the analyzer may use any suitable method to decide at what points in time to monitor the performance. For example the analyzer may monitor the performance periodically. The time period may be on the order of a few seconds, or any other suitable time duration.
  • the analyzer may continuously measure the algorithms performance. Further additionally or alternatively, the analyzer may monitor the performance in response to a change in the dictionary content by the user.
  • the analyzer uses the monitored performance to decide on updated splitting policy for splitter 100 .
  • the analyzer may derive a proportional splitting policy, i.e., the more an algorithm is efficient with respect to the others, a higher share of the traffic is reassigned to that algorithm.
  • the analyzer may derive an absolute splitting policy. For example, the analyzer may compare the performance of each algorithm to a predefined threshold, and direct most of the traffic to the algorithm whose performance relative to the respective threshold is the highest.
  • the analyzer can indicate the splitter to provide an algorithm with another input data segment, such as a packet, as the algorithm concludes processing a previous input data segment.
  • the processor may use any other suitable method to determine the splitting policy with response to the monitored performance.
  • the analyzer diverts some of the traffic to each algorithm in order to keep monitoring the performance of all the algorithms.
  • the analyzer may configure the splitter to direct a suitable data segment at the beginning of a certain flow to both algorithms. The rest of the flow will be directed to the algorithm that performed better on that data segment.
  • analyzer 116 analyzes the characteristics of the input traffic.
  • the analyzer accepts the traffic output from the splitter for analysis. Since the data characteristics may change over time, and since each algorithm may be better tuned to some characteristics, the analyzer may change the splitting policy accordingly.
  • the analyzer may use any suitable method to analyze the input data.
  • the analyzer may calculate statistical attributes of the data characters.
  • the analyzer can calculate a histogram that counts the number of each alphabet symbol in a data segment.
  • some metadata may accompany the data flow, indicating on the flow content, and therefore indicating on the data characteristics. For example a video, text, or images content may differ considerably in the data characteristics.
  • the analyzer may configure the splitter to direct a flow to the most suitable algorithm according to the accompanying metadata.
  • the analyzer may analyze the input data at any suitable points in time. For example the analyzer may periodically or continuously perform the analysis. Additionally or alternatively, the analyzer may perform the analysis when a new data source joins the traffic.
  • analyzer 116 may additionally consider the inherent complexity of the algorithms. For example the processor may utilize optimization techniques to select a splitting policy that would maximize the overall efficiency (i.e., the total traffic the system can handle per a time unit), under overall constrained computation resources. As an example, the analyzer may trade computation time versus memory access time and optimize splitting the traffic among the algorithms accordingly.
  • a complexity attack is typically designed to push a specific algorithm to its worst case behavior, by planting in the traffic carefully selected data patterns. Therefore, the performance of a matching algorithm that suffers an attack reduces significantly. Since an attack is designed for a specific algorithm, other algorithms may be much less sensitive for that attack, and would typically maintain high performance.
  • analyzer 116 When one algorithm is attacked, analyzer 116 would sense a significant performance reduction, and the processor may configure the splitter to stop directing any data to that algorithm. Alternatively, the processor maintains a small share of the traffic directed to the algorithm under attack and keeps monitoring the performance. When the attack stops, the processor may again split significant share of the traffic to that algorithm.
  • FIG. 2 uses two matching algorithms and a splitter with two output ports, directing data to each algorithm.
  • Other embodiments may use any number of different matching algorithms and a corresponding suitable data splitter.
  • an embodiment may use three different matching algorithms and a splitter with three output ports.
  • FIG. 3 is a block diagram that schematically illustrates another example configuration of processor 44 , in accordance with another embodiment of the present disclosure.
  • both matching algorithms in FIG. 3 i.e., the full input traffic is assigned to both algorithms ALGORITHM 1 104 and ALGORITHM 2 108 .
  • Performance analyzer 116 monitors the algorithms performance and analyzes the input data characteristics similarly to the methods described in FIG. 2 above.
  • ALGORITHM 1 and ALGORITHM 2 are configured to search for occurrences of patterns stored in respective dictionaries DICTIONARY 1 120 and DICTIONARY 2 124 . Both dictionaries together hold all the patterns that system 20 is configured to search.
  • the sets of patterns in DICTIONARY 1 and DICTIONARY 2 are disjoint.
  • System 20 can use any suitable method to decide what patterns to initially put in each dictionary. For example, it may be a-priori assumed that each algorithm performs more efficiently given a specific set of patterns. As an example, system 20 may assign patterns to algorithms based on the patterns length. For example, in a system that uses the AC and the WM algorithms, the system may assign a relatively small dictionary (preferably residing in a cache memory) with short length patterns to the AC algorithm, and a dictionary of only long patterns to the WM algorithm.
  • the internal hash function in the WM algorithm may experience a larger false positive probability due to collisions.
  • a certain matching algorithm may perform better than others when the patterns for search contain wildcard expressions, i.e., a pattern may not be fully defined.
  • a dictionary with wildcard patterns may be assigned to that superior algorithm.
  • user 28 may configure each dictionary with selected patterns via terminal 28 .
  • system 20 automatically adjusts the dictionaries content on the fly, to maximize the system performance for varying input traffic.
  • one or more of the algorithms may suffer performance degradation when the dictionary changes on the fly.
  • new patterns inserted by the user, or patterns moved from another dictionary may be assigned to a temporal dictionary and an algorithm (not shown). Under suitable conditions, patterns from the temporal dictionary may be merged into the algorithm's dictionary.
  • Analyzer 116 monitors the algorithms performance and the characteristics of the input data similarly to the description in FIG. 2 above, by evaluating a respective metric.
  • the analyzer may reassign patterns to the dictionaries to adjust and increase the overall performance.
  • the analyzer may move or swap patterns between the dictionaries.
  • analyzer 116 may move the dictionary patterns that are more susceptible to cause the attack when they are searched to the dictionary of the other algorithm.
  • FIG. 3 use two matching algorithms and two respective dictionaries. Other embodiments however may comprise any suitable number of matching algorithms and respective dictionaries. Moreover, in some embodiments a system may be configured to use a smaller number of dictionaries than algorithms. In such embodiments, multiple algorithms may be configured to search for patterns that are stored in one dictionary. For example, in a system that comprises three algorithms and two dictionaries, the first two algorithms may be attached to one dictionary and the third algorithm to the other dictionary.
  • FIG. 4 is a flow chart that schematically illustrates a method for efficient keyword searching, in accordance with an embodiment of the present disclosure.
  • the method begins with system 20 receiving patterns dictionaries at a patterns input step 200 .
  • System 20 receives packets (referred to as input data) from network 24 via NIC 36 , and stores the packets in RAM 40 , at a data input step 204 .
  • Processor 44 searches the packets using algorithms 104 and 108 (using dictionary 112 or dictionaries 120 and 124 ) at a searching step 208 .
  • Processor 44 checks whether a match is found between a portion of the input data and any of the textual phrases (patterns) of the dictionaries, at a matching step 212 . If a match with a respective pattern is found, processor 44 reports the match event to operator 28 using operator terminal 32 , at an output step 216 .
  • step 220 the processor monitors and analyzes the performance of the matching algorithms ALGORITHM 1 104 and ALGORITHM 2 108 . Still at step 220 , the processor additionally analyzes the characteristics of the input data.
  • the processor checks if the traffic splitting policy should be changed, at a check analysis step 224 . If the analysis of the algorithms performance and/or traffic characteristics indicates that by changing the splitting policy the overall performance will increase, the processor sets an updated splitting policy to data splitter 100 at adjusting step 228 . Otherwise, the splitting policy is maintained and the processor loops back to step 204 above, in which system 20 receives subsequent input data.
  • the processor checks if the analysis of the algorithms performance and/or data characteristics indicates that the overall performance may increase by moving or swapping patterns between DICTIONARY 1 120 and DICTIONARY 2 124 . If the check result is positive, processor 44 adjusts the dictionaries content by moving or swapping patterns. After adjusting the dictionaries, or if there is no need for such adjustment the processor loops back to step 204 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Methods and systems for keyword spotting, i.e., for identifying textual phrases of interest in input data. The input data may be communication packets exchanged in a communication network. A keyword spotting system holds a dictionary (or dictionaries) of textual phrases for searching input data. The input data and the patterns are assigned to multiple different pattern matching algorithms. For example, a share of the traffic is handled by one algorithm and smaller traffic shares may be handled by the others. The system monitors the algorithms performance as they process the data to search for a match. The ratio of traffic splitting among the algorithms is dynamically reassigned or adjusted to maximize the overall performance.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • The present application claims is a continuation of, and claims the benefit of priority to, U.S. patent application Ser. No. 14/263,108, entitled “SYSTEMS AND METHODS FOR KEYWORD SPOTTING USING ADAPTIVE MANAGEMENT OF MULTIPLE PATTERN MATCHING ALGORITHMS,” filed Apr. 28, 2014, whose disclosure is incorporated by reference herein.
  • FIELD OF THE DISCLOSURE
  • The present disclosure relates generally to data processing, and particularly to methods and systems for detecting strings in data.
  • BACKGROUND OF THE DISCLOSURE
  • Keyword searching techniques are used in a wide variety of applications. For example, in some applications, communication traffic is analyzed in an attempt to detect keywords that indicate traffic of interest. Some data security systems attempt to detect information that leaks from an organization network by detecting keywords in outgoing traffic. Intrusion detection systems sometimes identify illegitimate intrusion attempts by detecting keywords in traffic. Various keyword searching techniques are known in the art. For example, Aho and Corasick describe an algorithm for locating occurrences of a finite number of keywords in a string of text, in “Efficient String Matching: An Aid to Bibliographic Search,” Communications of the ACM, volume 18, no. 6, June, 1975, pages 333-340, which is incorporated herein by reference. This technique is commonly known as the Aho-Corasick algorithm. As another example, Yu et al. describe a multiple-pattern matching scheme, which uses Ternary Content-Addressable Memory (TCAM), in “Gigabit Rate Packet Pattern-Matching using TCAM,” Proceedings of the 12th IEEE International Conference on Network Protocols (ICNP), Berlin, Germany, Oct. 5-8, 2004, pages 174-183, which is incorporated herein by reference.
  • Other string matching algorithms are described, for example, by Navarro and Raffinot, in “Flexible Pattern Matching in Strings—Practical On-Line Search Algorithms for Texts and Biological Sequences,” Cambridge University Press, 2002, which is incorporated herein by reference. Chapter 3 of this book reviews multiple string matching algorithms such as the Wu-Manber (WM) and the Set Backward Oracle Matching (SBOM) algorithms.
  • SUMMARY OF THE DISCLOSURE
  • An embodiment that is described herein provides a method, including receiving input data to be searched for occurrences of a set of patterns, assigning the input data and the patterns to multiple different pattern matching algorithms, searching the input data using the pattern matching algorithms, evaluating a predefined metric, and reassigning the input data and the patterns to the pattern matching algorithms based on the evaluated metric.
  • In some embodiments, evaluating the predefined metric includes assessing a performance measure of the pattern matching algorithms. In other embodiments, evaluating the predefined metric includes assessing a characteristic of the input data. In yet other embodiments assigning the input data and the patterns includes applying each of the pattern matching algorithms to search a respective subset of the input data for the occurrences of all the patterns.
  • In some embodiments, reassigning the input data and the patterns includes reassigning a portion of the input data from a first pattern matching algorithm to a second pattern matching algorithm.
  • In other embodiments, assigning the input data and the patterns includes defining one of the pattern matching algorithms as a primary algorithm and assigning a majority of the input data to the primary algorithm, and reassigning the input data and the patterns includes redefining another of the pattern matching algorithms to serve as the primary algorithm and shifting the majority of the input data to the redefined primary algorithm.
  • In yet other embodiments, assigning the input data and the patterns includes applying each of the pattern matching algorithms to search all the input data for the occurrences of a respective subset of the patterns.
  • In an embodiment, evaluating the metric includes evaluating at least one metric type selected from a group of types consisting of: a volume of the input data processed by a given pattern matching algorithm per unit time; a memory size occupied by the assigned patterns; and the memory size used for maintaining state machines of respective flows of the input data.
  • There is also provided, in accordance with an embodiment that is described herein, an apparatus including an input circuit and a processor. The input circuit is configured to receive input data to be searched for occurrences of a set of patterns. The processor is configured to assign the input data and the patterns to multiple different pattern matching algorithms, to search the input data for the occurrences using the multiple algorithms, to evaluate a predefined metric, and to reassign the input data and the patterns to the pattern matching algorithms based on the evaluated metric.
  • The present disclosure will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram that schematically illustrates a system for keyword searching, in accordance with an embodiment of the present disclosure;
  • FIGS. 2 and 3 are block diagrams that schematically illustrate configurations of a processor in a system keyword searching, in accordance with embodiments of the present disclosure; and
  • FIG. 4 is a flow chart that schematically illustrates a method for efficient keyword searching, in accordance with an embodiment of the present disclosure.
  • DETAILED DESCRIPTION OF EMBODIMENTS Overview
  • Embodiments that are described herein provide improved methods and systems for keyword spotting, i.e., for identifying textual phrases of interest in input data. In the embodiments described herein, the input data comprises communication packets exchanged in a communication network. The disclosed keyword spotting techniques can be used, for example, in applications such as Data Leakage Prevention (DLP), Intrusion Detection Systems (IDS) or Intrusion Prevention Systems (IPS), and spam e-mail detection.
  • In the disclosed embodiments, a keyword spotting system holds a dictionary (or dictionaries) of textual phrases for searching input data. In a communication analytics system, for example, the dictionary defines textual phrases to be located in communication packets—such as e-mail addresses or Uniform Resource Locators (URLs).
  • In some applications, the dictionary comprises a large number of textual phrases, e.g., on the order of thousands or more, which may differ in size from one another. Each textual phrase in the dictionary typically comprises a string of characters, and in some embodiments may comprise various wildcard characters. Moreover, the dictionary may change over time, e.g., textual phrases may be added, deleted or modified. In the description that follows, the textual phrases are also referred to as keywords or patterns.
  • The performance of algorithms for keyword searching (also referred to as pattern matching algorithms) may be affected by many factors. Example factors include the dictionary size, the alphabet size (i.e., the number of different characters in the data), the sizes (or the minimal size) of the searched patterns, and the characteristics of the input data. In addition, an algorithm may suffer an attack (sometimes referred to as a “pattern matching algorithmic complexity attack” or “payload attack”) that may considerably reduce its efficiency.
  • In embodiments of the present invention, the keyword spotting system assigns the input data and the patterns to multiple different pattern matching algorithms. In one embodiment, the system splits the input data traffic between two or more matching algorithms. In one embodiment a dominant share of the traffic is handled by one algorithm and smaller traffic shares by the others. The system monitors the algorithms performance (by evaluating a respective metric) as they process the data to search for a match. The ratio of traffic splitting among the algorithms is dynamically reassigned or adjusted to maximize the overall performance.
  • In another embodiment, two or more pattern matching algorithms, each assigned to a distinct dictionary, process the input data in parallel. In other words, the patterns are split among the matching algorithms. The input traffic is not split but is rather directed in full to each of the matching algorithms. The dictionaries together include all the patterns to be searched. Again, the algorithms performance is monitored and a respective metric is evaluated as they process the data. With response to data characteristics change over time, patterns may be dynamically reassigned among the different dictionaries to adjust the corresponding algorithms to maximal overall performance.
  • The disclosed techniques enable the system to exploit the advantages and avoid the disadvantages of each pattern matching algorithm. The presented embodiments enable to handle high-bandwidth traffic with time-varying characteristics, and to search for a large number of patterns that otherwise would not be feasible with limited computing resources. Moreover, the methods and systems described herein are insensitive to pattern matching algorithmic complexity attacks.
  • System Description
  • FIG. 1 is a block diagram that schematically illustrates a system 20 for keyword spotting, in accordance with an embodiment that is described herein. System 20 receives communication traffic from a communication network 24, and attempts to detect in the traffic predefined textual phrases, also referred herein to as keywords or patterns. When one or more keywords are detected, the system reports the detection to a user 28 using an operator terminal 32.
  • System 20 can be used, for example, in an application that detects data leakage from a communication network. In applications of this sort, the presence of one or more keywords in a data item indicates that this data item should not be allowed to exit the network. Alternatively, system 20 can be used in any other suitable application in which input data is searched for occurrences of keywords, such as in intrusion detection and prevention systems, detection of spam in electronic mail (e-mail) systems, or detection of inappropriate content using a dictionary of inappropriate words or phrases.
  • Although the embodiments described herein refer mainly to processing of communication traffic, the disclosed techniques can also be used in other domains. For example, system 20 can be used for locating data of interest on storage devices, such as in forensic disk scanning applications. Certain additional aspects of keyword spotting are addressed, for example, in U.S. patent application Ser. No. 12/792,796, entitled “Systems and methods for efficient keyword spotting in communication traffic,” which is assigned to the assignee of the present patent applications and whose disclosure is incorporated herein by reference. Other applications may comprise, for example, pattern matching in gene sequences in biology.
  • Network 24 may comprise any suitable public or private, wireless or wire-line communication network, e.g., a Wide-Area network (WAN) such as the Internet, a Local-Area Network (LAN), a Metropolitan-Area Network (MAN), or a combination of network types. The communication traffic, to be used as input data by system 20, may be provided to the system using any suitable means. For example, the traffic may be forwarded to the system from a network element (e.g., router) in network 24, such as by port tapping or port mirroring. In alternative embodiments, system 20 may be placed in-line in the traffic path. These embodiments suitable, for example, for data leakage prevention applications, but can also be used in other applications.
  • Typically, network 24 comprises an Internet Protocol (IP) network, and the communication traffic comprises IP packets. The description that follows focuses on Transmission Control Protocol Internet Protocol (TCP/IP) networks and TCP packets. Alternatively, however, the methods and systems described herein can be used with other packet types, such as User Datagram Protocol (UDP) packets. Regardless of protocol, the packets searched by system 20 are referred to herein generally as input data.
  • In the example of FIG. 1, system 20 comprises a Network Interface Card (NIC) 36, which receives TCP packets from network 24. NIC 36 thus serves as an input circuit that receives the input data to be searched. NIC 36 stores the incoming TCP packets in a memory 40, typically comprising a Random Access Memory (RAM). A processor 44 searches the TCP packets stored in memory 40 and attempts to identify occurrences of predefined keywords in the packets.
  • The predefined keywords or patterns are stored in a patterns dictionary 48. Dictionary 48 may be stored on any suitable storage device. In some embodiments, dictionary 48, or part of it, may be stored in a cache memory (not shown) of processor 44 to increase the access speed by the processor. In some embodiments, dictionary may comprise multiple physical or logical distinct dictionaries.
  • When processor 44 detects a given keyword in a given packet, it reports the detection to user 28 using an output device of terminal 32, such as a display 56. For example, the processor may issue an alert to the user and/or present the data item (e.g., packet or session) in which the keyword was detected. In some embodiments, processor 44 may take various kinds of actions in response to detecting a keyword. For example, in a data leakage or intrusion prevention application, processor 44 may block some or all of the traffic upon detecting a keyword. User 28 may interact with system 20 using an input device of terminal 32, e.g., a keyboard 60.
  • The system configuration shown in FIG. 1 is an example configuration, which is chosen purely for the sake of conceptual clarity. Alternatively, any other suitable system configuration can be used. Generally, the different elements of system 20 may be implemented using software, hardware or a combination of hardware and software elements. In some embodiments, processor 44 comprises a general-purpose computer, which is programmed in software to carry out the functions described herein. The software may be downloaded to the computer in optical or electronic form, over a network, for example, or it may, additionally or alternatively, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
  • Maximizing Performance by Adaptive Splitting of Traffic
  • Many algorithms for keyword searching are known in the art. The algorithms may differ in several attributes such as run-time, implementation complexity and average or worst-case behavior. Moreover, their performance may be affected by several factors such as the size of the dictionary and the alphabet, as well as the length of the keywords.
  • For example, the run time as a function of the input length may be linear in the worst case, such as in Aho-Corasick (AC, which also performs better with a small dictionary of short patterns, and its run-time is not sensitive to the pattern length), or sub-linear on average, such as in the Wu-Manber (WM) and the Set Backward Oracle Matching (SBOM) algorithms that have a sub-linear on average run-time, and support a large keywords set as well as a large alphabet. The AC, and the SBOM algorithms are additionally relatively simple to implement. Some algorithms such as the WM perform better when searching for a set of only long patterns, as short patterns degrade their performance significantly.
  • Since under different and changing conditions, different algorithms may perform more efficiently than others, the disclosed techniques incorporate more than just a single algorithm in a system for keyword searching. Thus, with limited computation resources, the system may dynamically divert the traffic to the most suitable algorithm so as to maximize the overall performance.
  • FIG. 2 is a block diagram that schematically illustrates an example configuration of processor 44, in accordance with an embodiment of the present disclosure. Input traffic data enters a data splitter 100. In the example embodiment of FIG. 2, the splitter has one input port and two output ports. Processor 44 configures the splitter to extract part or a share of the traffic to each output port. Typically, the input traffic comprises multiple flows of packets, and the splitter directs certain segments of the flows (or flows parts) to one port and other or partially common segments to the other port. The splitter configuration is also referred to herein as a splitting policy.
  • When dynamically changing the splitting policy the processor should avoid missing any patterns as a result of the policy change. Processor 44 may use any suitable method to guarantee smooth transition of traffic with no loss of patterns detection. For example, when changing splitting policy, the processor may direct to each algorithm a sufficient lag of past characters. As another example, the processor may split the traffic on a flow basis, i.e., aggregate and direct all the data of a flow to one algorithm. As yet another example, a respective data segment around the flow cut point may be handled by a third (not shown) algorithm.
  • Two different matching algorithms denoted ALGORITHM1 104 and ALGORITHM2 108 are assigned input data from the respective output ports of splitter 100. When system 20 starts to receive communication traffic, processor 44 configures the splitter to an initial splitting policy. The processor may select any suitable initial policy. For example, if the initial data characteristics are not available to system 20, the processor may configure to initially split the data evenly.
  • In some embodiments, one of the algorithms may be a-priori assumed to be the most efficient for the expected input data. For example, the URL of the data source (if available) may indicate the data characteristics. In such embodiments the processor may initially configure the splitter to direct a dominant share or even all the traffic to the most efficient algorithm (referred to as the primary algorithm). Additionally or alternatively, the processor may get an initial splitting policy from user 28 via terminal 32.
  • ALGORITHM1 and ALGORITHM2 are configured to search the data accepted from the splitter for occurrences of patterns stored in a pattern dictionary 112. When either algorithm locates a pattern in the data, processor 44 reports the matching event as described in FIG. 1 above.
  • The performance or efficiency of a matching algorithm may change over time. For example, modifying/adding/deleting patterns in the dictionary (e.g., by user 28) may reduce the processing complexity of one algorithm and increase the complexity of another algorithm at the same time. As another example, as the characteristics of the input data change over time, the complexity burden on two different algorithms may change in opposite directions.
  • A performance analyzer 116 monitors the performance, e.g., the efficiency, of each matching algorithm. The efficiency of a matching algorithm can be estimated, for example, by evaluating a respective metric, such as the amount of input data that the algorithm can process per unit time, e.g., the number of processed input bytes per second. Other example performance metrics include the dictionaries memory size, and the amount of memory needed for flow state machine, i.e., for storing the internal state of the algorithm for each flow that is being analyzed.
  • In some embodiments each algorithm estimates its own performance and sends it to analyzer 116 for monitoring. Alternatively, the analyzer calculates the performance metric internally. The analyzer may use any suitable method to decide at what points in time to monitor the performance. For example the analyzer may monitor the performance periodically. The time period may be on the order of a few seconds, or any other suitable time duration. Alternatively, the analyzer may continuously measure the algorithms performance. Further additionally or alternatively, the analyzer may monitor the performance in response to a change in the dictionary content by the user.
  • The analyzer uses the monitored performance to decide on updated splitting policy for splitter 100. For example, the analyzer may derive a proportional splitting policy, i.e., the more an algorithm is efficient with respect to the others, a higher share of the traffic is reassigned to that algorithm. As another example the analyzer may derive an absolute splitting policy. For example, the analyzer may compare the performance of each algorithm to a predefined threshold, and direct most of the traffic to the algorithm whose performance relative to the respective threshold is the highest.
  • As yet another example, the analyzer can indicate the splitter to provide an algorithm with another input data segment, such as a packet, as the algorithm concludes processing a previous input data segment. Alternatively, the processor may use any other suitable method to determine the splitting policy with response to the monitored performance. Typically, the analyzer diverts some of the traffic to each algorithm in order to keep monitoring the performance of all the algorithms.
  • As yet another example, the analyzer may configure the splitter to direct a suitable data segment at the beginning of a certain flow to both algorithms. The rest of the flow will be directed to the algorithm that performed better on that data segment.
  • In addition to monitoring the algorithms performance, analyzer 116 analyzes the characteristics of the input traffic. The analyzer accepts the traffic output from the splitter for analysis. Since the data characteristics may change over time, and since each algorithm may be better tuned to some characteristics, the analyzer may change the splitting policy accordingly. The analyzer may use any suitable method to analyze the input data.
  • For example, the analyzer may calculate statistical attributes of the data characters. The analyzer can calculate a histogram that counts the number of each alphabet symbol in a data segment. In some embodiments, some metadata may accompany the data flow, indicating on the flow content, and therefore indicating on the data characteristics. For example a video, text, or images content may differ considerably in the data characteristics. In such embodiments, the analyzer may configure the splitter to direct a flow to the most suitable algorithm according to the accompanying metadata.
  • The analyzer may analyze the input data at any suitable points in time. For example the analyzer may periodically or continuously perform the analysis. Additionally or alternatively, the analyzer may perform the analysis when a new data source joins the traffic.
  • When deciding on an updated splitting policy as described above, analyzer 116 may additionally consider the inherent complexity of the algorithms. For example the processor may utilize optimization techniques to select a splitting policy that would maximize the overall efficiency (i.e., the total traffic the system can handle per a time unit), under overall constrained computation resources. As an example, the analyzer may trade computation time versus memory access time and optimize splitting the traffic among the algorithms accordingly.
  • Another example that may trigger the processor to change the splitting policy is referred to as an algorithmic complexity attack. A complexity attack is typically designed to push a specific algorithm to its worst case behavior, by planting in the traffic carefully selected data patterns. Therefore, the performance of a matching algorithm that suffers an attack reduces significantly. Since an attack is designed for a specific algorithm, other algorithms may be much less sensitive for that attack, and would typically maintain high performance.
  • When one algorithm is attacked, analyzer 116 would sense a significant performance reduction, and the processor may configure the splitter to stop directing any data to that algorithm. Alternatively, the processor maintains a small share of the traffic directed to the algorithm under attack and keeps monitoring the performance. When the attack stops, the processor may again split significant share of the traffic to that algorithm.
  • The embodiments in FIG. 2 use two matching algorithms and a splitter with two output ports, directing data to each algorithm. Other embodiments, however, may use any number of different matching algorithms and a corresponding suitable data splitter. For example an embodiment may use three different matching algorithms and a splitter with three output ports.
  • Maximizing Performance by Splitting Patterns Among Multiple Pattern Matching Algorithms
  • FIG. 3 is a block diagram that schematically illustrates another example configuration of processor 44, in accordance with another embodiment of the present disclosure. Unlike the description of FIG. 2, both matching algorithms in FIG. 3, i.e., the full input traffic is assigned to both algorithms ALGORITHM1 104 and ALGORITHM2 108. Performance analyzer 116 monitors the algorithms performance and analyzes the input data characteristics similarly to the methods described in FIG. 2 above. In FIG. 3, ALGORITHM1 and ALGORITHM2 are configured to search for occurrences of patterns stored in respective dictionaries DICTIONARY1 120 and DICTIONARY2 124. Both dictionaries together hold all the patterns that system 20 is configured to search. Typically, although not necessarily, the sets of patterns in DICTIONARY1 and DICTIONARY2 are disjoint.
  • System 20 can use any suitable method to decide what patterns to initially put in each dictionary. For example, it may be a-priori assumed that each algorithm performs more efficiently given a specific set of patterns. As an example, system 20 may assign patterns to algorithms based on the patterns length. For example, in a system that uses the AC and the WM algorithms, the system may assign a relatively small dictionary (preferably residing in a cache memory) with short length patterns to the AC algorithm, and a dictionary of only long patterns to the WM algorithm.
  • Additionally, when using a large dictionary, the internal hash function in the WM algorithm may experience a larger false positive probability due to collisions.
  • In some embodiments, a certain matching algorithm may perform better than others when the patterns for search contain wildcard expressions, i.e., a pattern may not be fully defined. In such embodiments, a dictionary with wildcard patterns may be assigned to that superior algorithm.
  • Additionally or alternatively, user 28 may configure each dictionary with selected patterns via terminal 28. As described below, system 20 automatically adjusts the dictionaries content on the fly, to maximize the system performance for varying input traffic.
  • In yet other embodiments, one or more of the algorithms may suffer performance degradation when the dictionary changes on the fly. In such embodiments, new patterns inserted by the user, or patterns moved from another dictionary, may be assigned to a temporal dictionary and an algorithm (not shown). Under suitable conditions, patterns from the temporal dictionary may be merged into the algorithm's dictionary.
  • As described in FIG. 2 above, the characteristics of the data may change over time, and as a result affect the performance of the matching algorithms. Analyzer 116 monitors the algorithms performance and the characteristics of the input data similarly to the description in FIG. 2 above, by evaluating a respective metric. When the analyzer detects a change in the algorithms performance and/or in the input data characteristics, it may reassign patterns to the dictionaries to adjust and increase the overall performance. To reassign patterns the analyzer may move or swap patterns between the dictionaries. As another example, if one algorithm suffers an algorithmic complexity attack, analyzer 116 may move the dictionary patterns that are more susceptible to cause the attack when they are searched to the dictionary of the other algorithm.
  • The embodiments in FIG. 3 use two matching algorithms and two respective dictionaries. Other embodiments however may comprise any suitable number of matching algorithms and respective dictionaries. Moreover, in some embodiments a system may be configured to use a smaller number of dictionaries than algorithms. In such embodiments, multiple algorithms may be configured to search for patterns that are stored in one dictionary. For example, in a system that comprises three algorithms and two dictionaries, the first two algorithms may be attached to one dictionary and the third algorithm to the other dictionary.
  • FIG. 4 is a flow chart that schematically illustrates a method for efficient keyword searching, in accordance with an embodiment of the present disclosure. The method begins with system 20 receiving patterns dictionaries at a patterns input step 200. System 20 receives packets (referred to as input data) from network 24 via NIC 36, and stores the packets in RAM 40, at a data input step 204.
  • Processor 44 searches the packets using algorithms 104 and 108 (using dictionary 112 or dictionaries 120 and 124) at a searching step 208. Processor 44 checks whether a match is found between a portion of the input data and any of the textual phrases (patterns) of the dictionaries, at a matching step 212. If a match with a respective pattern is found, processor 44 reports the match event to operator 28 using operator terminal 32, at an output step 216.
  • If no match is found, or following a match reporting, the method proceeds to an analyzing step 220. At step 220 the processor monitors and analyzes the performance of the matching algorithms ALGORITHM1 104 and ALGORITHM2 108. Still at step 220, the processor additionally analyzes the characteristics of the input data.
  • The processor checks if the traffic splitting policy should be changed, at a check analysis step 224. If the analysis of the algorithms performance and/or traffic characteristics indicates that by changing the splitting policy the overall performance will increase, the processor sets an updated splitting policy to data splitter 100 at adjusting step 228. Otherwise, the splitting policy is maintained and the processor loops back to step 204 above, in which system 20 receives subsequent input data.
  • Additionally or alternatively, at step 224 above, the processor checks if the analysis of the algorithms performance and/or data characteristics indicates that the overall performance may increase by moving or swapping patterns between DICTIONARY1 120 and DICTIONARY2 124. If the check result is positive, processor 44 adjusts the dictionaries content by moving or swapping patterns. After adjusting the dictionaries, or if there is no need for such adjustment the processor loops back to step 204.
  • It will be appreciated that the embodiments described above are cited by way of example, and that the present disclosure is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present disclosure includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.

Claims (20)

1. A method for identifying textual phrases of interest in input data, the method being performed by an apparatus comprising a network interface card (NIC) and a processor, the method comprising:
receiving, by the NIC, input communication traffic to be searched for occurrences of a set of patterns, wherein each pattern of the set of patterns comprises one or more textual phrases;
configuring by the processor, a data splitter in accordance with an initial splitting configuration policy to assign the input communication traffic and the patterns to multiple different pattern matching algorithms, in which certain segments of the input communication traffic are assigned to a first pattern matching algorithm and certain segments are assigned to a second pattern matching algorithm, wherein at least a first pattern of the set of patterns is assigned to the first pattern matching algorithm and wherein at least a second pattern of the set of patterns is assigned to the second pattern matching algorithm;
executing, by the processor, the first and second pattern matching algorithms to identify occurrences of textual phrases from the respective first and second patterns in the communication traffic, wherein the execution comprises the first pattern matching algorithm searching within certain for the one or more textual phrases of the first assigned pattern and comprises the second pattern matching algorithm searching within certain segments for the one or more textual phases of the second assigned pattern;
monitoring, by the processor, performance of the first and second pattern matching algorithms by evaluating a predetermined metric for each of the first and second pattern matching algorithms; and
generating, by the processor, for the data splitter, based on the monitored performance, an updated splitting policy configuration that reassigns which segments of the input communication traffic are assigned to which pattern matching algorithm and or which pattern of the set of patterns is assigned to which pattern matching algorithm; and
configuring, by the processor, the data splitter in accordance with the updated splitting policy configuration.
2. The method according to claim 1, wherein evaluating the predefined metric comprises assessing a performance measure of the pattern matching algorithms.
3. The method according to claim 1, wherein evaluating the predefined metric comprises assessing a characteristic of the input communication traffic.
4. The method according to claim 1, wherein assigning the input communication traffic and the patterns comprises applying each of the pattern matching algorithms to search a respective subset of the input communication traffic for the occurrences of all the patterns.
5. The method according to claim 4, wherein reassigning the input communication traffic and the patterns comprises reassigning a portion of the input communication traffic from the first pattern matching algorithm to the second pattern matching algorithm.
6. The method according to claim 1, wherein assigning the input communication traffic and the patterns comprises defining one of the pattern matching algorithms as a primary algorithm and assigning a majority of the input communication traffic to the primary algorithm, and wherein reassigning the input data and the patterns comprises redefining another of the pattern matching algorithms to serve as the primary algorithm and shifting the majority of the input communication traffic to the redefined primary algorithm.
7. The method according to claim 1, wherein assigning the input communication traffic and the patterns comprises applying each of the pattern matching algorithms to search all the input communication traffic for the occurrences of a respective subset of the patterns.
8. The method according to claim 1, wherein evaluating the metric comprises evaluating at least one metric type selected from a group of types consisting of:
a volume of the input communication traffic processed by a given pattern matching algorithm per unit time;
a memory size occupied by the assigned patterns; and
the memory size used for maintaining state machines of respective flows of the input communication traffic.
9. An apparatus for identifying textual phrases of interest in input data, comprising:
a network interface card (NIC) configured to receive input communication traffic that is to be searched for occurrences of a set of patterns, wherein each pattern of the set of patterns comprises one or more textual phrases; and
a processor configured to:
configure a data splitter in accordance with an initial splitting configuration policy to assign the input communication traffic and the patterns to multiple different pattern matching algorithms, in which certain segments of the input communication traffic are assigned to a first pattern matching algorithm and certain segments are assigned to a second pattern matching algorithm;
execute the first and second pattern matching algorithms to identify occurrences of textual phrases from the respective first and second patterns in the communication traffic, wherein the execution comprises the first pattern matching algorithm searching within certain for the one or more textual phrases of the first assigned pattern and comprises the second pattern matching algorithm searching within certain segments for the one or more textual phases of the second assigned pattern;
monitor performance of the first and second pattern matching algorithms by evaluating a predetermined metric for each of the first and second pattern matching algorithms;
generate, for the data splitter, based on the monitored performance, an updated splitting policy configuration that reassigns which segments of the input communication traffic are assigned to which pattern matching algorithm and or which pattern of the set of patterns is assigned to which pattern matching algorithm; and
configure the data splitter in accordance with the updated splitting policy configuration.
10. The apparatus according to claim 9, wherein the processor is configured to evaluate the predefined metric by assessing a performance measure of the pattern matching algorithms.
11. The apparatus according to claim 9, wherein the processor is configured to evaluate the predefined metric by assessing a characteristic of the input communication traffic.
12. The apparatus according to claim 9, wherein the processor is configured to assign the input communication traffic and the patterns by applying each of the pattern matching algorithms to search a respective subset of the input communication traffic for the occurrences of all the patterns.
13. The apparatus according to claim 12, wherein the processor is configured to reassign the input communication traffic and the patterns by reassigning a portion of the input communication traffic from the first pattern matching algorithm to the second pattern matching algorithm.
14. The apparatus according to claim 9, wherein the processor is configured to define one of the pattern matching algorithms as a primary algorithm and assigning a majority of the input communication traffic to the primary algorithm, and to reassign the input communication traffic and the patterns by redefining another of the pattern matching algorithms to serve as the primary algorithm and to shift the majority of the input communication traffic to the redefined primary algorithm.
15. The apparatus according to claim 9, wherein the processor is configured to assign the input communication traffic and the patterns by applying each of the pattern matching algorithms to search all the input communication traffic for the occurrences of a respective subset of the patterns.
16. The apparatus according to claim 9, wherein the processor is configured to evaluate the metric by evaluating at least one metric type selected from a group of types consisting of:
a volume of the input communication traffic processed by a given pattern matching algorithm per unit time;
a memory size occupied by the assigned patterns; and
the memory size used for maintaining state machines of respective flows of the input communication traffic.
17. A non-transitory computer readable media having instructions stored thereon for identifying textual phrases of interest in input data that, when executed by a computing system, cause the computing device to at least:
receive input communication traffic that is to be searched for occurrences of a set of patterns, wherein each pattern of the set of patterns comprises one or more textual phrases; and
configure a data splitter in accordance with an initial splitting configuration policy to assign the input communication traffic and the patterns to multiple different pattern matching algorithms, in which certain segments of the input communication traffic are assigned to a first pattern matching algorithm and certain segments are assigned to a second pattern matching algorithm;
execute the first and second pattern matching algorithms to identify occurrences of textual phrases from the respective first and second patterns in the communication traffic, wherein the execution comprises the first pattern matching algorithm searching within certain for the one or more textual phrases of the first assigned pattern and comprises the second pattern matching algorithm searching within certain segments for the one or more textual phases of the second assigned pattern;
monitor performance of the first and second pattern matching algorithms by evaluating a predetermined metric for each of the first and second pattern matching algorithms;
generate, for the data splitter, based on the monitored performance, an updated splitting policy configuration that reassigns which segments of the input communication traffic are assigned to which pattern matching algorithm and or which pattern of the set of patterns is assigned to which pattern matching algorithm; and
configure the data splitter in accordance with the updated splitting policy configuration.
18. The non-transitory computer readable media according to claim 17, wherein the computing device is configured to assign the input communication traffic and the patterns by applying each of the pattern matching algorithms to search a respective subset of the input communication traffic for the occurrences of all the patterns.
19. The non-transitory computer readable media according to claim 18, wherein the computing device is configured to reassign the input communication traffic and the patterns by reassigning a portion of the input communication traffic from the first pattern matching algorithm to the second pattern matching algorithm.
20. The non-transitory computer readable media according to claim 17, wherein the computing device is configured to evaluate the metric by evaluating at least one metric type selected from a group of types consisting of:
a volume of the input communication traffic processed by a given pattern matching algorithm per unit time;
a memory size occupied by the assigned patterns; and
the memory size used for maintaining state machines of respective flows of the input communication traffic.
US15/411,369 2013-04-28 2017-01-20 Systems and methods for keyword spotting using adaptive management of multiple pattern matching algorithms Abandoned US20200304414A9 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/411,369 US20200304414A9 (en) 2013-04-28 2017-01-20 Systems and methods for keyword spotting using adaptive management of multiple pattern matching algorithms

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
IL226056A IL226056A (en) 2013-04-28 2013-04-28 Systems and methods for keyword spotting using adaptive management of multiple pattern matching algorithms
US14/263,108 US9589073B2 (en) 2013-04-28 2014-04-28 Systems and methods for keyword spotting using adaptive management of multiple pattern matching algorithms
IL226056 2014-04-28
US15/411,369 US20200304414A9 (en) 2013-04-28 2017-01-20 Systems and methods for keyword spotting using adaptive management of multiple pattern matching algorithms

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US14/263,108 Continuation US9589073B2 (en) 2013-04-28 2014-04-28 Systems and methods for keyword spotting using adaptive management of multiple pattern matching algorithms

Publications (2)

Publication Number Publication Date
US20170195234A1 US20170195234A1 (en) 2017-07-06
US20200304414A9 true US20200304414A9 (en) 2020-09-24

Family

ID=54334957

Family Applications (2)

Application Number Title Priority Date Filing Date
US14/263,108 Active 2034-12-03 US9589073B2 (en) 2013-04-28 2014-04-28 Systems and methods for keyword spotting using adaptive management of multiple pattern matching algorithms
US15/411,369 Abandoned US20200304414A9 (en) 2013-04-28 2017-01-20 Systems and methods for keyword spotting using adaptive management of multiple pattern matching algorithms

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US14/263,108 Active 2034-12-03 US9589073B2 (en) 2013-04-28 2014-04-28 Systems and methods for keyword spotting using adaptive management of multiple pattern matching algorithms

Country Status (2)

Country Link
US (2) US9589073B2 (en)
IL (1) IL226056A (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9336331B2 (en) * 2010-04-26 2016-05-10 Ca, Inc. Detecting, using, and sharing it design patterns and anti-patterns
EP3198475A1 (en) 2014-09-26 2017-08-02 British Telecommunications Public Limited Company Efficient conditional state mapping in a pattern matching automaton
EP3198476A1 (en) * 2014-09-26 2017-08-02 British Telecommunications Public Limited Company Efficient pattern matching
US10846598B2 (en) 2014-09-26 2020-11-24 British Telecommunications Public Limited Company Pattern matching
US9875081B2 (en) * 2015-09-21 2018-01-23 Amazon Technologies, Inc. Device selection for providing a response
CN106934409B (en) * 2015-12-29 2021-04-20 优信拍(北京)信息科技有限公司 Data matching method and device
US10642889B2 (en) 2017-02-20 2020-05-05 Gong I.O Ltd. Unsupervised automated topic detection, segmentation and labeling of conversations
CN109145283B (en) * 2017-06-17 2022-03-15 黄冈 Artificial intelligent sensitive information detection method
US10482904B1 (en) 2017-08-15 2019-11-19 Amazon Technologies, Inc. Context driven device arbitration
US11276407B2 (en) 2018-04-17 2022-03-15 Gong.Io Ltd. Metadata-based diarization of teleconferences
US11663105B2 (en) * 2019-09-12 2023-05-30 Vmware, Inc. String pattern matching for multi-string pattern rules in intrusion detection
CN112330379B (en) * 2020-11-25 2023-10-31 税友软件集团股份有限公司 Invoice content generation method, invoice content generation system, electronic equipment and storage medium
US12010126B2 (en) 2021-07-13 2024-06-11 VMware LLC Method and system for automatically curating intrusion detection signatures for workloads based on contextual attributes in an SDDC
CN117668527B (en) * 2024-01-31 2024-04-26 国网湖北省电力有限公司信息通信公司 Multi-feature recognition method and system under large-flow model

Family Cites Families (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5014327A (en) * 1987-06-15 1991-05-07 Digital Equipment Corporation Parallel associative memory having improved selection and decision mechanisms for recognizing and sorting relevant patterns
TW274135B (en) * 1994-09-14 1996-04-11 Hitachi Seisakusyo Kk
US5689442A (en) 1995-03-22 1997-11-18 Witness Systems, Inc. Event surveillance system
GB9620082D0 (en) 1996-09-26 1996-11-13 Eyretel Ltd Signal monitoring apparatus
US6963871B1 (en) * 1998-03-25 2005-11-08 Language Analysis Systems, Inc. System and method for adaptive multi-cultural searching and matching of personal names
IL130893A (en) 1999-07-12 2003-12-10 Ectel Ltd Method and system for creating integrated call detail records (cdr) databases in management systems of telecommunications networks
US7574000B2 (en) 2000-01-13 2009-08-11 Verint Americas Inc. System and method for analysing communications streams
GB0000735D0 (en) 2000-01-13 2000-03-08 Eyretel Ltd System and method for analysing communication streams
IL136324A0 (en) 2000-05-24 2001-05-20 Softcom Computers Ltd Method of surveilling internet communication
US6785416B1 (en) * 2000-10-17 2004-08-31 Oak Technology, Inc. System and method for the processing of scanned image data using a pixel window
US6675164B2 (en) * 2001-06-08 2004-01-06 The Regents Of The University Of California Parallel object-oriented data mining system
CN1535433A (en) * 2001-07-04 2004-10-06 库吉萨姆媒介公司 Category based, extensible and interactive system for document retrieval
US7257576B2 (en) * 2002-06-21 2007-08-14 Microsoft Corporation Method and system for a pattern matching engine
US20060036561A1 (en) * 2002-09-27 2006-02-16 Carnegie Mellon University Pattern search algorithm for component layout
US7453439B1 (en) * 2003-01-16 2008-11-18 Forward Input Inc. System and method for continuous stroke word-based text input
US8429174B2 (en) * 2003-01-25 2013-04-23 Purdue Research Foundation Methods, systems, and data structures for performing searches on three dimensional objects
US7783657B2 (en) * 2005-06-03 2010-08-24 Microsoft Corporation Search authoring metrics and debugging
US20070129976A1 (en) * 2005-08-04 2007-06-07 Prolify Ltd. Apparatus and methods for process and project management and control
US7856411B2 (en) * 2006-03-21 2010-12-21 21St Century Technologies, Inc. Social network aware pattern detection
CA2648354A1 (en) * 2006-04-04 2007-10-11 Zota Limited Targeted advertising system
US7877401B1 (en) * 2006-05-24 2011-01-25 Tilera Corporation Pattern matching
US20080014873A1 (en) 2006-07-12 2008-01-17 Krayer Yvonne L Methods and apparatus for adaptive local oscillator nulling
AU2006249239B2 (en) * 2006-12-07 2010-02-18 Canon Kabushiki Kaisha A method of ordering and presenting images with smooth metadata transitions
US8113844B2 (en) 2006-12-15 2012-02-14 Atellis, Inc. Method, system, and computer-readable recording medium for synchronous multi-media recording and playback with end user control of time, data, and event visualization for playback control over a network
US7882217B2 (en) 2007-05-17 2011-02-01 Verint Systems Inc. Network identity clustering
US7720786B2 (en) * 2007-05-25 2010-05-18 Samsung Electronics Co., Ltd. Method of pattern identification using retain operators on multisets
US8428310B2 (en) * 2008-02-28 2013-04-23 Adt Services Gmbh Pattern classification system and method for collective learning
GB0816556D0 (en) * 2008-09-10 2008-10-15 Univ Napier Improvements in or relating to digital forensics
EP2466499A4 (en) * 2010-02-26 2016-10-26 Rakuten Inc Information processing device, information processing method, program for information processing device, and recording medium
US8478736B2 (en) * 2011-02-08 2013-07-02 International Business Machines Corporation Pattern matching accelerator
US20140099623A1 (en) * 2012-10-04 2014-04-10 Karmarkar V. Amit Social graphs based on user bioresponse data
US9674360B2 (en) * 2012-04-19 2017-06-06 Avaya Inc. Management of contacts at contact centers
EP2747078A1 (en) * 2012-12-18 2014-06-25 Telefónica, S.A. Method and system for improved pattern matching
US8938460B2 (en) * 2013-03-04 2015-01-20 Tracfone Wireless, Inc. Automated highest priority ordering of content items stored on a device
US20150066963A1 (en) * 2013-08-29 2015-03-05 Honeywell International Inc. Structured event log data entry from operator reviewed proposed text patterns

Also Published As

Publication number Publication date
IL226056A (en) 2017-06-29
US9589073B2 (en) 2017-03-07
US20170195234A1 (en) 2017-07-06
US20150310014A1 (en) 2015-10-29

Similar Documents

Publication Publication Date Title
US9589073B2 (en) Systems and methods for keyword spotting using adaptive management of multiple pattern matching algorithms
US10198427B2 (en) System and method for keyword spotting using representative dictionary
US9794229B2 (en) Behavior analysis based DNS tunneling detection and classification framework for network security
EP1889422B1 (en) Packet classification acceleration using spectral analysis
US10719540B2 (en) Systems and methods for keyword spotting using alternating search algorithms
US11888874B2 (en) Label guided unsupervised learning based network-level application signature generation
Oh et al. Fingerprinting keywords in search queries over tor
Born et al. Ngviz: detecting dns tunnels through n-gram visualization and quantitative analysis
Latif et al. EVFDT: An Enhanced Very Fast Decision Tree Algorithm for Detecting Distributed Denial of Service Attack in Cloud‐Assisted Wireless Body Area Network
Bamasag et al. Real-time DDoS flood attack monitoring and detection (RT-AMD) model for cloud computing
US9690873B2 (en) System and method for bit-map based keyword spotting in communication traffic
US10084876B2 (en) System and method for conditional analysis of network traffic
Mimura et al. Leaving all proxy server logs to paragraph vector
CN112104628B (en) Adaptive feature rule matching real-time malicious flow detection method
Iglesias et al. Are network attacks outliers? a study of space representations and unsupervised algorithms
Ferdous et al. Classification of SIP messages by a syntax filter and SVMs
US20220329625A1 (en) Systems and methods for ip spoofing security
Mimura On the effectiveness of extracting important words from proxy logs
US20240154997A1 (en) Tor-based malware detection
Khoa et al. A deep transfer learning approach for flow-based intrusion detection in SDN-enabled network
Shendi et al. Real-Time Attacks Detection Model And Platform Using Big Data And Machine Learning
Park et al. A lightweight software model for signature-based application-level traffic classification system
Zhou Hardware acceleration for power efficient deep packet inspection
Zhang et al. Research of adaptive immune network intrusion detection model
KR20170053895A (en) Computer-executable intrusion detection method, system and computer-readable storage medium storing the same

Legal Events

Date Code Title Description
AS Assignment

Owner name: VERINT SYSTEMS LTD., ISRAEL

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:YISHAY, YITSHAK;REEL/FRAME:041052/0230

Effective date: 20140519

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: ADVISORY ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

AS Assignment

Owner name: COGNYTE TECHNOLOGIES ISRAEL LTD, ISRAEL

Free format text: CHANGE OF NAME;ASSIGNOR:VERINT SYSTEMS LTD.;REEL/FRAME:060751/0532

Effective date: 20201116

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

AS Assignment

Owner name: COGNYTE TECHNOLOGIES ISRAEL LTD, ISRAEL

Free format text: CHANGE OF NAME;ASSIGNOR:VERINT SYSTEMS LTD.;REEL/FRAME:059710/0753

Effective date: 20201116

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION