WO2020167586A1 - Découverte de données automatisée pour cybersécurité - Google Patents

Découverte de données automatisée pour cybersécurité Download PDF

Info

Publication number
WO2020167586A1
WO2020167586A1 PCT/US2020/017064 US2020017064W WO2020167586A1 WO 2020167586 A1 WO2020167586 A1 WO 2020167586A1 US 2020017064 W US2020017064 W US 2020017064W WO 2020167586 A1 WO2020167586 A1 WO 2020167586A1
Authority
WO
WIPO (PCT)
Prior art keywords
entry
matched
score
lexical
contained
Prior art date
Application number
PCT/US2020/017064
Other languages
English (en)
Inventor
David A. ROSENBERG
Vincent ENG
Original Assignee
Db Cybertech, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Db Cybertech, Inc. filed Critical Db Cybertech, Inc.
Publication of WO2020167586A1 publication Critical patent/WO2020167586A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the embodiments described herein are generally directed to automated data discovery for cybersecurity, and, more particularly, to automated discovery and classification of data within a network to improve cybersecurity within the network.
  • Data may exist in several forms and in many places within an entity’s infrastructure.
  • structured data primarily resides in databases, and frequently in a relational database management system (RDBMS).
  • Unstructured data may exist across a number of files (e.g., documents, spreadsheets, etc.) spread across numerous systems.
  • files e.g., documents, spreadsheets, etc.
  • NoSQL Not Only Structured Query Language
  • the term“structured data,” when used herein, should be construed to include both structured and semi -structured data.
  • a method comprises using at least one hardware processor to: receive a plurality of semantic objects, representing textual components of at least one or more statements, in a data-access language, captured from network traffic, within a context; parse the textual components into one or more groups of one or more words; matching the one or more the groups to lemmas in a semantic dictionary to produce one or more matched lemmas representing the one or more groups; matching the one or more matched lemmas to lexical entries in the semantic dictionary to produce one or more matched lexical entries; matching the one or more matched lexical entries to one or more classifiers in a classifier database to produce one or more classifications, wherein each of the one or more classifications represents a possibility that at least one of the one or more statements accesses data within that classification; and storing the one or more classifications in association with the context in an analytic database.
  • the plurality of semantic objects may represent textual components of one or more statements in the data-access language and one or more comments in the data-access language.
  • the textual components may comprise one or more identifiers utilized by one or more statements in the data-access language. Parsing the one or more identifiers into the one or more groups of one or more words may comprise expanding one non-standard identifier into two or more standard words.
  • the textual components may be parsed into the one or more groups of one or more words according to the context.
  • Matching the one or more matched lemmas to lexical entries in the semantic dictionary may comprise constraining the matching according to one or more of the following constraints: each of the one or more matched lemmas is matched, at most, once to the lexical entries in the semantic dictionary; a total number of characters of the one or more lemmas, matched to the lexical entries in the semantic dictionary, is maximized; and any of the one or more matched lemmas that are not matched to the lexical entries in the semantic dictionary are skipped.
  • the total number of characters of the one or more lemmas may be determined based on a written representation associated with each of the one or more matched lexical entries in the semantic dictionary, wherein each written representation is in a single human language regardless of a human language of the groups which were matched to the one or more lemmas.
  • Each of the one or more classifiers may comprise a containment hierarchy that comprises one or more categories contained within the classifier, one or more entry groups contained within each of the one or more categories, one or more entry sets contained within each of the one or more entry groups, and one or more lexical entries contained within each of the one or more entry sets. Any of the one or more entry groups that contains a plurality of lexical entries may be associated with a match window, such that the entry group is only matched if the one or more matched lexical entries comprise all of the contained plurality of lexical entries within the match window.
  • the method may further comprise generating a quantitative score for each of the one or more classifications, wherein the quantitative score indicates a likelihood that the one or more statements implicate the respective classification within the context.
  • Each of the one or more classifications may comprise a containment hierarchy that comprises one or more categories contained within the classification, one or more entry groups contained within each of the one or more categories, one or more entry sets contained within each of the one or more entry groups, and one or more lexical entries contained within each of the one or more entry sets
  • generating the quantitative score for each of the one or more classifications may comprise: generating a local score for each of the one or more lexical entries; generating an entry-set score for each of the one or more entry sets based on the local scores for the one or more lexical entries contained within that entry set; generating an entry-group score for each of the one or more entry groups based on the entry-set scores for the one or more entry sets contained within that entry group; generating a category score for each of the one or more categories based on the entry-
  • the local score for each of the one or more lexical entries may be calculated based on an actual number of times that the lexical entry occurs, an expected number of times that the lexical entry is expected to occur, and a weight.
  • the entry-set score for each of the one or more entry sets may be calculated based on a maximum local score for the one or more lexical entries contained with that entry set, and a weight.
  • the entry-group score for each of the one or more entry groups may be calculated based on: if the entry group contains only a single entry set, the entry-set score for the single entry set, and a weight; and, if the entry group contains a plurality of entry sets, an accumulated probability calculated based on the local scores of the one or more lexical entries contained in each of the plurality of entry sets, and a weight.
  • the category score for each of the one or more categories may be calculated based on a maximum entry -group score for the one or more entry groups contained with that category, and a weight.
  • the classification score for each of the one or more classifications may be calculated based on a maximum category score for the one or more categories contained with that classification.
  • the method may further comprise filtering out at least one of the one or more categories contained within one or more classifications based on the category score calculated for the at least one category.
  • the method may further comprise filtering out at least one of the one or more classifications by abbreviating at least one of the one or more classifiers to elide matches to the at least one classifier that produce the at least one classification.
  • the method may be embodied in executable software modules of a processor- based system, such as a server, and/or in executable instructions stored in a non-transitory computer-readable medium.
  • FIG. 1 illustrates an example infrastructure, in which one or more of the processes described herein, may be implemented, according to an embodiment
  • FIG. 2 illustrates an example processing system, by which one or more of the processed described herein, may be executed, according to an embodiment
  • FIG. 3 illustrates an example process for data discovery, according to an embodiment
  • FIG. 4 illustrates an example process for data classification, according to an embodiment
  • FIG. 5 illustrates an example classifier, according to an embodiment
  • FIG. 6 illustrates an example algorithm for quantitative scoring, according to an embodiment
  • FIG. 7 illustrates an example algorithm for qualitative filtering, according to an embodiment.
  • systems, methods, and non-transitory computer-readable media are disclosed to prevent, detect, and/or mitigate cyberattacks and/or other forms of data loss through an improved means of discovering structured data.
  • the disclosed embodiments may enable entities that control data to continuously monitor access, modification, and/or movement of that data, including when such activity falls outside of behavioral norms.
  • FIG. 1 illustrates an example infrastructure for capturing and analyzing interactions between two or more network agents, according to an embodiment.
  • the infrastructure comprises a platform 110 that comprises a capture/analysis system 112 and a detection system 114.
  • Capture/analysis system 112 may be communicatively connected to one or more network taps 120 on network paths 140 between two or more network agents 130.
  • the network comprising network agents 130 may comprise any type of network, including an intranet and/or the Internet, and network agents 130 may communicate using any known communication protocols, such as Transmission Control Protocol (TCP), Internet Protocol (IP), Hypertext Transfer Protocol (HTTP), Secure HTTP (HTTPS), File Transfer Protocol (FTP), and/or the like.
  • TCP Transmission Control Protocol
  • IP Internet Protocol
  • HTTP Hypertext Transfer Protocol
  • HTTPS Secure HTTP
  • FTP File Transfer Protocol
  • the infrastructure may comprise any number and mixture of platforms 110, capture/analysis systems 112, detection systems 114, and network taps 120, and any plurality of network agents 130.
  • Each network tap 120 is positioned within a network so as to capture data packets that are sent within network traffic between two or more network agents 130.
  • network tap 120 may be positioned on a network switch in network path 140 between network agents 130A and 130B.
  • Networks taps may be implemented in software, hardware, or a combination of hardware and software.
  • Network agents 130 may comprise any type and mixture of device(s) that are capable of network communication, including servers, desktop computers, mobile devices, Internet of Things (IoT) devices, electronic kiosks, and/or the like.
  • the network switch is configured to relay raw data packets sent from network agent 130A to network agent 130B, and to relay raw data packets sent from network agent 130B to network agent 130A.
  • the network tap 120 on the network switch is configured to copy all of these raw data packets, and transmit the copies of these raw data packets to capture/analysis system 112 via a communication path 150. While network tap 120 may occasionally miss data packets, network tap 120 will capture substantially all network traffic and transmit that network traffic as raw data packets to capture/analysis system 112.
  • Capture/analysis system 112 receives the raw data packets, copied by network tap 120 from network traffic between a plurality of network agents 130, and assembles and analyzes the raw data packets to produce a semantic model of the network traffic.
  • This capture-and-analysis process is described in detail in U.S. Patent Nos. 9,100,291, issued on August 4, 2016, 9,185,125, issued on November 10, 2015, and 9,525,642, issued on December 20, 2016, which are all hereby incorporated herein by reference as if set forth in full, and collectively referred to herein as“the prior patents.” It should be understood that, in these prior patents, the capture-and-analysis device corresponds to capture/analysis system 112 described herein, and that the detector corresponds to detection system 114 described herein.
  • Detection system 114 consumes the semantic model(s) of network traffic that are produced by capture/analysis system 112. Embodiments of this consumption by detection system 114 are the primary focus of the present disclosure. Thus, it should be understood that processes disclosed herein may be implemented by detection system 114. Detection system 114 may be a software module that communicatively interfaces with capture/analysis system 112, via any known means (e.g., an application programming interface (API), shared memory, etc.), to receive the semantic models of network traffic. Detection system 114 may be executed by the same processing device as capture/analysis system 112 or a different processing device than capture/analysis system 112.
  • API application programming interface
  • capture/analysis system 112 and detection system 114 are shown as being comprised within a single platform 110, capture/analysis system 112 and detection system 114 may instead be comprised in separate platforms and/or may be controlled by different operators. In other words, while detection system 114 utilizes the output of capture/analysis system 112, these two systems are not required to be under control of the same entity. However, for purposes of security, it may be beneficial for these two systems 112 and 114 to be under control of the same entity.
  • FIG. 2 is a block diagram illustrating an example wired or wireless system 200 that may be used in connection with various embodiments described herein.
  • system 200 may be used as or in conjunction with one or more of the processes described herein (e.g., to store and/or execute software implementing the disclosed processes), and may represent components of platform 110, including capture/analysis system 112 and detection system 114, network tap 120, network agents 130, and/or any other processing devices described herein.
  • capture/analysis system 112 and detection system 114 are implemented as software modules
  • one or more systems 200 may be used to store and execute these software modules.
  • System 200 can be a server, a conventional personal computer, or any other processor-enabled device that is capable of wired or wireless data communication. Other computer systems and/or architectures may be also used, as will be clear to those skilled in the art.
  • System 200 preferably includes one or more processors, such as processor 210. Additional processors may be provided, such as an auxiliary processor to manage input/output, an auxiliary processor to perform floating-point mathematical operations, a special-purpose microprocessor having an architecture suitable for fast execution of signal processing algorithms (e.g., digital-signal processor), a slave processor subordinate to the main processing system (e.g., back-end processor), an additional microprocessor or controller for dual or multiple processor systems, and/or a coprocessor.
  • auxiliary processors may be discrete processors or may be integrated with processor 210. Examples of processors which may be used with system 200 include, without limitation, the Pentium® processor, Core i7® processor, and Xeon® processor, all of which are available from Intel Corporation of Santa Clara, California.
  • Processor 210 is preferably connected to a communication bus 205.
  • Communication bus 205 may include a data channel for facilitating information transfer between storage and other peripheral components of system 200.
  • communication bus 205 may provide a set of signals used for communication with processor 210, including a data bus, address bus, and/or control bus (not shown).
  • Communication bus 205 may comprise any standard or non-standard bus architecture such as, for example, bus architectures compliant with industry standard architecture (ISA), extended industry standard architecture (EISA), Micro Channel Architecture (MCA), peripheral component interconnect (PCI) local bus, standards promulgated by the Institute of Electrical and Electronics Engineers (IEEE) including IEEE 488 general-purpose interface bus (GPIB), IEEE 696/S- 100, and/or the like.
  • ISA industry standard architecture
  • EISA extended industry standard architecture
  • MCA Micro Channel Architecture
  • PCI peripheral component interconnect
  • System 200 preferably includes a main memory 215 and may also include a secondary memory 220.
  • Main memory 215 provides storage of instructions and data for programs executing on processor 210, such as one or more of the functions and/or modules discussed herein. It should be understood that programs stored in the memory and executed by processor 210 may be written and/or compiled according to any suitable language, including without limitation C/C++, Java, JavaScript, Perl, Visual Basic, .NET, and the like.
  • Main memory 215 is typically semiconductor-based memory such as dynamic random access memory (DRAM) and/or static random access memory (SRAM). Other semiconductor- based memory types include, for example, synchronous dynamic random access memory (SDRAM), Rambus dynamic random access memory (RDRAM), ferroelectric random access memory (FRAM), and the like, including read only memory (ROM).
  • SDRAM synchronous dynamic random access memory
  • RDRAM Rambus dynamic random access memory
  • FRAM ferroelectric random access memory
  • ROM read only memory
  • Secondary memory 220 may optionally include an internal medium 225 and/or a removable medium 230.
  • Removable medium 230 is read from and/or written to in any well- known manner.
  • Removable storage medium 230 may be, for example, a magnetic tape drive, a compact disc (CD) drive, a digital versatile disc (DVD) drive, other optical drive, a flash memory drive, and/or the like.
  • Secondary memory 220 is a non-transitory computer-readable medium having computer-executable code (e.g., disclosed software representing capture/analysis system 112 and/or detection system 114) and/or other data stored thereon.
  • the computer software or data stored on secondary memory 220 is read into main memory 215 for execution by processor 210
  • secondary memory 220 may include other similar means for allowing computer programs or other data or instructions to be loaded into system 200. Such means may include, for example, a communication interface 240, which allows software and data to be transferred from external storage medium 245 to system 200. Examples of external storage medium 245 may include an external hard disk drive, an external optical drive, an external magneto-optical drive, and/or the like. Other examples of secondary memory 220 may include semiconductor-based memory, such as programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable read-only memory (EEPROM), and flash memory (block-oriented memory similar to EEPROM).
  • PROM programmable read-only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable read-only memory
  • flash memory block-oriented memory similar to EEPROM
  • system 200 may include a communication interface 240.
  • Communication interface 240 allows software and data to be transferred between system 200 and external devices (e.g. printers), networks, or other information sources.
  • external devices e.g. printers
  • computer software or executable code may be transferred to system 200 from a network server (e.g., platform 110) via communication interface 240.
  • Examples of communication interface 240 include a built-in network adapter, network interface card (NIC), Personal Computer Memory Card International Association (PCMCIA) network card, card bus network adapter, wireless network adapter, Universal Serial Bus (USB) network adapter, modem, a wireless data card, a communications port, an infrared interface, an IEEE 1394 fire-wire, and any other device capable of interfacing system 200 with a network (e.g., network(s) 120) or another computing device.
  • NIC network interface card
  • PCMCIA Personal Computer Memory Card International Association
  • USB Universal Serial Bus
  • Communication interface 240 preferably implements industry-promulgated protocol standards, such as Ethernet IEEE 802 standards, Fiber Channel, digital subscriber line (DSL), asynchronous digital subscriber line (ADSL), frame relay, asynchronous transfer mode (ATM), integrated digital services network (ISDN), personal communications services (PCS), transmission control protocol/Internet protocol (TCP/IP), serial line Internet protocol/point to point protocol (SLIP/PPP), and so on, but may also implement customized or non-standard interface protocols as well.
  • industry-promulgated protocol standards such as Ethernet IEEE 802 standards, Fiber Channel, digital subscriber line (DSL), asynchronous digital subscriber line (ADSL), frame relay, asynchronous transfer mode (ATM), integrated digital services network (ISDN), personal communications services (PCS), transmission control protocol/Internet protocol (TCP/IP), serial line Internet protocol/point to point protocol (SLIP/PPP), and so on, but may also implement customized or non-standard interface protocols as well.
  • Communication interface 240 Software and data transferred via communication interface 240 are generally in the form of electrical communication signals 255. These signals 255 may be provided to communication interface 240 via a communication channel 250.
  • communication channel 250 may be a wired or wireless network, or any variety of other communication links.
  • Communication channel 250 carries signals 255 and can be implemented using a variety of wired or wireless communication means including wire or cable, fiber optics, conventional phone line, cellular phone link, wireless data communication link, radio frequency (“RF”) link, or infrared link, just to name a few.
  • RF radio frequency
  • Computer-executable code e.g., computer programs, such as implementations of the disclosed software processes
  • main memory 215 and/or secondary memory 220 are stored in main memory 215 and/or secondary memory 220.
  • Computer programs can also be received via communication interface 240 and stored in main memory 215 and/or secondary memory 220.
  • Such computer programs when executed, enable system 200 to perform the various functions of the disclosed embodiments as described elsewhere herein.
  • computer-readable medium is used to refer to any non-transitory computer-readable storage media used to provide computer-executable code and/or other data to or within system 200.
  • Examples of such media include main memory 215, secondary memory 220 (including internal memory 225, removable medium 230, and external storage medium 245), and any peripheral device communicatively coupled with communication interface 240 (including a network information server or other network device).
  • These non-transitory computer-readable media are means for providing executable code, programming instructions, software, and/or other data to system 200.
  • the software may be stored on a computer-readable medium and loaded into system 200 by way of removable medium 230, I/O interface 235, or communication interface 240.
  • the software is loaded into system 200 in the form of electrical communication signals 255.
  • the software when executed by processor 210, preferably causes processor 210 to perform one or more of the processes described herein.
  • I/O interface 235 provides an interface between one or more components of system 200 and one or more input and/or output devices.
  • Example input devices include, without limitation, sensors, keyboards, touch screens or other touch-sensitive devices, biometric sensing devices, computer mice, trackballs, pen-based pointing devices, and/or the like.
  • Examples of output devices include, without limitation, other processing devices, cathode ray tubes (CRTs), plasma displays, light-emitting diode (LED) displays, liquid crystal displays (LCDs), printers, vacuum fluorescent displays (VFDs), surface- conduction electron-emitter displays (SEDs), field emission displays (FEDs), and/or the like.
  • CTRs cathode ray tubes
  • LED light-emitting diode
  • LCDs liquid crystal displays
  • VFDs vacuum fluorescent displays
  • SEDs surface- conduction electron-emitter displays
  • FEDs field emission displays
  • System 200 may also include optional wireless communication components that facilitate wireless communication over a voice network and/or a data network (e.g., in the case of user system 130).
  • the wireless communication components comprise an antenna system 270, a radio system 265, and a baseband system 260.
  • RF radio frequency
  • antenna system 270 may comprise one or more antennae and one or more multiplexors (not shown) that perform a switching function to provide antenna system 270 with transmit and receive signal paths.
  • received RF signals can be coupled from a multiplexor to a low noise amplifier (not shown) that amplifies the received RF signal and sends the amplified signal to radio system 265.
  • radio system 265 may comprise one or more radios that are configured to communicate over various frequencies.
  • radio system 265 may combine a demodulator (not shown) and modulator (not shown) in one integrated circuit (IC). The demodulator and modulator can also be separate components. In the incoming path, the demodulator strips away the RF carrier signal leaving a baseband receive audio signal, which is sent from radio system 265 to baseband system 260.
  • baseband system 260 decodes the signal and converts it to an analog signal. Then the signal is amplified and sent to a speaker. Baseband system 260 also receives analog audio signals from a microphone. These analog audio signals are converted to digital signals and encoded by baseband system 260. Baseband system 260 also encodes the digital signals for transmission and generates a baseband transmit audio signal that is routed to the modulator portion of radio system 265. The modulator mixes the baseband transmit audio signal with an RF carrier signal, generating an RF transmit signal that is routed to antenna system 270 and may pass through a power amplifier (not shown). The power amplifier amplifies the RF transmit signal and routes it to antenna system 270, where the signal is switched to the antenna port for transmission.
  • Baseband system 260 is also communicatively coupled with processor 210, which may be a central processing unit (CPU).
  • Processor 210 has access to data storage areas 215 and 220.
  • Processor 210 is preferably configured to execute instructions (i.e., computer programs, such as the disclosed software) that can be stored in main memory 215 or secondary memory 220.
  • Computer programs can also be received from baseband processor 260 and stored in main memory 210 or in secondary memory 220, or executed upon receipt. Such computer programs, when executed, enable system 200 to perform the various processes described herein.
  • the described processes may be embodied in one or more software modules that are executed by one or more hardware processors (e.g., processor 210), for example, as capture/analysis system 112 and/or detection system 114.
  • the described processes may be implemented as instructions represented in source code, object code, and/or machine code. These instructions may be executed directly by the hardware processor(s), or alternatively, may be executed by a virtual machine operating between the object code and the hardware processors.
  • the described processes may be implemented as a hardware component (e.g., general-purpose processor, integrated circuit (IC), application-specific integrated circuit (ASIC), digital signal processor (DSP), field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, etc.), combination of hardware components, or combination of hardware and software components.
  • a hardware component e.g., general-purpose processor, integrated circuit (IC), application-specific integrated circuit (ASIC), digital signal processor (DSP), field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, etc.
  • IC integrated circuit
  • ASIC application-specific integrated circuit
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • FIG. 3 illustrates the data flow in a process 300 for automated data discovery, according to an embodiment.
  • Process 300 may be executed by at least one hardware processor (e.g., processor 210) of detection system 114.
  • hardware processor e.g., processor 210
  • sub-process 310 in-context statements are received and analyzed to produce semantic objects.
  • Sub-process 310 is described in greater detail in the prior patents.
  • network traffic e.g., TCP/IP traffic representing client-server database information
  • decoded e.g., at both the level of the communication protocol used and the level of the data-access language used
  • This semantic model may comprise or be used to produce a plurality of semantic objects within context.
  • the inputs to sub-process 310 are statements within their contexts.
  • the statements may comprise any statement in any programming language for accessing data (e.g., within a relational database).
  • Examples of data-access languages include, without limitation, OracleTM, MySQLTM, SQL ServerTM, MS AccessTM, IBM Db2TM, dBaseTM, FoxProTM, and the like.
  • the data-access language will primarily be illustrated herein as a form of Structured Query Language (SQL).
  • SQL Structured Query Language
  • the outputs from sub-process 310 may be a set of semantic objects representing one or more statements that have been extracted from network traffic (e.g., between network agents 130).
  • Each semantic object may represent one or more textual components (e.g., an identifier used in a SQL statement) of the statement, from which it was derived via sub-process 310, as well as an indication of the semantic context in which each of these textual component s) appeared.
  • the semantic objects may be logically arranged in a hierarchical parse tree, such that the semantic context of each textual component is indicated or implied by its position within a structured descent into the hierarchical parse tree (e.g., its ancestral and/or descendant nodes).
  • sub-process 320 the semantic objects, output by sub-process 310, are received and analyzed by a data classifier to produce classifications, which may also be referred to as “semantic matches.”
  • the classifications may be represented and stored within an analytic database 330, which is accessible to sub-processes 340 and/or 350.
  • Sub-processes 340 and 350 represent two beneficial uses of the classifications in analytic database 330. Any detection system 114 may implement one or both of sub-processes 340 and 350, may utilize the classifications in some other way, and/or may simply provide access to the classifications in analytic database 330 (e.g., for use by another system).
  • detection system 114 may consist of the entirety of process 300 or any subset of the sub processes in process 300 (e.g., only sub-process 320, only sub-process 340, only sub-process 350, a combination of sub-processes 320, 340, and 350, a combination of sub-processes 320 and 340, a combination of sub-process 320 and 350, a combination of sub-processes 340 and 350, etc.), including subsets that comprise only portions of one or more of the sub-processes in process 300.
  • any subset of the sub processes in process 300 e.g., only sub-process 320, only sub-process 340, only sub-process 350, a combination of sub-processes 320, 340, and 350, a combination of sub-processes 340 and 350, etc.
  • each real number represents a classification score that indicates the relative importance of a given classification, matched in sub-process 320 and stored in analytic database 330.
  • These classifications may be filtered (e.g., by their respective classification scores, for example, using one or more score thresholds) to narrow the universe of classifications, available in analytic database 330, to only those classifications that are associated with certain statements and/or contexts that, for example, are associated with activity that falls outside of behavioral norms, and therefore, may represent a cyberattack or other form of data loss.
  • sub-process 350 the classifications output by sub-process 320 may be processed using relatively general classification taxonomies, specialized to local environments, to yield selective classifications. This qualitative approach is on par with the quantitative approach in sub-process 340, but with no implicit filtering required. Thus, this qualitative approach in sub-process 350 can be more suitable for compliance-oriented environments than the quantitative approach in sub-process 340.
  • FIG. 4 illustrates the data flow in data-classification sub-process 320 within process 300, according to an embodiment.
  • Sub-process 320 may be executed by at least one hardware processor (e.g., processor 210) of detection system 114.
  • each semantic object may represent an identifier or comment in a data-access language, within a data-access language statement, executing within the context of client-server communication discovered within network traffic.
  • Each semantic object may be analyzed in sub-process 410 to determine one or more word sequences, within the represented statement or comment, that are likely to convey the intent of the statement or comment, which will be useful for classification.
  • an identifier e.g., the name or alias of a database, table, column, etc.
  • each word sequence may be referred to herein as a“group.” However, it should be understood that each word sequence or group may comprise only a single word or any plurality of words.
  • sub-process 410 may select the identifiers “Soc Sec No” (i.e., a column identifier) and“User Data” (i.e., a relation identifier) for subsequent processing.
  • Soc Sec No i.e., a column identifier
  • User Data i.e., a relation identifier
  • sub-process 420 a non-standard word analysis is applied to each group, produced by sub-process 410.
  • the non-standard word analysis for a group determines the best match between the group (i.e., sequence of one or more words) and a sequence of one or more lemmas in semantic dictionary 450.
  • semantic dictionary 450 may comprise WordNetTM, produced by Princeton University, for the English Language.
  • a lemma is a standard word that is a known key to a set of one or more entries within semantic dictionary 450.
  • sub-process 420 receives groups of words and standardizes them into sequences of lemmas that can be found in semantic dictionary 450.
  • sub-process 420 may expand the identifiers“Soc Sec No” and“User Data” into the following groups: “Soc”;“Sec”;“No”;“User”; and“Data”. Then, the group“Soc” may be matched to the lemma“social”, the group“Sec” may be matched to the lemma“security”, and the group “No” may be matched to the lemma“number”. In addition, the group“User” may be matched to the lemma“user”, and the group“Data” may be matched to the lemma“data”. It should be understood that these are all examples of single-word groups. In other examples, a single group may comprise two or more words. For example, the SQL comment‘7* social security number */” would form the group consisting of“social”,“security”, and“number”.
  • sub-process 430 sequences of one or more of the lemmas, produced by sub process 420, are matched to lexical entries in semantic dictionary 450 to produce lexical- entry matches, which could also be referred to as“lexemes.” Each lexical-entry match represents a unit of meaning represented by a set of one or more of the lemmas output by sub process 420.
  • sub-process 430 maps sequences of lemma(s) to their potential meanings. Continuing the simple example from above, sub-process 430 may match the sequence of lemmas“Soc”,“Sec”, and“No” to a lexical entry for“social security number”.
  • each lexical-entry match may comprise a sequence of one or more tuples.
  • each tuple may comprise a two-tuple with at least the following components: (1) a subsequence of one or more lemmas, produced by sub-process 420, that matched one or more lexical entries in semantic dictionary 450; and (2) a set of one or more references to the one or more lexical entries in semantic dictionary 450 that were matched.
  • lexical-entry matching in sub-process 430 may be limited by one or more logical constraints.
  • the constraints may include one or more, including potentially all, of the following:
  • Each lexical-entry match comprises a subsequence of the input lemmas, with each input lemma matched, at most, once across all matched subsequences;
  • sub-process 430 favors lexical-entry matches (e.g., to either written representations for English-language lemmas or translations for foreign-language lemmas) that contain multiple lemmas and represent the most specific possible lexical entry in semantic dictionary 450.
  • sub-process 430 skips any non matches as unimportant.
  • constraint (3) leverages existing English translations of lexical entries in semantic dictionary 450 (e.g., WordNetTM), so that the subsequent semantic entry matching in sub-process 440 can apply the same classifiers regardless of the human language of the groups input to sub-process 410.
  • sub-process 440 the lexical-entry matches, output by sub-process 430, are matched against specific classifiers to produce classifications.
  • the classifiers may be defined as records within an extensible classifier database 460.
  • the resulting classifications from sub-process 440 may be stored in analytic database 330.
  • the input to sub-process 440 is a sequence of one or more of the tuples (e.g., two-tuples) that were described above as the output of sub-process 430.
  • the classifications, output by sub-process 440 may comprise a sequence of one or more references to lexical entries in semantic dictionary 450, within a data structure representing each classification.
  • each classifier in classifier database 460 may be defined as a containment hierarchy - in which each child in the hierarchy is“contained” by, and hence implicitly associated with, exactly one parent in the hierarchy - in a top-down fashion.
  • FIG. 5 illustrates example classifiers, according to an embodiment.
  • classifier database 460 may comprise one or more classifiers 510.
  • Each classifier 510 represents one entire classification taxonomy, and may have one or more categories 520 as children.
  • Each category 520 may be recursively defined to represent a specific classification asserted by its associated classifier 510, and may have one or more other categories 520 and/or one or more entry groups 530 as children.
  • Each entry group 530 represents a specific way that a category 520 can be inferred from lexical-entry matches, and may have one or more entry sets 540 as children.
  • an entry group 530 that is able to match to a plurality of lexical entries may have an associated match window (e.g., defined by a specified window size).
  • a match window e.g., defined by a specified window size.
  • each entry set 540 may specify one or more lexical entries 550 that are associated with one aspect of the entry group 530 that is the parent of the entry set 540. While the illustrated hierarchy will be used throughout the present description for consistency and ease of understanding, the hierarchy may comprise fewer or more levels than illustrated and described.
  • the classification process in sub-process 440 may be limited by one or more logical constraints.
  • the constraints may include one or more, including potentially all, of the following, for each statement or comment that is input into process 300:
  • Each of the lexical entries 550, output by sub-process 440, is associated with exactly one lexical-entry match, input by sub-process 430, such that an ordinal can be determined that represents the position of each lexical entry 550 within the sequence of lexical-entry matches; and/or
  • sub-process 410 analyzes the semantic model output by the processes described in the prior patents.
  • the semantic model may comprise elements of statements, found in network traffic (e.g., in a data access language), arranged as semantic objects in parse trees within their contexts.
  • sub-process 410 may consider one or more, including potentially all, of the following context indications, if present in or otherwise known for the given statement:
  • ⁇ server the server being accessed by the statement (e.g., as defined by an optional first part of a four-part relation identifier);
  • ⁇ database the database being accessed by the statement (e.g., as defined by an optional second part of the four-part relation identifier);
  • schema the schema being used by the statement (e.g., as defined by an optional third part of the four-part relation identifier);
  • relation the type of relationship (e.g., as defined by a required fourth part of the four-part relation identifier);
  • alias an identifier optionally assigned as an alias (e.g., of the result column of a query) in the statement
  • sub-process 410 may comprise a plurality of discrete passes on one or more parse trees of semantic objects in the semantic model. Each pass may comprise a semantic analysis of one or more textual components within a window that moves or changes between passes.
  • the textual components will generally comprise individual identifiers (e.g., user-defined identifiers of tables, columns, etc., within statements) or natural-language word sequences (e.g., comments found with statements).
  • Word sequences in statements formed according to whichever data-access language is used, are frequently encoded in a manner that renders simple matching to standard word dictionaries ineffective. Examples of such issues include compounds words (e.g., identifiers of columns, databases, etc.), syntax driven by the rules of the particular data-access language that was used, abbreviations (e.g., employed within identifiers and/or natural -language comments), and/or the like.
  • human programmers often employ non-standard words (i.e., words not found in standard dictionaries) for reasons including syntax rules of the chosen data-access language, brevity, and/or the like.
  • non-standard word analysis is to expand these non-standard words to standard words that can be matched to standard dictionaries.
  • the state of the art in non-standard word analysis is represented by“A Text Normalisation System for Non-Standard English Words,” by Flint et al., Dept of Theoretical and Applied Linguistics, Computer Laboratory, Univ. of Cambridge, Cambridge, U.K. (“Flint et al”), which is hereby incorporated herein by reference as if set forth in full.
  • the non-standard word analysis in Flint et al. is modified to tailor the analysis to the particular data-classification requirements unique to data-access languages.
  • the non-standard word analysis may be tailored in one or more, including potentially all, of the following manners:
  • Non-standard word analysis is highly dependent upon the groups that are input.
  • the groups are defined by context analysis in sub-process 410. This affects various heuristics employed by the non-standard word analysis in sub-process 420, and materially changes the lemma sequences output by sub-process 420.
  • semantic dictionary 450 is used in place of standard word dictionaries.
  • lemmas are derived by sub process 420 from semantic dictionary 450, instead of a standard word dictionary, thereby altering the results of the non-standard word analysis.
  • Non-standard word analysis is generally language-dependent. In other words, the process of decoding text sequences to standard words involves language-specific rules. However, in an embodiment, sub-process 420 receives the human language as a parameter and applies rules, specific to the human language specified by that parameter, to produce the lemmas. In this way, sub-process 420 recognizes textual references to non-English words that correspond to the meaning of the lemmas in semantic dictionary 450.
  • sub-process 440 for lexical entry matching will now be described in further detail.
  • these lemmas are matched to lexical entries in semantic dictionary 450, in order to efficiently represent the relationship between statements found in network traffic.
  • the context of lexical-entry matches may be retained via representation in a database (e.g., in analytic database 330, which may itself be a relational database).
  • a database e.g., in analytic database 330, which may itself be a relational database.
  • the lexical-entry matches, output by sub-process 430 may be represented using two tables (e.g., in PostgreSQL):
  • the above tables are a very efficient representation, in terms of both time and memory, of all matched lemmas, within individual statements and their contexts.
  • the stmt lex entry table represents the result of lexical-entry matching in sub-process 430, and each row in the stmt lex entry table represents the result of a lexical analysis on a group output by sub-process 410.
  • the tables may also encode a direct reference to the matched lexical entry in semantic dictionary 450 (e.g., via the wn ids column in the ident mapping table).
  • the lexical-entry matches represent all matches, within semantic dictionary 450, to a statement, independent of the details of the user-extensible classifiers discussed elsewhere herein.
  • the lexical-entry matches can be very quickly and efficiently analyzed for many purposes within the semantic-entry matching of sub-process 440, without recourse to the statement text or lemmas.
  • each lexical-entry match records the human source language of the matched text, for example, via the language column in the stmt lex entry table.
  • the value of this language column may be set to a parameterized language value from the non standard word analysis in sub-process 420 and/or semantic dictionary 450.
  • the corresponding ident mapping(s) in the mappings column of the stmt lex entry table may include references to both language-dependent and language-independent identifiers in semantic dictionary 450.
  • one objective of the data classifier is to allow for the separate representation of completely extensible classification taxonomies, which can be easily defined and extended as needed.
  • semantic dictionary 450 comprises at least a portion (e.g., a large subset of English-language words) of Princeton University’s WordNetTM, as represented in the Resource Definition Framework (RDF). Then, the classifier definitions may reference not just words, but the words’ meanings within a large universe of terms, leveraging many years of linguistic research.
  • RDF Resource Definition Framework
  • foreign-language representations can be mapped to their corresponding lexical entries in semantic dictionary 450, thereby facilitating multi-language support.
  • the WordNetTM RDF ontology includes the translation of senses.
  • many people have contributed to creating translations linked to WordNetTM.
  • semantic dictionary 450 enables different source languages to be matched to the same classifications in sub-process 440.
  • the“ident mapping” entries, in the lexical-entry matches output by sub-process 430 may reference language- independent WordNetTM identifiers to be used in the semantic-entry matching of sub-process 440.
  • classifiers 510 in classifier database 460 may define arbitrary classification taxonomies in terms of sets of lexical entries in WordNetTM, defined via the Web Ontology Language (OWL) of the semantic web.
  • WNL Web Ontology Language
  • Advantages to such a definition may include one or more of the following: [118]
  • the classification taxonomies are completely arbitrary trees, which define semantic categories 520 ultimately in terms of arbitrary references to lexical entries in semantic dictionary 450.
  • WordNetTM synsets which define particular meanings (e.g., complete with glosses and examples) via implication over a plurality of lexical entries.
  • These mechanisms include, without limitation, discrete lexical entries, complete synsets, hyponym (more specific) synsets that are transitively closed, meronym (part-whole hierarchies) synsets that are transitively closed, and/or arbitrary SPARQL Protocol and RDF Query Language (SPAR.QL) queries.
  • SPARQL Protocol and RDF Query Language SPARQL Protocol and RDF Query Language
  • An entry group 530 can specify an optional match window with a set of entry sets 540, thereby requiring that each entry set 540 - perhaps providing only weak evidence for category membership - be separately satisfied within this match window within a particular match context. In practice, this makes it easier to express category membership with fairly general phrases, without resort to sequential notions or complex syntax.
  • RDF/OWL since RDF/OWL is completely declarative, the implementation may be defined directly as follows. Notably, functional hyperlinks have been removed from the following definitions (by removing the“http://” prefix) to comply with the Manual of Patent Examining Procedure (MPEP) for the U.S. Patent and Trademark Office. The following definitions set constants to be used throughout the definition, according to an embodiment:
  • ⁇ prefix rdfs ⁇ www. w3. org/2000/ 01/rdf-schema#> .
  • ⁇ prefix xsd ⁇ www. w3. org/2001/XMLSchema#> .
  • classifiers 510 as well as name and description properties to be used, according to an embodiment:
  • rdfs comment "Represents a category within a classification taxonomy" @en; rdfs : subClassOf [
  • entry groups 530 are defined, according to an embodiment:
  • entry sets 540 and lexical entries 550 are defined, according to an embodiment:
  • rdfs comment "Abstract, potentially heritable attribute of a category"; rdfs : subClassOf [
  • GDPR General Data Protection Regulation
  • EU European Union
  • sub-process 440 for semantic entry matching will now be described in further detail. Given the lexical-entry matches and classifier definitions described above, classifier matching in sub-process 440 can be done at high speed by incrementally determining all matches to categories 520, constrained by window requirements, requirements for multiple entry sets 540 to be collectively satisfied within an entry group 530 for a match to the entry group 530 (when applicable), and/or the like.
  • each row in the stmt lex entry table, described above, is analyzed by sub-process 440.
  • Each row represents a lexical-entry match, and may give rise, in sub-process 440, to zero, one, or a plurality of classifications.
  • each classification is evidence that the invocation of the statement (e.g., an SQL statement) accesses data within a category 520 of a classifier 510.
  • classifications, output by sub-process 400 may be represented as follows for fast and efficient scoring and reporting in sub-processes 340 and/or 350:
  • each row in the stmt entry match table represents the satisfaction of a discrete category 520 defined by a classifier 510;
  • classifiers 510 are expanded from RDF/OWL definitions into a concrete
  • this representation allows fast and flexible scoring and reporting (e.g., in sub-processes 340 and/or 350) of all classifier implications (e.g., represented as classifications in analytic database 330), even in very large environments with many thousands of databases and clients, without loss of detail.
  • Quantitative scoring is appealing in security-oriented environments.
  • the quantitative scoring of sub-process 340 may account for both the selectivity of classifications, which compare local match frequencies to broad contexts, and various weightings (e.g., user-defined weightings).
  • Quantitative scoring appeals primarily to analytic applications or users, which are typically interested in data security.
  • One goal of quantitative scoring is to generate a score that indicates the relative likelihood that statements, within a particular context that defines a subset of all observed statements that have been classified in sub-process 320, implicate a particular category 520 within a taxonomy of one or more classifiers 510.
  • the quantitative scoring of sub-process 340 may utilize the following data types and relations (e.g., defined in PostgreSQL):
  • FIG. 6 illustrates an example algorithm 600 for quantitative scoring, according to an embodiment.
  • algorithm 600 may be defined as a PostgreSQL user-defined aggregate.
  • This scoring aggregate may be applied to any set of one or more rows in the stmt entry match table, to produce a scalar score for that set of row(s).
  • the scoring aggregate scans over records of the stmt entry match score t and stmt lex entry stats t data types.
  • the stmt entry match score t records may be associated in a one-to-one relation with rows in the stmt entry match table.
  • Functions of the scoring aggregate may be implemented as three user-defined functions: state initialization (step 605), transition (step 610), and finalization (steps 615). These functions may be programmed in C or any other suitable programming language, and operate on records of the score state t data type.
  • values in the wn_prob column of the stmt entry match score t records are computed via a procedure derived from Zipf s Law.
  • Zipf s Law is a heuristic that relates word sequences, sorted by frequency of occurrence, to a probability of occurrence, without undue dependence on a specific application corpus.
  • Each stmt lex entry stats t record may be created by a subsidiary function of the scoring aggregate, which sums values across all of the rows in the stmt lex entry table that are associated with the given context of interest (e.g., all of the rows in the stmt lex entry table that are associated with a particular database service).
  • the values in the score vf column of the stmt lex entry stats t records represent the scale of the given context of interest, compared to the scale of all lexical entries in semantic dictionary 450.
  • Step 605 represents the initialization function of the scoring aggregate of algorithm 600, in which data structures are initialized.
  • algorithm 600 initializes a record of the match score t data type and creates a score state t record (e.g., implemented as a C-language struct type). For example, all values may be initialized to zero, except for the stmt stats column, which may be initialized to the record of the stmt lex entry stats t data type, representing the summary of statistics, which may be used and remain constant across all iterations of the subsequent steps.
  • a score state t record e.g., implemented as a C-language struct type
  • a classifiers linked list refers to records associated with classifiers 510.
  • Each classifier record contains the associated classifier state, derived from a stmt entry match score t record, which includes a categories linked list of records associated with categories 520 in that classifier 510.
  • each record in the categories linked list refers to an entry-groups linked list of records associated with entry groups 530 in that category 520.
  • each record in the entry-groups linked list refers to an entry-sets linked list of records associated with entry sets 540 in that entry group 530.
  • each record in the entry-sets linked list refers to a lexical-entries linked list of records associated with lexical entries 550 in that entry set 540.
  • each record in the lexical-entries linked list refers to a matches linked list of records associated with the matches between that lexical entry 550 and each distinct statement (e.g., SQL statement), entry context aux t record, and entry context t record.
  • each record in the matches linked list refers to another match record (e.g., C-language struct type) that represents the next statement, entry context aux t record, or entry context record.
  • a“current” classifier 510, category 520, entry group 530, entry set 540, lexical entry 550, and match is defined by transitive closure over the linked lists embedded in these records (e.g., C-language struct type records).
  • Each of the records in the various nested linked lists may be implemented as a C-language struct type, and may also comprise a score record, which represents the number of matches accumulated during the aggregation.
  • Step 610 represents the transition function of the scoring aggregate of algorithm 600, which a classification record is created.
  • algorithm 600 sorts all stmt entry match score t records of interest by their c_id, cat id, g_id, ces id, ent id, sql id, sle context aux, and sle context columns, in that order, and then applies the transition function to each of these stmt entry match score t records of interest.
  • the transition function may comprise the following steps to aggregate data in the score state t record that was created in step 605: [148] (1) If the c_id represents the first row seen for the identified classifier 510 (e.g., the first iteration of the transition function, or a newly seen identifier indicating the start of the next classifier 510, etc.), create a new record for this current classifier 510, and add this new classifier record to the classifiers linked list;
  • the cat id represents the first row seen for the identified category 520 for the current classifier 510 (e.g., the categories linked list for the current classifier 510 is empty, or the cat id is a newly seen identifier indicating the start of the next category 520, etc.), create a new record for this current category 520, and add this new category record to the categories linked list for the current classifier record;
  • the g_id represents the first row seen for the identified entry group 530 for the current category 520 of the current classifier 510 (e.g., the entry- groups linked list for the current category 520 is empty, or the g_id is a newly seen identifier indicating the start of the next entry group 530, etc.), create a new record for this current entry group 530, and add this new entry-group record to the entry-groups linked list for the current category record;
  • the ces id represents the first row seen for the identified entry set 540 for the current entry group 530 of the current category 520 of the current classifier 510 (e.g., the entry-sets linked list for the current entry group 530 is empty, or the ces id is a newly seen identifier indicating the start of the next entry set 540, etc.), create a new record for this current entry set 540, and add this new entry-set record to the entry-sets linked list for the current entry -group record;
  • the lexical-entries linked list for the current entry set 540 is empty, or the ent id is a newly seen identifier indicating the start of the next lexical entry 550, etc.
  • create a new record for this current lexical entry 550 and add this new lexical- entry record to the lexical-entries linked list for the current entry-set record;
  • sql id represents the first row seen for the identified statement (e.g., the statements associated with the current matches linked list for the current lexical entry 550 is empty, or the sql id is a newly seen identifier indicating the start of the next statement, etc.), create a new match record for the current statement, and add this new match record to the matches linked list for the current lexical-entry record;
  • the sle context aux represents the first row seen for the current sle context aux value (e.g., the sle context aux value represents the next auxiliary context in the aggregation sequence), create a new match record for this sle context aux value, and add this new match record to the matches linked list for the current lexical-entry record.
  • the sle context represents the first row seen for the current sle context value (e.g., the sle context value represents the next context in the aggregation sequence), create a new match record for this sle context value, and add this new match record to the matches linked list for the current lexical-entry record.
  • Step 615 represents the finalization function of the scoring aggregate of algorithm 600.
  • algorithm 600 converts the score state t record, which has accumulated information over the aggregation of all of the stmt entry match score t records of interest, into a final score represented by a classifiers score t record.
  • the finalization function of algorithm 600 performs a depth-first walk of the tree (e.g., created in step 610), rooted in the classifiers linked list, and descends into the categories linked list, entry-groups linked list, entry-sets linked list, and lexical-entries linked list, in turn, to produce a classifiers score t record for each classifier 510.
  • step 615 may comprise calculating a local score for each matched lexical entry 550, calculating an entry-set score for each matched entry set 540 based on the local scores of the matched lexical entries 550 contained within the matched entry set 540, calculating an entry-group score for each matched entry group 530 based on the entry-set scores of the matched entry sets 540 contained within the matched entry group 530, calculating a category score for each matched category 520 based on the entry-group scores of the matched entry groups 530 contained within the matched category 520, and calculating a classifier score for each matched classifier 510 based on the category scores of the matched categories 520 contained with the matched classifier 510.
  • the local score for each matched lexical entry 550 is computed by first determining the number of lexical entries 550 expected from a randomly distributed population of references to the lexical entry, in semantic dictionary 450, that is associated with the classifier lexical entry 550. This number may be computed by multiplying the Zipf s Law estimated probability of the lexical entry in semantic dictionary 450 by the total number of lexical entries (e.g., wn entries) from the associated stmt lex entry stats t record.
  • the actual number of occurrences is determined by multiplying the number of matched lexical entries 550 (i.e., associated with the lexical entry in semantic dictionary 450) by the certainty associated with the matched lexical entry.
  • the certainty represents the strength of a non standard word match and, by default, may be set to 1.0 (e.g., if no specific certainty is available or specified).
  • an intermediate score for a given matched lexical entry 550 is calculated by subtracting the expected number of matches from the actual number of matches, and dividing the difference by the expected number of matches. A raw score can then be calculated by multiplying this intermediate score by a user-defined weighting for the matched lexical entry 550, to produce the local score for the matched lexical entry 550.
  • the entry-set score for each matched entry set 540 is calculated by multiplying the maximum local score for the matched lexical entries 550 contained by the given entry set 540 by the weighting value in the score_vf column of the stmt lex entry stats t record.
  • the entry-group score for each matched entry group 530 is computed based on the entry-set scores of the entry sets 540 contained by the given entry group 530. If the given entry group 530 contains a single entry set 540, the entry-group score for that given entry group 530 is calculated by multiplying the entry-set score for the single entry set 540 by a weight (e.g., a user-defined weight for the entry group 530).
  • a weight e.g., a user-defined weight for the entry group 530.
  • the given entry group 530 contains a plurality of entry sets 540 (i.e., the given entry group 530 is satisfied via the satisfaction of a plurality of entry sets 540), multiple lexical entries in semantic dictionary 450 have been satisfied, thereby implying additionally selectivity for which algorithm 600 should preferably account.
  • the entry -group score is computed by first computing a probability estimate that is accumulated by a series of computations that are based on the maximum local scores of the lexical entries 550 that are associated with each of the entry sets 540 in the given entry group 530.
  • This accumulated probability estimate may be initiated by, for the first entry set 540 in the given entry group 530, computing the probability of the lexical entry in semantic dictionary 450 that is associated with the matched lexical entry 550 having the maximum local score from all matched lexical entries 550 contained within the first entry set 540.
  • This probability may be calculated by, first, subtracting the probability associated with that lexical entry in semantic dictionary 450 from 1.0, and then raising the difference by the power given by the total number of lexical entries in semantic dictionary 450 (e.g., as represented by the value in the wn entries column in the stmt lex entry stats t record), to produce an intermediate quantity.
  • a probability may then be calculated by subtracting the intermediate quantity from 1.0, and dividing the difference by the weight (e.g., user-defined weight for the lexical entry 550) associated with the matched lexical entry 550 having the maximum local score.
  • the intermediate quantity is calculated by subtracting the probability associated with the lexical entry in semantic dictionary 450, corresponding to matched lexical entry 500 within the given entry set 540 having the maximum local score, from 1.0, and then raising the difference by the power given by the window size of the given entry group 530.
  • the probability is calculated in the same manner as for the first entry set 540 by subtracting the intermediate quantity from 1.0, and dividing the difference by the weight (e.g., user-defined weight) associated with the matched lexical entry 550 having the maximum local score.
  • the probability for the given entry group 530 is accumulated by multiplying the probability calculated for each subsequent entry set 540 by the accumulated probability calculated in the previous step (i.e., after accounting for the preceding entry set 540). After the accumulated probability has taken into account every entry set 540, contained within the given entry group 530, the entry-group score for the given entry group 530 may then be calculated by dividing 2.0 by this final accumulated probability, and then multiplying the quotient by the weight associated with the given entry group 530 (e.g., a user defined weight represented by the value in the score vf column of the associated stmt lex entry stats t record).
  • the weight associated with the given entry group 530 e.g., a user defined weight represented by the value in the score vf column of the associated stmt lex entry stats t record.
  • the category score for each matched category 520 is computed from the entry-group scores of the entry groups 530 contained within that category 520.
  • the category score for a given category 520 may be calculated by multiplying the maximum entry-group score from all of the entry-group scores of the entry groups 530, contained within the given category 520, by a weight (e.g., a user-defined weight) associated with the given category 520.
  • the classifier score for each matched classifier 510 is computed from the category scores of the categories 520 contained with that classifier 510.
  • the classifier score for a given classifier 510 may be the maximum category score from all of the category scores of the categories 520 contained within the given classifier 510.
  • the ranges of classifier scores can be stratified as follows:
  • sub-process 340 may also comprise quantitative filtering in step 620. However, in an alternative embodiment, step 620 may be omitted. For example, when depicting the categories 520 of classifications implicated by a given context (e.g., in response to a user or system query), sub-process 340 may execute a meta-classification algorithm in step 620 that heuristically identifies the“knee” in the curve of the category scores for the given context.
  • This heuristic essentially estimates the slope of the graph of category scores, and filters out category scores beyond the point where this slopes reaches some minimum (e.g., thereby representing insignificant category scores), subject to one or more parameters that prevent the algorithm from ignoring significant category scores that fall above a first threshold value and/or that enable the algorithm to ignore insignificant category scores that fall below a second threshold value.
  • the algorithm may employ a coordinate transform that is intended to reflect the scale of all data as square.
  • the qualitative filtering of step 620 minimizes false positives, thereby making it easier for operators to pinpoint exactly what and where their classified data are, across very large enterprises. [171] 2.1.3. Qualitative Filtering
  • sub-process 350 may comprise an algorithm that makes recommendations for highly tailored negations, designed to reduce the largest number (i.e., benefit) of implicated contexts (e.g., typically discrete database services) with the lowest classifier category group significance (i.e., risk).
  • the algorithm may exhaustively discover discrete sets of category satisfactions at specific levels of significance, and present these sets in order of increasing risk.
  • a user may specify one or more elements (e.g., via a graphical user interface), which precisely define a modified or abbreviated classifier.
  • the semantic entry matching in sub-process 440 may be extended to elide matches negated by the abbreviated classifier.
  • the net effect of sub-process 350 is to allow the depiction of all classifications found in any context, while using local and explicit decision-making to suppress noise.
  • FIG. 7 illustrates an example algorithm 700 for qualitative filtering, according to an embodiment.
  • Algorithm 700 may be implemented as a single PostgreSQL stored procedure).
  • step 705 algorithm 700 computes and stores a first relation between each entry in the stmt lex entry table and the associated context (e.g., database service identifier).
  • the associated context e.g., database service identifier
  • step 710 algorithm 700 projects a second relation between each context (e.g., database service identifier) and category scores (e.g., the category scores produced by algorithm 600) for all associated classifications. This projection may be staged through a number of intermediate relations.
  • algorithm 700 projects a third relation giving the maximum category score by distinct entry sets 540 that satisfy one or more classifications. This relation may be given a primary key.
  • step 720 algorithm 700 projects a fourth relation between each context (e.g., database service identifier) and the primary key of the third relation projected in step 715.
  • context e.g., database service identifier
  • step 725 algorithm 700 perform a loop for as long as the number of distinct contexts (e.g., database service identifiers) within the fourth relation from step 720 is greater than zero. Specifically, if the number of distinct contexts within the fourth relation is greater than zero (i.e.,“Yes” in step 725), algorithm 700 proceeds to step 730. Otherwise, if the number of distinct contexts within the fourth relation is zero (i.e., “No” in step 725), algorithm 700 ends.
  • the number of distinct contexts e.g., database service identifiers
  • step 730 algorithm 700 determines the minimum category score in the fourth relation and determines a context (e.g., database service identifier) associated with this minimum score. It should be understood that the determined context may be one of a plurality of contexts associated with the minimum score.
  • a context e.g., database service identifier
  • step 735 algorithm 700 computes the set of entry sets 540 required to negate all of the classifications associated with this context (e.g., database service identifier).
  • step 740 algorithm 700 deletes all of the rows from the fourth relation that match any of the entry sets 540 within the set computed in step 735. This deletion of rows represents the effect of negating all of the contexts (e.g., services) implicated by just these entry sets 540.
  • step 745 algorithm 700 computes the new number of distinct contexts (e.g., database service identifiers) in the fourth relation following the deletion in step 740.
  • distinct contexts e.g., database service identifiers
  • step 750 algorithm 700 returns a row comprising the set of entry sets 540 deleted in step 740, the maximum quantitative score associated with the proposed negation (e.g., representing the associated risk), and the number of released contexts (e.g., database service identifiers), which represents the delta between the number of distinct contexts computed in step 745 and the entry criteria to the loop represented by step 725.
  • This result row serves several useful purposes. First it recommends a specific set of classifier entry sets that, if negated, would reduce the number of implicated contexts (e.g., service identifiers), with a strictly maximum score/minimum risk. Second, it characterizes this level of risk.
  • sub-process 350 will effectively negate the associated set of entry sets 540, thereby pruning the associated classifier 510.
  • step 750 algorithm 700 continues the loop, represented by step 725, with this new number of distinct contexts, unless this new number of distinct contexts is zero. Returning multiple“frontier” rows, via a plurality of iterations of step 750 within these loops, allows the end-user to make multiple, inherently conditional decisions at one time.
  • a user may view the recommendations of classifier entry sets 540 (e.g., via a graphical user interface), produced by iterations of steps 725-750 of algorithm 700 (e.g., in increasing level of significance and decreasing level of triviality).
  • the user may select one or more of the recommendations to thereby abbreviate the classifier 510 by negating the recommended entry sets 540.
  • classifications which rely on the negated entry sets 540 may be filtered out from classifications shown to the user, so that the user can concentrate on only those classifications which are important to the user.
  • process 300 automates data discovery and monitoring in a manner that was not previously possible.
  • the semantic analysis of sub-process 310 automatically captures and produces a semantic analysis of statements and comments in network traffic between network agents 130.
  • the data classification of sub-process 320 classifies the captured statements and comments in their contexts.
  • sub-processes 310 and 320 infer what and where data is stored based on the data-access-language statements used to access that data.
  • the result is a representation, stored in analytic database 330, of the type of data (i.e., classifications of data) being stored in the network and the locations (i.e., contexts) of each type of data. Because this representation of types and locations of data is produced automatically, entities no longer have to go through the costly and time-consuming process of manually and explicitly identifying, accessing, and probing their structured data.
  • the representations of types and locations of data in the network can be quickly scored and filtered via sub-processes 340 and 350. This enables the operator of the network to quickly identify relevant types of data and their locations.
  • classifications can be reported to end user’s in any chosen language (e.g., by leveraging the translations of lexical entries present in semantic dictionary 450).
  • Combinations, described herein, such as“at least one of A, B, or C,”“one or more of A, B, or C,”“at least one of A, B, and C,”“one or more of A, B, and C,” and“A, B, C, or any combination thereof’ include any combination of A, B, and/or C, and may include multiples of A, multiples of B, or multiples of C.
  • combinations such as“at least one of A, B, or C,”“one or more of A, B, or C,”“at least one of A, B, and C,”“one or more of A, B, and C,” and“A, B, C, or any combination thereof’ may be A only, B only, C only, A and B, A and C, B and C, or A and B and C, and any such combination may contain one or more members of its constituents A, B, and/or C.
  • a combination of A and B may comprise one A and multiple B’s, multiple A’s and one B, or multiple A’s and multiple B’s.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne la découverte de données automatisée pour la cybersécurité. Dans un mode de réalisation, une pluralité d'objets sémantiques, représentant des composantes textuelles, en contexte, d'une ou de plusieurs instructions dans un langage d'accès aux données, capturées à partir de trafic réseau, sont reçus. Les composantes textuelles sont décomposées en groupe(s) de mot(s). Le ou les groupes sont ensuite appariés à des lemmes dans un dictionnaire sémantique pour produire un ou plusieurs lemmes appariés, et le ou les lemmes appariés sont appariés à une ou plusieurs entrées lexicales dans le dictionnaire sémantique pour produire une ou plusieurs entrées lexicales appariées. La ou les entrées lexicales appariées sont appariées à un ou plusieurs classifieurs dans une base de données de classifieurs pour produire une ou plusieurs classifications. Chaque classification représente une possibilité que la ou les instructions accèdent à des données dans cette classification. La ou les classifications peuvent être stockées en association avec leur contexte en vue d'une analyse quantitative et/ou qualitative des types et des emplacements de données dans un réseau.
PCT/US2020/017064 2019-02-11 2020-02-06 Découverte de données automatisée pour cybersécurité WO2020167586A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962804083P 2019-02-11 2019-02-11
US62/804,083 2019-02-11

Publications (1)

Publication Number Publication Date
WO2020167586A1 true WO2020167586A1 (fr) 2020-08-20

Family

ID=72045281

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/017064 WO2020167586A1 (fr) 2019-02-11 2020-02-06 Découverte de données automatisée pour cybersécurité

Country Status (1)

Country Link
WO (1) WO2020167586A1 (fr)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090024604A1 (en) * 2007-07-19 2009-01-22 Microsoft Corporation Dynamic metadata filtering for classifier prediction
US8214199B2 (en) * 2006-10-10 2012-07-03 Abbyy Software, Ltd. Systems for translating sentences between languages using language-independent semantic structures and ratings of syntactic constructions
US20140201838A1 (en) * 2012-01-31 2014-07-17 Db Networks, Inc. Systems and methods for detecting and mitigating threats to a structured data storage system
US20160132492A1 (en) * 2010-11-22 2016-05-12 Alibaba Group Holding Limited Text segmentation with multiple granularity levels
US20160239480A1 (en) * 2010-03-26 2016-08-18 Virtuoz Sa Performing Linguistic Analysis by Scoring Syntactic Graphs
US20170010829A1 (en) * 2015-07-06 2017-01-12 Wistron Corporation Method, system and apparatus for predicting abnormality
WO2017127850A1 (fr) * 2016-01-24 2017-07-27 Hasan Syed Kamran Sécurité informatique basée sur l'intelligence artificielle

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8214199B2 (en) * 2006-10-10 2012-07-03 Abbyy Software, Ltd. Systems for translating sentences between languages using language-independent semantic structures and ratings of syntactic constructions
US20090024604A1 (en) * 2007-07-19 2009-01-22 Microsoft Corporation Dynamic metadata filtering for classifier prediction
US20160239480A1 (en) * 2010-03-26 2016-08-18 Virtuoz Sa Performing Linguistic Analysis by Scoring Syntactic Graphs
US20160132492A1 (en) * 2010-11-22 2016-05-12 Alibaba Group Holding Limited Text segmentation with multiple granularity levels
US20140201838A1 (en) * 2012-01-31 2014-07-17 Db Networks, Inc. Systems and methods for detecting and mitigating threats to a structured data storage system
US20170010829A1 (en) * 2015-07-06 2017-01-12 Wistron Corporation Method, system and apparatus for predicting abnormality
WO2017127850A1 (fr) * 2016-01-24 2017-07-27 Hasan Syed Kamran Sécurité informatique basée sur l'intelligence artificielle

Similar Documents

Publication Publication Date Title
US20240070204A1 (en) Natural Language Question Answering Systems
US20220382752A1 (en) Mapping Natural Language To Queries Using A Query Grammar
Lin et al. KBPearl: a knowledge base population system supported by joint entity and relation linking
US8630989B2 (en) Systems and methods for information extraction using contextual pattern discovery
US20180232443A1 (en) Intelligent matching system with ontology-aided relation extraction
US10229200B2 (en) Linking data elements based on similarity data values and semantic annotations
US9037615B2 (en) Querying and integrating structured and unstructured data
US10339453B2 (en) Automatically generating test/training questions and answers through pattern based analysis and natural language processing techniques on the given corpus for quick domain adaptation
Ferrari et al. Pragmatic ambiguity detection in natural language requirements
US11816156B2 (en) Ontology index for content mapping
Matentzoglu et al. A Snapshot of the OWL Web
Dinh et al. Identifying relevant concept attributes to support mapping maintenance under ontology evolution
US10628749B2 (en) Automatically assessing question answering system performance across possible confidence values
US10282678B2 (en) Automated similarity comparison of model answers versus question answering system output
US20230325368A1 (en) Automatic data store architecture detection
Zhang et al. An unsupervised data-driven method to discover equivalent relations in large Linked Datasets
KR101375221B1 (ko) 의료 프로세스 모델링 및 검증 방법
WO2020167586A1 (fr) Découverte de données automatisée pour cybersécurité
Arnold et al. Resolving common analytical tasks in text databases
WO2021034663A1 (fr) Procédé de catégorisation dynamique par traitement de langage naturel
Ciszak Knowledge Discovery Approach to Repairs of Textual Attributes in Data Warehouses
Knoop Making company policies accessible through Information Retrieval
Song Towards a linked semantic web: Precisely, comprehensively and scalably linking heterogeneous data in the semantic web
SCHEMA DASKALAKI EVANGELIA

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20756222

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 09.11.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 20756222

Country of ref document: EP

Kind code of ref document: A1