EP3991076A1 - Threat detection using machine learning query analysis - Google Patents
Threat detection using machine learning query analysisInfo
- Publication number
- EP3991076A1 EP3991076A1 EP20830727.2A EP20830727A EP3991076A1 EP 3991076 A1 EP3991076 A1 EP 3991076A1 EP 20830727 A EP20830727 A EP 20830727A EP 3991076 A1 EP3991076 A1 EP 3991076A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- queries
- data
- query
- sample database
- database query
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010801 machine learning Methods 0.000 title claims abstract description 29
- 238000001514 detection method Methods 0.000 title description 4
- 239000013598 vector Substances 0.000 claims abstract description 61
- 230000000246 remedial effect Effects 0.000 claims abstract description 18
- 238000012549 training Methods 0.000 claims description 35
- 238000013473 artificial intelligence Methods 0.000 claims description 32
- 238000000034 method Methods 0.000 claims description 24
- 230000004044 response Effects 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 230000000977 initiatory effect Effects 0.000 claims description 5
- 230000001010 compromised effect Effects 0.000 abstract description 7
- 230000008520 organization Effects 0.000 abstract description 7
- 230000006378 damage Effects 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 11
- 230000002547 anomalous effect Effects 0.000 description 8
- 230000002093 peripheral effect Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 6
- 230000009471 action Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000012502 risk assessment Methods 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000000116 mitigating effect Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/552—Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/554—Detecting local intrusion or implementing counter-measures involving event detection and direct action
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/566—Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/03—Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
- G06F2221/034—Test or assess a computer or a system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
Definitions
- This disclosure relates to improvements in computer security and threat detection that are also associated with machine learning and artificial intelligence technology, in various embodiments.
- FIG. 1 illustrates a block diagram of a system including user systems, a front end server, backend server, AI system, and database, according to some embodiments.
- FIG. 2 illustrates a diagram showing some examples of database queries, according to some embodiments.
- FIG. 3 A illustrates a representation of a vector space resulting from training a machine learning classifier on query data, according to some embodiments
- Fig. 3B illustrates a further representation of a vector space resulting from training a machine learning classifier on query data, with an additional query shown, according to some embodiments.
- FIG. 4 illustrates a flowchart of a method relating to determining if a database query represents a data access anomaly, according to some embodiments.
- FIG. 5 is a diagram of a computer readable medium, according to some embodiments.
- the different users who have database access may read, modify, delete, or add data, among other things. These operations can be performed via a database query (e.g. a sequence of one or more commands for the database to execute). Such queries may vary in length, content, and formatting / styling. Numerous different features (e.g. characteristics) of these queries can be extracted by a feature extraction algorithm. The extracted features can then be used to essentially profile different users, through the targeted application of machine learning & artificial intelligence technology.
- a database query e.g. a sequence of one or more commands for the database to execute.
- Such queries may vary in length, content, and formatting / styling.
- Numerous different features (e.g. characteristics) of these queries can be extracted by a feature extraction algorithm. The extracted features can then be used to essentially profile different users, through the targeted application of machine learning & artificial intelligence technology.
- Queries for each of a number of users can be mapped to a vector space such that each user’s queries are clustered relatively close together, but have some distance between queries by other users.
- anew sample query is received, it can be assessed (through a trained machine learning classifier) to determine its level of similarity with previous queries by that user, as well one or more other users who may behave similarly to the querying user (e.g. nearby neighboring users within a vector space) and one or more other users who do not write similar queries to the querying user (e.g. one or more distant users within the vector space).
- this new query represents a data access anomaly—e.g. a particularly unusual query for a user, given his or her past, that may indicate user security credentials have been compromised, or that some other kind of data breach may be occurring.
- a data access anomaly e.g. a particularly unusual query for a user, given his or her past, that may indicate user security credentials have been compromised, or that some other kind of data breach may be occurring.
- remedial actions may also be executed. These remedial actions can include increasing logging or locking a user account for further investigation, among other options.
- the techniques outlined below thus provide for increased computer and data security, in various embodiments.
- Various components may be described or claimed as“configured to” perform a task or tasks.
- “configured to” is used to connote structure by indicating that the components include structure (e.g., stored logic) that performs the task or tasks during operation.
- the component can be said to be configured to perform the task even when the component is not currently operational (e.g., is not on). Reciting that a component is“configured to” perform one or more tasks is expressly intended not to invoke 35 U.S.C. ⁇ 112(f) for that component.
- system 100 includes user systems 105A, 105B, and 105C.
- System 100 also includes front end server 120, backend server 160, database 165, AI system 170 (artificial intelligence system 170), and network 150.
- AI system 170 artificial intelligence system 170
- network 150 The techniques described herein can be utilized in the environment of system 100, as well as numerous other types of environment.
- Fig. 1 Note that many other permutations of Fig. 1 are contemplated (as with all figures). While certain connections are shown (e.g. data link connections) between different components, in various embodiments, additional connections and/or components may exist that are not depicted.
- routers switches, load balancers, computing clusters, additional databases, servers, and firewalls, etc., may all be present and utilized.
- Components may be combined with one other and/or separated into one or more systems in this figure, as in other figures.
- User systems 105A, 105B, and 105C (“user systems 105”) may be any user computer system that can potentially interact with front end server 120, according to various embodiments.
- Front end server 120 may send communications to users, such as emails, text messages, etc. These communications may contain personalized content created based on an association of a particular country with a particular person, in some embodiments.
- Front end server 120 may also provide web pages that facilitate one or more services, such as account access and electronic payment transactions (as may be provided by PayPal.comTM). Front end server 120 may thus facilitate access to various electronic resources, which can include an account, data, and various software programs / functionality, etc.
- a user of user system 105 A may receive communications from front end server 120.
- Front end server 120 may be any computer system configured to provide access to electronic resources. This can include providing communications to users and/or web content, in various embodiments, as well as access to functionality provided a web client (or via other protocols, including but not limited to SSH, FTP, database and/or API connections, etc.). Services provided may include serving web pages (e.g.
- Database 165 may include various data, such as user account data, system data, and any other information. Multiple such databases may exist, of course, in various embodiments, and can be spread across one or more data centers, cloud computing services, etc.
- Front end server 120 may comprise one or more computing devices each having a processor and a memory.
- Network 150 may comprise all or a portion of the Internet.
- Front end server 120 may correspond to an electronic payment transaction service such as that provided by PayPalTM in some embodiments, though in other embodiments, front end server 120 may correspond to different services and functionality.
- Front end server 120 and/or backend server 160 may have a variety of associated user accounts allowing users to make payments electronically and to receive payments electronically.
- a user account may have a variety of associated funding mechanisms (e.g. a linked bank account, a credit card, etc.) and may also maintain a currency balance in the electronic payment account.
- a number of possible different funding sources can be used to provide a source of funds (credit, checking, balance, etc.).
- User devices smartt phones, laptops, desktops, embedded systems, wearable devices, etc. can be used to access electronic payment accounts such as those provided by PayPalTM.
- Front end server 120 may also correspond to a system providing functionalities such as API access, a file server, or another type of service with user accounts in some embodiments (and such services can also be provided via front end server 120 in various embodiments).
- Database 165 can include a transaction database having records related to various transactions taken by users of a transaction system in the embodiment shown. These records can include any number of details, such as any information related to a transaction or to an action taken by a user on a web page or an application installed on a computing device (e.g., the PayPalTM app on a smartphone). Many or all of the records in database 165 are transaction records including details of a user sending or receiving currency (or some other quantity, such as credit card award points, cryptocurrency, etc.). The database information may include two or more parties involved in an electronic payment transaction, date and time of transaction, amount of currency, whether the transaction is a recurring transaction, source of funds / type of funding instrument, and any other details.
- Such information may be used for bookkeeping purposes as well as for risk assessment (e.g. fraud and risk determinations can be made using historical data; such determinations may be made using systems and risk models not depicted in Fig. 1 for purposes of simplicity).
- risk assessment e.g. fraud and risk determinations can be made using historical data; such determinations may be made using systems and risk models not depicted in Fig. 1 for purposes of simplicity.
- Query log 168 may contain a record of some or all database queries executed on database 165.
- Backend server 160 may be one or more computing devices each having a memory and processor that enable a variety of services. Backend server 160 may be deployed in various configurations. In some instances, all or a portion of the functionality for web services that is enabled by backend server 160 is accessible only via front end server 120 (e.g. some of the functionality provided by backend server 160 may not be publicly accessible via the Internet unless a user goes through front end server 120 or some other type of gateway system). Backend server 160 may perform operations such as risk assessment, checking funds availability, among other operations.
- AI system 170 likewise may be one or more computing devices each having a memory and processor. In various embodiments, AI system 170 performs operations related to training a machine learning classifier and/or user the trained classifier to identity anomalous database queries. AI system 170 may transmit information to and/or receive information from a number of systems, including database 165, front end server 120, and back end server 160, as well as other systems, in various embodiments.
- a diagram 200 is shown of some example database queries. Queries 205, 210, 215, 220, and 225 are each different queries in the SQL language.
- the present techniques can be used with multiple different types of database langauges, and are not limited to SQL, however.
- database queries can vary in their content.
- Each particular query may include one or more different commands executable by a database. These commands may cause data to be modified, deleted, or added to a database.
- the commands may also allow retrieval of various information, and modification to particular structures within the database (e.g. joining of tables, creation or deletion of existing tables, etc.).
- the queries shown in Fig. 2 are not especially lengthy, but database queries may be multiple lines and contain a number of different commands in various embodiments.
- a database query as discussed herein, thus is not limited to a single command.
- a database query may be contained in a script file and/or macro, and can be executed this way, or can be executed manually via a command line interface, for example. Queries submitted to a database can be logged and stored in query log 168. These logs can be accessed in order to do analysis and determine if any data access anomalies are present.
- a diagram 300 is shown of a representation of a vector space resulting from training a machine learning classifier, according to some embodiments.
- a first axis 302 and a second axis 304 are shown on the graph.
- This graph is a simplified two dimensional representation of a vector space; note that in many instances however, the vector space may have many different dimensions (e.g. 3 dimension, 10 dimensional, 40 dimensional, or some other number based on features consumed by the ML classifier).
- centroid 310 is surrounded by nearby queries A, B, D, and E, which all are associated with a first user (e.g. that user has executed those queries against one or more databases).
- Centroid 320 is likewise near queries C, S, and T, which are queries associated with a second user.
- Centroid 330 is near queries F, H, X, and Y, which are associated with a third user.
- a diagram 350 is shown of a further representation of a vector space resulting from training a machine learning classifier, with an additional query shown, according to some embodiments.
- a new query (Query Z) is shown as being present in the top right of the vector space. This query is associated as the same user— a first user— who is already associated with queries A, B, D, and E. The centroid for this user is centroid 310.
- Query Z is a significant distance from centroid 310 in the vector space.
- the features of this new query are different enough from other queries associated with the first user that the new query maps to a distance location within the vector space. This distance is indicative that the new query may represent a data access anomaly— this query could have been generated as a result of an account security breach, for example. Being able to identify these anomalies means that one or more remedial actions can be taken, however, to prevent and/or mitigate the effects of a possible security breach.
- FIG. 4 a flowchart is shown of a method 400 relating to determining if a database query represents a data access anomaly, according to some embodiments.
- Operations described relative to Fig. 4 may be performed, in various embodiments, by any suitable computer system and/or combination of computer systems, including AI system 170.
- AI system 170 For convenience and ease of explanation, operations described below will simply be discussed relative to AI system 170 rather than any other system, however. Further, various operations and elements of operations discussed below may be modified, omitted, and/or used in a different manner or different order than that indicated. Thus, in some embodiments, AI system 170 may perform one or more operations while another system might perform one or more other operations.
- AI system 170 access a first plurality of database queries executed by a plurality of different users on one or more databases, according to various embodiments. These queries can be used as a basis for machine learning techniques described further below.
- SQL queries can be in one or more different query languages.
- Some query languages that may be present in the queries include SQL variants and/or NoSQL variants, Contextual Query Language (CQL), Java Persistence Query Language (JPQL), embedded SQL, access query language, Cypher Query Language, Hyper Text Structured Query Language (HTSQL), Object Query Language (OQL), Molecular Query Language (MQL), Lightweight Directory Access Protocol (LDAP), RDF query language (RDQL), ISBL (Information Systems Base Language), Multidimensional Expressions (MDX), Object Constraint Language (OCL), Data Mining Extensions (DMX), or others.
- SQL Contextual Query Language
- JPQL Java Persistence Query Language
- embedded SQL access query language
- Cypher Query Language Hyper Text Structured Query Language
- HTTP Hyper Text Structured Query Language
- OQL Object Query Language
- MQL Lightweight Directory Access Protocol
- RDF query language RDF query language
- ISBL Information Systems Base
- the queries may relate to a single database, or a number of different databases. These databases may be relational databases, and may be within a single organization (i.e. company) or be spread across two or more different organizations.
- the database queries may be accessed by retrieving log information from query log 168, in various embodiments. Each of the queries may be associated with a particular user— e.g., a user who has executed a query against one or more databases.
- AI system 170 creates a set of artificial intelligence (AI) training data by extracting a plurality of features from each of the first plurality of database queries, where each of the plurality of features relates to a different attribute of the queries, according to various embodiments.
- AI artificial intelligence
- Feature extraction can be performed using an algorithm that processes text contained in a database query. Different users may write queries in different styles, and thus the queries can be processed to determine some of the features that differentiate queries associated with different users. Feature extraction can include running one or more scripts on the queries to process their text.
- Some features that are extracted, in various embodiments, include number of lines in a query, number of tabs (e.g. at the start of a line), number of semi-colons; number of lines with a capital letter in them and/or number of lines beginning with a capital letter; length of query (e.g. in characters); total number of capital letters; total number of lowercase letters; total number of uppercase words; total number of lowercase words; number of total comments in the query; number of total comments of a first style in the query (some query languages allow for different styles of comments); number of total comments of a second style in the query; total number of each different type of command statement in the query (e.g.
- Additional features may also be extracted in various embodiments. Each of the features may thus relate to a different attribute of a query, as seen above. These attributes can relate to one another, however (e.g. there may be correlation between total number of capital letters and total number of uppercase words).
- the extracted features form a vector space that can be used for machine learning (e.g. a first feature may be a first dimension of the vector space, a second feature can be a second dimension of the vector space, etc.).
- the set of artificial intelligence (AI) training data may include only query data for users that have a minimum threshold number of previous database queries associated with them. That is, in various embodiments, a user has to have a minimum number of queries (e.g. 5 queries, 20 queries, 60 queries, or some other number) for the queries associated with that user to be included in the training set. If only a small number of queries were used (e.g. a user has only two previous queries) the results might prove unreliable as a sufficient past history of the user’s habits and tendencies in writing queries has not yet been established.
- a minimum number of queries e.g. 5 queries, 20 queries, 60 queries, or some other number
- AI system 170 trains a machine learning classifier using the set of AI training data and using corresponding labels for each of the first plurality of database queries, according to various embodiments.
- Each of these labels may indicate an identity of one of the plurality of different users that is associated with a particular query, and the trained ML classifier is configured to produce vector outputs in a vector space in response to receiving database queries, according to various embodiments.
- an artificial neural network is trained using the set of AI training data.
- many different queries can be input into an ANN.
- the ANN will seek to place queries by a same user close together in the vector space, but simultaneously place queries by other users at a distance, in various embodiments. This allows for good differentiation between different users in the vector space.
- Different neuron values within different layers of the ANN may be adjusted to achieve these goals (e.g. changing a neuron value up or down, running the training data through the ANN to see what the resulting vector space, then comparing that to other results to see if it was an improvement or deterioration in performance).
- the resulting ANN after training forms a trained machine learning classifier.
- the training process may also include data dropout between different layers of the ANN to avoid the classifier becoming overtrained and to instead retain better general performance.
- a centroid for a user refers to an average vector defined by various queries associated with that user. (The average vector can be calculated using weighting for component vectors, or as a straight average in various embodiments.) Each user’s queries may thus have a particular centroid within the vector space.
- the training process seeks to ensure that these centroids for different users have a threshold distance between one another (e.g. a minimum average threshold distance between centroids may be used where the distances between all centroids are measured and averaged, and/or a minimum absolute distance between any two centroids may be used, and/or other particular criteria may be used in training the classifier).
- the training process affects how a particular set of features (for a database query) are mapped to a particular location / vector within a vector space.
- a loss formula is used in training the classifier. This formula is:
- E is the user embedding (e.g. the centroid for a user’s previous queries).
- N is a new query for that user— i.e. a query that is being evaluated for a potential data access anomaly.
- the distance function provides a distance measurement in the vector space.
- Distance(E, N) represents a distance between the user’s centroid and the new query.
- M is a neighboring user within the vector space (this could be someone on the user’s team within the organization who executes similar types of queries, for example).
- Distance (E, M) represents a distance between the user’s centroid and a neighboring centroid of another user.
- the neighboring centroid may be chosen with partial or complete randomness in some circumstances, or may be specifically selected other circumstances.
- the beta parameter may be a value ranging from 0 to 1, in some embodiments, that provides a weighting as to the importance of of a neighboring user.
- D is the centroid for a different user whose centroid may be some minimum threshold distance away from the user.
- the distant centroid D may likewise be chosen with partial or complete randomness in some circumstances, or may be specifically selected other circumstances.
- Distance(E, D) may represent the distance between a querying user’s centroid and the centroid of a distant other user’s centroid.
- the alpha parameter may be a constant value to provided regularization in the above formula, with a large alpha requiring larger distance between the querying user’s centroid and a neighboring user centroid.
- This loss function can be used (along with the alpha and beta parameters) to tune the training of the classifier.
- the classifier is configured to receive a particular database query and to output a vector that places the query within a vector space. This output vector can be used to determine if data access anomalies exist, as explained further below.
- AI system 170 provides a sample database query from a first particular user to the trained machine learning classifier and in response, receives a first output vector in the vector space, according to various embodiments.
- the classifier may essentially learn a mapping, where a particular input of features (extracted from a database query) will produce an output vector within a vector space.
- Operation 440 produces a particular output vector for a query that may represent a potential data access anomaly— i.e., a potential security threat.
- additional analysis can then be performed to see if this query appears distinctly different enough from other previous queries (e.g. by the user and/or by neighboring users) to warrant a possible remedial action designed to prevent or mitigate a data breach.
- AI system 170 determines, based on the first output vector, if the sample database query represents a data access anomaly, according to various embodiments. This determination may include, in various forms, measuring distances between the first output vector and other locations within the vector space (e.g. a centroid for the querying user and/or centroids for other querying users).
- the nearest N user centroids are determined based on a distance between those centroids and the first output vector, where N is an integer one or greater. Then a determination is made as to whether the querying user’s centroid appears in the list of closest centroids.
- the value for N may be predetermined in various embodiments, but in some instances may be set to 3, 5, or some other number as desired.
- a distance formula may be used in order to determine whether a sample query represents a data access anomaly.
- the distance between the querying user’s particular centroid and the new query may be determined. Based on the magnitude of this distance, a probability can then be assessed regarding whether the query looks anomalous. For example, the average distance from the user’s centroid to other queries of the user may be a distance of 0.2, with a standard deviation of 0.1. If the distance between the new query and the existing centroid is then found to be 2.8, for example, then the new query is highly likely to be anomalous.
- Statistical analysis regarding the distribution of queries by the querying user, as well as other users can be used in determining how anomalous a particular query appears to be. If a certain threshold value (e.g. 99.99%, or a distance magnitude of 1.0, or some other threshold value) is reached, then the query is deemed to represent a data access anomaly.
- the classifier does not have to be re-trained even if new users and/or new user queries come up after the initial training.
- the initial training can cause the ML classifier to learn a particular transformation that embeds the queries in particular locations for the vector space. This transformation can also apply to new users.
- a sufficient threshold number of database queries e.g. 20 queries, 60 queries, or some other number
- those queries can be used to generate a query centroid for that user. That query centroid can then be used to determine if future queries for that user are anomalous.
- Remedial actions may be taken if a data access anomaly is determined to exist for a particular database query. Some actions may be taken automatically (e.g. without human initiation) and some may be taken based on human input (e.g. based analysis from a system administrator and/or security analyst).
- One remedial action is to transmit an alert indicating the data access anomaly.
- An alert could be sent in the form of an email, text message, chat instant message, or some other form to a supervisor of an employee whose account is associated with the sample database query, for example, or such an alert could be sent to a system administrator and/or security specialist.
- Another remedial action would be to disallow the query.
- Yet another remedial action would be to place the query in a hold queue and require approval of at least one other authorized person (e.g. within the organization) before allowing its execution.
- Another remedial action would be to add additional logging for the sample query and/or add additional logging for the associated user account (e.g. more details on all queries going forward that are associated with that account for at least some period of time).
- Remedial actions may also be taken based on the severity of the data access anomaly and/or the data being accessed in the sample query.
- a data access anomaly that is only slightly anomalous e.g. past a threshold for anomaly, but perhaps only within 20% of that threshold
- a data access anomaly that is highly anomalous e.g. 99.999% unlikely based on past queries for that user
- remedial action may also be based on a number of anomalous queries within a particular time period.
- a first data access anomaly might simply result in an alert being sent, for example, while five data access anomalies within an hour, day, or some other time period might result in the associated user account being automatically locked due to the higher level of suspicious activity.
- program instructions may be stored on a non-volatile medium such as a hard disk or FLASH drive, or may be stored in any other volatile or non-volatile memory medium or device as is well known, such as a ROM or RAM, or provided on any media capable of staring program code, such as a compact disk (CD) medium, DVD medium, holographic storage, networked storage, etc.
- a non-volatile medium such as a hard disk or FLASH drive
- any other volatile or non-volatile memory medium or device such as a ROM or RAM, or provided on any media capable of staring program code, such as a compact disk (CD) medium, DVD medium, holographic storage, networked storage, etc.
- program code may be transmitted and downloaded from a software source, e.g., over the Internet, or from another server, as is well known, or transmitted over any other conventional network connection as is well known (e.g., extranet, VPN, LAN, etc.) using any communication medium and protocols (e.g., TCP/IP, HTTP, HTTPS, Ethernet, etc.) as are well known.
- computer code for implementing aspects of the present invention can be implemented in any programming language that can be executed on a server or server system such as, for example, in C, C+, HTML, Java, JavaScript, or any other scripting language, such as Perl.
- the term“computer-readable medium” refers to a non- transitory computer readable medium.
- FIG. 6 one embodiment of a computer system 600 is illustrated. Various embodiments of this system may be included in front end server 120, backend server 160, AI system 170, or any other computer system.
- system 500 includes at least one instance of an integrated circuit (processor) 610 coupled to an external memory 615.
- the external memory 615 may form a main memory subsystem in one embodiment.
- the integrated circuit 610 is coupled to one or more peripherals 620 and the external memory 615.
- a power supply 605 is also provided which supplies one or more supply voltages to the integrated circuit 610 as well as one or more supply voltages to the memory 615 and/or the peripherals 620.
- more than one instance of the integrated circuit 610 may be included (and more than one external memory 615 may be included as well).
- the memory 615 may be any type of memory, such as dynamic random access memory (DRAM), synchronous DRAM (SDRAM), double data rate (DDR, DDR2, DDR6, etc.) SDRAM (including mobile versions of the SDRAMs such as mDDR6, etc., and/or low power versions of the SDRAMs such as LPDDR2, etc.), RAMBUS DRAM (RDRAM), static RAM (SRAM), etc.
- DRAM dynamic random access memory
- SDRAM synchronous DRAM
- DDR, DDR2, DDR6, etc. SDRAM (including mobile versions of the SDRAMs such as mDDR6, etc., and/or low power versions of the SDRAMs such as LPDDR2, etc.), RAMBUS DRAM (RDRAM), static RAM (SRAM), etc.
- One or more memory devices may be coupled onto a circuit board to form memory modules such as single inline memory modules (SIMMs), dual inline memory modules (DIMMs), etc.
- the devices may be mounted
- the peripherals 620 may include any desired circuitry, depending on the type of system 600.
- the system 600 may be a mobile device (e.g. personal digital assistant (PDA), smart phone, etc.) and the peripherals 620 may include devices for various types of wireless communication, such as Wi-fi, Bluetooth, cellular, global positioning system, etc.
- Peripherals 620 may include one or more network access cards.
- the peripherals 620 may also include additional storage, including RAM storage, solid state storage, or disk storage.
- the peripherals 620 may include user interface devices such as a display screen, including touch display screens or multitouch display screens, keyboard or other input devices, microphones, speakers, etc.
- the system 600 may be any type of computing system (e.g.
- Peripherals 620 may thus include any networking or communication devices.
- system 600 may include multiple computers or computing nodes that are configured to communicate together (e.g. computing cluster, server pool, cloud computing system, etc.).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Computer Security & Cryptography (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Virology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/451,170 US12001548B2 (en) | 2019-06-25 | 2019-06-25 | Threat detection using machine learning query analysis |
PCT/US2020/039561 WO2020264118A1 (en) | 2019-06-25 | 2020-06-25 | Threat detection using machine learning query analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
EP3991076A1 true EP3991076A1 (en) | 2022-05-04 |
EP3991076A4 EP3991076A4 (en) | 2023-07-12 |
Family
ID=74044095
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP20830727.2A Pending EP3991076A4 (en) | 2019-06-25 | 2020-06-25 | Threat detection using machine learning query analysis |
Country Status (4)
Country | Link |
---|---|
US (1) | US12001548B2 (en) |
EP (1) | EP3991076A4 (en) |
AU (1) | AU2020303560B2 (en) |
WO (1) | WO2020264118A1 (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3879421A1 (en) * | 2020-03-11 | 2021-09-15 | ABB Schweiz AG | Method and system for enhancing data privacy of an industrial system or electric power system |
US11531933B2 (en) * | 2020-03-23 | 2022-12-20 | Mcafee, Llc | Explainability of an unsupervised learning algorithm outcome |
US11356480B2 (en) * | 2020-08-26 | 2022-06-07 | KnowBe4, Inc. | Systems and methods of simulated phishing campaign contextualization |
US20220222570A1 (en) * | 2021-01-12 | 2022-07-14 | Optum Technology, Inc. | Column classification machine learning models |
US20230153404A1 (en) * | 2021-11-18 | 2023-05-18 | Imperva, Inc. | Determining the technical maturity of a system user to use as a risk indicator when auditing system activity |
WO2024180689A1 (en) * | 2023-02-28 | 2024-09-06 | 富士通株式会社 | Training program, inference program, training method, inference method, and information processing device |
CN116108072B (en) * | 2023-04-04 | 2023-09-19 | 阿里巴巴(中国)有限公司 | Data query method and query prediction model training method |
Family Cites Families (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7707129B2 (en) | 2006-03-20 | 2010-04-27 | Microsoft Corporation | Text classification by weighted proximal support vector machine based on positive and negative sample sizes and weights |
US20080086432A1 (en) | 2006-07-12 | 2008-04-10 | Schmidtler Mauritius A R | Data classification methods using machine learning techniques |
US7958067B2 (en) | 2006-07-12 | 2011-06-07 | Kofax, Inc. | Data classification methods using machine learning techniques |
US7937345B2 (en) | 2006-07-12 | 2011-05-03 | Kofax, Inc. | Data classification methods using machine learning techniques |
US20080215607A1 (en) | 2007-03-02 | 2008-09-04 | Umbria, Inc. | Tribe or group-based analysis of social media including generating intelligence from a tribe's weblogs or blogs |
US9672355B2 (en) | 2011-09-16 | 2017-06-06 | Veracode, Inc. | Automated behavioral and static analysis using an instrumented sandbox and machine learning classification for mobile security |
US10311096B2 (en) * | 2012-03-08 | 2019-06-04 | Google Llc | Online image analysis |
US9832211B2 (en) * | 2012-03-19 | 2017-11-28 | Qualcomm, Incorporated | Computing device to detect malware |
US9594907B2 (en) | 2013-03-14 | 2017-03-14 | Sas Institute Inc. | Unauthorized activity detection and classification |
US20150379092A1 (en) | 2014-06-26 | 2015-12-31 | Hapara Inc. | Recommending literacy activities in view of document revisions |
US20180089271A1 (en) | 2015-04-15 | 2018-03-29 | Hewlett Packard Enterprise Development Lp | Database query classification |
US9699205B2 (en) | 2015-08-31 | 2017-07-04 | Splunk Inc. | Network security system |
US10332116B2 (en) | 2015-10-06 | 2019-06-25 | Netflix, Inc. | Systems and methods for fraudulent account detection and management |
US9792534B2 (en) | 2016-01-13 | 2017-10-17 | Adobe Systems Incorporated | Semantic natural language vector space |
US11443224B2 (en) | 2016-08-10 | 2022-09-13 | Paypal, Inc. | Automated machine learning feature processing |
US11687769B2 (en) | 2017-06-30 | 2023-06-27 | Paypal, Inc. | Advanced techniques for machine learning using sample comparisons |
US10181032B1 (en) | 2017-07-17 | 2019-01-15 | Sift Science, Inc. | System and methods for digital account threat detection |
US9978067B1 (en) | 2017-07-17 | 2018-05-22 | Sift Science, Inc. | System and methods for dynamic digital threat mitigation |
US11055407B2 (en) | 2017-09-30 | 2021-07-06 | Oracle International Corporation | Distribution-based analysis of queries for anomaly detection with adaptive thresholding |
US10860654B2 (en) * | 2019-03-28 | 2020-12-08 | Beijing Jingdong Shangke Information Technology Co., Ltd. | System and method for generating an answer based on clustering and sentence similarity |
US11586659B2 (en) * | 2019-05-03 | 2023-02-21 | Servicenow, Inc. | Clustering and dynamic re-clustering of similar textual documents |
US11574209B2 (en) * | 2019-05-30 | 2023-02-07 | International Business Machines Corporation | Device for hyper-dimensional computing tasks |
EP3987420A4 (en) * | 2019-06-21 | 2023-10-25 | Cyemptive Technologies, Inc. | Method to prevent root level access attack and measurable sla security and compliance platform |
-
2019
- 2019-06-25 US US16/451,170 patent/US12001548B2/en active Active
-
2020
- 2020-06-25 WO PCT/US2020/039561 patent/WO2020264118A1/en unknown
- 2020-06-25 AU AU2020303560A patent/AU2020303560B2/en active Active
- 2020-06-25 EP EP20830727.2A patent/EP3991076A4/en active Pending
Also Published As
Publication number | Publication date |
---|---|
EP3991076A4 (en) | 2023-07-12 |
AU2020303560B2 (en) | 2023-01-19 |
AU2020303560A1 (en) | 2022-02-03 |
US20200410091A1 (en) | 2020-12-31 |
WO2020264118A1 (en) | 2020-12-30 |
US12001548B2 (en) | 2024-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2020303560B2 (en) | Threat detection using machine learning query analysis | |
US11900271B2 (en) | Self learning data loading optimization for a rule engine | |
US10726156B1 (en) | Method and system for protecting user information in an overlay management system | |
US11507631B2 (en) | Rapid online clustering | |
US10902429B2 (en) | Large dataset structuring techniques for real time fraud detection | |
US11681719B2 (en) | Efficient access of chainable records | |
US11403641B2 (en) | Transactional probability analysis on radial time representation | |
US11249965B2 (en) | Efficient random string processing | |
US11455364B2 (en) | Clustering web page addresses for website analysis | |
US20220027428A1 (en) | Security system for adaptive targeted multi-attribute based identification of online malicious electronic content | |
CA3163551A1 (en) | Real-time names entity based transaction approval | |
US11188647B2 (en) | Security via web browser tampering detection | |
US11687601B2 (en) | Dynamic user interface for navigating user account data | |
US11704747B1 (en) | Determining base limit values for contacts based on inter-network user interactions | |
US11777959B2 (en) | Digital security violation system | |
US20200394525A1 (en) | Country Identification Using Unsupervised Machine Learning on Names | |
US20240037279A1 (en) | Super-cookie identification for stolen cookie detection | |
US20240135364A1 (en) | Method for transferring data over a blockchain network for digital transactions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20211224 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
A4 | Supplementary search report drawn up and despatched |
Effective date: 20230609 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06N 3/084 20230101ALI20230602BHEP Ipc: G06N 3/082 20230101ALI20230602BHEP Ipc: G06F 21/55 20130101ALI20230602BHEP Ipc: G06N 99/00 20190101ALI20230602BHEP Ipc: G06F 21/56 20130101AFI20230602BHEP |
|
P01 | Opt-out of the competence of the unified patent court (upc) registered |
Effective date: 20230718 |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: GRANT OF PATENT IS INTENDED |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06N 3/084 20230101ALI20240730BHEP Ipc: G06N 3/082 20230101ALI20240730BHEP Ipc: G06F 21/55 20130101ALI20240730BHEP Ipc: G06N 99/00 20190101ALI20240730BHEP Ipc: G06F 21/56 20130101AFI20240730BHEP |
|
INTG | Intention to grant announced |
Effective date: 20240809 |