CN110945538A

CN110945538A - Automatic rule recommendation engine

Info

Publication number: CN110945538A
Application number: CN201880029314.0A
Authority: CN
Inventors: 凯瑟琳·卢; 帕特里克·格伦·默里; 齐明; 闪硕; 谢映莲; 俞舫; 郑煜昊
Original assignee: Vistor Technology
Current assignee: Vistor Technology
Priority date: 2017-04-03
Filing date: 2018-04-03
Publication date: 2020-03-31
Anticipated expiration: 2038-04-03
Also published as: US20180285745A1; WO2018187361A1; US11232364B2; CN110945538B

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for rule generation and interaction. A rules engine is provided that does not require manual work to generate or maintain high quality rules for detecting malicious accounts or events. There is no longer a need to manually add, adjust, or delete rules from the system. The system can determine the health of each rule and automatically add, adjust, and delete rules to maintain a consistent, valid rule set.

Description

Automatic rule recommendation engine

Technical Field

This document relates to a rules engine.

Background

Rules engines are common in enterprise settings and have a wide range of functionality, including for detecting specific types of entity behavior. The entity may be a user, a user account, a service, etc. The entity behavior may include fraudulent behavior, anti-money laundering, or other forms of entity behavior.

Conventional rules engines may be used for a variety of applications including, but not limited to, fraud prevention, anti-money laundering work, and enforcement of business policies.

Typically, rule engines are largely manual, so rules must be manually added, adjusted, and deleted from the rule engine. Typically, these personnel will manually generate rules based on domain expertise.

Disclosure of Invention

Techniques are described herein for generating and maintaining a rules engine that eliminate the need for manual intervention when adding new rules, adjusting existing rules, and deleting rules that are no longer relevant. The system may receive a labeled (labeled) data set and output many (e.g., up to thousands) of common rules to model the labeled data set, allowing the rules to be automatically generated. Rules are then maintained, deleted, and created based on the change data fed into the system.

Further described herein is a user interface that provides graphics and metrics for displaying the overall health of all rules in the system, as well as the health of each rule. The graph includes the effectiveness over time and the number of rules deployed on the line. The regular health metrics include accuracy, coverage, and false alarm rate.

The use of the system to support the addition of manual rules to a rules engine is also described herein. In particular, the system also supports the backtesting of manually generated rules against one or more tokens (labels), which may be created in a variety of ways, including but not limited to unsupervised machine learning, supervised machine learning, and manual inspection. A backtest of the manually created rules may then be run against the historically labeled data set, for example, in response to user input.

In summary, one innovative aspect of the subject matter herein can be embodied in a method that includes the following operations: obtaining input data points associated with a plurality of users; determining whether the input data point is labeled or unlabeled; in response to determining that the data point is labeled, determining a set of features from the input data point using a supervised machine learning technique; generating a set of candidate univariate rules using the determined feature set, wherein each rule specifies a matching condition based on the corresponding feature dimension; generating a set of candidate multi-variable rules from the univariate rules; filtering the candidate univariate rules and the candidate multivariate rules using the marked input data points to generate a final valid rule set; and outputting the final valid rule set.

Various aspects of the subject matter described herein may be embodied in methods, computer systems, devices, and computer programs recorded on one or more computer storage devices, each configured to perform the operations of the methods. For a system consisting of one or more computers configured to perform a particular operation or action means: the system has installed thereon software, firmware, hardware, or a combination thereof that in operation cause the system to perform an operation or action. By one or more computer programs being configured to perform a particular operation or action, it is meant that the one or more programs include instructions which, when executed by data processing apparatus, cause the apparatus to perform the operation or action.

The above-described embodiments and other embodiments can each optionally include one or more of the following features, either alone or in combination. In particular, one embodiment includes all of the following features in combination. In response to determining that the data point is not labeled: generating a token using unsupervised machine learning; generating positively labeled clusters of data points; and determining a set of characteristics for each cluster. The rule is periodically updated based on the most recent data point. Each data point corresponds to a user-generated event and contains a set of attributes that describe the event. Filtering the candidate univariate rules and the candidate multivariate rules comprises: candidate rules for the labeled data points are evaluated based on the accuracy and validity measures. The method further comprises the following steps: metrics are maintained on each rule in the final valid set of rules, the metrics including one or more of rule validity, false positive rate, and recency. Rules that do not meet the metric threshold are removed. The method further comprises the following steps: a user interface is provided that is configured to selectively present the rules and data regarding the validity of each rule. The user interface may also be configured to receive manually generated rules from the user, wherein the manually generated rules are retested against historical data to verify the manually generated rules.

The subject matter described herein may be implemented in particular embodiments to realize one or more of the following advantages over manually created rules.

First, automatic rules may be updated frequently to ensure that they remain valid over time. Manually created rules may quickly fail as an attacker modifies the policy. For example, manually-formulated rules targeting a particular domain (e.g., an address ending with the particular domain, such as a domain associated with an email provider) will become invalid when an attacker switches to another domain. Second, automatic rules are less likely to trigger false positives in detection, as each rule must be subjected to rigorous system testing. For example, the example manual rules above may falsely detect legitimate users of email providers in the domain. Instead, automatic rules may be defined in a rule or with one or more sub-rules to limit false positives. For example, a sub-rule may need to satisfy other criteria (e.g., a specified transaction range) that limit false positives in addition to rules targeting the current domain. Finally, the generation and adjustment of manually formulated rules is time consuming, while automatically generated rules may completely eliminate such manual adjustment work.

The details of one or more embodiments of the subject matter herein are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the present subject matter will become apparent from the following description, the accompanying drawings, and the claims.

Drawings

FIG. 1 is a flow diagram of an example process for automatically generating rules.

FIG. 2 is an example user interface showing an overview of an automatic rules engine for detecting fraudulent users.

FIG. 3 is an example user interface showing an automatically generated rule set in the system.

FIG. 4 is an example user interface showing details of a particular rule.

FIG. 5 is an example user interface showing a manual rule set and a backtest result.

Like reference numbers and designations in the various drawings indicate like elements

Detailed Description

Generating automatic rules

FIG. 1 is a flow diagram of an example process 100 for automatically generating rules. The rules engine may be used to generate rules based on the input data. The rules engine may be part of a system, for example, a system for detecting malicious accounts or events across one or more networks. For convenience, the process 100 will be described with respect to such a system that performs the process 100. The system takes the input data points 102 and outputs a rule set 118. The input data points may or may not carry predefined markers. If the input data points do not have predefined labels, the rules engine may first generate labels using an unsupervised machine learning algorithm, and then further classify the data being labeled as clusters. In some embodiments, unsupervised machine learning may be performed using the systems described in one or more of the following co-pending U.S. patent applications: 14/620,028 filed on month 11 of 2015, 14/620,048 filed on month 11 of 2015, 14/620,062 filed on month 11 of 2015, and U.S. provisional patent application 14/620,029 filed on month 11 of 2015, the contents of which are incorporated herein by reference.

Referring to FIG. 1, the system determines 104 whether the acquired input data is flagged. If the obtained input data points are labeled, the system identifies features 106, for example, using supervised machine learning techniques. In some embodiments, the primary feature set may be determined based on a particular scoring metric.

The system may use supervised machine learning techniques as a guide to select the best relevant features that best distinguish positive signature data from negative signature data. For example, a machine learning algorithm such as a decision tree or random forest would provide information about the dominant features that could best classify the positively labeled input data from the rest. Such algorithms may guide the selection of the main features.

The dominant features are then used to generate candidate rules 114. For example, the system may use features in various combinations to generate candidate rules. The system verifies 116 the candidate rule. For example, the system may evaluate the results of a rule on labeled data according to validity and accuracy metrics (e.g., whether the rule is able to capture the labeled data as a target and without false positives). The system determines 118 a final rule set based on the verification result.

Rules Engine input

The input data to the rules engine may include a list of input rows, where each row contains a data point described by a list of feature attributes. The data points generally describe the detected entity of interest. For example, a data point may correspond to a user-generated event, such as a login event, having a list of attributes describing the particular event, including event time, user ID, event type, associated device type, user agent, IP address, and the like. The data points may also correspond to a user account with a list of attributes describing this user, including user demographic information, such as age, location, user profile information (e.g., email address, nickname), or user behavior patterns, such as historical events and their attributes.

The purpose of rule generation is to automatically derive and maintain a set of rules that match all positive marker data points maximally, but not negative marker data points.

In some cases, each data point in the rules engine input has associated with it a positive or negative label ("yes" branch in fig. 1). A positive marker data point may indicate that the corresponding entity was detected as a fraudulent or abusive entity. The more general case of marked data points may also include other use cases where the meaning of a mark is "a marketing target candidate", or "a recommended item", or "a user targeted for promotion".

In some embodiments, the input data points are unlabeled (the "no" branch of fig. 1). The system generates a token for the acquired input data. In the example process 100 of FIG. 1, as shown in FIG. 1, the system generates tokens 108 from the obtained input data using unsupervised machine learning. For example, unsupervised machine learning may generate a label for each input line in the input data. The system may further cluster 110 the input data points being labeled (i.e., feature attributes) and label them with their cluster identifiers. Data points labeled by the same cluster ID or group ID mean that they will be more similar based on a distance measure (e.g., euclidean distance or cosine distance) in the corresponding feature space.

Generating rules

The system may automatically generate rules for each cluster or group using the following steps. First, the rules engine may rank all features by the coverage of the positively labeled data within the cluster and select the dominant feature 112. The dominant features may be used to generate univariate (i.e., single feature) rules during rule generation 114. Each univariate rule specifies a match condition based on the corresponding feature dimension (e.g., the IP address equals a certain value, or the user agent string contains a particular substring). The selection of the dominant feature may be guided by a threshold value of the matching range. Second, the rule engine may create multi-variable candidate rules in rule generation 114 by generating combinations of single variables from those single variable rules using logical expressions (e.g., "and" or "conditions). The goal is to generate finer grained rules to match as many positively labeled data points as possible while reducing false positives. The rules engine may match all candidate multivariate rules with negatively labeled input data during rule verification 116 and filter out those rules with high coverage of negatively labeled data based on a preset threshold (e.g., 1% false positive rate). Finally, the rules engine may collect and output all valid rules for the cluster as part of generating output rule set 118. After generating rules for each cluster, the rules engine may merge all rules originating from each cluster and remove duplicate rules for final output rules 118.

In some alternative embodiments, the rules engine may use the input data to generate rules, where the input data points have been pre-labeled, but have not been clustered or grouped. In this embodiment, the rules engine may treat all the data points being labeled as belonging to the same group or cluster and generate rules using the methods described above.

Rule update and expiration

The rule generation may be performed periodically, for example once per day. The rules generated over time may be further merged together to remove duplicates to ensure low complexity in matching the rules to the new dataset.

Rules derived by the rules engine are also automatically updated over time to ensure that they effectively cope with changing input data patterns (e.g., evolving technology or adversary attack patterns), limit the number of rules and ensure low runtime complexity of rule matching. For rules that no longer match well (e.g., based on a coverage threshold) with the positive token data for a certain period of time (e.g., 3 days) and on a consistent basis, the rules engine may simply delete these rules to ensure that only valid rules exist in the system.

In some implementations, additional input data over a subsequent period of time can be used to generate rules, such that updated rules can be generated, which can allow for rules that are no longer valid to be removed. In some other embodiments, the updated input data may be used to re-kernel implementation rules so that rules that are no longer valid may be quickly identified.

Rules Engine interface

Overview

FIG. 2 is an example user interface 200 showing an overview of an automated rules engine for detecting fraudulent users. At the top of the user interface 200, a summary tab 202 is selected that displays at a high level a daily overview of all automatic and manual rules currently deployed in the system. The first part 204 displays the validity of all rules as the number of detections deployed in the system, the false alarm rate, and the total number of rules. Under the high level overview, the second section 206 displays a chart illustrating the effectiveness of the rules over time, which is a stacked graph with 3 items: a detection rate 208 that passes both the rule and the label data set, a detection rate 210 that passes only the label data, and a detection rate 212 that passes only the rule. Below the second portion 206, a third portion 214 shows the number of rules over time, the number of newly created rules 216 over time for a particular date, and the total number of deployed rules 218 over time.

The detection rate 208 by both the rule and the labeled dataset identifies the number of users detected by the labeled dataset and the created rule. Regions that need further refinement from the rule set may be illustrated by the detection rate 212 of the rules alone to reach users that are not covered by the rules but are detected according to the unsupervised detection algorithm. The detection rate 212 of only passing rules indicates users that are detected by the rule alone. It is typically small and may indicate the correct detection of a miss in the marked dataset by the rule. However, in some cases, it may indicate a false positive result. Therefore, this section may need further investigation to determine if the detection is erroneous.

The rule engine outputs: automatically generated rules

FIG. 3 is a user interface 300 showing an automatically generated rule set in the system. In particular, in FIG. 3, the auto tab 302 is selected. The user interface 300 shows a plurality of rows of automatically generated rules 304.

There are also 2 switch keys: a hierarchy switch button 306 and a definition switch button 308. When the hierarchy switch button 306 is activated, the rule will expand into a sub-rule. For example, if rule 1 contains the logical expressions a & B, and another rule 2 contains the logical expression a only, then rule 1 is a sub-rule of rule 2. With the hierarchy in the enabled state, only rule 1 is displayed. However, there is a visual boundary indicator 310 that indicates the presence of a sub-rule, and user interaction with the boundary displays the sub-rule therein. There may be multiple layers in the rule.

For example, a first class of rules with sub-rules is "In a list". For example, rules

"email _ domain in list gmail.com, outlook.com" has 2 sub-rules:

com, and out ootlook. The second type of rule with sub-rules is not necessarily included in the sub-rules, but has more details in the sub-rules. For example, the rule "email _ domain ═ gmail.com" may have a sub-rule "email _ domain ═ gmail.com AND registration _ state ═ California".

The definition switch button 308 is turned on to display the definition of each rule. Thus, the actual expression of each rule can be viewed.

For each rule, a high level metric is displayed: rule ID, validity measure, and false alarm rate measure. Each rule also has an action. For example, the user may pause or initiate a rule to delete or add it to the actual detection logic. The user may also view details of a particular rule by selecting the particular rule (see "rule details" below).

In addition to the above, there is the ability to change the date 312 for which the rules in the system at a given point in time are presented 312. The user interface 300 also includes a page selector 314, the page selector 314 allowing the user to quickly jump to other pages in the rule. This also prevents too many rules from being loaded at once for the user.

Rule details

FIG. 4 is an example user interface 400 showing details of particular rules under the auto tab 302. For example, the detailed information may be presented in response to a user selection of a detailed information button for a particular rule in the user interface 300.

The user interface 400 presents a definition of the rule and an ID 402. User interface 400 also includes information such as the creation time 404 of the rule. The user interface 400 also includes high-level statistical information about the health of the rules: the validity 406 and false alarm rate 408 are obtained, for example, from the detection information and labeled data, and there is an at-a-angle icon 410 that shows the health status of the rule. Below that, there is a graph 412 showing the effectiveness of the rule over time as compared to the false positive rate. Finally, the user interface 400 includes the latest accuracy 414 of the rule, which is derived from backtesting against historical results (e.g., against input tokens in the input data or tokens derived from unsupervised learning algorithms). Rule information including rule definitions and metadata (such as creation date) may be obtained from a rule repository storing rule sets.

Manual rule entry and retest

The automatic rules engine also allows for the addition of manually generated rules and for automatic backtesting of manually created rules. FIG. 5 is an example user interface 500 illustrating a manually edited rule set and a retest result. In particular, in FIG. 5, manual tab 502 is highlighted as the selected state. The user interface 500 allows a user to create rules 504 in the system to enhance detection of current systems. Manual rules may also result in white listing the user rather than detecting. The manual rule creation process allows the use of complex boolean expressions of AND (AND) AND OR (OR) statements AND various operands. These operands include standard numeric operands (e.g., equal to, less than, greater than, less than or equal to, greater than or equal to) and string operands (e.g., identical, beginning with …, ending with …, regular match, substring).

Once a rule is created, it can be retested in the system against input markers in the input data or retrieved markers from unsupervised learning algorithms. The retest 506 highlights the validity and potential false positives of the rules and from there the user can check the actual users matched by the rules. These users are subdivided into users that may be regularly correctly or incorrectly labeled.

The back test is done automatically. The retesting provides a measure of the effectiveness and false alarm rate of each rule by testing the rule against one or more days of labeled data.

Searching

The user can search for a particular rule in his system from any page. Further, the user may view the particular rule that detected the user from the user details page.

In this document, the term "engine" will be used broadly to refer to a software-based system or subsystem that may perform one or more particular functions. Typically, the engine will be implemented as one or more software modules or components installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine. In other cases, multiple engines may be installed and run on the same computer.

Embodiments of the subject matter and the functional operations described herein may be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware, including the structures disclosed herein and their equivalents, or in combinations of one or more of them. Embodiments of the subject matter described herein may be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory storage medium for execution by, or operation of, data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. The program instructions may alternatively or additionally be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing device.

The term "data processing apparatus" refers to data processing hardware and includes all types of devices, apparatuses and machines for processing data, including by way of example a programmable processor, a computer or multiple processors or computers. The apparatus can also be or further comprise special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for the computer program, e.g., in the form of processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of the foregoing.

A computer program, which can also be referred to as a program, software application, app, module, software module, script, or code, can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages; it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, such as one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, such as files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.

The processes and logic flows described herein can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and in combination with, special purpose logic circuitry, e.g., an FPGA or an ASIC.

A computer suitable for the execution of a computer program may be based on a general purpose microprocessor, a special purpose microprocessor, both general and special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Further, the computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a Universal Serial Bus (USB) flash drive, etc.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.

To provide for interaction with a user, embodiments of the subject matter described herein may be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device, such as a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with the user. For example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. In addition, the computer may interact with the user by sending and receiving documents to and from the device used by the user; for example, a web page is sent to a web browser on a user device in response to a request received from the web browser.

The subject matter embodied in the description herein may be implemented in a computing system that: the computing system includes a back-end component, e.g., as a data server; or the computing system includes a middleware component, e.g., an application server; or the computing system may include a front-end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an embodiment of the subject matter described herein; or any combination of one or more such back-end, middleware, or front-end components. The components of the system may be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a Local Area Network (LAN) and a Wide Area Network (WAN), such as the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, the server transmits data, such as HTML pages, to the user device for displaying data and receiving user input to a user interacting with the device, such as a client. Data generated on the user device, such as the results of the user interaction, may be received at the server.

In addition to the embodiments of the appended claims and the embodiments described above, the following embodiments are also inventive:

embodiment 1 is a system configured to provide a user interface that displays an overview of the effectiveness of all rules deployed in the system.

Embodiment 2 is the system of embodiment 1, wherein the user interface provides advanced metrics for presentation of the plurality of rules within the system, including overall effectiveness, overall false alarm rate, and total number of currently deployed rules.

Embodiment 3 is the system of any of embodiments 1-2, wherein the user interface provides a chart showing effectiveness of the rules over time, depicting rule coverage, missing coverage, and false positives.

Embodiment 4 is the system of any of embodiments 1-3, wherein the user interface provides a chart showing a plurality of rules created on a specified date and a plurality of rules deployed.

Embodiment 5 is the system of any of embodiments 1-4, wherein the user interface provides a drop down menu with the ability to toggle between panels of all rules, automatic only rules, and manual only rules.

Embodiment 6 is the system of any of embodiments 1-5, wherein the user interface provides an input box with the ability to change a date on which to view a snapshot of a panel showing rules.

Embodiment 7 is a system configured to provide a user interface displaying a list of automatically generated rules, the list allowing easy management, comprising: automatically ordering rules, organizing rules into general rules and sub-rules, and search functions.

Embodiment 8 is the system of embodiment 7, wherein the user interface automatically orders the rules based on their validity. For example, the rules that capture the most entities can be listed at the top.

Embodiment 9 is the system of any of embodiments 7-8, wherein the user interface organizes the rules into groups for viewing. The user interface includes one or more selectable elements that provide for switching sub-rules on/off within a more general rule. There may be sub-rules within a sub-rule.

Embodiment 10 is the system of any one of embodiments 7 to 9, wherein the user interface includes one or more selectable elements for toggling on/off of the rule definition, thereby making the interface more compact or detailed depending on the user's preferences.

Embodiment 11 is the system of any of embodiments 7-10, wherein the user interface comprises a tabular view of automatic rules with metrics per column showing one or more metrics selected from the set comprising: the total detection rate, the percentage detection rate relative to the total entity set to be detected, the total false alarm rate or the percentage false alarm rate.

Embodiment 12 is the system of any of embodiments 7-11, wherein the user interface comprises a selectable view details button that, when selected, will show details of a particular rule in a separate display.

Embodiment 13 is the system of any of embodiments 7-12, wherein the user interface includes an input box, wherein a date to view the snapshot of the automatic rules is specified in response to user interaction with the input box.

Embodiment 14 is a system wherein the providing is configured to provide a user interface that allows a user to control rules, including copying, pausing, and deploying rules.

Embodiment 15 is the system of embodiment 14, wherein the user interface includes a selectable copy button associated with each automation rule, wherein in response to selection of the copy button, the system presents a rule editor interface into which the associated automation rule is entered.

Embodiment 16 is the system of any of embodiments 14-15, wherein the user interface includes a selectable pause button associated with each automatic rule, wherein selection of the pause button causes the associated rule to stop without actively detecting the user.

Embodiment 17 is the system of any of embodiments 14-16, wherein the user interface comprises a selectable deploy button associated with each automatic rule, wherein selection of the deploy button transitions the associated automatic rule from suspended to actively detecting the user.

Embodiment 18 is a system configured to provide a user interface providing details of a particular one of a plurality of automatic rules.

Embodiment 19 is the system of embodiment 18, wherein the user interface includes a plurality of metrics in a portion of the user interface, the plurality of metrics including one or more of: validity of the particular automatic rule based on the number of detected entities, false alarm rate, and time at which the rule was first created.

Embodiment 20 is the system of any of embodiments 18-19, wherein the user interface comprises a chart showing effectiveness of a particular rule over time, depicting rule coverage and false positives.

Embodiment 21 is the system of any of embodiments 18-20, wherein the user interface comprises: a presented metric showing the accuracy of a particular rule; and a link to the page showing detailed information of the accurately detected entity relative to the falsely detected entity.

Embodiment 22 is the system of any one of embodiments 18 to 21, wherein the user interface comprises a presented measure showing redundancy for a particular rule, the redundancy indicating: only the number of entities detected specific to the rule is relative to the number of entities also captured by at least one other rule.

Embodiment 23 is a system configured for a user interface that supports manual rules for creation, modification, deletion, and "test-back" (testing historical data).

Embodiment 24 is the system of embodiment 23, wherein the rules may be edited by user interaction with a user interface.

Embodiment 25 is the system of any of embodiments 23-24, wherein a rule may be deleted through user interaction with a user interface.

Embodiment 26 is the system of any of embodiments 23-25, wherein the rules are selectively backlogged against historical data, wherein the historical data time range is specified by a user in the user interface.

Embodiment 27 is the system of any of embodiments 23 to 26, wherein for rule creation, the user interface supports complex boolean logic (AND OR) AND OR (OR)), base numerical operands (e.g., equal to, less than, greater than, less than, equal to, greater than, equal to) AND string operands (e.g., same, beginning with …, ending with …, regular matching, substring matching) to manually generate the rules.

Embodiment 28 is the system of any of embodiments 23-27, wherein the system reverts links to entities that are correctly detected and entities that are detected but that may be false positives.

While this document contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Embodiments of certain features described herein can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although the description of features above may apply to certain combinations, even initially it is stated that one or more of the combined features may in some cases be excised from the combination, and that the combination may be directed to a subcombination or variation of a subcombination.

Similarly, while the operations depicted in the figures are performed in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the foregoing implementations should not be understood as requiring such separation in all instances, and it should be understood that the described program components and systems can generally be integrated within a single software product or packaged into multiple software products.

Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the operations recited in the claims can be performed in a different order and still achieve desirable results. For example, the processes illustrated in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A method for generating rules for identifying malicious accounts or events, the method comprising:

obtaining input data points associated with a plurality of users;

determining whether the input data point is labeled or unlabeled;

in response to determining that the data point is labeled, determining a set of features from the input data point using a supervised machine learning technique;

generating a set of candidate univariate rules using the determined feature set, wherein each rule specifies a matching condition based on the corresponding feature dimension;

generating a set of candidate multi-variable rules from the univariate rules;

filtering the candidate univariate rules and the candidate multivariate rules using the marked input data points to generate a final valid rule set; and

and outputting the final effective rule set.

2. The method of claim 1, wherein, in response to determining that the data point is not labeled:

generating a token using unsupervised machine learning;

generating a cluster of positively labeled data points; and

a feature set is determined for each cluster.

3. The method of claim 1, wherein the rule is periodically updated based on recent data points.

4. The method of claim 1, wherein each data point corresponds to a user-generated event and includes a set of attributes describing the event.

5. The method of claim 1, wherein filtering the candidate univariate rule and the candidate multivariate rule comprises: candidate rules for the labeled data points are evaluated based on the accuracy and validity metrics.

6. The method of claim 1, further comprising:

maintaining metrics on each rule in the final active set of rules, the metrics including one or more of: rule validity, false positive rate, and recency.

7. The method of claim 6, wherein rules that do not meet a metric threshold are deleted.

8. The method of claim 1, further comprising:

a user interface is provided that is configured to selectively present the rules and data regarding the validity of each rule.

9. The method of claim 8, wherein the user interface is further configured to receive a manually generated rule from a user, wherein the manually generated rule is backtested against historical data to verify the manually generated rule.

10. A system, comprising:

one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:

obtaining input data points associated with a plurality of users;

determining whether the input data point is labeled or unlabeled;

generating a set of candidate multi-variable rules from the univariate rules;

and outputting the final effective rule set.

11. The system of claim 10, wherein, in response to determining that the data point is not labeled:

generating a token using unsupervised machine learning;

generating a cluster of positively labeled data points; and

a feature set is determined for each cluster.

12. The system of claim 10, wherein the rule is periodically updated based on recent data points.

13. The system of claim 10, wherein each data point corresponds to a user-generated event and includes a set of attributes describing the event.

14. The system of claim 10, wherein filtering the candidate univariate rule and the candidate multivariate rule comprises: candidate rules for the labeled data points are evaluated based on the accuracy and validity metrics.

15. The system of claim 10, further operable to cause the one or more computers to:

16. The system of claim 15, wherein rules that do not meet a metric threshold are deleted.

17. The system of claim 10, further operable to cause the one or more computers to:

18. The system of claim 17, wherein the user interface is further configured to receive a manually generated rule from a user, wherein the manually generated rule is backtested against historical data to verify the manually generated rule.

19. One or more computer-readable storage media encoded with instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising:

obtaining input data points associated with a plurality of users;

determining whether the input data point is labeled or unlabeled;

generating a set of candidate multi-variable rules from the univariate rules;

and outputting the final effective rule set.