US20210218760A1 - Fraud detection using graph databases - Google Patents
Fraud detection using graph databases Download PDFInfo
- Publication number
- US20210218760A1 US20210218760A1 US16/944,932 US202016944932A US2021218760A1 US 20210218760 A1 US20210218760 A1 US 20210218760A1 US 202016944932 A US202016944932 A US 202016944932A US 2021218760 A1 US2021218760 A1 US 2021218760A1
- Authority
- US
- United States
- Prior art keywords
- node
- account
- attribute
- revised
- nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q20/00—Payment architectures, schemes or protocols
- G06Q20/38—Payment protocols; Details thereof
- G06Q20/382—Payment protocols; Details thereof insuring higher security of transaction
- G06Q20/3827—Use of message hashing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q20/00—Payment architectures, schemes or protocols
- G06Q20/38—Payment protocols; Details thereof
- G06Q20/40—Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
- G06Q20/401—Transaction verification
- G06Q20/4016—Transaction verification involving fraud or risk level assessment in transaction processing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
Definitions
- aspects of the disclosure relate generally to data storage and more specifically to graph databases.
- Fraud detection is a set of activities undertaken to determine attempts to gain access to one or more accounts.
- One common type of fraud in banking is customer account takeover, where someone illegally gains access to a victim's bank account.
- Other examples of fraud include the use of false identities and money laundering.
- aspects described herein may address these and other problems, and generally improve the quality, efficiency, and speed of fraud detection by improving the ability of a system to store and process data.
- Fraud detection systems may use graph databases to store data, allowing for querying the graph database to obtain data using a variety of graph semantics such as nodes, edges, and properties.
- Graph databases in accordance with embodiments of the invention may include account nodes and attribute nodes, where nodes of the same type are not directly linked to each other. That is, account nodes are not linked to other account nodes and attribute notes are not linked to other attribute nodes.
- an updated node may be created with a higher version number than the existing node. The updated node may then be linked while preserving the previous version(s) of the node.
- Each node may include an indication of the node being associated with fraudulent activity.
- Fraud proximity scores (and other fraud indicators) may be calculated based on the relationships between the attribute nodes, address nodes, and fraud indicators within the graph database.
- Fraud may be detected by identifying situations where a fraudster is reusing permutations of (possibly stolen) credentials to open new accounts or to perform account takeovers.
- a fraudster may use the same mailing address to open multiple accounts and/or take over an existing account by changing the mailing address on file to a fraudulent address in order to receive a new card in the mail.
- the existing address node may be replicated and the new version of the address node may be created with the fraudulent mailing address.
- the account node may be connected to previous versions of the address nodes by an immutable linking feature, such as account number, such that the account node is associated with each version of the address nodes.
- Particular versions of the address node such as the updated version inserted by the fraudster in this example, may be marked as fraudulent. In this way, accounts associated with the fraudulent versions of the address node may be identified. Additionally, when the account is recovered and a non-fraudulent address is associated with the account, the previously fraudulent address attribute node may be maintained as a historical record of the fraudulent activity.
- FIG. 1 illustrates an example of a fraud detection system in which one or more aspects described herein may be implemented
- FIG. 2 illustrates an example computing device in accordance with one or more aspects described herein;
- FIG. 3 illustrates an example graph database in accordance with one or more aspects described herein;
- FIG. 4 depicts a flow chart for inserting data into a graph database according to one or more aspects of the disclosure
- FIG. 5 depicts a flow chart for calculating a fraud proximity score according to one or more aspects of the disclosure.
- FIG. 6 depicts a flow chart for preprocessing data according to one or more aspects of the disclosure.
- Fraud detection systems may use graph databases to store data, allowing for querying the graph database to obtain data using a variety of graph semantics such as nodes, edges, and properties.
- Graph databases in accordance with embodiments of the invention may include account nodes and attribute nodes, where nodes of the same type are not directly linked to each other. That is, account nodes are not linked to other account nodes and attribute notes are not linked to other attribute nodes.
- an updated node may be created. The updated node may then be linked to a copy of an account node with a higher version number, allowing the preservation of previous states of the account.
- Each node may include an indication of the node being associated with fraudulent activity.
- Fraud may be detected by identifying situations where a fraudster is reusing permutations of (possibly stolen) credentials to open new accounts or to perform account takeovers.
- a fraudster may use the same mailing address to open multiple accounts and/or take over an existing account by changing the mailing address on file to a fraudulent address in order to receive a new card in the mail.
- the account node may be replicated and the new address node may be created with the fraudulent mailing address.
- the address node is connected to previous versions of the address nodes through a connection to an account node by an immutable linking feature, such as account number.
- Particular versions of the account node such as the updated version created as a result of the fraudster updating an address in this example, may be marked as fraudulent. In this way, addresses associated with the fraudulent versions of the account node may be identified. Additionally, when the account is recovered and a non-fraudulent address is associated with the account, the previously fraudulent address attribute node may be maintained as a historical record of the fraudulent activity.
- any of a variety of data such as customer name, phone numbers, date of birth, etc. may be utilized as described in more detail herein.
- Fraud proximity scores may be calculated based on the relationships between the attribute nodes, account nodes, and fraud indicators within the graph database. In existing fraud systems, accounts are forever penalized for being associated with fraudulent activity. Fraud detection systems in accordance with embodiments of the invention allows for a separation between the fraudulent activity and legitimate accounts, thereby improving the ability of the fraud detection systems to store data, particularly historical changes to particular nodes within the graph database, and accurately determine current and historical fraudulent activity. In several embodiments, fraudulent activity may be identified based on a fraud proximity score. The fraud proximity score may be used to dynamically rate particular nodes for fraudulent activity from the perspective of a particular node within the graph database.
- data being stored in the graph database may be preprocessed into a common format, thereby facilitating easy comparison and scoring of the data.
- whitelists may be used to drop common/not useful attributes from the data, thereby improving the efficiency of the graph database to store data and eliminating potential false indicators of fraudulent activity.
- FIG. 1 illustrates a fraud detection system 100 in accordance with an embodiment of the invention.
- the fraud detection system 100 includes at least one client device 110 and/or at least one fraud detection server system 120 in communication via a network 130 .
- network connections shown are illustrative and any means of establishing a communications link between the computers may be used.
- the existence of any of various network protocols such as TCP/IP, Ethernet, FTP, HTTP and the like, and of various wireless communication technologies such as GSM, CDMA, WiFi, and LTE, is presumed, and the various computing devices described herein may be configured to communicate using any of these network protocols or technologies. Any of the devices and systems described herein may be implemented, in whole or in part, using one or more computing devices described with respect to FIG. 2 .
- Client devices 110 may provide data to and/or obtain data from the at least one fraud detection server system 120 as described herein.
- Fraud detection server systems 120 may store and process a variety of data as described herein.
- the network 130 may include a local area network (LAN), a wide area network (WAN), a wireless telecommunications network, and/or any other communication network or combination thereof.
- the data transferred to and from various computing devices in the fraud detection system 100 may include secure and sensitive data, such as confidential documents, customer personally identifiable information, and account data. It may be desirable to protect transmissions of such data using secure network protocols and encryption and/or to protect the integrity of the data when stored on the various computing devices. For example, a file-based integration scheme or a service-based integration scheme may be utilized for transmitting data between the various computing devices.
- Data may be transmitted using various network communication protocols. Secure data transmission protocols and/or encryption may be used in file transfers to protect the integrity of the data, for example, File Transfer Protocol (FTP), Secure File Transfer Protocol (SFTP), and/or Pretty Good Privacy (PGP) encryption.
- FTP File Transfer Protocol
- SFTP Secure File Transfer Protocol
- PGP Pretty Good Privacy
- one or more web services may be implemented within the various computing devices. Web services may be accessed by authorized external devices and users to support input, extraction, and manipulation of data between the various computing devices in the fraud detection system 100 . Web services built to support a personalized display system may be cross-domain and/or cross-platform, and may be built for enterprise use. Data may be transmitted using the Secure Sockets Layer (SSL) or Transport Layer Security (TLS) protocol to provide secure connections between the computing devices.
- SSL Secure Sockets Layer
- TLS Transport Layer Security
- Web services may be implemented using the WS-Security standard, providing for secure SOAP messages using XML encryption.
- Specialized hardware may be used to provide secure web services.
- secure network appliances may include built-in features such as hardware-accelerated SSL and HTTPS, WS-Security, and/or firewalls.
- Such specialized hardware may be installed and configured in the fraud detection system 100 in front of one or more computing devices such that any external devices may communicate directly with the specialized hardware.
- the computing device 200 may include a processor 203 for controlling overall operation of the computing device 200 and its associated components, including RAM 205 , ROM 207 , input/output device 209 , communication interface 211 , and/or memory 215 .
- a data bus may interconnect processor(s) 203 , RAM 205 , ROM 207 , memory 215 , I/O device 209 , and/or communication interface 211 .
- computing device 200 may represent, be incorporated in, and/or include various devices such as a desktop computer, a computer server, a mobile device, such as a laptop computer, a tablet computer, a smart phone, any other types of mobile computing devices, and the like, and/or any other type of data processing device.
- I/O device 209 may include a microphone, keypad, touch screen, and/or stylus through which a user of the computing device 200 may provide input, and may also include one or more of a speaker for providing audio output and a video display device for providing textual, audiovisual, and/or graphical output.
- Communication interface 211 may include one or more transceivers, digital signal processors, and/or additional circuitry and software for communicating via any network, wired or wireless, using any protocol as described herein.
- Software may be stored within memory 215 to provide instructions to processor 203 allowing computing device 200 to perform various actions.
- memory 215 may store software used by the computing device 200 , such as an operating system 217 , application programs 219 , and/or an associated internal database 221 .
- the various hardware memory units in memory 215 may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data.
- Memory 215 may include one or more physical persistent memory devices and/or one or more non-persistent memory devices.
- Memory 215 may include, but is not limited to, random access memory (RAM) 205 , read only memory (ROM) 207 , electronically erasable programmable read only memory (EEPROM), flash memory or other memory technology, optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to store the desired information and that may be accessed by processor 203 .
- RAM random access memory
- ROM read only memory
- EEPROM electronically erasable programmable read only memory
- flash memory or other memory technology
- optical disk storage magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to store the desired information and that may be accessed by processor 203 .
- Processor 203 may include a single central processing unit (CPU), which may be a single-core or multi-core processor, or may include multiple CPUs. Processor(s) 203 and associated components may allow the computing device 200 to execute a series of computer-readable instructions to perform some or all of the processes described herein.
- various elements within memory 215 or other components in computing device 200 may include one or more caches, for example, CPU caches used by the processor 203 , page caches used by the operating system 217 , disk caches of a hard drive, and/or database caches used to cache content from database 221 .
- the CPU cache may be used by one or more processors 203 to reduce memory latency and access time.
- a processor 203 may retrieve data from or write data to the CPU cache rather than reading/writing to memory 215 , which may improve the speed of these operations.
- a database cache may be created in which certain data from a database 221 is cached in a separate smaller database in a memory separate from the database, such as in RAM 205 or on a separate computing device.
- a database cache on an application server may reduce data retrieval and data manipulation time by not needing to communicate over a network with a back-end database server.
- computing device 200 Although various components of computing device 200 are described separately, functionality of the various components may be combined and/or performed by a single component and/or multiple computing devices in communication without departing from the invention.
- FIG. 3 depicts an example graph database in accordance with one or more aspects described herein.
- the graph database 300 includes account node 310 , labeled Tom with account number 123 and version 1, account node 312 , labeled Sally with account number 158 and version 1, account node 314 , labeled Dave with account number 8560 and version 1, account node 316 , labeled Jane with account number 2587 and version 1, and account node 318 , labeled Bill with account number 9874 and version 1.
- Graph database 300 also includes attribute node 320 , having two versions: version 1 having value “123 Q Avenue” and a fraud indicator of 0, version 2 having value “456 Z Ave” with a fraud indicator of 1.
- Graph database 300 further includes attribute node 322 , version 1 with value “123-45-6789” and a fraud indicator of 0, attribute node 324 , version 1 with value “(123) 456-7890” and a fraud indicator of 0, and attribute node 326 , version 1 with value “jedi@usa.com” and a fraud indicator of 0.
- Graph database 300 also includes query node 330 , having identifier 123 and associated with account node 310 and query node 332 , having identifier 9874 and associated with account node 318 . It should be noted that, in a variety of embodiments, each account node includes an associated query node that may be used to query the graph database as described herein.
- the account nodes and/or attribute nodes may or may not be versioned.
- some embodiments of the invention may have versioned account nodes but not versioned attribute nodes.
- a variety of embodiments of the invention may employ versioned attribute nodes but not versioned attribute nodes.
- Several embodiments of the invention may include versioning for both account nodes and attribute nodes.
- Account nodes may include a unique identifier and static information for an account, such as customer name, date of birth, a fraudulent flag, a version number, account status, account number, account open date, account closed date, and the like.
- the unique identifier may use as a primary key for the account node.
- An account status may indicate the current status of an account such as, but not limited to, open, closed, and voluntarily closed.
- Attribute nodes may include a value, a version number, and a fraudulent flag. The value may be used as the primary key of the attribute node.
- Attribute nodes may be associated with dynamic information of an account such as, but not limited to, mailing address, email address, social security number, phone number, etc.
- an attribute node includes a label indicating the class of data stored in the attribute node.
- the data stored in an attribute node may be disambiguated to differentiate different data types that may be confusingly similar, such as social security numbers and phone numbers.
- Attribute nodes may be linked to account nodes via an edge having a label and a weight.
- the label of the edge may indicate the class of data stored using the attribute node. For example, if an attribute node stores a social security number, the corresponding edges may have the label “SSN.” However, the label on an edge need not correspond to the data type of the corresponding attribute node.
- the weight of an edge may be pre-determined and/or determined dynamically based on its associated account nodes and/or attribute nodes. An edge weight may be any value, including positive and negative values. In a variety of embodiments, the weight of an edge is modified based on a fraud proximity score and/or whitelist as described in more detail herein.
- account nodes may only be connected to attribute nodes within graph database 300 , account-to-account connections are an even number of levels deep.
- This structure of the graph database may be used to discover connections to known fraud accounts when querying the graph database 300 . For example, Bill's two-level deep connections will return Dave (account node 314 ) and Jane (account node 316 ), while Bill's four-level deep connections will return Sally (account node 312 ) and Jane (account node 316 ).
- Multiple account nodes may link to the same attribute node and/or various versions thereof. For example, multiple people may live at the same address.
- Nodes are linked using edges indicating a relationship between the linked nodes and having a label indicating a particular attribute described by the relationship. For example, account node 310 is linked to version 2 of attribute node 320 via an edge having a label of “Address,” thereby indicating that account 123 has an address of 456 Z Ave. Similarly, account node 312 is linked to version 1 of attribute node 320 via an edge having a label of “Address,” thereby indicating that account 158 has an address of 123 Q Ave.
- a fraud indicator of zero may be used to indicate a non-fraudulent account and a fraud indicator of one may be used to indicate a fraudulent account.
- any fraud indicator including fraud indicators that utilize more than two values to indicate a degree of fraudulence associated with a node, may be used.
- FIG. 4 depicts a flow chart for inserting data into a graph database according to one or more aspects of the disclosure. Some or all of the steps of process 400 may be performed using any of the computing devices and/or combination thereof described herein. In a variety of embodiments, some or all of the steps described below may be combined and/or divided into sub-steps as appropriate.
- updated data may be obtained.
- the updated data is obtained from a computing device associated with a particular account.
- the updated data may be regarding an account and/or a particular property of the account.
- the updated data may update data already associated with an account and/or add new data to an account. Any property of an account, including those described herein, may be added and/or updated.
- the updated data is preprocessed using one or more of a variety of techniques, such as those described in more detail with respect to FIG. 6 .
- the updated data indicates that a particular attribute of an account has been flag as fraudulent or non-fraudulent.
- a node may be determined.
- the node may be determined based on the updated data.
- the updated data includes an indication of the class of data being updated and the class indication may be used to determine an appropriate node.
- the updated data includes an account number that may be used to identify a particular node, such as an account node and/or attribute node, to be updated based on the updated data.
- the updated data is associated with a particular account node and adds a new attribute to the associated account.
- the updated data may not be associated with any account nodes current stored using a graph database.
- an updated node may be generated.
- the updated data contains updated data for an existing node.
- An updated node may be generated based on an existing node and the updated data, where the updated node has a higher version number than the existing node. In this way, the updated node is associated with the corresponding existing node. For example, if the updated data includes a new address for an account, an account attribute node may be created and linked to the address with the updated data and a version indicator determined based on the version of the previous account attribute node.
- the updated data may include a new node to be inserted into the graph database.
- edge data may be generated.
- the generated edge data may indicate the relationship between the account node indicated in the updated data and the updated node.
- the generated edge data may have a label corresponding to the class of data indicated in the updated data.
- the generated edge data may have a weight determined based on the label of the edge data and/or any other criteria, such as the difference in time between when the previous node was created and the updated data was received. For example, a recent change to a particular attribute of an account may be indicative of fraud, and more recently created edges may be given a greater weight in determining a fraud proximity score for an account.
- the updated node includes an account node and the generated edge data may link the updated account node to a query node associated with the account node. In this way, a query node may link to every version of an account node, thereby facilitating the querying of different versions of an account stored within a graph database.
- the graph database may be updated.
- the graph database may be updated to store the updated node and edge data.
- updating the graph database includes adding a new version to an existing node and updated edge data to link particular account nodes to the new version of the existing node.
- updating the graph database includes adding newly created nodes to the graph database and associating existing nodes to the newly created nodes using the generated edge data.
- a new query node may be generated based on the newly created account node and the new query node may also be added to the graph database.
- FIG. 5 depicts a flow chart for calculating a fraud proximity score according to one or more aspects of the disclosure. Some or all of the steps of process 500 may be performed using any of the computing devices and/or combination thereof described herein. In a variety of embodiments, some or all of the steps described below may be combined and/or divided into sub-steps as appropriate.
- a graph database may be obtained.
- the graph database may be obtained from any of a variety of computing devices as described herein.
- the graph database may contain data for a variety of accounts, stored using a set of account nodes and attribute nodes as described herein.
- the graph database may be queried to determine features within the graph database. For example, a graph database may be queried to determine a number of unique account attributes (e.g. account numbers, social security numbers, etc.) stored in the graph database, the size of the graph database, and any of a variety of other queries.
- a graph database may be queried to calculate a fraud proximity score for a particular account.
- a source account node may be determined.
- the source account node may be the account node for which a fraud proximity score is to be calculated.
- the source account node is determined by querying the graph database to determine a query node linking to one or more versions of the account node.
- the query node may be used to determine a particular version of the associated account node to be used as the source account node.
- one or more paths to suspicious account nodes may be determined.
- An account node may be indicated as suspicious when one or more attribute nodes associated with the account node have a fraud indicator set.
- an account node includes a suspicious account indicator that identifies a particular account node as being suspicious.
- the paths to suspicious account nodes are determined using the source account node as the root node of a graph traversal algorithm. Any of a variety of graph traversal algorithms, such as depth-first search and breadth-first search, may be used to determine a path from the source account node to a suspicious account node.
- a path to a suspicious account node may be determined by finding a path from the source account node to an attribute node with a fraud indicator set, then determining the set of account nodes linked to the attribute node by at least one edge.
- the fraud indicator may be set for a particular version of an attribute node linked to the attribute node and/or to a different version of the account node that is not directly linked to the attribute node.
- the linked account nodes may be indicated as suspicious account nodes based on their relationship to the fraudulent attribute node. For example, a large number of accounts linked on a common IP address may have many paths to suspicious account nodes even if the majority of accounts are not fraudulent
- a fraud proximity score may be calculated.
- the fraud proximity score may be calculated based on the number of paths to suspicious account nodes and/or the magnitude of the suspiciousness of a particular account node. For example, account nodes that are indicated as suspicious based on a relationship to a fraudulent social security number attribute node may be more likely to be fraudulent than account nodes that are indicated as suspicious based on a relationship to a fraudulent address attribute node.
- the degree to which a particular node is indicated as suspicious may be determined based on the edge weight for the edge connecting the account node to the attribute node.
- the degree to which a particular node is indicated as suspicious may be determined based on how far removed the linked version of the attribute node to the account node is from the version of the attribute node indicated as fraudulent. For example, an account node linked to fraudulent account through an intermediary account is less suspicious than one that is directly linked to the fraudulent account. In a variety of embodiments, the contribution of edge-weights decays as the depth increases.
- the fraud proximity score may be normalized based on a percentage of all paths within the graph database that end at a fraudulent node and/or a suspicious account node.
- a fraud proximity score for a source account node is given using the following equation:
- n p A is the number of paths in P A
- n P A is the number of paths connecting the source node to non-fraudulent accounts
- n R is the length of path R
- r is each relationship in path R
- ⁇ is a decay factor
- e r is the edge weight of relationship r in path R.
- the depth of r R may be determined based on the distance from the source node to the relationship r.
- the decay factor is a number between zero and one with a default value of 0.5.
- particular relationships between nodes may be excluded from the calculation of a fraud proximity score by setting the edge weight of the corresponding edge to zero.
- the fraud proximity scores can be associated with and/or stored using the source node.
- when the fraud proximity score for the source node exceeds a threshold value the source node can be indicated as fraudulent.
- the threshold value can be pre-determined and/or determined dynamically based on the fraud proximity scores for multiple nodes within the graph database.
- FIG. 6 depicts a flow chart for preprocessing data according to one or more aspects of the disclosure. Some or all of the steps of process 600 may be performed using any of the computing devices and/or combination thereof described herein. In a variety of embodiments, some or all of the steps described below may be combined and/or divided into sub-steps as appropriate.
- whitelist data may be generated.
- Whitelist data may be used to filter common, default, and/or invalid values from data to be stored and/or stored using a graph database. For example, a default invalid email address may be used during an account creation and any account node linking to an email attribute node having the default email address may be considered to not have an associated email address.
- the whitelist data may be pre-determined and/or determined dynamically based on the data stored in the graph database. Pre-determined whitelists may include known and/or default values, such as default phone numbers, default email address, and/or any other default value for any attribute associated with an account. Dynamic whitelists may be automatically generated based on the frequency of which a piece of data occurs within the graph database and the fraud rate associated with the piece of data.
- the attribute may be ignored (e.g. whitelisted) from processing, such as when processing fraud proximity scores.
- the threshold for fraud rate and/or frequency may be determined dynamically and the dynamic whitelist may be regenerated periodically (and/or on demand) to ensure that the values stored using the dynamic whitelist are accurate. If there are some fraudulent accounts associated with a particular attribute, a threshold value of fraudulent activity associated with the attribute may need to be reached before an attribute is removed from the dynamic whitelist.
- edge weights may be updated.
- Edge weights may be updated based on the generated whitelist data. For example, certain types of relationships may be excluded by setting the weight of edges corresponding to the relationship to zero. These whitelisted attributed nodes are shared by many accounts and do not generate useful networks, such as those for determining fraud proximity scores. By removing these relationships, the ability of fraud detection systems to identify fraud within graph databases is improved by reducing the amount of data to be processed and removing noise from the generated fraud proximity scores.
- data may be normalized.
- Data may be normalized by converting the data to a common format and/or removing special characters from the data. For example, provided account numbers, telephone numbers, addresses, and the like may be provided in a variety of formats that do not allow for easy matching between different pieces of data of the same class. For example, the following addresses may refer to the same geographic location but do not match: 123 Main Street Suite 200b and Unit 200 B 123 Main St.
- Data may be normalized based on rules for particular classes of data, such that every piece of data of a particular class is formatted using a common format.
- address lines may be merged into a single line, apartment number formatting may be standardized, and street types and directions may be formatted using common wording and punctuation.
- both addresses may be normalized to 200B-123 Main Street. Any of a variety of rules, such as total number of lines of data, casing requirements (e.g. all uppercase, all lowercase, all sentence case, etc.), removing accented characters and/or converting accented characters to their non-accented equivalents, converting words in foreign languages into a standardized language (e.g. converting French words to English), and the like may be used. Normalizing data prior to inserting the data into the graph database may allow for improved fraud detection by allowing for fuzzy linking of the data, thereby improving the accuracy of the determined fraud proximity scores.
- sensitive data may be encrypted. It may be desirable to encrypt particular classes of data, such as those associated with personally identifiable information standards, to improve the security of graph databases.
- sensitive data may be encrypted by salting the data with a random string and hashing the salted data. Any of a variety of hashing algorithms, such as MD5, MD6, SHA-2, and SHA-3, may be used. The use of the salt ensures that the hash cannot be reversed without knowing the salt, which is not stored. Salting is particularly important for linking attributes that existing in a finite domain space (such as phone numbers that consist of exactly 10 numeric digits) as an unsalted hash may easily be reversed through brute force.
- the sensitive data may be decrypted by applying the hashing function to the encrypted data when the salt is known.
- the sensitive data may be re-encrypted on a schedule using a different salt value, such as a nightly encryption of sensitive data using a newly generated salt value. In this way, even if the salt value is discovered, the encrypted data is only vulnerable for a short time.
- the hashing of the data allows for fuzzy matching of normalized attributes, as each matching attribute will hash to the same value. This allows for the matching of data even when the underlying data values are not known.
- One or more aspects discussed herein may be embodied in computer-usable or readable data and/or computer-executable instructions, such as in one or more program modules, executed by one or more computers or other devices as described herein.
- program modules include routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types when executed by a processor in a computer or other device.
- the modules may be written in a source code programming language that is subsequently compiled for execution, or may be written in a scripting language such as (but not limited to) HTML or XML.
- the computer executable instructions may be stored on a computer readable medium such as a hard disk, optical disk, removable storage media, solid-state memory, RAM, and the like.
- the functionality of the program modules may be combined or distributed as desired in various embodiments.
- the functionality may be embodied in whole or in part in firmware or hardware equivalents such as integrated circuits, field programmable gate arrays (FPGA), and the like.
- Particular data structures may be used to more effectively implement one or more aspects discussed herein, and such data structures are contemplated within the scope of computer executable instructions and computer-usable data described herein.
- Various aspects discussed herein may be embodied as a method, a computing device, a system, and/or a computer program product.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Computer Networks & Wireless Communication (AREA)
- Data Mining & Analysis (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- Computer Hardware Design (AREA)
- Accounting & Taxation (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Finance (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- The instant application is a continuation of U.S. patent application Ser. No. 16/739,519, titled “Fraud Detection using Graph Databases” and filed Jan. 10, 2020, the disclosure of which is hereby incorporated by reference in its entirety.
- A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
- Aspects of the disclosure relate generally to data storage and more specifically to graph databases.
- Fraud detection is a set of activities undertaken to determine attempts to gain access to one or more accounts. One common type of fraud in banking is customer account takeover, where someone illegally gains access to a victim's bank account. Other examples of fraud include the use of false identities and money laundering.
- Aspects described herein may address these and other problems, and generally improve the quality, efficiency, and speed of fraud detection by improving the ability of a system to store and process data.
- The following presents a simplified summary of various aspects described herein. This summary is not an extensive overview, and is not intended to identify key or critical elements or to delineate the scope of the claims. The following summary merely presents some concepts in a simplified form as an introductory prelude to the more detailed description provided below. Corresponding apparatus, systems, and computer-readable media are also within the scope of the disclosure.
- Aspects discussed herein relate to the storage of data in graph databases and detecting fraudulent behavior in the stored data. Fraud detection systems may use graph databases to store data, allowing for querying the graph database to obtain data using a variety of graph semantics such as nodes, edges, and properties. Graph databases in accordance with embodiments of the invention may include account nodes and attribute nodes, where nodes of the same type are not directly linked to each other. That is, account nodes are not linked to other account nodes and attribute notes are not linked to other attribute nodes. When a particular node is updated, an updated node may be created with a higher version number than the existing node. The updated node may then be linked while preserving the previous version(s) of the node. Each node may include an indication of the node being associated with fraudulent activity. Fraud proximity scores (and other fraud indicators) may be calculated based on the relationships between the attribute nodes, address nodes, and fraud indicators within the graph database.
- Fraud may be detected by identifying situations where a fraudster is reusing permutations of (possibly stolen) credentials to open new accounts or to perform account takeovers. For example, a fraudster may use the same mailing address to open multiple accounts and/or take over an existing account by changing the mailing address on file to a fraudulent address in order to receive a new card in the mail. When the request to update the mailing address node for the account is received, the existing address node may be replicated and the new version of the address node may be created with the fraudulent mailing address. The account node may be connected to previous versions of the address nodes by an immutable linking feature, such as account number, such that the account node is associated with each version of the address nodes. Particular versions of the address node, such as the updated version inserted by the fraudster in this example, may be marked as fraudulent. In this way, accounts associated with the fraudulent versions of the address node may be identified. Additionally, when the account is recovered and a non-fraudulent address is associated with the account, the previously fraudulent address attribute node may be maintained as a historical record of the fraudulent activity.
- These features, along with many others, are discussed in greater detail below.
- The present disclosure is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
-
FIG. 1 illustrates an example of a fraud detection system in which one or more aspects described herein may be implemented; -
FIG. 2 illustrates an example computing device in accordance with one or more aspects described herein; -
FIG. 3 illustrates an example graph database in accordance with one or more aspects described herein; -
FIG. 4 depicts a flow chart for inserting data into a graph database according to one or more aspects of the disclosure; -
FIG. 5 depicts a flow chart for calculating a fraud proximity score according to one or more aspects of the disclosure; and -
FIG. 6 depicts a flow chart for preprocessing data according to one or more aspects of the disclosure. - In the following description of the various embodiments, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration various embodiments in which aspects of the disclosure may be practiced. It is to be understood that other embodiments may be utilized and structural and functional modifications may be made without departing from the scope of the present disclosure. Aspects of the disclosure are capable of other embodiments and of being practiced or being carried out in various ways. In addition, it is to be understood that the phraseology and terminology used herein are for the purpose of description and should not be regarded as limiting. Rather, the phrases and terms used herein are to be given their broadest interpretation and meaning.
- By way of introduction, aspects discussed herein relate to storing data using graph databases and identifying fraudulent behavior within the stored data. Fraud detection systems may use graph databases to store data, allowing for querying the graph database to obtain data using a variety of graph semantics such as nodes, edges, and properties. Graph databases in accordance with embodiments of the invention may include account nodes and attribute nodes, where nodes of the same type are not directly linked to each other. That is, account nodes are not linked to other account nodes and attribute notes are not linked to other attribute nodes. When a particular node is updated, an updated node may be created. The updated node may then be linked to a copy of an account node with a higher version number, allowing the preservation of previous states of the account. Each node may include an indication of the node being associated with fraudulent activity.
- Fraud may be detected by identifying situations where a fraudster is reusing permutations of (possibly stolen) credentials to open new accounts or to perform account takeovers. For example, a fraudster may use the same mailing address to open multiple accounts and/or take over an existing account by changing the mailing address on file to a fraudulent address in order to receive a new card in the mail. When the associated mailing address node for the account is changed, the account node may be replicated and the new address node may be created with the fraudulent mailing address. The address node is connected to previous versions of the address nodes through a connection to an account node by an immutable linking feature, such as account number. Particular versions of the account node, such as the updated version created as a result of the fraudster updating an address in this example, may be marked as fraudulent. In this way, addresses associated with the fraudulent versions of the account node may be identified. Additionally, when the account is recovered and a non-fraudulent address is associated with the account, the previously fraudulent address attribute node may be maintained as a historical record of the fraudulent activity. However, it should be noted that any of a variety of data, such as customer name, phone numbers, date of birth, etc. may be utilized as described in more detail herein.
- Fraud proximity scores (and other fraud indicators) may be calculated based on the relationships between the attribute nodes, account nodes, and fraud indicators within the graph database. In existing fraud systems, accounts are forever penalized for being associated with fraudulent activity. Fraud detection systems in accordance with embodiments of the invention allows for a separation between the fraudulent activity and legitimate accounts, thereby improving the ability of the fraud detection systems to store data, particularly historical changes to particular nodes within the graph database, and accurately determine current and historical fraudulent activity. In several embodiments, fraudulent activity may be identified based on a fraud proximity score. The fraud proximity score may be used to dynamically rate particular nodes for fraudulent activity from the perspective of a particular node within the graph database. In a number of embodiments, data being stored in the graph database may be preprocessed into a common format, thereby facilitating easy comparison and scoring of the data. Additionally, whitelists may be used to drop common/not useful attributes from the data, thereby improving the efficiency of the graph database to store data and eliminating potential false indicators of fraudulent activity.
-
FIG. 1 illustrates afraud detection system 100 in accordance with an embodiment of the invention. Thefraud detection system 100 includes at least oneclient device 110 and/or at least one frauddetection server system 120 in communication via anetwork 130. It will be appreciated that the network connections shown are illustrative and any means of establishing a communications link between the computers may be used. The existence of any of various network protocols such as TCP/IP, Ethernet, FTP, HTTP and the like, and of various wireless communication technologies such as GSM, CDMA, WiFi, and LTE, is presumed, and the various computing devices described herein may be configured to communicate using any of these network protocols or technologies. Any of the devices and systems described herein may be implemented, in whole or in part, using one or more computing devices described with respect toFIG. 2 . -
Client devices 110 may provide data to and/or obtain data from the at least one frauddetection server system 120 as described herein. Frauddetection server systems 120 may store and process a variety of data as described herein. Thenetwork 130 may include a local area network (LAN), a wide area network (WAN), a wireless telecommunications network, and/or any other communication network or combination thereof. - Some or all of the data described herein may be stored using any of a variety of data storage mechanisms, such as databases. These databases may include, but are not limited to relational databases, hierarchical databases, distributed databases, in-memory databases, flat file databases, XML databases, NoSQL databases, graph databases, and/or a combination thereof. The data transferred to and from various computing devices in the
fraud detection system 100 may include secure and sensitive data, such as confidential documents, customer personally identifiable information, and account data. It may be desirable to protect transmissions of such data using secure network protocols and encryption and/or to protect the integrity of the data when stored on the various computing devices. For example, a file-based integration scheme or a service-based integration scheme may be utilized for transmitting data between the various computing devices. Data may be transmitted using various network communication protocols. Secure data transmission protocols and/or encryption may be used in file transfers to protect the integrity of the data, for example, File Transfer Protocol (FTP), Secure File Transfer Protocol (SFTP), and/or Pretty Good Privacy (PGP) encryption. In many embodiments, one or more web services may be implemented within the various computing devices. Web services may be accessed by authorized external devices and users to support input, extraction, and manipulation of data between the various computing devices in thefraud detection system 100. Web services built to support a personalized display system may be cross-domain and/or cross-platform, and may be built for enterprise use. Data may be transmitted using the Secure Sockets Layer (SSL) or Transport Layer Security (TLS) protocol to provide secure connections between the computing devices. Web services may be implemented using the WS-Security standard, providing for secure SOAP messages using XML encryption. Specialized hardware may be used to provide secure web services. For example, secure network appliances may include built-in features such as hardware-accelerated SSL and HTTPS, WS-Security, and/or firewalls. Such specialized hardware may be installed and configured in thefraud detection system 100 in front of one or more computing devices such that any external devices may communicate directly with the specialized hardware. - Turning now to
FIG. 2 , acomputing device 200 in accordance with an embodiment of the invention is shown. Thecomputing device 200 may include aprocessor 203 for controlling overall operation of thecomputing device 200 and its associated components, includingRAM 205,ROM 207, input/output device 209,communication interface 211, and/ormemory 215. A data bus may interconnect processor(s) 203,RAM 205,ROM 207,memory 215, I/O device 209, and/orcommunication interface 211. In some embodiments,computing device 200 may represent, be incorporated in, and/or include various devices such as a desktop computer, a computer server, a mobile device, such as a laptop computer, a tablet computer, a smart phone, any other types of mobile computing devices, and the like, and/or any other type of data processing device. - Input/output (I/O)
device 209 may include a microphone, keypad, touch screen, and/or stylus through which a user of thecomputing device 200 may provide input, and may also include one or more of a speaker for providing audio output and a video display device for providing textual, audiovisual, and/or graphical output.Communication interface 211 may include one or more transceivers, digital signal processors, and/or additional circuitry and software for communicating via any network, wired or wireless, using any protocol as described herein. Software may be stored withinmemory 215 to provide instructions toprocessor 203 allowingcomputing device 200 to perform various actions. For example,memory 215 may store software used by thecomputing device 200, such as anoperating system 217,application programs 219, and/or an associatedinternal database 221. The various hardware memory units inmemory 215 may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data.Memory 215 may include one or more physical persistent memory devices and/or one or more non-persistent memory devices.Memory 215 may include, but is not limited to, random access memory (RAM) 205, read only memory (ROM) 207, electronically erasable programmable read only memory (EEPROM), flash memory or other memory technology, optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to store the desired information and that may be accessed byprocessor 203. -
Processor 203 may include a single central processing unit (CPU), which may be a single-core or multi-core processor, or may include multiple CPUs. Processor(s) 203 and associated components may allow thecomputing device 200 to execute a series of computer-readable instructions to perform some or all of the processes described herein. Although not shown inFIG. 2 , various elements withinmemory 215 or other components incomputing device 200, may include one or more caches, for example, CPU caches used by theprocessor 203, page caches used by theoperating system 217, disk caches of a hard drive, and/or database caches used to cache content fromdatabase 221. For embodiments including a CPU cache, the CPU cache may be used by one ormore processors 203 to reduce memory latency and access time. Aprocessor 203 may retrieve data from or write data to the CPU cache rather than reading/writing tomemory 215, which may improve the speed of these operations. In some examples, a database cache may be created in which certain data from adatabase 221 is cached in a separate smaller database in a memory separate from the database, such as inRAM 205 or on a separate computing device. For instance, in a multi-tiered application, a database cache on an application server may reduce data retrieval and data manipulation time by not needing to communicate over a network with a back-end database server. These types of caches and others may be included in various embodiments, and may provide potential advantages in certain implementations of devices, systems, and methods described herein, such as faster response times and less dependence on network conditions when transmitting and receiving data. - Although various components of
computing device 200 are described separately, functionality of the various components may be combined and/or performed by a single component and/or multiple computing devices in communication without departing from the invention. -
FIG. 3 depicts an example graph database in accordance with one or more aspects described herein. Thegraph database 300 includesaccount node 310, labeled Tom withaccount number 123 and version 1,account node 312, labeled Sally withaccount number 158 and version 1,account node 314, labeled Dave withaccount number 8560 and version 1,account node 316, labeled Jane withaccount number 2587 and version 1, andaccount node 318, labeled Bill withaccount number 9874 and version 1.Graph database 300 also includesattribute node 320, having two versions: version 1 having value “123 Q Avenue” and a fraud indicator of 0, version 2 having value “456 Z Ave” with a fraud indicator of 1.Graph database 300 further includesattribute node 322, version 1 with value “123-45-6789” and a fraud indicator of 0, attributenode 324, version 1 with value “(123) 456-7890” and a fraud indicator of 0, and attributenode 326, version 1 with value “jedi@usa.com” and a fraud indicator of 0.Graph database 300 also includesquery node 330, havingidentifier 123 and associated withaccount node 310 andquery node 332, havingidentifier 9874 and associated withaccount node 318. It should be noted that, in a variety of embodiments, each account node includes an associated query node that may be used to query the graph database as described herein. However, it should be noted that, depending on the requirements of specific embodiments of the invention, that the account nodes and/or attribute nodes may or may not be versioned. For example, some embodiments of the invention may have versioned account nodes but not versioned attribute nodes. Similarly, a variety of embodiments of the invention may employ versioned attribute nodes but not versioned attribute nodes. Several embodiments of the invention may include versioning for both account nodes and attribute nodes. - Account nodes may include a unique identifier and static information for an account, such as customer name, date of birth, a fraudulent flag, a version number, account status, account number, account open date, account closed date, and the like. The unique identifier may use as a primary key for the account node. An account status may indicate the current status of an account such as, but not limited to, open, closed, and voluntarily closed. Attribute nodes may include a value, a version number, and a fraudulent flag. The value may be used as the primary key of the attribute node. Attribute nodes may be associated with dynamic information of an account such as, but not limited to, mailing address, email address, social security number, phone number, etc. In many embodiments, an attribute node includes a label indicating the class of data stored in the attribute node. In a variety of embodiments, the data stored in an attribute node may be disambiguated to differentiate different data types that may be confusingly similar, such as social security numbers and phone numbers. Attribute nodes may be linked to account nodes via an edge having a label and a weight. The label of the edge may indicate the class of data stored using the attribute node. For example, if an attribute node stores a social security number, the corresponding edges may have the label “SSN.” However, the label on an edge need not correspond to the data type of the corresponding attribute node. The weight of an edge may be pre-determined and/or determined dynamically based on its associated account nodes and/or attribute nodes. An edge weight may be any value, including positive and negative values. In a variety of embodiments, the weight of an edge is modified based on a fraud proximity score and/or whitelist as described in more detail herein.
- As account nodes may only be connected to attribute nodes within
graph database 300, account-to-account connections are an even number of levels deep. This structure of the graph database may be used to discover connections to known fraud accounts when querying thegraph database 300. For example, Bill's two-level deep connections will return Dave (account node 314) and Jane (account node 316), while Bill's four-level deep connections will return Sally (account node 312) and Jane (account node 316). Multiple account nodes may link to the same attribute node and/or various versions thereof. For example, multiple people may live at the same address. Nodes are linked using edges indicating a relationship between the linked nodes and having a label indicating a particular attribute described by the relationship. For example,account node 310 is linked to version 2 ofattribute node 320 via an edge having a label of “Address,” thereby indicating thataccount 123 has an address of 456 Z Ave. Similarly,account node 312 is linked to version 1 ofattribute node 320 via an edge having a label of “Address,” thereby indicating thataccount 158 has an address of 123 Q Ave. In the illustrated embodiment, a fraud indicator of zero may be used to indicate a non-fraudulent account and a fraud indicator of one may be used to indicate a fraudulent account. However, it should be noted that any fraud indicator, including fraud indicators that utilize more than two values to indicate a degree of fraudulence associated with a node, may be used. -
FIG. 4 depicts a flow chart for inserting data into a graph database according to one or more aspects of the disclosure. Some or all of the steps ofprocess 400 may be performed using any of the computing devices and/or combination thereof described herein. In a variety of embodiments, some or all of the steps described below may be combined and/or divided into sub-steps as appropriate. - At
step 410, updated data may be obtained. In several embodiments, the updated data is obtained from a computing device associated with a particular account. The updated data may be regarding an account and/or a particular property of the account. The updated data may update data already associated with an account and/or add new data to an account. Any property of an account, including those described herein, may be added and/or updated. In many embodiments, the updated data is preprocessed using one or more of a variety of techniques, such as those described in more detail with respect toFIG. 6 . In a variety of embodiments, the updated data indicates that a particular attribute of an account has been flag as fraudulent or non-fraudulent. - At
step 412, a node may be determined. The node may be determined based on the updated data. In several embodiments, the updated data includes an indication of the class of data being updated and the class indication may be used to determine an appropriate node. In several embodiments, the updated data includes an account number that may be used to identify a particular node, such as an account node and/or attribute node, to be updated based on the updated data. In a variety of embodiments, the updated data is associated with a particular account node and adds a new attribute to the associated account. In a number of embodiments, the updated data may not be associated with any account nodes current stored using a graph database. - At
step 414, an updated node may be generated. In a variety of embodiments, the updated data contains updated data for an existing node. An updated node may be generated based on an existing node and the updated data, where the updated node has a higher version number than the existing node. In this way, the updated node is associated with the corresponding existing node. For example, if the updated data includes a new address for an account, an account attribute node may be created and linked to the address with the updated data and a version indicator determined based on the version of the previous account attribute node. When the updated data is not associated with any node within the graph database, the updated data may include a new node to be inserted into the graph database. - At
step 416, edge data may be generated. The generated edge data may indicate the relationship between the account node indicated in the updated data and the updated node. The generated edge data may have a label corresponding to the class of data indicated in the updated data. The generated edge data may have a weight determined based on the label of the edge data and/or any other criteria, such as the difference in time between when the previous node was created and the updated data was received. For example, a recent change to a particular attribute of an account may be indicative of fraud, and more recently created edges may be given a greater weight in determining a fraud proximity score for an account. In several embodiments, the updated node includes an account node and the generated edge data may link the updated account node to a query node associated with the account node. In this way, a query node may link to every version of an account node, thereby facilitating the querying of different versions of an account stored within a graph database. - At
step 418, the graph database may be updated. The graph database may be updated to store the updated node and edge data. In several embodiments, updating the graph database includes adding a new version to an existing node and updated edge data to link particular account nodes to the new version of the existing node. In a number of embodiments, updating the graph database includes adding newly created nodes to the graph database and associating existing nodes to the newly created nodes using the generated edge data. In many embodiments, when a newly created account node is added to the graph database, a new query node may be generated based on the newly created account node and the new query node may also be added to the graph database. -
FIG. 5 depicts a flow chart for calculating a fraud proximity score according to one or more aspects of the disclosure. Some or all of the steps ofprocess 500 may be performed using any of the computing devices and/or combination thereof described herein. In a variety of embodiments, some or all of the steps described below may be combined and/or divided into sub-steps as appropriate. - At
step 510, a graph database may be obtained. The graph database may be obtained from any of a variety of computing devices as described herein. The graph database may contain data for a variety of accounts, stored using a set of account nodes and attribute nodes as described herein. The graph database may be queried to determine features within the graph database. For example, a graph database may be queried to determine a number of unique account attributes (e.g. account numbers, social security numbers, etc.) stored in the graph database, the size of the graph database, and any of a variety of other queries. In several embodiments, a graph database may be queried to calculate a fraud proximity score for a particular account. - At
step 512, a source account node may be determined. The source account node may be the account node for which a fraud proximity score is to be calculated. In several embodiments, the source account node is determined by querying the graph database to determine a query node linking to one or more versions of the account node. The query node may be used to determine a particular version of the associated account node to be used as the source account node. - At
step 514, one or more paths to suspicious account nodes may be determined. An account node may be indicated as suspicious when one or more attribute nodes associated with the account node have a fraud indicator set. In several embodiments, an account node includes a suspicious account indicator that identifies a particular account node as being suspicious. In several embodiments, the paths to suspicious account nodes are determined using the source account node as the root node of a graph traversal algorithm. Any of a variety of graph traversal algorithms, such as depth-first search and breadth-first search, may be used to determine a path from the source account node to a suspicious account node. In several embodiments, a path to a suspicious account node may be determined by finding a path from the source account node to an attribute node with a fraud indicator set, then determining the set of account nodes linked to the attribute node by at least one edge. The fraud indicator may be set for a particular version of an attribute node linked to the attribute node and/or to a different version of the account node that is not directly linked to the attribute node. The linked account nodes may be indicated as suspicious account nodes based on their relationship to the fraudulent attribute node. For example, a large number of accounts linked on a common IP address may have many paths to suspicious account nodes even if the majority of accounts are not fraudulent - At
step 516, a fraud proximity score may be calculated. The fraud proximity score may be calculated based on the number of paths to suspicious account nodes and/or the magnitude of the suspiciousness of a particular account node. For example, account nodes that are indicated as suspicious based on a relationship to a fraudulent social security number attribute node may be more likely to be fraudulent than account nodes that are indicated as suspicious based on a relationship to a fraudulent address attribute node. In many embodiments, the degree to which a particular node is indicated as suspicious may be determined based on the edge weight for the edge connecting the account node to the attribute node. In a variety of embodiments, the degree to which a particular node is indicated as suspicious may be determined based on how far removed the linked version of the attribute node to the account node is from the version of the attribute node indicated as fraudulent. For example, an account node linked to fraudulent account through an intermediary account is less suspicious than one that is directly linked to the fraudulent account. In a variety of embodiments, the contribution of edge-weights decays as the depth increases. The fraud proximity score may be normalized based on a percentage of all paths within the graph database that end at a fraudulent node and/or a suspicious account node. - In a number of embodiments, a fraud proximity score for a source account node (FPSA) is given using the following equation:
-
- where PA is the set of all possible paths from source account node A to suspicious account nodes, np
A is the number of paths in PA,nP is the number of paths connecting the source node to non-fraudulent accounts, nR is the length of path R, r is each relationship in path R, λ is a decay factor, and er is the edge weight of relationship r in path R. The depth of rR may be determined based on the distance from the source node to the relationship r.A - In a variety of embodiments, the decay factor is a number between zero and one with a default value of 0.5. In several embodiments, particular relationships between nodes may be excluded from the calculation of a fraud proximity score by setting the edge weight of the corresponding edge to zero. The fraud proximity scores can be associated with and/or stored using the source node. In several embodiments, when the fraud proximity score for the source node exceeds a threshold value, the source node can be indicated as fraudulent. The threshold value can be pre-determined and/or determined dynamically based on the fraud proximity scores for multiple nodes within the graph database.
-
FIG. 6 depicts a flow chart for preprocessing data according to one or more aspects of the disclosure. Some or all of the steps ofprocess 600 may be performed using any of the computing devices and/or combination thereof described herein. In a variety of embodiments, some or all of the steps described below may be combined and/or divided into sub-steps as appropriate. - At
step 610, whitelist data may be generated. Whitelist data may be used to filter common, default, and/or invalid values from data to be stored and/or stored using a graph database. For example, a default invalid email address may be used during an account creation and any account node linking to an email attribute node having the default email address may be considered to not have an associated email address. The whitelist data may be pre-determined and/or determined dynamically based on the data stored in the graph database. Pre-determined whitelists may include known and/or default values, such as default phone numbers, default email address, and/or any other default value for any attribute associated with an account. Dynamic whitelists may be automatically generated based on the frequency of which a piece of data occurs within the graph database and the fraud rate associated with the piece of data. That is, if an attribute is shared by many accounts but none of the associated accounts are suspicious and/or fraudulent, the attribute may be ignored (e.g. whitelisted) from processing, such as when processing fraud proximity scores. The threshold for fraud rate and/or frequency may be determined dynamically and the dynamic whitelist may be regenerated periodically (and/or on demand) to ensure that the values stored using the dynamic whitelist are accurate. If there are some fraudulent accounts associated with a particular attribute, a threshold value of fraudulent activity associated with the attribute may need to be reached before an attribute is removed from the dynamic whitelist. - At
step 612, edge weights may be updated. Edge weights may be updated based on the generated whitelist data. For example, certain types of relationships may be excluded by setting the weight of edges corresponding to the relationship to zero. These whitelisted attributed nodes are shared by many accounts and do not generate useful networks, such as those for determining fraud proximity scores. By removing these relationships, the ability of fraud detection systems to identify fraud within graph databases is improved by reducing the amount of data to be processed and removing noise from the generated fraud proximity scores. - At
step 614, data may be normalized. Data may be normalized by converting the data to a common format and/or removing special characters from the data. For example, provided account numbers, telephone numbers, addresses, and the like may be provided in a variety of formats that do not allow for easy matching between different pieces of data of the same class. For example, the following addresses may refer to the same geographic location but do not match: 123 Main Street Suite 200b and Unit 200B 123 Main St. Data may be normalized based on rules for particular classes of data, such that every piece of data of a particular class is formatted using a common format. For example, for address data, address lines may be merged into a single line, apartment number formatting may be standardized, and street types and directions may be formatted using common wording and punctuation. Returning to the previous example, both addresses may be normalized to 200B-123 Main Street. Any of a variety of rules, such as total number of lines of data, casing requirements (e.g. all uppercase, all lowercase, all sentence case, etc.), removing accented characters and/or converting accented characters to their non-accented equivalents, converting words in foreign languages into a standardized language (e.g. converting French words to English), and the like may be used. Normalizing data prior to inserting the data into the graph database may allow for improved fraud detection by allowing for fuzzy linking of the data, thereby improving the accuracy of the determined fraud proximity scores. - At
step 616, sensitive data may be encrypted. It may be desirable to encrypt particular classes of data, such as those associated with personally identifiable information standards, to improve the security of graph databases. In several embodiments, sensitive data may be encrypted by salting the data with a random string and hashing the salted data. Any of a variety of hashing algorithms, such as MD5, MD6, SHA-2, and SHA-3, may be used. The use of the salt ensures that the hash cannot be reversed without knowing the salt, which is not stored. Salting is particularly important for linking attributes that existing in a finite domain space (such as phone numbers that consist of exactly 10 numeric digits) as an unsalted hash may easily be reversed through brute force. The sensitive data may be decrypted by applying the hashing function to the encrypted data when the salt is known. In a variety of embodiments, the sensitive data may be re-encrypted on a schedule using a different salt value, such as a nightly encryption of sensitive data using a newly generated salt value. In this way, even if the salt value is discovered, the encrypted data is only vulnerable for a short time. The hashing of the data allows for fuzzy matching of normalized attributes, as each matching attribute will hash to the same value. This allows for the matching of data even when the underlying data values are not known. - One or more aspects discussed herein may be embodied in computer-usable or readable data and/or computer-executable instructions, such as in one or more program modules, executed by one or more computers or other devices as described herein. Generally, program modules include routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types when executed by a processor in a computer or other device. The modules may be written in a source code programming language that is subsequently compiled for execution, or may be written in a scripting language such as (but not limited to) HTML or XML. The computer executable instructions may be stored on a computer readable medium such as a hard disk, optical disk, removable storage media, solid-state memory, RAM, and the like. As will be appreciated by one of skill in the art, the functionality of the program modules may be combined or distributed as desired in various embodiments. In addition, the functionality may be embodied in whole or in part in firmware or hardware equivalents such as integrated circuits, field programmable gate arrays (FPGA), and the like. Particular data structures may be used to more effectively implement one or more aspects discussed herein, and such data structures are contemplated within the scope of computer executable instructions and computer-usable data described herein. Various aspects discussed herein may be embodied as a method, a computing device, a system, and/or a computer program product.
- Although the present invention has been described in certain specific aspects, many additional modifications and variations would be apparent to those skilled in the art. In particular, any of the various processes described above may be performed in alternative sequences and/or in parallel (on different computing devices) in order to achieve similar results in a manner that is more appropriate to the requirements of a specific application. It is therefore to be understood that the present invention may be practiced otherwise than specifically described without departing from the scope and spirit of the present invention. Thus, embodiments of the present invention should be considered in all respects as illustrative and not restrictive. Accordingly, the scope of the invention should be determined not by the embodiments illustrated, but by the appended claims and their equivalents.
Claims (20)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/944,932 US11316874B2 (en) | 2020-01-10 | 2020-07-31 | Fraud detection using graph databases |
US17/726,827 US11843617B2 (en) | 2020-01-10 | 2022-04-22 | Fraud detection using graph databases |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/739,519 US10778706B1 (en) | 2020-01-10 | 2020-01-10 | Fraud detection using graph databases |
US16/944,932 US11316874B2 (en) | 2020-01-10 | 2020-07-31 | Fraud detection using graph databases |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/739,519 Continuation US10778706B1 (en) | 2020-01-10 | 2020-01-10 | Fraud detection using graph databases |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/726,827 Continuation US11843617B2 (en) | 2020-01-10 | 2022-04-22 | Fraud detection using graph databases |
Publications (2)
Publication Number | Publication Date |
---|---|
US20210218760A1 true US20210218760A1 (en) | 2021-07-15 |
US11316874B2 US11316874B2 (en) | 2022-04-26 |
Family
ID=72425660
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/739,519 Expired - Fee Related US10778706B1 (en) | 2020-01-10 | 2020-01-10 | Fraud detection using graph databases |
US16/944,932 Active 2040-01-25 US11316874B2 (en) | 2020-01-10 | 2020-07-31 | Fraud detection using graph databases |
US17/726,827 Active US11843617B2 (en) | 2020-01-10 | 2022-04-22 | Fraud detection using graph databases |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/739,519 Expired - Fee Related US10778706B1 (en) | 2020-01-10 | 2020-01-10 | Fraud detection using graph databases |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/726,827 Active US11843617B2 (en) | 2020-01-10 | 2022-04-22 | Fraud detection using graph databases |
Country Status (1)
Country | Link |
---|---|
US (3) | US10778706B1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023093638A1 (en) * | 2021-11-24 | 2023-06-01 | 百果园技术(新加坡)有限公司 | Abnormal data identification method and apparatus, and device and storage medium |
US11704680B2 (en) * | 2020-08-13 | 2023-07-18 | Oracle International Corporation | Detecting fraudulent user accounts using graphs |
WO2024015423A1 (en) * | 2022-07-12 | 2024-01-18 | Akamai Technologies, Inc. | Real-time detection of online new-account creation fraud using graph-based neural network modeling |
US20240143658A1 (en) * | 2022-11-01 | 2024-05-02 | Alipay (Hangzhou) Information Technology Co., Ltd. | Methods and apparatuses for inserting data into graph database |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11580560B2 (en) * | 2019-07-19 | 2023-02-14 | Intuit Inc. | Identity resolution for fraud ring detection |
US20220198471A1 (en) * | 2020-12-18 | 2022-06-23 | Feedzai - Consultadoria E Inovação Tecnológica, S.A. | Graph traversal for measurement of fraudulent nodes |
CN113869904B (en) * | 2021-08-16 | 2022-09-20 | 工银科技有限公司 | Suspicious data identification method, device, electronic equipment, medium and computer program |
US20230107703A1 (en) * | 2021-10-06 | 2023-04-06 | The Toronto-Dominion Bank | Systems and methods for automated fraud detection |
CN114925217B (en) * | 2022-05-24 | 2023-05-02 | 中国电子科技集团公司第十研究所 | High-value path discovery method based on relation attribute weighting |
US20240144275A1 (en) * | 2022-10-28 | 2024-05-02 | Hint, Inc. | Real-time fraud detection using machine learning |
CN116258420B (en) * | 2023-05-11 | 2023-08-01 | 中南大学 | Product quality detection method, device, terminal equipment and medium |
Family Cites Families (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7865427B2 (en) | 2001-05-30 | 2011-01-04 | Cybersource Corporation | Method and apparatus for evaluating fraud risk in an electronic commerce transaction |
US7562814B1 (en) | 2003-05-12 | 2009-07-21 | Id Analytics, Inc. | System and method for identity-based fraud detection through graph anomaly detection |
US7398925B2 (en) | 2003-12-09 | 2008-07-15 | First Data Corporation | Systems and methods for assessing the risk of a financial transaction using biometric information |
US7962513B1 (en) * | 2005-10-31 | 2011-06-14 | Crossroads Systems, Inc. | System and method for defining and implementing policies in a database system |
US8595161B2 (en) | 2006-05-12 | 2013-11-26 | Vecna Technologies, Inc. | Method and system for determining a potential relationship between entities and relevance thereof |
US8244772B2 (en) | 2007-03-29 | 2012-08-14 | Franz, Inc. | Method for creating a scalable graph database using coordinate data elements |
US7890518B2 (en) | 2007-03-29 | 2011-02-15 | Franz Inc. | Method for creating a scalable graph database |
US8364605B2 (en) | 2008-07-13 | 2013-01-29 | Tros Interactive Ltd. | Calculating connectivity, social proximity and trust level between web user |
US8185558B1 (en) | 2010-04-19 | 2012-05-22 | Facebook, Inc. | Automatically generating nodes and edges in an integrated social graph |
US20120096002A1 (en) | 2010-10-19 | 2012-04-19 | 7 Degrees, Inc. | Systems and methods for generating and managing a universal social graph database |
US20130024364A1 (en) | 2011-02-22 | 2013-01-24 | Abhinav Shrivastava | Consumer transaction leash control apparatuses, methods and systems |
US8787358B2 (en) * | 2011-06-28 | 2014-07-22 | Cisco Technology, Inc. | System for ad-hoc communication sessions |
US20130339186A1 (en) * | 2012-06-15 | 2013-12-19 | Eventbrite, Inc. | Identifying Fraudulent Users Based on Relational Information |
US20140244335A1 (en) | 2013-02-28 | 2014-08-28 | Linkedin Corporation | Techniques for deriving a social proximity score for use in allocating resources |
US20140244531A1 (en) | 2013-02-28 | 2014-08-28 | Linkedin Corporation | Techniques for using social proximity scores in recruiting and/or hiring |
US9563921B2 (en) | 2013-03-13 | 2017-02-07 | Opera Solutions U.S.A., Llc | System and method for detecting merchant points of compromise using network analysis and modeling |
US10586234B2 (en) * | 2013-11-13 | 2020-03-10 | Mastercard International Incorporated | System and method for detecting fraudulent network events |
US9886581B2 (en) | 2014-02-25 | 2018-02-06 | Accenture Global Solutions Limited | Automated intelligence graph construction and countermeasure deployment |
US10019536B2 (en) | 2014-07-15 | 2018-07-10 | Oracle International Corporation | Snapshot-consistent, in-memory graph instances in a multi-user database |
US20160125094A1 (en) | 2014-11-05 | 2016-05-05 | Nec Laboratories America, Inc. | Method and system for behavior query construction in temporal graphs using discriminative sub-trace mining |
US9367872B1 (en) * | 2014-12-22 | 2016-06-14 | Palantir Technologies Inc. | Systems and user interfaces for dynamic and interactive investigation of bad actor behavior based on automatic clustering of related data in various data structures |
US9294497B1 (en) | 2014-12-29 | 2016-03-22 | Nice-Systems Ltd. | Method and system for behavioral and risk prediction in networks using automatic feature generation and selection using network topolgies |
US20160196615A1 (en) * | 2015-01-06 | 2016-07-07 | Wells Fargo Bank, N.A. | Cross-channel fraud detection |
US20190311367A1 (en) | 2015-06-20 | 2019-10-10 | Quantiply Corporation | System and method for using a data genome to identify suspicious financial transactions |
CA3001839C (en) | 2015-10-14 | 2018-10-23 | Pindrop Security, Inc. | Call detail record analysis to identify fraudulent activity and fraud detection in interactive voice response systems |
US20170169432A1 (en) * | 2015-12-15 | 2017-06-15 | Mastercard International Incorporated | System and method of identifying baker's fraud in transactions |
US20170178139A1 (en) | 2015-12-18 | 2017-06-22 | Aci Worldwide Corp. | Analysis of Transaction Information Using Graphs |
US11431736B2 (en) * | 2017-06-30 | 2022-08-30 | Equifax Inc. | Detecting synthetic online entities facilitated by primary entities |
US10469504B1 (en) * | 2017-09-08 | 2019-11-05 | Stripe, Inc. | Systems and methods for using one or more networks to assess a metric about an entity |
US11238368B2 (en) * | 2018-07-02 | 2022-02-01 | Paypal, Inc. | Machine learning and security classification of user accounts |
US11587100B2 (en) * | 2018-07-25 | 2023-02-21 | Ebay Inc. | User interface for fraud detection system |
US10951620B2 (en) * | 2018-08-28 | 2021-03-16 | Mastercard International Incorporated | Systems and methods for use in network services migration |
US10783270B2 (en) * | 2018-08-30 | 2020-09-22 | Netskope, Inc. | Methods and systems for securing and retrieving sensitive data using indexable databases |
US11580560B2 (en) * | 2019-07-19 | 2023-02-14 | Intuit Inc. | Identity resolution for fraud ring detection |
-
2020
- 2020-01-10 US US16/739,519 patent/US10778706B1/en not_active Expired - Fee Related
- 2020-07-31 US US16/944,932 patent/US11316874B2/en active Active
-
2022
- 2022-04-22 US US17/726,827 patent/US11843617B2/en active Active
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11704680B2 (en) * | 2020-08-13 | 2023-07-18 | Oracle International Corporation | Detecting fraudulent user accounts using graphs |
WO2023093638A1 (en) * | 2021-11-24 | 2023-06-01 | 百果园技术(新加坡)有限公司 | Abnormal data identification method and apparatus, and device and storage medium |
WO2024015423A1 (en) * | 2022-07-12 | 2024-01-18 | Akamai Technologies, Inc. | Real-time detection of online new-account creation fraud using graph-based neural network modeling |
US20240143658A1 (en) * | 2022-11-01 | 2024-05-02 | Alipay (Hangzhou) Information Technology Co., Ltd. | Methods and apparatuses for inserting data into graph database |
Also Published As
Publication number | Publication date |
---|---|
US11843617B2 (en) | 2023-12-12 |
US10778706B1 (en) | 2020-09-15 |
US11316874B2 (en) | 2022-04-26 |
US20220247765A1 (en) | 2022-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11316874B2 (en) | Fraud detection using graph databases | |
US11791987B2 (en) | Content validation using blockchain | |
US11874947B1 (en) | System of managing data across disparate blockchains | |
US11455427B2 (en) | Systems, methods, and apparatuses for implementing a privacy-preserving social media data outsourcing model | |
Fu et al. | Toward efficient multi-keyword fuzzy search over encrypted outsourced data with accuracy improvement | |
US20210067542A1 (en) | Determining digital vulnerability based on an online presence | |
US9576005B2 (en) | Search system | |
US10721271B2 (en) | System and method for detecting phishing web pages | |
US20220100899A1 (en) | Protecting sensitive data in documents | |
US10614250B2 (en) | Systems and methods for detecting and remedying theft of data | |
US12051063B2 (en) | Systems and methods for blockchain-based transaction break prevention | |
US20240211969A1 (en) | Device Requirement and Configuration Analysis | |
Yi et al. | Privacy protection method for multiple sensitive attributes based on strong rule | |
Lv et al. | Publishing triangle counting histogram in social networks based on differential privacy | |
CN116745767A (en) | System and method for data enrichment | |
US11783088B2 (en) | Processing electronic documents | |
Peng et al. | Differential attribute desensitization system for personal information protection | |
Liao et al. | BCDP: a blockchain-based credible data publishing system | |
Ma et al. | Preserving privacy on the searchable internet | |
US11727108B2 (en) | Systems and methods for providing secure passwords | |
US11138275B1 (en) | Systems and methods for filter conversion | |
Tu et al. | Differential Privacy Enhanced Dynamic Searchable Symmetric Encryption for Cloud Environments | |
WO2020223901A1 (en) | Data query method, and server | |
CN112580097A (en) | Method and device for protecting user privacy data based on semantic reasoning, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
AS | Assignment |
Owner name: CAPITAL ONE SERVICES, LLC, VIRGINIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LEE, JONATHAN SHEK WING;HARIHARA, VIDHYASAGAR MAHADEVAN;INDYARTA, MICHELLE;AND OTHERS;SIGNING DATES FROM 20200106 TO 20200109;REEL/FRAME:053380/0885 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PRE-INTERVIEW COMMUNICATION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |