WO2023239759A1 - Résolution d'entités probabilistes à l'aide de micro-graphes - Google Patents

Résolution d'entités probabilistes à l'aide de micro-graphes Download PDF

Info

Publication number
WO2023239759A1
WO2023239759A1 PCT/US2023/024654 US2023024654W WO2023239759A1 WO 2023239759 A1 WO2023239759 A1 WO 2023239759A1 US 2023024654 W US2023024654 W US 2023024654W WO 2023239759 A1 WO2023239759 A1 WO 2023239759A1
Authority
WO
WIPO (PCT)
Prior art keywords
name
address
cross
communities
machine learning
Prior art date
Application number
PCT/US2023/024654
Other languages
English (en)
Inventor
Greg Rothman
Original Assignee
Kinesso, LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kinesso, LLC filed Critical Kinesso, LLC
Publication of WO2023239759A1 publication Critical patent/WO2023239759A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • Entity resolution systems are used to determine the identity of objects within a universe of objects, and to correctly associate data concerning each object with its corresponding object. Entity resolution systems are used, for example, to track information about a universe of consumers within a particular geographic region.
  • An identity graph is a data structure used within an entity resolution system to organize data concerning the objects, wherein nodes are datapoints and the vertices between the nodes indicate relationships between datapoints pertaining to objects. The vertices may be used to connect datapoints that pertain to the same object.
  • Consumer identity graphs may be extremely large, including thousands of datapoints pertinent to each of hundreds of millions of consumers.
  • the information to build an identity graph may be contained in records that come from many different source types. Sources may contain common identifiers to distinguish different entities, or may contain identifiers unique to their own use cases. Consumer entity resolution systems must bring together all “touchpoints” (i.e., points of contact for a consumer), demographic markers, and personal identifiers into a single profile. Unifying data records to single profiles is an immense benefit to enable messaging to consumers across multiple channels, such as email and telephone. Having a “single view” of each consumer across touchpoints reduces duplication, increases messaging effectiveness, and reduces costs. Benefits to the consumers include relevant messaging and a reduction in time and effort to manage duplicative marketing. Privacy may also be more easily protected.
  • the present invention is directed to a probabilistic entity resolution system that utilizes machine learning.
  • the system utilizes cohabitation pairs for generating labeled inputs.
  • Traditional methods use graph algorithms that construct one large graph of nodes and edges that utilize transitivity to connect records.
  • the techniques of the present invention go against that prior teaching by creating many smaller micro-graphs and using optimization methods under constraints to link records together.
  • a number of unique features characterize this present invention.
  • Unique combinations of a given name key create row combination inputs in to the machine learning system.
  • the numerically connected row combinations form a machine learning prospecting window.
  • Machine learning features are then created from summary statistics for the name key aggregations.
  • specific location identity features are created for each name key.
  • the name key is generated by stacking multiple phonetic encoders. This technique allows for an intelligently constructed oversized prospecting window for the network analysis subsystem later in the process. Utilizing the network analysis subsystem in an iterative fashion, the machine learning prospecting window is scanned. At each iterative step, the network analysis subsystem constrains its output under strict personal identification constraints. This processing allows for the creation of millions of micro-graphs using the name key from the machine learning layers previously described. All of the micro-graphs may then be run in parallel for efficient processing in a production environment, greatly increasing the computational efficiency of the identity resolution solution.
  • FIG. 1 is an overall flow diagram illustrating a method according to an embodiment of the present invention.
  • FIG. 2 is a flow diagram illustrating stage one of a method according to an embodiment of the present invention.
  • FIG. 3 is a flow diagram illustrating stage two of a method according to an embodiment of the present invention.
  • Fig. 4 is a flow diagram illustrating stage three of a method according to an embodiment of the present invention.
  • FIG. 5 is a flow diagram illustrating stage four of a method according to an embodiment of the present invention.
  • the problem space addressed by certain embodiments of the invention is characterized by an often vast quantity of unconnected records from various, unrelated sources.
  • There are many potential fields in these records including name and address components, phone and email touchpoints, and demographic markers such as age.
  • the records will include a subset of all possible fields, but not necessarily all contain the same fields.
  • the task then is bringing the records that belong to a single entity together in some manner so that the single view of the customer may be realized. Merging records that pertain to two separate consumers must be avoided, since this will cause confusion when the resulting graphs are used for messaging.
  • failing to merge data that pertains to the same consumer, and treating such data as pertaining to multiple different consumers will result in duplicity of messaging, wasting resources and reducing messaging effectiveness.
  • N may be arbitrarily large, and in a real-world application may be in the billions.
  • a solution is desired that considers all potential partitions of the records, rather than a subset of potential partitions, since any limitations in this regard may reduce the quality of the result. This solution therefore should include the ability to find an optimal partition among all possible partitions.
  • deterministic and probabilistic are defined by rigid rules. For example, record A is linked to record B if the rule indicates they should be connected. There is no way to formally quantify the relationship between A and B. If B was also linked to C, but A was not linked to C, then a decision based on another rule would need to be made regarding the linkage of A, B, and C together.
  • the prospecting window for record connections is fixed. If a contradiction within a set of records arises, a new rule must be written to resolve this contradiction, or the contradiction must be ignored. The view is limiting and cannot see beyond the rules that have been written. The consequence is that the prospecting window for potential connections is inherently limited.
  • the probabilistic approach opens the prospecting widow to an infinite size. The probabilistic approach allows all records to be considered for a connection because they will have a numeric relationship based on their processing in the trained machine learning model. As will be seen, determining which subsets of records should be connected is the task of the network analysis subsystem.
  • the network analysis subsystem can consume the numeric linkages to form an optimal partition of the records by optimizing a mathematical function, and thus achieve a rigorous solution.
  • a process for probabilistic entity resolution may now be described with reference to Fig. 1 .
  • the process begins with the ingestion of potentially billions of name and address records 10.
  • the records may include information pertaining to the social security number, date of birth, first sale or service date, email, phone, record history, and other data for individual consumers.
  • the records may come from a diverse array of sources, may be received in various forms, may be received through different channels, and may be received at different times.
  • record fields may be malformed, incomplete, contain errors, or may be completely missing.
  • the challenge then is to develop a system under these constraints that can accurately bring together full, partial, and incomplete records belonging to the same entity.
  • the present invention achieves this objective.
  • a method according to an embodiment of the present invention may be described as consisting of four general stages. Each stage will be described in more detail below, but will be described here in a general fashion. It should be understood, however, that the division into stages is arbitrary, and other divisions into stages are within the scope of the present invention.
  • Stage one uses semi-supervised machine learning (specifically, cross-address processing as further explained below) to evaluate records across different addresses for each unique name. This produces node/edge pairs of numerical connections.
  • Stage two uses unsupervised learning to connect records across each unique name using the network analysis subsystem. This produces communities of partial entities from the node/edge pairs of stage one.
  • stage three applies supervised learning (specifically, cross-name processing as further explained below) to evaluate partial entities across different names. This again produces node/edge pairs of numerical connections.
  • stage four again applies unsupervised learning to connect partial entities across name variation using the network analysis subsystem to form the communities comprised of the final entities. The result is a number of micrographs, each pertaining to a unique person within the universe of consumers.
  • the search space to find potential record connections begins by transforming the first and last name found in a record into a key at step 20.
  • This key is referred to herein as the name key.
  • the name key can take many forms. For example, it could be a concatenation of the sorted first and last name. It could be a phonetic encoding of the concatenated sorted first and last name. It could be phonetic encoding of the sorted concatenation of the first and last name after substituting nicknames and abbreviation for the first name. For purposes of this description, the simplest case will be used where the name key represents the concatenation of the unsorted first and last name, although the invention is not so limited in alternative embodiments.
  • Each unique name key is treated as its own universe and a dataset is constructed that computes the probability that two distinct addresses from the input sources belong to the same instance of the unique name key. This is referred to here as a cross-address search.
  • Each unique name/address combination will yield a row with its raw personal information, at step 22.
  • An example with limited data is shown in the table below:
  • T is the number of unique names in the raw data
  • A/ is the unique number of addresses associated with the first unique name
  • k is the number of unique addresses associated with the 7 th unique name.
  • Rows are then created to form a table, wherein each row contains a different address pair combination, at step 24.
  • each row contains a different address pair combination, at step 24.
  • the personal information generated in this way must be transformed into features consumable by a machine learning model.
  • Personal information associated with each name/address record often includes malformed and missing data.
  • a feature space is therefore developed that can adapt to unpredictable and chaotic patterns from the input sources.
  • Such personal information includes social security number and the like as listed above.
  • Each row in the cross-address dataset is given a variable-length array associated with one of these categories for each address in the row.
  • Each component of every tuple is a variable length array of data, at step 28.
  • each tuple For each tuple, one may then compute summary statistics comparing all the elements in componentl to all the elements in component2, at step 30. For example, given a row one may compute the maximum Levenshtein distance by comparing all the elements in componentl to all the elements in component2. One may also compute the minimum Levenshtein distance by comparing all the elements in componentl to all the elements in component2. One may compute the maximum or minimum length by comparing the length of componentl to the length of component2. One may compute a Boolean (i.e. , a flag) by creating an empty value if componentl is empty or component2 is empty. And in addition, one may compute a minimum or maximum distance of days in dob and fsd (since these fields are both date fields) by comparing all of the elements in componentl to all of the elements in component2.
  • a Boolean i.e. , a flag
  • the location identity feature summary set forth herein will only detail the social security number (ssn) component, but the same procedure is used to extract features for the date of birth (dob) component. For each ZIP code, city, and 50-mile radius around the centroid of the ZIP code, one may count the number of unique ssn/name combinations. (Other identifiers may be used in alternative embodiments.) For each phone number, its area-code is extracted and transformed to its 50-meter home geography via latitude/longitude transformation. Boolean features characterizing the two addresses and the relationship to ssn/name counts are formed.
  • the cross-address feature space then, in this example, is built from phone, email, dob, fsd, ssn, area code, ZIP_ssn, City_ssn, Lat1_Lon1_ssn, ZIP_dob, City_dob, and Lat1_Lon1_sdob values for each address in the row.
  • the resulting feature space is composed of l/l/ features in the feature space.
  • a cross-address binary classifier is trained to output the probability that the two addresses on the same row belong to the same instance of the name in their row.
  • the cross-address model is semi-supervised, so it trains on partially labeled datasets. The system derives a subset of the data universe that is positively labeled. Labeling is done by finding unique name pairs on a filtered subset of the data universe under strict conditions.
  • the cross-address process begins by identifying records whose addresses identify as business properties or communal housing such as apartments, dormitories, or condominiums without a unit designation. These rows are removed from consideration, at step 34. The remaining addresses are further reduced by filtering out those that have more than a particular number of unique names, such as fifty unique names, associated with them. For each remaining address, all of the two-pair combinations of the unique names at that address are constructed, at step 36. For example, suppose that at address X there are three unique Names, A, B, and C. The name pair combinations at address X are then (A,B), (B,C), and (A,C).
  • the next step is to compute the number of unique addresses at which each unique name-pair were seen together. For example, if name pairs (A,B) were seen together at addresses X1 , X2, X3, and X4, then they share four unique addresses together. All name pairs that share three or more addresses together are retained and will be referred to herein as cohabitations or cohabs, at step 38. All other name pairs are removed. [0033]
  • the cohabs are labels for the cross-address search model. The total number of labeled points is equal to the number of unique cohabs multiplied by twice the number of shared addresses.
  • Community detection analysis solutions are a family of network analysis solutions that serve to form communities.
  • the community detection subsystem of certain embodiments of the present invention ingests the nodes and weighted edges of a graph, and extracts communities by optimizing an objective function with respect to a resolution parameter.
  • the resolution parameter lies in the interval of 0 to 1 and controls the quantity and density of the formed communities.
  • Single-entity communities will display consistent identity data among date of birth, social security number, first name, and other data. Multipleentity communities with contain inconsistent identity data.
  • the goal of community detection is to identify the largest single-entity communities possible. Some communities are very hard to surface, and may contain nodes and edges that do not make up single entity communities at lower resolution levels. In order to break the multi-entity communities into single-entity communities, the system re-enters their nodes and edges into the community detection subsystem at a higher resolution level. Over a series of steps, with increasing resolution values, the community detection solution will identify singleentity communities from multiple-entity communities.
  • the single-entity communities will be fixed and their nodes and edges set aside.
  • the multi-entity communities (and their nodes and edges) will be re-entered into the successive step. This process will begin with a resolution value slightly above 0 and end at a resolution value slightly below 1 .
  • the outputted communities will consist solely of single-entity communities. For example, given edge and weight set and a resolution parameter of x:
  • EW1— EW2 ⁇ name1 , name2, .9 ⁇
  • EW1— EW3 ⁇ name1 , name3, .2 ⁇
  • EW1— EW4 ⁇ name1 , name4, .2 ⁇
  • EW1— EW5 ⁇ name1 ,name5, .15 ⁇
  • EW2— EW3 ⁇ name2, name3, .3 ⁇
  • EW2— EW4 ⁇ name2, name4, .1 ⁇
  • EW2— EW5 ⁇ name2,name5, .2 ⁇
  • EW3— EW4 ⁇ name3, name4, .5 ⁇
  • a Leiden algorithm is applied, as shown at step 42 of Fig. 3, for detecting communities in large networks with form ⁇ EW1.ES2 ⁇ and ⁇ EW3,EW4,EW5 ⁇ :
  • ⁇ EW1 ,EW2 ⁇ is a single-entity community because its social security number and date of birth data match
  • ⁇ EW3,EW4, EW5 ⁇ is a multiple-entity community because its social security number data does not match for all members of the community.
  • ⁇ EW1 , EW2 ⁇ is set aside because it is a single-member community, while EW3, EW4, and EW5 are re-entered by incrementing the resolution to a resolution level of X+.1 at step 48, and again applying the Leiden algorithm at step 42.
  • the community detection solution creates two communities, ⁇ EW3 ⁇ and ⁇ EW4,EW5 ⁇ .
  • EW3 is a single-entity community because its social security number data and date of birth data match
  • ⁇ EW4,EW5 ⁇ is a single-entity community because its social security number data and dob data also match.
  • Each output row (Name, Addressl , Address2, Probability) of the crossaddress model belongs to a unique name.
  • Each unique name becomes its own micro-graph.
  • Each of these micro-graphs are networks wherein the addresses become nodes, and the weights connecting the nodes become the probabilities produced from the machine learning model.
  • the probabilities can be considered vertices in a graph, in which the addresses are nodes and the name (in this case Namel ) is the name of the network or micro-graph.
  • edges and notes are fed into the community detection subsystem.
  • the Leiden algorithm at step 42 uses the nodes and edges to partition each name into communities, where each community will have a collection of addresses.
  • the partial entity John Smith_ ⁇ Address1 ,Address2 ⁇ will be comprised of the ssn, dob, fsd, email, and phone of two records, and will be represented by the aggregated features of its corresponding raw data.
  • Each partial entity will have one or multiple addresses associated with it. For a given name, if an address was not in any row that had a probability above the determined threshold, then it was not matched to another address, so it remains isolated. For example, suppose that John Thomas has three addresses,
  • Address3 does not have any probability values above the threshold with another address, so John Thomas ⁇ Address3 ⁇ will remain a partial entity with a single address, while John Thomas ⁇ Addressl ,Address2 ⁇ will be a partial entity with multiple addresses. Iteration continues until the max resolution is detected at step 46.
  • the cross-name search model compares partial entities across different names to determine the probability that both partial entities belong together.
  • each partial entity has one unique name.
  • the entry point for matching across partial entities is to define a search space based on the unique name of each partial entity. The search space will determine which partial entity combinations to consider for a match.
  • a full entity may have many names associated with it.
  • the full entity may use its legal name on a given source, a nickname or abbreviation on another source, have a misspelled name on another source, and a name change (maiden name, alias, etc.) on another source.
  • a model therefore needs to bring all of the possible name variations of an entity together.
  • To create the rows (i.e. , the search space) for the cross-name search model an identifier is created that combines phonetic word encoders and subsequence identification on each name variation on the partial entities from stage two.
  • a stepwise approach is used.
  • MRA match rating approach
  • the MRA is a phonetic algorithm for indexing of words by their pronunciation.
  • the encoding rules are (1) delete all vowels unless the vowel beings the word; (2) remove the second consonant of any double consonants present; and (3) reduce codes to six letters by joining the first three and last three letters only.
  • T transforms to 'X' if followed by 'IA' or 'IO'.
  • 'TH' transforms to 'O'.
  • Drop T if followed by 'CH'.
  • Double metaphone improves on this original metaphone approach based on irregularities in English and some other languages, but has a much more complex ruleset.
  • the match rating encoded identifier is then truncated to its first three characters, resulting in the following:
  • the system finds the most frequent three-letter sub-sequence (mfs) associated with it at step 56, such that for multiple three-letter sub-sequences, the most common among all phonetic encodings is chosen. This may be illustrated with another example:
  • Each partial entity can belong to one and only one mfs-id.
  • Feature generation in the cross-name search model follows the same pattern as feature generation in the cross-address search model, except that the system removes the social security number feature because it is a label, and replaces it with the address features.
  • the dataset has the form as shown below:
  • each row in the cross-name model there are two partial-entities, and each partial-entity has a variable-length address set associated with it from stage two.
  • address sets Addressl , Address2
  • summary statistics are computed between them as minimum and maximum values of address lengths.
  • Boolean features are constructed by comparing the phonetic encoding for namel and name2 on each row in the cross-name dataset. The resulting cross-name features space will have Y features.
  • the cross-name classification model trains on the labeled subset to determine the probability that the two partial-entities that comprise the same row belong to the same full entity, at step 60.
  • the returned dataset has the form: mfs-id Partial-entityl Partial-entity2 probability
  • edges and nodes are fed into the community detection subsystem that partitions each mfs-id into communities at step 64, where each community will have a collection of partial entities.
  • the probabilities are the edges in this community, and the partial-entities are the nodes for each mfs-id network.
  • the resulting micro-graphs may now be used in a production environment, wherein the entity resolution process has already been performed to provide a single view of each object within the overall graph system, such as in the examples described herein where the objects are consumers.
  • the result may be millions of individual micro-graphs, rather than a single overall graph used for various production purposes.
  • processing using these micro-graphs may be performed in parallel, potentially increasing computational efficiency and speed by orders of magnitude.
  • the ability to massively parallelize the processing can result in dramatic savings in cost and time.
  • the systems and methods described herein may in various embodiments be implemented by any combination of hardware and software.
  • the systems and methods may be implemented by a computer system or a collection of computer systems, each of which includes one or more processors executing program instructions stored on a computer-readable storage medium coupled to the processors.
  • the program instructions may implement the functionality described herein.
  • the various systems and displays as illustrated in the figures and described herein represent example implementations. The order of any method may be changed, and various elements may be added, modified, or omitted.
  • a computing system or computing device as described herein may implement a hardware portion of a cloud computing system or non-cloud computing system, as forming parts of the various implementations of the present invention.
  • the computer system may be any of various types of devices, including, but not limited to, a commodity server, personal computer system, desktop computer, laptop or notebook computer, mainframe computer system, handheld computer, workstation, network computer, a consumer device, application server, storage device, telephone, mobile telephone, or in general any type of computing node, compute node, compute device, and/or computing device.
  • the computing system includes one or more processors (any of which may include multiple processing cores, which may be single or multi-threaded) coupled to a system memory via an input/output (I/O) interface.
  • the computer system further may include a network interface coupled to the I/O interface.
  • the computer system may be a single processor system including one processor, or a multiprocessor system including multiple processors.
  • the processors may be any suitable processors capable of executing computing instructions. For example, in various embodiments, they may be general-purpose or embedded processors implementing any of a variety of instruction set architectures. In multiprocessor systems, each of the processors may commonly, but not necessarily, implement the same instruction set.
  • the computer system also includes one or more network communication devices (e.g., a network interface) for communicating with other systems and/or components over a communications network, such as a local area network, wide area network, or the Internet.
  • a client application executing on the computing device may use a network interface to communicate with a server application executing on a single server or on a cluster of servers that implement one or more of the components of the systems described herein in a cloud computing or non-cloud computing environment as implemented in various subsystems.
  • a server application executing on a computer system may use a network interface to communicate with other instances of an application that may be implemented on other computer systems.
  • the computing device also includes one or more persistent storage devices and/or one or more I/O devices.
  • the persistent storage devices may correspond to disk drives, tape drives, solid state memory, other mass storage devices, or any other persistent storage devices.
  • the computer system (or a distributed application or operating system operating thereon) may store instructions and/or data in persistent storage devices, as desired, and may retrieve the stored instruction and/or data as needed.
  • the computer system may implement one or more nodes of a control plane or control system, and persistent storage may include the SSDs attached to that server node. Multiple computer systems may share the same persistent storage devices or may share a pool of persistent storage devices, with the devices in the pool representing the same or different storage technologies.
  • the computer system includes one or more system memories that may store code/instructions and data accessible by the processor(s).
  • the system’s memory capabilities may include multiple levels of memory and memory caches in a system designed to swap information in memories based on access speed, for example.
  • the interleaving and swapping may extend to persistent storage in a virtual memory implementation.
  • the technologies used to implement the memories may include, by way of example, static random-access memory (RAM), dynamic RAM, read-only memory (ROM), non-volatile memory, or flashtype memory.
  • RAM static random-access memory
  • ROM read-only memory
  • flashtype memory non-volatile memory
  • multiple computer systems may share the same system memories or may share a pool of system memories.
  • System memory or memories may contain program instructions that are executable by the processor(s) to implement the routines described herein.
  • program instructions may be encoded in binary, Assembly language, any interpreted language such as Java, compiled languages such as C/C++, or in any combination thereof; the particular languages given here are only examples.
  • program instructions may implement multiple separate clients, server nodes, and/or other components.
  • program instructions may include instructions executable to implement an operating system (not shown), which may be any of various operating systems, such as UNIX, LINUX, SolarisTM, MacOSTM, or Microsoft WindowsTM. Any or all of program instructions may be provided as a computer program product, or software, that may include a non-transitory computer-readable storage medium having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to various implementations.
  • a non-transitory computer-readable storage medium may include any mechanism for storing information in a form (e.g., software, processing application) readable by a machine (e.g., a computer).
  • a non-transitory computer- accessible medium may include computer-readable storage media or memory media such as magnetic or optical media, e.g., disk or DVD/CD-ROM coupled to the computer system via the I/O interface.
  • a non-transitory computer-readable storage medium may also include any volatile or non-volatile media such as RAM or ROM that may be included in some embodiments of the computer system as system memory or another type of memory.
  • program instructions may be communicated using optical, acoustical or other form of propagated signal (e.g., carrier waves, infrared signals, digital signals, etc.) conveyed via a communication medium such as a network and/or a wired or wireless link, such as may be implemented via a network interface.
  • a network interface may be used to interface with other devices, which may include other computer systems or any type of external electronic device.
  • system memory, persistent storage, and/or remote storage accessible on other devices through a network may store data blocks, replicas of data blocks, metadata associated with data blocks and/or their state, database configuration information, and/or any other information usable in implementing the routines described herein.
  • the I/O interface may coordinate I/O traffic between processors, system memory, and any peripheral devices in the system, including through a network interface or other peripheral interfaces.
  • the I/O interface may perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory) into a format suitable for use by another component (e.g., processors).
  • the I/O interface may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example.
  • PCI Peripheral Component Interconnect
  • USB Universal Serial Bus
  • some or all of the functionality of the I/O interface such as an interface to system memory, may be incorporated directly into the processor(s).
  • a network interface may allow data to be exchanged between a computer system and other devices attached to a network, such as other computer systems (which may implement one or more storage system server nodes, primary nodes, read-only node nodes, and/or clients of the database systems described herein), for example.
  • the I/O interface may allow communication between the computer system and various I/O devices and/or remote storage.
  • Input/output devices may, in some embodiments, include one or more display terminals, keyboards, keypads, touchpads, scanning devices, voice or optical recognition devices, or any other devices suitable for entering or retrieving data by one or more computer systems.
  • the user interfaces described herein may be visible to a user using various types of display screens, which may include CRT displays, LCD displays, LED displays, and other display technologies.
  • the inputs may be received through the displays using touchscreen technologies, and in other implementations the inputs may be received through a keyboard, mouse, touchpad, or other input technologies, or any combination of these technologies.
  • similar input/output devices may be separate from the computer system and may interact with one or more nodes of a distributed system that includes the computer system through a wired or wireless connection, such as over a network interface.
  • the network interface may commonly support one or more wireless networking protocols (e.g., Wi-Fi/IEEE 802.11 , or another wireless networking standard).
  • the network interface may support communication via any suitable wired or wireless general data networks, such as other types of Ethernet networks, for example.
  • the network interface may support communication via telecommunications/telephony networks such as analog voice networks or digital fiber communications networks, via storage area networks such as Fibre Channel SANs, or via any other suitable type of network and/or protocol.
  • a read-write node and/or readonly nodes within the database tier of a database system may present database services and/or other types of data storage services that employ the distributed storage systems described herein to clients as network-based services.
  • a network-based service may be implemented by a software and/or hardware system designed to support interoperable machine-to-machine interaction over a network.
  • a web service may have an interface described in a machine-processable format, such as the Web Services Description Language (WSDL).
  • WSDL Web Services Description Language
  • Other systems may interact with the network-based service in a manner prescribed by the description of the network-based service’s interface.
  • the network-based service may define various operations that other systems may invoke, and may define a particular application programming interface (API) to which other systems may be expected to conform when requesting the various operations.
  • API application programming interface
  • a network-based service may be requested or invoked through the use of a message that includes parameters and/or data associated with the network-based services request.
  • a message may be formatted according to a particular markup language such as Extensible Markup Language (XML), and/or may be encapsulated using a protocol such as Simple Object Access Protocol (SOAP).
  • SOAP Simple Object Access Protocol
  • a network-based services client may assemble a message including the request and convey the message to an addressable endpoint (e.g., a Uniform Resource Locator (URL)) corresponding to the web service, using an Internet-based application layer transfer protocol such as Hypertext Transfer Protocol (HTTP).
  • URL Uniform Resource Locator
  • HTTP Hypertext Transfer Protocol
  • network-based services may be implemented using Representational State Transfer (REST) techniques rather than message-based techniques.
  • REST Representational State Transfer
  • a network-based service implemented according to a REST technique may be invoked through parameters included within an HTTP method such as PUT, GET, or DELETE.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un système de résolution d'entités qui utilise des paires de cohabitation pour générer des entrées étiquetées. Des combinaisons d'une clé de nom donné créent des entrées de combinaisons de rangées dans un système d'apprentissage automatique pour former une fenêtre de prospection. Des caractéristiques d'apprentissage automatique sont ensuite créées à partir de statistiques récapitulatives pour les agrégations de clés de noms. Des caractéristiques d'identités d'emplacements spécifiques sont créées pour chaque clé de nom. La clé de nom est créée par empilement de multiples codeurs phonétiques et la fenêtre de prospection d'apprentissage automatique est balayée. À chaque étape itérative, la solution d'analyse de réseau contraint sa sortie sous des contraintes d'identification personnelle strictes. Ce traitement permet la création de millions de micro-graphes d'identités à l'aide de la clé de nom à partir des couches d'apprentissage automatique et tous les micro-graphes peuvent ensuite être exécutés en parallèle pour un traitement efficace.
PCT/US2023/024654 2022-06-09 2023-06-07 Résolution d'entités probabilistes à l'aide de micro-graphes WO2023239759A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263350667P 2022-06-09 2022-06-09
US63/350,667 2022-06-09

Publications (1)

Publication Number Publication Date
WO2023239759A1 true WO2023239759A1 (fr) 2023-12-14

Family

ID=89118889

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/024654 WO2023239759A1 (fr) 2022-06-09 2023-06-07 Résolution d'entités probabilistes à l'aide de micro-graphes

Country Status (1)

Country Link
WO (1) WO2023239759A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080019496A1 (en) * 2004-10-04 2008-01-24 John Taschereau Method And System For Providing Directory Assistance
US20150356088A1 (en) * 2014-06-06 2015-12-10 Microsoft Corporation Tile-based geocoder
US20170091692A1 (en) * 2015-09-30 2017-03-30 Linkedln Corporation Inferring attributes of organizations using member graph
US20200342006A1 (en) * 2019-04-29 2020-10-29 Adobe Inc. Higher-Order Graph Clustering
US20210357375A1 (en) * 2020-05-12 2021-11-18 Hubspot, Inc. Multi-service business platform system having entity resolution systems and methods

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080019496A1 (en) * 2004-10-04 2008-01-24 John Taschereau Method And System For Providing Directory Assistance
US20150356088A1 (en) * 2014-06-06 2015-12-10 Microsoft Corporation Tile-based geocoder
US20170091692A1 (en) * 2015-09-30 2017-03-30 Linkedln Corporation Inferring attributes of organizations using member graph
US20200342006A1 (en) * 2019-04-29 2020-10-29 Adobe Inc. Higher-Order Graph Clustering
US20210357375A1 (en) * 2020-05-12 2021-11-18 Hubspot, Inc. Multi-service business platform system having entity resolution systems and methods

Similar Documents

Publication Publication Date Title
US11238240B2 (en) Semantic map generation from natural-language-text documents
CN111125460B (zh) 信息推荐方法及装置
WO2019118388A1 (fr) Indexation rapide avec graphes et codes de régression compacts sur des réseaux sociaux en ligne
CN112241481A (zh) 基于图神经网络的跨模态新闻事件分类方法及系统
US11423249B2 (en) Computer architecture for identifying data clusters using unsupervised machine learning in a correlithm object processing system
CN112784009B (zh) 一种主题词挖掘方法、装置、电子设备及存储介质
CN111737315B (zh) 地址模糊匹配方法及装置
CN115293919A (zh) 面向社交网络分布外泛化的图神经网络预测方法及系统
CN106844553A (zh) 基于样本数据的数据探测和扩充方法及装置
CN115730087A (zh) 基于知识图谱的矛盾纠纷分析和预警方法及其应用
CN113535977A (zh) 一种知识图谱融合方法和装置及设备
Cao et al. A new method to construct the KD tree based on presorted results
Costa et al. A blocking scheme for entity resolution in the semantic web
US11354533B2 (en) Computer architecture for identifying data clusters using correlithm objects and machine learning in a correlithm object processing system
Alyas et al. Query optimization framework for graph database in cloud dew environment
CN116757280A (zh) 一种基于图变换网络的知识图谱多元关系链路预测方法
CN116723090A (zh) 告警根因的定位方法、装置、电子设备及可读存储介质
Li et al. Consistency preserving database watermarking algorithm for decision trees
WO2023239759A1 (fr) Résolution d'entités probabilistes à l'aide de micro-graphes
JP6261669B2 (ja) クエリ校正システムおよび方法
US11455568B2 (en) Computer architecture for identifying centroids using machine learning in a correlithm object processing system
Vanamala et al. Rare association rule mining for data stream
JP2022188894A (ja) 相関ルール生成プログラム、装置、及び方法
CN109299260B (zh) 数据分类方法、装置以及计算机可读存储介质
JP7221590B2 (ja) 抽出装置、抽出方法、及び抽出プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23820385

Country of ref document: EP

Kind code of ref document: A1