CN110674231A - Data lake-oriented user ID integration method and system - Google Patents
Data lake-oriented user ID integration method and system Download PDFInfo
- Publication number
- CN110674231A CN110674231A CN201910952703.7A CN201910952703A CN110674231A CN 110674231 A CN110674231 A CN 110674231A CN 201910952703 A CN201910952703 A CN 201910952703A CN 110674231 A CN110674231 A CN 110674231A
- Authority
- CN
- China
- Prior art keywords
- user
- data
- event
- attributes
- integration
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24564—Applying rules; Deductive queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
- G06F16/24578—Query processing with adaptation to user needs using ranking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data lake-oriented user ID integration method and system, and belongs to the technical field of the Internet. The user ID integration method and the user ID integration system are suitable for construction of various large data platforms based on data lakes. Specifically, the embodiment of the invention adopts a general data structure, a calculation process and an integration interface, thereby greatly reducing the labor cost and the error risk of manually customizing a user ID integration program, a data structure and a calculation program aiming at specific problems. In addition, the embodiment of the invention reads records containing user ID association information from a plurality of data sources from the message system, automatically and dynamically associates different identifications or IDs belonging to the same user according to a preset rule, gives a unique global identification, and simultaneously feeds back the association change among the IDs to the application system through the message system in real time. The latter depends on the information to associate the scattered user data, and realizes the integration of the user data on the data lake.
Description
Technical Field
The invention relates to the technical field of internet, in particular to a user ID integration method and system facing a data lake.
Background
The vast amount of guest, user, or customer (hereinafter collectively referred to as "user") data has enabled the data warehouse technology of the relational database era to become blind and blind. Aiming at the problem, a Data Lake (Data Lake) scheme adopts a brand-new big Data management and application method, advocates storage and analysis (application) firstly, and reduces Data integration and maintenance cost. Apache Hado op based distributed storage and computing clusters are currently the most popular data lake operating environment. Unfortunately, development efficiency remains poor. Each data application project must start from scratch, and is inefficient, or requires some critical work to be automated to accelerate later types of data analysis and application.
"user identification integration", "uniquely identifying user" or "uniformly identifying user" is one of the key tasks. The main contents are as follows: different identities (hereinafter referred to as "user Identities (IDs)") belonging to the same user from multiple data sources need to be dynamically associated. Subsequent jobs can re-integrate data belonging to the same user based on this unified association. The User ID may be used For identifying the device, such as a virtual ID (guest ID, User-ID, or account ID, etc.) set in a Cookie by the Web browser, an IDFA (Identifier For add), an Android ID of the Android system, a Media Access Control (MAC) address of the device, and an International Mobile Equipment Identity (IMEI); the user ID may also be one that identifies the offline individual and group, such as an identification number, passport number, cell phone number, landline number, and the like; the user ID may also be an ID that identifies the system account, such as WeChat, QQ, website, APP, and the internal user account of various management systems within the enterprise.
The conventional user ID integration method is: a set of application-specific "user ID integration" programs, data structures, and rules are manually customized for each enterprise, and the relationships between these IDs are then calculated and maintained on the data lake. The development and test procedures are time-consuming, labor-consuming and prone to errors. When a new user ID type is added, the program and rules must be modified and then retested and deployed. Such basic work urgently requires a user ID integration method or system having a general-purpose data structure, calculation process and integration interface, which is adaptable to data lake-based calculation.
Associated with this problem are "cross-media integration" and "cross-screen recognition" techniques within the field of digital marketing. They address the problem of how to identify individuals across devices and media, improving the accuracy of marketing positioning. Generally, the method comprises two steps: firstly, searching and identifying the association existing between user IDs by adopting various methods (predefined rules or machine learning technology); the users are then identified collectively based on these associations. In view of the incompleteness of user data in the field of digital marketing, they focus on the first step of finding and identifying possible ID associations in a specific type of data, ignoring practical technical problems and challenges encountered by subsequent uniform identification, specific problems including:
1. how to embed enterprise-defined constraints and rules in a unified identification algorithm because the data of enterprises are differentiated and the ID association constraints are also different?
The association between IDs can change often, especially for relationships predicted based on machine learning models. How to dynamically track these changes, adjust the uniform identification?
How is "unified user identification" how to provide flexible docking and integration modes for big Data applications in the Data lake environment, such as Data Warehouse (Data wait), Continuous Data Protection (CDP), and Data Management Platform (DMP), so that applications can integrate user Data according to their own features?
Disclosure of Invention
The invention aims to provide a data lake-oriented user ID integration method and system to solve the problems in the background technology.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
a data lake-oriented user ID integration method comprises the following steps:
collecting raw user data from one or more data sources in a data lake and constructing an ID event;
inputting and storing the ID event in a first distributed message queue supporting persistent storage;
reading the ID event from the first distributed message queue, generating an ID connection record, and updating an ID graph;
acquiring effective ID connection records from the ID map according to the ID connection attributes;
acquiring connected components in the ID map aiming at the effective ID connection records, forming the acquired connected components into a user ID group, and updating an ID mapping table;
and storing the mapping record of the user ID and the user ID group in the ID mapping table in a second distributed message queue supporting permanent storage.
In an embodiment of the present invention, the ID event includes a data source identifier, an occurrence time, an event type, and a single user ID.
In another preferred embodiment of the present invention, the ID event includes a data source identifier, an occurrence time, an event type, and two associated user IDs.
In another preferred scheme adopted in the embodiment of the present invention, the ID link attribute includes a plurality of sets of source attributes, the source attributes include a data source ID, a statistical indicator calculated based on an ID event, and an effective state of a data source layer, and the effective state of the data source layer determines whether an association between two user IDs is valid.
In another preferred embodiment of the present invention, the ID join attribute further includes a global attribute, the global attribute includes an effective state of the ID join and a statistical indicator calculated based on the source attribute and the ID event, and the effective state of the ID join determines whether the ID join record is valid.
The embodiment of the invention also provides a user ID integration system facing the data lake, which comprises:
the system comprises a construction module, a data processing module and a data processing module, wherein the construction module is used for collecting original user data from one or more data sources in a data lake and constructing ID events;
a first storage module for entering and storing ID events in a first distributed message queue supporting persistent storage;
the correlation module is used for reading the ID events from the first distributed message queue, generating an ID connection record and updating an ID graph;
the judging module is used for acquiring effective ID connection records from the ID map according to the ID connection attributes;
the integration module is used for acquiring the connected components in the ID map aiming at the effective ID connection records, forming a user ID group by the acquired connected components and updating an ID mapping table;
and the second storage module is used for storing the mapping record of the user ID and the user ID group in the ID mapping table into a second distributed message queue supporting permanent storage.
Compared with the prior art, the technical scheme provided by the embodiment of the invention has the following technical effects:
the embodiment of the invention provides a universal and automatic user ID unifying method and system for a data lake, which are suitable for construction of various big data platforms based on the data lake. Specifically, the embodiment of the invention adopts a general data structure, a calculation process and an integration interface, thereby greatly reducing the labor cost and the error risk of manually customizing a user ID integration program, a data structure and a calculation program aiming at specific problems. In addition, the embodiment of the invention reads records containing user ID association information from a plurality of data sources from the message system, automatically and dynamically associates different identifications or IDs belonging to the same user according to a preset rule, gives a unique global identification, and simultaneously feeds back the association change among the IDs to the application system through the message system in real time. The latter depends on the information to associate the scattered user data, and realizes the integration of the user data on the data lake.
Drawings
Fig. 1 is a schematic diagram of a user having multiple ID tags.
FIG. 2 is a schematic diagram of an ID event production process.
FIG. 3 is a schematic diagram of a logical structure of a data lake-oriented user ID integration method.
Fig. 4 is a schematic structural diagram of an ID map formed by the ID join record.
Fig. 5 is a diagram illustrating the structure of the ID mapping table.
FIG. 6 is a flow diagram illustrating a process for computing an ID event.
FIG. 7 is a flowchart illustrating the calculation of the update ID map.
FIG. 8 is a diagram illustrating the relationship between three types of business rules.
Fig. 9 is a flowchart illustrating the execution of searching for an affected user ID.
Fig. 10 is a schematic diagram illustrating an execution flow of searching a user ID group in which a user ID is located.
FIG. 11 is a flowchart illustrating an implementation of updating an ID mapping table.
FIG. 12 is a schematic structural diagram of a data lake-oriented user ID integration system.
In the figure: 100-user, 105-website and App used by user, 110-one or more IDs generated by user in the process of using website and App, relationship between 120-ID, 200-ID Event list output process, 205-ID Event production, 210-saved various types of data related to user ID, 215-ID Event list, 300-user ID integration engine, 305-ID Event production process, 310-administrator, 315-big data application, 320-ID Event queue, 325-ID fusion service, 330-ID mapping table, 335-business rule, 340-ID mapping change Event message queue, 345-ID map, 350-data lake, 400-ID map sample, 405-edge between user ID a and b, statistical data and validation status of correlation between 410-user ID, 415-an edge between user IDs a and b, 420-ID join attribute, 425-global attribute, 430-source attribute, 435-statistical index, 500-an internal structure of an ID map, 505-an example of an ID mapping table, 510-user IDs j, 601-635 are computing processes executed at regular time for ID fusion service, 701-735 are computing processes for updating/generating ID-Link records, 810-SLA rules, 815-GLA rules, 820-LDS rules, 825-SLA Status, 835-GLA Status, 910-930 are computing processes for affected user ID sets, 1010-1035 are computing processes for user ID groups, and 1110-1135 are computing processes for updating the ID mapping table.
Detailed Description
The following specific embodiments are specifically and clearly described in the technical solutions of the present application with reference to the drawings provided in the present specification. The drawings in the specification are for clarity of presentation of the technical solutions of the present application, and do not represent shapes or sizes in actual production or use, and reference numerals of the drawings are not limited to the claims involved.
In addition, in the description of the present application, terms used should be construed broadly, and specific meanings of the terms may be understood by those skilled in the art according to actual situations. For example, the terms "disposed" and "disposed," as used in this application, may be defined as either a contact or a non-contact arrangement, etc.; the terms "connected" and "coupled" as used herein may be defined as mechanically, electrically, or both fixedly and removably coupled; all the terms of orientation used are used with reference to the drawings or are based on the direction defined by the actual situation and the common general knowledge.
Example 1
The embodiment provides a data lake-oriented user ID integration method, which comprises the following steps:
(1) collecting raw user data from one or more data sources in a data lake and constructing an ID event;
(2) inputting and storing the ID event in a first distributed message queue supporting persistent storage;
(3) reading the ID event from the first distributed message queue, generating an ID connection record, and updating an ID graph;
(4) acquiring effective ID connection records from the ID map according to the ID connection attributes; the ID connection attribute comprises a plurality of groups of source attributes and global attributes, the source attributes comprise a data source ID, a statistical index calculated based on an ID event and an effective state of a data source layer, and the effective state of the data source layer judges whether the association between two user IDs is effective or not; the global attribute comprises the effective state of the ID connection and a statistical index calculated based on the source attribute and the ID event, and the effective state of the ID connection judges whether the ID connection record is effective or not.
(5) Acquiring connected components in the ID map aiming at the effective ID connection records, forming the acquired connected components into a user ID group, and updating an ID mapping table;
(6) and storing the mapping record of the user ID and the user ID group in the ID mapping table in a second distributed message queue supporting permanent storage.
Referring to fig. 1, 100 is the user, 105 is the website and App used by the user, 110 is one or more IDs generated by the user during the use of the website and App, and 120 is the relationship between the new user ID and various user IDs analyzed from the log data by some method. If the collation rules between the user IDs are affirmative, direct registration or extraction can be performed according to these rules. If the relationship between the user IDs cannot be determined directly according to the rules, a complex machine learning method can be adopted to predict the probability of whether the user IDs belong to the same user or not based on the user behavior data or the characteristics of the IDs, and then the relationship between the user IDs is established based on the prediction results. It should be noted that the embodiment does not include these extraction and analysis methods, but the results of these extraction and analysis methods can be input as data of the present invention.
Specifically, the input data of this embodiment is an "ID Event" (ID Event) that refers to a data record in which a new user ID is found, a new ID relationship is found, or a relationship between user IDs is changed. ID events can be obtained on the data lake through various types of registration, extraction, and analysis methods in the information system. The ID event includes a data source identifier, a time of occurrence, a type of the event, and a user ID, and the user ID may be a single user ID or two associated IDs.
Referring to fig. 2, 210 is a list of stored various types of data associated with user IDs, ID event production methods according to prior art rules or analysis methods, and 215 is an output ID event list. Wherein, the ID event list comprises various types of ID events. Specifically, the ID event list includes at least five attributes, which are structurally defined as: < ID1, ID2, timestamp, type, source >, where ID1 is the master user ID and ID2 is the slave user ID associated therewith. If there is no ID2, it indicates that a new user ID is found. the timestamp is the date and time of the event (UNIX timestamp is used in fig. 2), and the type is the type of the IDEvent, including two types, namely "new" and "remove", and is represented by "new" and "remove" in fig. 2. The source is the source ID of the ID Event, is used for uniquely identifying the data source and can be represented by a data table name, a file name or a system name and the like.
In addition, the user ID may be normalized so that there is no case where the IDs of different users generated by different types and different applications are the same or conflict. Specifically, each application domain may be assigned a different type ID as a prefix to the original user ID, separated by an underline. For example, the original user ID is a mobile phone number "13810091001", the user mobile phone number assignment type ID is 1001, and the normalized user ID becomes: "1001 _ 13810091001".
Referring to fig. 3, there is shown a logical structure of a "user ID integration engine" for implementing the method of the embodiment. Specifically, a user ID integration engine firstly inputs a generated ID Event into a first distributed information queue (ID Event queue); reading the ID Event from the ID Event queue through an ID fusion service, generating an ID-linkage record, and b updating an ID Graph (ID-Graph); then according to the service rule, generating an ID mapping record from the ID map, and updating the ID mapping table; then, the ID fusion service sends an ID mapping change event to a first distributed information queue (ID mapping change event queue) to inform a plurality of big data applications; after the big data application receives the change event, the application data can be recalculated according to the ID mapping table, for example, the index for evaluating the user value. In addition, an administrator can configure business rules based on business needs and expert knowledge.
In order to adapt to the characteristics of large data calculation scale, high concurrency and high expansibility requirement in the data lake, the user ID integration engine needs to use a distributed message system (such as Apache Kafka or RabbitMQ and the like) supporting permanent storage to realize the mapping of the ID mapping change Event to the message queue and the ID Event queue. The ID map and ID map must employ a high performance No-SQL database (usually Hbase) that supports persistent key-value storage and access.
Referring to fig. 4, fig. 4 is a structural illustration of an ID map, of which 400 is an example. The nodes a-i are different normalized user IDs. The user IDs are connected together by using undirected edges, and the connection is called ID connection (ID-Link). The rectangular box on the ID-Link represents that each edge has a corresponding attribute, called ID join Attributes (Link Attributes). The two user IDs that form the edge are referred to as ID1 and ID 2. 405 represents an edge between user IDs a and b, and 420 is its join attribute. The ID join attribute holds statistics and validation status of associations between user IDs. As shown at 410, ID join attributes fall into two broad categories: source Link Attributes (SLA) and Global Attributes (GLA). Each ID node has a set of statistics: statistical indicator of link Density LDS (Link Density State statistics). This set of statistics is for all IDs associated with this ID.
In particular, the source property SLA maintains statistics and validation status regarding the ID-Link from a particular data source. The condition of disordered sequence of events can be effectively processed by a good f-list design and a business rule. There will be multiple SLAs for an ID-Link, the number of which is consistent with the number of data sources (sources) in the ID Event. Source attributes SLA are defined as < source, f-list, SLA-status >, where source is the data source ID for an ID Event, f-list is a set of statistical measures for this join, and SLA-status is the active state. The validation state, if "true", indicates that this data source level determines that the association between the two user IDs is valid. Whether an association is valid depends on the business rules and the value on the f-list. The index of the f-list is calculated based on the ID Event, and the specific index and the calculation method of the f-list are not limited in the invention. In one implementation example, the f-list adopts the following statistical indexes:
1. number of occurrences of associated event
2. Number of days of occurrence of associated event
3. Date and time of last appearance
4. Earliest occurring date and event
5. Number of occurrences of deletion event
6. Type of event that has recently occurred (delete/associate)
In addition, the global attribute GLA holds statistical data and validation status calculated based on the data of the SLA. Only one GLA exists for one ID-Link, which is specifically defined as follows: < f-list, GLA-status >, f-list is a set of statistical indicators for this association, GLA-status is the validation status of the ID-Link. The validation state, if "true", indicates that the ID-Link between the two user IDs is determined to be valid. The index of the f-list is calculated based on SLA and ID-Event, and the invention does not limit the specific index and the realization method of the f-list. In a certain implementation example, the f-list adopts the following statistical indexes:
1. number of SLAs whose validation state is "true
2. Total number of occurrences of associated event
3. Total number of days of occurrence of associated event
4. Number of occurrences of deletion event
5. Date and time of most recent occurrence
6. Earliest occurring date and event
The value of GLA-Status is determined by the business rule and the value on the f-list. If GLA-Status equals true, this ID-Link is considered valid, otherwise, the ID-Link is considered invalid.
Further, the f-list statistical program is a program for calculating statistical indexes of SLA and GLA, which are respectively referred to as: gf-list statistical program and sf-list statistical program. The program updates the statistical index (f-list) of the corresponding ID-Link according to the attribute and the type of the ID-Event. The invention does not limit the implementation mode of the f-list statistical program, can be developed in a customized manner, and meets the requirements of different application scenes. The link Density statistics LDS (Link Density statistics) is associated with a certain ID for all IDs associated with this ID. The invention does not limit the specific indexes and implementation methods of the LDS. In a certain implementation example, the LDS includes the following types of statistical indicators:
1. number of valid IDs (GLA status 1)
2. Number of IDs of a certain type in effect (GLA status 1)
3. All ID numbers
4. The number of IDs of a certain type.
The system can use these metrics to check whether multiple associations for a certain ID meet business rules.
For business rules, the role of the business rules in FIG. 3 is: according to the statistical data of the GLA, the SLA and the LDS, the status values (true/false) of the GLA and the SLA on the ID-Link are determined. The business rules are divided into three categories according to different used statistical indexes: SLA rules, GLA rules, and LDS rules. The business rules are allowed to have a plurality of numbers, and each rule adopts a production rule of IF … THEN …, and the specific format is as follows: IF boulean-function THEN DO Action. A boolean function in which the boolean-function can be determined according to the following parameters:
statistical indices of SLA, GLA or LDS
2. User ID of edge
3. Type of user ID
If true, then the Action is executed. The invention does not limit how the boolean-function is specifically determined, but only the output is limited to one boolean value (true/false). One implementation sample uses boolean expressions, such as the 2 expressions for fig. 4:
1.ID1=a AND ID2=b AND f-list.OccTimes>=3
2.ID1.type=1001AND ID2.type=1002AND f-list.OccDays>1
where expression 1 is for edge < a, b >, and occTimes is the number of associations of the edge. Expression 2 is for all edges with user ID types of 1001 and 1002, and OccDays is the associated number of days of the edge.
Different types of rules have different actions: the Action of the SLA rule is to set the status in the SLA, which is called SLA-Action; the GLA rule is to set the status of GLA, called GLA-Action; the LDS rule sets the status of GLA for all ID-links of the current ID, called GLA-Action. SLA-Action and GLA-Action logic is simple and does not require customization, but LDS-Action logic can be complex and should allow for customization development. LDS-Action needs to readjust GLA-Status of all ID-links according to data of LDS. For example, the service requires that a chinese mobile phone number cannot be associated with more than one identification number. To this end, it is possible to set LDS rules: the number of IDs of the ID card types available for the IF is >1 THENRREMOVE _ old _ links.
Wherein remove _ old _ links is a customized function, and the logic is: the GLA-Status of the newly appearing ID-Link is true, and the rest is set to false.
The sequence of execution of the different types of business rules is shown in figure 8. The present invention does not limit the order of execution between rules of the same type. Each rule in the same type may set an execution priority value, which the administrator may set. The system sorts the rules according to priority value, and then executes the rules from top to bottom according to the priority value.
The invention does not limit the concrete implementation method of the business rule, and developers can design a proper implementation method according to the system environment and engineering requirements. One example of our implementation is to use a scripting Engine (Script Engine) to interpret and execute business rules. The custom function is also implemented in a scripting language. Another example of a custom function is implemented directly in compilable speech such as java. In addition, it is also possible to implement, based on a relational database: storing the parameters and Action names of the bootean-function by using the relation table, completely converting specific codes into rules, and executing related actions.
Referring to FIG. 5, 500 of FIG. 5 is an internal structure of an ID-Graph, and a black attribute box is marked with an edge where GLA-Status is "true". We call these valid ID-links. A Connected subgraph of 4 user IDs (along with the Component Connected Component) formed based on valid edges within 500 is circled with dashed lines and the system is referred to as a user ID group. The system assigns unique IDs to these groups of user IDs, G1-G4 respectively. User ID j does not enter ID-Graph because it is not associated with other IDs. The system also assigns a unique ID to the user ID group, and these virtual IDs we refer to as GIDs. It is directed to an "individual". This "individual" may be an individual, a family or a small group. Table 505 in fig. 5 is a sample ID mapping table, storing the mapping relationship between the user ID and the GID. The external big data application in fig. 2 can merge scattered data belonging to the same user but having different user IDs according to the ID mapping table.
For the update of the ID mapping table, the ID fusion service needs to notify the external big data application in fig. 3 when the ID mapping table changes. Each notification is called an ID mapping change event, and the specific definition is as follows: < GID, event-type >, wherein event-type is an event type, the specific definition and execution actions are defined as follows:
and (1) new, newly generating a GID, and fusing data of a user ID corresponding to the GID.
And 2, removed, namely merging, splitting or deleting the group corresponding to the GID, and removing the fused data corresponding to the GID.
After receiving the GID change event, the application program searches the ID mapping table according to the event type and fuses the data of the user ID together. The invention does not specify how to process the data fusion, and only specifies the execution action corresponding to the ID mapping change event. The application will decide on its own how to "fuse data" or "un-fuse". The following 3 common methods are suggested.
1. Post index
This is the simplest way, where the original user data is kept unchanged in place and no ID mapping change events are processed. Only when the data is used, the ID-Mapping Table is queried based on the user ID to acquire the GID, then other user IDs with the same GID are queried, and finally the data corresponding to the user IDs are combined. One application case of the method adopts the method to calculate the KPI in batch at regular time.
2. Instant mark
The original user data remains unchanged in place, but a GID marker is added to each user ID record. When the existing new GID appears, reading the user ID corresponding to the ID-Mapping Table, and marking the data record of the user ID by the new GID; when the event type of a GID is "removed", the old GID is cleared from the data record. When data is used, multiple data records belonging to the same GID can be processed together. One application case of our solution solves the problem of uniform user identification in a customer order base in this way.
3. Instant polymerization
Original user data is kept unchanged in place, and data of a plurality of user IDs are aggregated in advance according to the GID. When a new GID appears, reading user IDs corresponding to the ID-Mapping tables, finding out data records corresponding to the IDs, and then aggregating the data; when the event type of a certain GID is "removed", the aggregation record corresponding to the GID is directly deleted or a deletion mark is marked (waiting for subsequent cleaning). Real-time KPI computing applications may employ this approach.
For the ID fusion service, the ID fusion service reads the ID Event from the ID Event queue, generates an ID-linkage record, updates an ID Graph, and is called ID-Graph in the following. And then generating an ID mapping record from the ID-Graph according to the business rule, and updating the ID mapping table. The ID fusion service sends ID mapping change events to the queue to inform a plurality of related external big data applications. The specific flow of the inside is shown in FIG. 6.
The ID fusion service will periodically perform the calculation process of fig. 6. The input of the method is an ID mapping table, an ID-Graph, an ID-Event queue and a business rule. Reading events from the ID-Event queue, then updating or generating the connection between various user IDs in the ID-Graph, and calculating the status of SLA and GLA according to the business rules. After the update is completed, the ID-Mapping Table is searched for which user IDs are affected by the new event, and the associated group of user IDs can be recalculated for these affected IDs. And finally, generating a new GID aiming at the new ID groups, updating the ID mapping table and simultaneously sending GID update events to the ID mapping change event queue.
Referring to fig. 7, fig. 7 is a calculation process of a procedure UPDATE/generation ID-Link record (UPDATE _ ID _ GRAPH). Firstly, finding out ID-links matched with e.id1 and e.id2 in the ID-Event from the ID-Graph, if the existing connection does not exist, creating a new ID-Link for the current Event, and initializing related data. The SLA is then updated, and SLA-Rule is executed based on the SLAs and the ID-Linkd. Similarly, GLA is updated and GLA-Rule is executed. And finally updating the LDS of each ID and executing LDS-Rules. The end results of SLA-Rule, GLA-Rule and LDS-Rule execution can affect SLA-Status and GLA-Status.
Referring to FIG. 9, FIG. 9 shows searching for an affected user ID in the ID-Mapping Table according to the ID set S. To improve the computational efficiency, the set of user IDs S contained in the new ID-Event data is used as the starting point for the search. Firstly, checking out a GID (915) corresponding to the user ID in the S from the T; then according to the searched GID set: g, inquiring all the user IDs associated with the T from the T, and putting the affected mapping relation < user ID, GID > into the set P. And finally, combining the new user ID in the set P and the user ID set S together. P is the initial set of recalculated GIDs and user ID mappings.
Referring to FIG. 10, starting with the set of affected user IDs, the set of user IDs where those user IDs are located is found. This computation is similar to finding all connected components (or connected branches) from an undirected graph, from a set of vertices (user IDs). Since there are many methods to solve this problem, the present invention is not limited to the specific implementation scheme, and the implementer can select an appropriate calculation method according to the technical and equipment conditions of the implementer. This embodiment employs a computing process as shown in fig. 10, which employs a method of searching directly connected user IDs in parallel, and then uniformly merging subsequent user ID groups.
Referring to fig. 11, fig. 11 is a diagram for assigning a new GID to the newly generated user ID group, and updating an ID mapping table while transmitting a GID update event to an ID mapping change event queue. Wherein 1115-11130 in FIG. 11 is the process of generating the new GID. 1115 and 1120 are responsible for assigning a new GID to the user ID group. 1125, search the ID mapping table to establish the relationship between the old GID and the new GID. The function of 1130 is to remove the group of user IDs whose structure is not changed (i.e., the old and new GIDs cover the same group of user IDs) according to the relationship of the old and new GIDs. Finally 1135 sends new events to the ID mapping change event queue for the new GID and also sends delete events to the same queue for the old GID. After receiving the GID change event, the application program can query the ID mapping table in an appropriate manner according to the event type, and fuse the data of the user ID together.
Example 2
Referring to fig. 12, the embodiment provides a data lake-oriented user ID integration system for implementing the method, specifically, the user ID integration system includes a construction module, a first storage module, an association module, a determination module, an integration module, and a second storage module. The system comprises a construction module, a data processing module and a data processing module, wherein the construction module is used for collecting original user data from one or more data sources in a data lake and constructing ID events; a first storage module for entering and storing ID events in a first distributed message queue supporting persistent storage; the correlation module is used for reading the ID events from the first distributed message queue, generating an ID connection record and updating an ID graph; the judging module is used for acquiring effective ID connection records from the ID map according to the ID connection attributes; the integration module is used for acquiring the connected components in the ID map aiming at the effective ID connection records, forming a user ID group by the acquired connected components and updating an ID mapping table; and the second storage module is used for storing mapping records of the user ID and the user ID group in the I D mapping table in a second distributed message queue supporting permanent storage.
The ID event includes a data source identifier, a time of occurrence, a type of the event, and a user ID, and the user ID may be a single user ID or two associated IDs. The ID connection attribute comprises a plurality of groups of source attributes and global attributes, the source attributes comprise a data source ID, a statistical index calculated based on an ID event and an effective state of a data source layer, and the effective state of the data source layer judges whether the association between two user IDs is effective or not; the global attribute comprises the effective state of the ID connection and a statistical index calculated based on the source attribute and the ID event, and the effective state of the ID connection judges whether the ID connection record is effective or not. The method for implementing the user ID integration by the user ID integration system is as described in embodiment 1 above.
In summary, the embodiment of the present invention provides a data lake-oriented universal method and system for automatically unifying user IDs, which are suitable for building various big data platforms based on data lakes. Specifically, the embodiment of the invention adopts a general data structure, a calculation process and an integration interface, thereby greatly reducing the labor cost and the error risk of manually customizing a user ID integration program, a data structure and a calculation program aiming at specific problems. In addition, the embodiment of the invention reads records containing user ID association information from a plurality of data sources from the message system, automatically and dynamically associates different identifications or IDs belonging to the same user according to a preset rule, gives a unique global identification, and simultaneously feeds back the association change among the IDs to the application system through the message system in real time. The latter depends on the information to associate the scattered user data, and realizes the integration of the user data on the data lake.
It should be noted that the above embodiments are only specific and clear descriptions of technical solutions and technical features of the present application. However, to those skilled in the art, aspects or features that are part of the prior art or common general knowledge are not described in detail in the above embodiments.
Of course, the technical solutions of the present application are not limited to the above-mentioned embodiments, and those skilled in the art should take the description as a whole, and the technical solutions in the embodiments may also be appropriately combined, so that other embodiments that may be understood by those skilled in the art may be formed.
Claims (10)
1. A data lake-oriented user ID integration method is characterized by comprising the following steps:
collecting raw user data from one or more data sources in a data lake and constructing an ID event;
inputting and storing the ID event in a first distributed message queue supporting persistent storage;
reading the ID event from the first distributed message queue, generating an ID connection record, and updating an ID graph;
acquiring effective ID connection records from the ID map according to the ID connection attributes;
acquiring connected components in the ID map aiming at the effective ID connection records, forming the acquired connected components into a user ID group, and updating an ID mapping table;
and storing the mapping record of the user ID and the user ID group in the ID mapping table in a second distributed message queue supporting permanent storage.
2. The method as claimed in claim 1, wherein the ID event includes data source identification, occurrence time, event type and unique user ID.
3. The method of claim 1, wherein the ID event comprises data source identification, time of occurrence, type of event and two associated user IDs.
4. The method as claimed in claim 1, wherein the ID linkage attributes include a plurality of groups of source attributes, the source attributes include a data source ID, a statistical indicator calculated based on ID events, and a data source level validation status, and the data source level validation status determines whether the association between two user IDs is valid.
5. The method of claim 4, wherein the ID link attributes further comprise global attributes, the global attributes comprise an effective status of the ID link and statistical indicators calculated based on the source attributes and the ID events, and the effective status of the ID link determines whether the ID link record is valid.
6. A data lake oriented user ID integration system, comprising:
the system comprises a construction module, a data processing module and a data processing module, wherein the construction module is used for collecting original user data from one or more data sources in a data lake and constructing ID events;
a first storage module for entering and storing ID events in a first distributed message queue supporting persistent storage;
the correlation module is used for reading the ID events from the first distributed message queue, generating an ID connection record and updating an ID graph;
the judging module is used for acquiring effective ID connection records from the ID map according to the ID connection attributes;
the integration module is used for acquiring the connected components in the ID map aiming at the effective ID connection records, forming a user ID group by the acquired connected components and updating an ID mapping table;
and the second storage module is used for storing the mapping record of the user ID and the user ID group in the ID mapping table into a second distributed message queue supporting permanent storage.
7. The data lake oriented user ID integration system of claim 6, wherein the ID events comprise data source identification, time of occurrence, type of event and unique user ID.
8. The data lake oriented user ID integration system of claim 6, wherein the ID events comprise data source identification, time of occurrence, type of event and two associated user IDs.
9. The data lake-oriented user ID integration system of claim 6, wherein the ID connection attributes comprise a plurality of sets of source attributes, the source attributes comprise a data source ID, a statistical indicator calculated based on ID events, and a data source level validation status, the data source level validation status determining whether an association between two user IDs is valid.
10. The data lake-oriented user ID integration system of claim 9, wherein the ID join attributes further comprise global attributes, the global attributes comprising an ID join validation status and statistical indicators calculated based on source attributes and ID events, the ID join validation status determining whether an ID join record is valid.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910952703.7A CN110674231A (en) | 2019-10-09 | 2019-10-09 | Data lake-oriented user ID integration method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910952703.7A CN110674231A (en) | 2019-10-09 | 2019-10-09 | Data lake-oriented user ID integration method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110674231A true CN110674231A (en) | 2020-01-10 |
Family
ID=69081121
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910952703.7A Pending CN110674231A (en) | 2019-10-09 | 2019-10-09 | Data lake-oriented user ID integration method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110674231A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111488261A (en) * | 2020-03-11 | 2020-08-04 | 北京健康之家科技有限公司 | User behavior analysis system, method, storage medium and computing device |
CN112395321A (en) * | 2020-12-03 | 2021-02-23 | 恩亿科(北京)数据科技有限公司 | User ID (identity) association method and system and batch-type and streaming-type data processing method |
CN113626482A (en) * | 2021-08-17 | 2021-11-09 | 北京深演智能科技股份有限公司 | Query method and device based on system fusion ID table |
CN114339729A (en) * | 2020-09-30 | 2022-04-12 | 阿里巴巴集团控股有限公司 | Method and device for generating equipment identifier, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104346377A (en) * | 2013-07-31 | 2015-02-11 | 克拉玛依红有软件有限责任公司 | Method for integrating and exchanging data on basis of unique identification |
CN109271382A (en) * | 2018-08-17 | 2019-01-25 | 广东技术师范学院 | A kind of data lake system towards full data shape opening and shares |
CN109298840A (en) * | 2018-11-19 | 2019-02-01 | 平安科技(深圳)有限公司 | Data integrating method, server and storage medium based on data lake |
-
2019
- 2019-10-09 CN CN201910952703.7A patent/CN110674231A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104346377A (en) * | 2013-07-31 | 2015-02-11 | 克拉玛依红有软件有限责任公司 | Method for integrating and exchanging data on basis of unique identification |
CN109271382A (en) * | 2018-08-17 | 2019-01-25 | 广东技术师范学院 | A kind of data lake system towards full data shape opening and shares |
CN109298840A (en) * | 2018-11-19 | 2019-02-01 | 平安科技(深圳)有限公司 | Data integrating method, server and storage medium based on data lake |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111488261A (en) * | 2020-03-11 | 2020-08-04 | 北京健康之家科技有限公司 | User behavior analysis system, method, storage medium and computing device |
CN114339729A (en) * | 2020-09-30 | 2022-04-12 | 阿里巴巴集团控股有限公司 | Method and device for generating equipment identifier, electronic equipment and storage medium |
CN114339729B (en) * | 2020-09-30 | 2024-09-17 | 阿里巴巴集团控股有限公司 | Device identifier generation method and device, electronic device and storage medium |
CN112395321A (en) * | 2020-12-03 | 2021-02-23 | 恩亿科(北京)数据科技有限公司 | User ID (identity) association method and system and batch-type and streaming-type data processing method |
CN113626482A (en) * | 2021-08-17 | 2021-11-09 | 北京深演智能科技股份有限公司 | Query method and device based on system fusion ID table |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110674231A (en) | Data lake-oriented user ID integration method and system | |
CN111930547B (en) | Fault positioning method, device and storage medium | |
CN110222127B (en) | Knowledge graph-based information aggregation method, device and equipment | |
US20230040635A1 (en) | Graph-based impact analysis of misconfigured or compromised cloud resources | |
US20110016451A1 (en) | Method and system for generating test cases for a software application | |
US20160103678A1 (en) | Maintaining the integrity of process conventions within an alm framework | |
CN111552607A (en) | Health evaluation method, device and equipment of application program and storage medium | |
CN112711496A (en) | Log information full link tracking method and device, computer equipment and storage medium | |
CN111090401B (en) | Storage device performance prediction method and device | |
CN113760677A (en) | Abnormal link analysis method, device, equipment and storage medium | |
CN114185761B (en) | Log acquisition method, device and equipment | |
CN114385551B (en) | Log time-sharing management method, device, equipment and storage medium | |
CN112579552A (en) | Log storage and calling method, device and system | |
CN110109906B (en) | Data storage system and method | |
CN113556368A (en) | User identification method, device, server and storage medium | |
CN106095511A (en) | A kind of server updating method and apparatus | |
CN115309907B (en) | Alarm log association method and device | |
CN116340536A (en) | Operation and maintenance knowledge graph construction method, device, equipment, medium and program product | |
JP5206268B2 (en) | Rule creation program, rule creation method and rule creation device | |
CN112182413B (en) | Intelligent recommendation method and server based on big teaching data | |
CN111352824B (en) | Test method and device and computer equipment | |
Mijumbi et al. | MAYOR: machine learning and analytics for automated operations and recovery | |
CN113810235A (en) | CMDB automatic operation and maintenance management method and system based on configuration intelligent discovery | |
JP5741717B2 (en) | Information processing method, apparatus and program | |
CN114691700A (en) | Kafaka cluster-based intelligent park retrieval method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200110 |