CN112395321A

CN112395321A - User ID (identity) association method and system and batch-type and streaming-type data processing method

Info

Publication number: CN112395321A
Application number: CN202011394658.7A
Authority: CN
Inventors: 邵真奇; 张波
Original assignee: Enyike Beijing Data Technology Co ltd
Current assignee: Enyike Beijing Data Technology Co ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2021-02-23

Abstract

The application relates to a user association method, a user association system and a SuperID calculation method based on batch and stream calculation, wherein the user association method comprises the following steps: a data obtaining step, configured to obtain multiple original IDs of a user to be processed in an upstream system and a binding relationship between the original IDs, and obtain unique identification values of the original IDs, where the unique identification values include a type IDType of the original ID and a value IDValue corresponding to the original ID; a SuperID defining step for defining a SuperID for identifying the original IDs connected with each other through a binding relationship or multiple binding relationships; and a user ID association step, which is used for obtaining the original ID belonging to the same SuperID based on a binding rule to obtain an association ID. By the method and the device, the problems of excessive association and wrong association of the user ID are solved, and hardware cost and maintenance cost are reduced.

Description

User ID (identity) association method and system and batch-type and streaming-type data processing method

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method and a system for associating a user ID, a batch data processing method, and a streaming data processing method.

Background

With the increasing rise of digitization, more and more actions of consumers can be collected and digitally recorded, for example: advertising activity, APP activity, WeChat activity, offline consumption activity, and the like.

But due to the independence of each domain, the original ID of the consumer recorded in the log records of different domains is different, for example: the advertising action is generally based on IMEI/IDFA as a unique identifier of the consumer; WeChat behavior is generally based on open _ id/unity _ id as the consumer identifier; purchasing is typically based on a membership number as a unique identifier. That is, without "user ID association," business owners cannot concatenate consumer data across domains, or more precisely, analyze the full path behavior of individual consumers across IDs, thereby forming more comprehensive insights and policies.

However, simply making the IDs of all channels "all-associated open" may cause excessive association in some scenarios where data accuracy is high. For example, it is obvious that there is a phenomenon that two member accounts registered in the same computer through a browser are regarded as being associated with the same cookie, and the two member accounts are regarded as being the same person and are not accurate enough. In the "full association" mode, although the user ID association is increased to the maximum, the wrong association is also increased greatly, which causes service loss, specifically including: when data is analyzed, errors and abnormalities of analysis are caused due to excessive association IDs; when the marketing touch is reached, the associated ID is excessive, so that touch waste and consumer dislike are caused. The graph database frequently used in the industry has the difficulties of high hardware cost and high maintenance cost.

Based on this, we need a better, business logic-compliant solution.

Disclosure of Invention

The embodiment of the application provides a user association method, a user association system and a SuperID calculation method based on batch and stream calculation, solves the problems of excessive association and error association of user IDs, and reduces hardware cost and maintenance cost.

In a first aspect, an embodiment of the present application provides a user ID association method, including:

a data obtaining step, configured to obtain multiple original IDs of a user to be processed and a binding relationship between the original IDs, and obtain unique identification values of the original IDs, where the unique identification values include a type IDType of the original ID and an IDValue corresponding to the original ID;

a SuperID defining step for defining a SuperID to identify the original IDs connected to each other through at least one binding relationship;

and a user ID association step, which is used for obtaining the original ID belonging to the same SuperID based on a binding rule to obtain an association ID.

In some embodiments, the SuperID is evaluated by an anchor point, and the anchor point is an ID with the highest service priority and/or an ID or record with the earliest time.

In some of these embodiments, the binding rule further comprises:

rule one, each original ID can only be directly bound with a high-priority original ID, so as to ensure that a plurality of original IDs are not associated due to association with the same low-level original ID;

a second rule, each original ID cannot be directly bound with a plurality of original IDs with the same priority;

and a third rule, when one original ID has a plurality of high-priority binding relations, taking the only effective binding relation.

Based on the binding rule, the association relation with the highest priority on the business is selected from the association relations of a plurality of conflicts, so that the reliability of association is ensured, and the accuracy of association is improved.

In some embodiments, the taking of the unique effective binding relationship in the rule three specifically includes:

taking a binding relation with a relatively high priority in the plurality of binding relations with high priorities;

and if the binding relations of the same priority exist, taking the record with the latest generation time of the binding relations as the only effective binding relations.

Based on the binding rule, the number of the ID with the highest priority corresponding to each SuperID is ensured to be less than or equal to 1, and the reliability of the binding of the SuperIDs is further improved.

In some embodiments, the SuperID is calculated according to a parameter limitation, and the parameter limitation further includes:

if the connection times of an original ID exceed m, carrying out data cleaning on the binding relationship between every two original IDs;

if the number of the original IDs corresponding to one SuperID exceeds m, resetting and recording the original IDs, wherein the value of m can be flexibly set according to the actual application scene.

In a second aspect, an embodiment of the present application provides a user ID association system, configured to execute the user ID association method according to the first aspect, where the system includes:

the data acquisition module is used for acquiring a plurality of original IDs of a user to be processed and binding relations among the original IDs and acquiring unique identification values of the original IDs, wherein the unique identification values comprise the type IDType of the original IDs and the value IDvalue corresponding to the original IDs;

a SuperID definition module for defining a SuperID to identify the original IDs connected to each other through at least one binding relationship;

and the user ID association module is used for acquiring the original ID belonging to the same SuperID based on a binding rule to obtain an association ID.

In a third aspect, an embodiment of the present application provides a batch data processing method, including

A data acquisition step, which is used for acquiring a batch binding relationship through an upstream application and transmitting the batch binding relationship to a SuperID service;

a SuperID obtaining step, configured to execute the user ID association method according to the first aspect, perform calculation according to the batch binding relationship, and output a batch of SuperIDs to a SuperID result file;

and a SuperID application step, configured to obtain the SuperID result file through a downstream application, and perform data processing based on the SuperID result file. By way of example and not limitation, the downstream application performs database updating, data governance, data punch-through, data analysis, or ID batch export, etc. based on the SuperID result file.

In a fourth aspect, an embodiment of the present application provides a streaming data processing method, including:

a data acquisition step, which is used for acquiring a stream type binding relationship through an upstream application and transmitting the stream type binding relationship to a SuperID service; optionally, the streaming binding relationship is transmitted through a Kafka distributed streaming processing platform or an application program interface API;

a SuperID obtaining step, configured to execute the user ID association method according to the first aspect, and perform calculation according to the streaming binding relationship to obtain a SuperID result file, where the SuperID result file further includes: a SuperID change result file and a SuperID full result file; specifically, the SuperID change result file is output in real time by the SuperID service based on the real-time binding relationship of the upstream application; the SuperID full result file is obtained by the SuperID service based on a binding relationship with daily frequency.

And a SuperID application step, configured to obtain the SuperID result file through a downstream application, and perform data processing based on the SuperID result file. By way of example and not limitation, the downstream application performs database updates based on the SuperID result file.

It should be noted that the batch data processing method based on the super id generally uses a scenario with low real-time requirement but large data volume, and the streaming data processing method based on the super id is generally used in a scenario with high real-time requirement but small data volume.

In a fifth aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the user ID association method according to the first aspect when executing the computer program.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the user ID association method as described in the first aspect above.

Compared with the prior art, the user association method and the user association system and the SuperID calculation method based on batch and stream calculation effectively solve the problems of excessive association and error association of the user IDs and provide a reliable and accurate user ID association scheme.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow chart diagram illustrating a user ID association method according to an embodiment of the present application;

FIG. 2 is a block diagram of a user ID association system according to an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram of a batch data processing method according to an embodiment of the present application;

FIG. 4 is a schematic illustration of a batch data processing method according to an embodiment of the present application;

FIG. 5 is a flow chart diagram of a streaming data processing method according to an embodiment of the application;

FIG. 6 is a schematic diagram of a streaming data processing method according to an embodiment of the application;

FIG. 7 is a schematic diagram of a SuperID association relationship in accordance with a preferred embodiment of the present application;

FIG. 8 is a schematic diagram of another SuperID association relationship in accordance with a preferred embodiment of the present application;

FIG. 9 is a schematic diagram of another SuperID association relationship in accordance with the preferred embodiment of the present application;

fig. 10 is a logical schematic diagram of a streaming data processing method substep according to an embodiment of the present application.

Description of the drawings:

101. a data acquisition module; 102. a SuperID definition module; 103. and a user ID association module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The embodiment provides a user ID association method. Fig. 1 is a schematic flowchart of a user ID association method according to an embodiment of the present application, and as shown in fig. 1, the flowchart includes the following steps:

a data obtaining step S101, configured to obtain multiple original IDs of a user to be processed and a binding relationship between the original IDs, and obtain unique identification values of the original IDs, where the unique identification values include a type IDType of the original ID and an IDValue corresponding to the original ID;

a SuperID defining step S102, configured to define a SuperID to identify original IDs connected to each other through at least one binding relationship, specifically, the SuperID is valued through an anchor point, and the anchor point is an ID with the highest service priority and/or an ID or a record with the earliest time, where the ID with the highest service priority is preferentially used as a value anchor point of the SuperID in this embodiment.

A user ID associating step S103, configured to obtain an original ID belonging to the same SuperID based on a binding rule, to obtain an associated ID, where the binding rule further includes:

according to the rule I, each original ID can only be directly bound with one high-priority original ID, so that a plurality of original IDs cannot be associated by associating the same low-level original ID;

according to a second rule, each original ID cannot be directly bound with a plurality of original IDs with the same priority;

and a third rule, when an original ID has a plurality of high-priority binding relations, taking the only effective binding relation.

In some embodiments, in the rule three, the taking of the unique effective binding relationship specifically includes:

taking a binding relation with a relatively high priority in a plurality of binding relations with high priorities;

Based on the binding rule, the number of the ID with the highest priority corresponding to each SuperID is ensured to be less than or equal to 1, and the reliability of the binding of the SuperIDs is ensured.

if the number of the original IDs corresponding to one SuperID exceeds m, the original IDs are reset and recorded, wherein the value of m can be flexibly set according to the actual application scene. For example, but not by way of limitation, setting m to 5, and assuming that an upstream browser repeatedly registers 25 accounts through a cheating means, it may be determined that an invalid account exists therein according to the value of m, and perform unbinding processing accordingly.

The embodiments of the present application are described and illustrated below by means of preferred embodiments.

Fig. 7-9 are schematic diagrams of the association relationship of the super id according to the preferred embodiment of the present application, as shown in fig. 7, when we obtain the binding relationship,

for example, 1 month and 1 day, mobile 1 binds to openid 1:

CREATE(mobile_1:id{name:"mobile_1"}),(openid_1:id{name:"openid_1"})

CREATE(mobile_1)-[r1:binding{type:'default'}]->(openid_1)

the Mobie1 and Opentid 1 are considered to belong to the same SuperID A; the association shown in fig. 7 is obtained.

For another example, 1 month and 2 days, mobile 2 binds to openid 2:

CREATE(mobile_2:id{name:"mobile_2"}),(openid_2:id{name:"openid_2"})

CREATE(mobile_2)-[r2:binding{type:'default'}]->(openid_2)

newly increased mobile 2& openid 2 belongs to SuperID B; the updated SuperID association relationship is shown in fig. 8.

If the mobile 1 is bound with openid 2 in the newly added data of 1 month and 3 days:

MATCH(mobile_1:id{name:"mobile_1"}),(openid_2:id{name:"openid_2"})

CREATE(mobile_1)-[r3:binding{type:'default'}]->(openid_2)

then 24 new mobile 1, opennid 1, mobile 2 and opennid IDs are connected together and belong to the SuperID a, and the final SuperID association obtained by updating is shown in fig. 9.

It should be noted that in practical applications, there may be more than 2 IDs in a binding relationship, and for this case, the original message is first split into two binding relationships for processing.

It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.

The embodiment of the present application further provides a user ID association system, which is used to implement the user ID association method according to the foregoing embodiment and preferred embodiments, and the description of the system is omitted here. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 2 is a block diagram of a structure of a user ID association system according to an embodiment of the present application, and referring to fig. 2, the system includes:

the data acquisition module 101 is configured to acquire a plurality of original IDs of a user to be processed and a binding relationship between the original IDs, and acquire unique identification values of the original IDs, where the unique identification values include a type IDType of the original ID and an IDValue corresponding to the original ID;

a SuperID definition module 102, configured to define a SuperID to identify original IDs that are connected to each other through at least one binding relationship;

the user ID association module 103 is configured to obtain an original ID belonging to the same SuperID based on a binding rule, and obtain an association ID.

The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

The embodiment also provides a batch data processing method based on the SuperID. FIG. 3 is a schematic flow diagram of a batch data processing method according to an embodiment of the present application, and FIG. 4 is a schematic diagram of a streaming data processing method according to an embodiment of the present application; referring to fig. 3 and 4 in combination, the calculation method includes the following steps:

a data obtaining step S201, configured to obtain a batch binding relationship through an upstream application, and transmit the batch binding relationship to a SuperID service;

a super ID obtaining step S202, configured to execute the user ID association method according to the above embodiment through a super ID service, perform calculation according to the batch binding relationship, and output a batch of super IDs to a super ID result file;

the step S203 of the SuperID application is used to obtain a SuperID result file through a downstream application, and perform data processing based on the SuperID result file. By way of example and not limitation, the downstream application performs database updating, data governance, data punch-through, data analysis, or ID batch export, etc. based on the SuperID result file.

An embodiment of the present application further provides a streaming data processing method based on the foregoing, fig. 5 is a schematic flowchart of the streaming data processing method according to the embodiment of the present application, fig. 6 is a schematic diagram of a principle of the streaming data processing method according to the embodiment of the present application, and with reference to fig. 5 and 6, the calculating method includes:

a data obtaining step S301, configured to obtain a streaming binding relationship through an upstream application, and transmit the streaming binding relationship to a super id service; optionally, the stream-type binding relationship is transmitted through a Kafka distributed stream processing platform or an application program interface API;

a super ID obtaining step S302, configured to execute the user ID association method according to the foregoing embodiment through a super ID service, and perform calculation according to the streaming binding relationship to obtain a super ID result file, where the super ID result file further includes: a SuperID change result file and a SuperID full result file; specifically, the SuperID change result file is output in real time based on the real-time binding relationship of the SuperID service based on the upstream application; the SuperID full result file is obtained for the SuperID service based on the binding relationship with the frequency of every day.

The SuperID application step S303 is configured to obtain a SuperID result file through a downstream application, and perform data processing based on the SuperID result file. By way of example and not limitation, the downstream application performs database updates based on the SuperID result file.

Fig. 10 is a logic schematic diagram of step S302 of the streaming data processing method according to an embodiment of the present application, and referring to fig. 10, implementation logic of the step further includes:

firstly, acquiring mapping data of upstream binding data;

then, performing data iteration, further comprising:

A. and converting the data format of the binding data into:

(id, binding relationship detail mapping, binding timestamp ts);

B. judging whether the binding relationship is a center vertex or not, namely whether the SuperID set can be completely calculated based on the current binding relationship or not;

if the binding relationship is not the center vertex, continuously collecting the binding relationship and calculating the SuperID set;

and if the binding relation is not the center vertex, obtaining the current SuperID set, ending iteration, and entering the next step.

Finally, based on the id priority, calculating the unique SuperID value of the SuperID set; judging whether the current calculation result has influence on other existing SuperID set graphs or not,

if the influence is influenced, calculating the influence range, and outputting the results of all influenced SuperID sets;

and if the current SuperID set is not influenced, directly outputting the result of the current SuperID set.

In addition, the user ID association method described in conjunction with fig. 1 in the embodiment of the present application may be implemented by a computer device, which may include a processor and a memory storing computer program instructions.

In particular, the processor may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

The memory may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. The memory may include removable or non-removable (or fixed) media, where appropriate. The memory may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory is a Non-Volatile (Non-Volatile) memory. In particular embodiments, the Memory includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by the processor. The processor implements any of the user ID association methods in the above embodiments by reading and executing computer program instructions stored in the memory.

In addition, in combination with the user ID association method in the foregoing embodiment, the embodiment of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the user ID association methods in the above embodiments.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A user ID association method is characterized by comprising the following steps:

2. The user ID association method according to claim 1, wherein the SuperID is taken by an anchor point, and the anchor point is an ID with the highest service priority and/or an ID or a record with the earliest time.

3. The method of claim 1, wherein the binding rule further comprises:

according to a rule I, each original ID can only be directly bound with one high-priority original ID;

4. The user ID association method according to claim 3, wherein in rule three, taking the unique and valid binding relationship specifically includes:

5. The method according to claim 2 or 4, wherein the SuperID is calculated according to a parameter constraint, and the parameter constraint further comprises:

and if the number of the original IDs corresponding to one SuperID exceeds m, resetting the original IDs and recording.

6. A user ID association system for performing the user ID association method according to any one of claims 1 to 5, comprising:

7. A batch data processing method is characterized by comprising

a SuperID obtaining step, configured to execute the user ID association method according to any one of claims 1 to 5 through the SuperID service, perform calculation according to the batch binding relationship, and output a batch of SuperIDs to a SuperID result file;

and a SuperID application step, which is used for acquiring a SuperID result file through a downstream application and carrying out data processing based on the SuperID result file.

8. A streaming data processing method, comprising:

a data acquisition step, which is used for acquiring a stream type binding relationship through an upstream application and transmitting the stream type binding relationship to a SuperID service;

a SuperID obtaining step, configured to execute the user ID association method according to any one of claims 1 to 5 through the SuperID service, and perform calculation according to the streaming binding relationship to obtain a SuperID result file, where the SuperID result file further includes: a SuperID change result file and a SuperID full result file;

and a SuperID application step, configured to obtain the SuperID result file through a downstream application, and perform data processing based on the SuperID result file.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the user ID association method according to any one of claims 1 to 5 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the user ID associating method according to any one of claims 1 to 5.