CN110825919B

CN110825919B - ID data processing method and device

Info

Publication number: CN110825919B
Application number: CN201810814300.1A
Authority: CN
Inventors: 贺勇; 李楠; 龚坚
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-07-23
Filing date: 2018-07-23
Publication date: 2023-05-26
Anticipated expiration: 2038-07-23
Also published as: CN110825919A

Abstract

The embodiment of the invention provides an ID data processing method and device, namely, a large amount of ID data are acquired from each service system, an ID node association diagram is constructed according to the acquired ID data, and each node corresponds to one ID data; all communication branches in the ID association diagram are obtained; and encoding each communication branch according to a preset encoding rule to obtain a corresponding unique characteristic identifier, namely, the ID data of all nodes of each communication branch belong to the unique characteristic identifier. Because various ID data of each person are directly hung under a unique characteristic identifier, the searching speed can be greatly increased, and all ID data of the person can be obtained by searching once.

Description

ID data processing method and device

Technical Field

The present invention relates to the field of big data processing technologies, and in particular, to an ID data processing method and apparatus.

Background

In the data management service, various data are mainly managed, for example, in large data centered on "people", one person and one file centered on "people" need to be established, data of various service systems can be gathered, each person has various ID data, the ID data are in the billion level, different occasions can use different IDs, and different people may have different types of IDs. And other IDs such as a mobile phone number, a communication device ID, various network account numbers (payment treasury account number, micro signal, QQ number, mailbox number, micro blog account number, etc.).

Therefore, when "one person for one file" of the data management service is established, it is necessary to unify IDs of individual system persons and to string various IDs of the same person when data of a plurality of service systems are fused. At the same time, the ID data and the ID-associated data are on the order of billions, and it is very necessary how this large-scale data amount is calculated.

Disclosure of Invention

The invention provides an ID data processing method and device, which can greatly accelerate the ID data retrieval speed.

The embodiment of the invention provides an ID data processing method, which comprises the following steps:

constructing an ID node association graph according to the acquired ID data, wherein each node corresponds to one ID data;

solving a communication branch in the ID node association graph;

and coding each communication branch according to a preset coding rule to obtain a unique characteristic identifier corresponding to the communication branch.

Optionally, constructing an ID node association graph according to the acquired ID data includes:

acquiring a plurality of ID data from each service system, wherein the ID data at least comprises an ID type and an ID number;

according to the association relation among the ID data, an ID node association diagram is constructed by using the association among the IDs as undirected edges, and each node in the ID node association diagram uses own ID data as the attribution ID thereof.

Optionally, the obtaining all the communication branches in the ID node association graph includes:

step A, each node receives the home IDs sent by all neighboring nodes adjacent to the node in the ID node association graph; selecting the smallest home ID from the home IDs transmitted by all neighbor nodes, and setting the smallest home ID as MIN_ID; comparing the MIN_ID with the own attribution ID, if the MIN_ID is smaller than the own attribution ID, setting the MIN_ID as the own new attribution ID, setting an ID update mark, transmitting the updated attribution ID to all the neighboring nodes, and if the MIN_ID is greater than or equal to the own attribution ID, keeping the attribution ID unchanged;

b, if the attribution ID of a certain node has an updating mark, repeating the iteration step A until the attribution IDs of all the nodes are not updated any more;

and C, attributing the nodes with the same attribution ID to the same communication branch, and outputting the ID data and attribution ID of each node in the communication branch.

Optionally, encoding each communication branch according to a preset encoding rule to obtain a unique feature identifier corresponding to the communication branch, including:

according to the ID data and the attribution ID of each node in each communication branch, each communication branch is encoded by utilizing a preset encoding rule to obtain a unique characteristic identifier corresponding to the communication branch;

The unique signature consists of 32-bit 16-ary 0-F, defined as:

1 st to 17 th: 17 bits, which are used to represent the unique feature identifier generation time, can satisfy that when initializing and updating multiple updates in the same day or in the same day, the same unique feature identifier of different people does not exist;

18 th to 19 th bits: 2 bits, reserved bits, for distinguishing the type of unique feature identifier;

20 th to 25 th bits: 6 bits, the primary tag sequence;

26 th to 31 th bits: 6 bits, secondary marker sequence;

bit 32: 1 bit, check bit.

Optionally, after each communication branch is encoded according to a preset encoding rule to obtain a corresponding unique feature identifier, the method further includes:

acquiring newly added ID data in each service system at regular intervals;

determining the attribution ID of the newly-added ID according to the original ID data and the newly-added ID data by combining all the communication branches and the unique characteristic identifiers corresponding to the communication branches;

and solving a new communication branch, and coding the new communication branch according to a preset coding rule to obtain a corresponding unique characteristic identifier.

Optionally, determining the attribution ID of the newly added ID according to the original acquired ID data and the newly added ID data in combination with all the communication branches and the unique feature identifiers corresponding to the communication branches includes:

Determining whether an association relationship exists between the newly added ID and the originally acquired ID according to the originally acquired ID data and the newly added ID data;

if the association relation exists, determining the unique feature identifier of the communication branch to which the original acquired ID belongs as the attribution ID of the newly added ID; if the association relation does not exist, the attribution ID of the newly added ID is the ID data.

Optionally, the method for obtaining the new communication branch and coding the new communication branch according to a preset coding rule to obtain a unique feature identifier corresponding to the new communication branch includes:

the method comprises the steps that firstly, each node sends an attribution ID (identity) to all neighbor nodes, and if the attribution ID is a unique characteristic identifier, the attribution ID is provided with a type identifier of the unique characteristic identifier;

secondly, each node receives the attribution ID of each neighbor node sent by all neighbor nodes, two temporary variables are set, wherein one temporary variable stores the minimum attribution unique characteristic identifier with the type being unique characteristic identifier sent by all neighbor nodes and is set as T_ID1, and the other temporary variable stores the minimum attribution ID with the type being not unique characteristic identifier sent by all neighbor nodes and is set as T_ID2;

third, each node compares its home ID with the t_id1 or the t_id2, which specifically includes:

If the own attribution ID is the unique characteristic identifier, comparing the attribution ID with T_ID1, if T_ID1 is not empty and T_ID1 is smaller, updating the attribution ID of the own to be T_ID1 and the type to be the unique characteristic identifier, and simultaneously setting an updating mark of the attribution of the ID, if T_ID1 is empty, the attribution ID is unchanged;

if the own attribution ID is not the unique characteristic identifier, if the T_ID1 is not null, updating the own attribution ID to be T_ID1 and the type to be the unique characteristic identifier, and simultaneously setting an updating mark to which the ID belongs;

if the own attribution ID is not the unique characteristic identification, if T_ID1 is empty, comparing the own attribution ID with T_ID2, if T_ID2 is not empty and is smaller than the own attribution ID, updating the own attribution ID to be T_ID2, and the type to be the non-unique characteristic identification type, and setting an ID attribution updating mark;

fourth step: if the attribution ID of a certain node has an update mark, repeating the iteration from the first step to the third step until the attribution IDs of all the nodes are not updated any more, stopping the iteration to obtain a new communication branch, and encoding the new communication branch according to a preset encoding rule to obtain a unique feature mark corresponding to the new communication branch.

Optionally, the method further comprises:

carrying out one-to-one identification on each two unique feature identifiers according to the unique feature identifiers corresponding to each communication branch;

if the two unique feature identifiers are identified to be the same person, outputting the two unique feature identifiers to generate a connected edge, namely, edge connection exists between the communication branches corresponding to the two unique feature identifiers, so that a connection relation diagram between all the communication branches is obtained;

determining a plurality of communication branches with connection relations as the largest communication branch according to the connection relation diagram among all the communication branches;

and selecting the smallest unique characteristic identifier from the largest communication branch as the attribution unique characteristic identifier of the largest communication branch, namely merging and collecting the unique characteristic identifiers of the same person.

Optionally, the method further comprises:

and creating a forward index and an inverse index for the unique feature identifiers after merging and collecting, namely, creating a corresponding relation between the unique feature identifier of the attribution of each maximum communication branch and the ID data of each node in the unique feature identifier.

The application also provides an ID data processing device, comprising:

the initialization module is used for constructing an ID node association diagram according to the acquired ID data, and each node corresponds to one ID data; solving a communication branch in the ID node association graph; and coding each communication branch according to a preset coding rule to obtain a unique characteristic identifier corresponding to the communication branch.

Optionally, the initialization module is specifically configured to:

acquiring a large amount of ID data from each service system, wherein the ID data at least comprises an ID type and an ID number;

Optionally, the initialization module is specifically further configured to perform the following steps:

Optionally, the initialization module is specifically further configured to:

and according to the ID data and the attribution ID of each node in each communication branch, encoding each communication branch by utilizing a preset encoding rule to obtain a unique characteristic identifier corresponding to the communication branch.

Optionally, the apparatus further comprises:

the updating module is used for periodically acquiring newly-added ID data in each service system; determining the attribution ID of the newly added ID according to the original ID data and the newly added ID data and combining all the communication branches and the unique characteristic identifiers corresponding to the communication branches, solving a new communication branch, and encoding the new communication branch according to a preset encoding rule to obtain the unique characteristic identifier corresponding to the new communication branch.

Optionally, the updating module is specifically configured to:

Optionally, the updating module is specifically further configured to perform the following steps:

Optionally, the apparatus further comprises:

the unique feature identification merging module is used for carrying out one-to-one identification on each two unique feature identifications according to the unique feature identifications corresponding to each communication branch; if the two unique feature identifiers are identified to be the same person, outputting the two unique feature identifiers to generate a connected edge, namely, edge connection exists between the communication branches corresponding to the two unique feature identifiers, so that a connection relation diagram between all the communication branches is obtained; determining a plurality of communication branches with connection relations as the largest communication branch according to the connection relation diagram among all the communication branches; and selecting the smallest unique characteristic identifier from the largest communication branch as the attribution unique characteristic identifier of the largest communication branch, namely merging and collecting the unique characteristic identifiers of the same person.

Optionally, the apparatus further comprises:

the establishing module is used for establishing a forward index and an inverse index for the unique feature identifiers after merging and gathering, namely establishing a corresponding relation between the attribution unique feature identifier of each maximum communication branch and the ID data of each node in the unique feature identifier.

According to the embodiment of the application, a large amount of ID data are acquired from each service system, an ID node association diagram is constructed according to the acquired ID data, and each node corresponds to one ID data; all communication branches in the ID node association diagram are obtained; and encoding each communication branch according to a preset encoding rule to obtain a corresponding unique characteristic identifier, namely, the ID data of all nodes of each communication branch belong to the unique characteristic identifier. Because various ID data of each person are directly hung under a unique characteristic identifier, the searching speed can be greatly increased, and all ID data of the person can be obtained by searching once.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the prior art descriptions, and it is obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an ID data processing method according to an embodiment of the present invention;

FIG. 2 is a flowchart of an ID data processing method according to another embodiment of the present invention;

FIG. 3 is a flowchart illustrating an ID data processing method according to another embodiment of the present invention;

FIG. 4 is a flowchart illustrating an initialization step according to another embodiment of the present invention;

FIG. 5 is a flowchart of a unique identifier date updating step according to another embodiment of the present invention;

FIG. 6 is a flowchart of a step of merging unique identification of multiple persons in another embodiment of the present invention;

FIG. 7 is a schematic diagram of an ID data processing device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a server according to another embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, the "plurality" generally includes at least two, but does not exclude the case of at least one.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a product or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such product or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a commodity or system comprising such elements.

For hundreds of millions of levels of ID data and ID associated data, the inventor researches on the existing ID retrieval method to find that: because the existing searching method is based on the index relation between the stored IDs and the IDs, the searching method is used for obtaining the other ID by inputting one ID, but when the direct association between the two IDs is not available in the database, the other ID can only be indirectly associated with the input ID by inputting the other ID, and the searching method can only be used for obtaining the other ID by multiple times, for example, the indirect association between the two IDs is carried out by the other 3 IDs, and the searching is needed for 4 times, so that the direct obtaining can not be carried out, and the searching efficiency is greatly reduced.

Therefore, the inventor improves the existing ID data retrieval method, based on a large amount of ID data and the association relation among the IDs, a unified and unique characteristic identifier is built for all the associated IDs according to the coding rule set by the invention, and after the unique characteristic identifier is built, various IDs of the same person are directly hung below the same unique characteristic identifier, so that all other IDs of the person can be obtained by converting the input ID into the unique characteristic identifier and then retrieving once, thereby greatly accelerating the retrieval speed and simultaneously obtaining all ID data of the person by retrieving once.

In the invention, in order to facilitate management and cross search of data of people and under various scenes, a globally unique, identical-mode and invariant ID needs to be designed for each person, and the ID is called a unique feature identifier, and various IDs of people corresponding to the unique feature identifier are combined and collected under the unique feature identifier, so that people in any country in any region have a uniform ID, and an ID is input at will, so that other various IDs of people corresponding to the ID can be obtained.

FIG. 1 is a flowchart of an ID data processing method according to an embodiment of the present invention; as shown in fig. 1, includes:

101. constructing an ID node association graph according to the acquired ID data, wherein each node corresponds to one ID data;

the specific implementation method comprises the following steps: acquiring a large amount of ID data from each service system, wherein the ID data at least comprises an ID type and an ID number; according to the association relation among the ID data, an ID node association diagram is constructed by using the association among the IDs as undirected edges, and each node in the ID node association diagram uses own ID data as the attribution ID thereof.

102. Solving a communication branch in the ID node association graph;

Specifically, the step 102 of obtaining the communication branch in the ID node association graph includes:

103. And coding each communication branch according to a preset coding rule to obtain a unique characteristic identifier corresponding to the communication branch.

The specific implementation method comprises the following steps: according to the ID data and the attribution ID of each node in each communication branch, each communication branch is encoded by utilizing a preset encoding rule to obtain a unique characteristic identifier corresponding to the communication branch;

Wherein the unique signature is composed of 32-bit 16-ary 0-F, defined as:

20 th to 25 th bits: 6 bits, the primary tag sequence;

26 th to 31 th bits: 6 bits, secondary marker sequence;

bit 32: 1 bit, check bit.

For example, there is much ID data about a person in each data service system, and there is also relation data such as dependence, association, etc. between various IDs of the same person, for example, in a resident license system, an ID card or a passport number is used to handle a resident license, so that there is a dependence between the resident license and the ID card or between the resident license and the passport; in another example, in a passport handling system, an identity card may be used to handle a passport, and there may be a passport-to-identity card dependency; for another example, in a vehicle management system, there is an association relationship between license plate numbers and identity cards, etc., so that a large-scale undirected graph (i.e. an ID node association graph) with billions of edges and points can be constructed through the association data between the IDs, the IDs with the association relationship are all the same person, so that all the communication branches can be obtained on the ID node association graph, each communication branch is the ID data of the same person, then each communication branch is encoded according to a specific rule designed to obtain a unique feature identifier, and then all the IDs under the communication branch are gathered under the unique feature identifier, so that the unique feature identifier can be used for identifying the person represented by the communication branch.

FIG. 2 is a flowchart of an ID data processing method according to another embodiment of the present invention; as shown in fig. 2, includes:

201. acquiring newly added ID data in each service system at regular intervals;

202. determining the attribution ID of the newly-added ID according to the original ID data and the newly-added ID data by combining all the communication branches and the unique characteristic identifiers corresponding to the communication branches;

specifically, according to the original acquired ID data and the newly added ID data, combining all the communication branches and the unique feature identifiers corresponding to the communication branches to determine the attribution ID of the newly added ID, including:

Determining whether an association relationship exists between the newly added ID and the originally acquired ID according to the originally acquired ID data and the newly added ID data; if the association relation exists, determining the unique feature identifier of the communication branch to which the original acquired ID belongs as the attribution ID of the newly added ID; if the association relation does not exist, the attribution ID of the newly added ID is the ID data.

203. And solving a new communication branch, and coding the new communication branch according to a preset coding rule to obtain a corresponding unique characteristic identifier.

The method specifically comprises the following steps:

In practical applications, each data service system generates new ID data every day, for example, including ID data of a new applicant (i.e. there is no ID of the person in the original acquired ID data), and multiple communication branches belonging to the same person generate associated ID data (i.e. the original acquired ID data cannot associate IDs of multiple communication branches belonging to the same person, a person having a unique feature identifier adds ID. which does not exist before, for example, there is still the same person having multiple largest communication branches (multiple unique feature identifiers), because there may be no association edge between the largest communication branches of the same person, but the present application technical scheme operates once a day, and once there is some association relationship data, the associated communication branches are combined.

Aiming at the situation, the embodiment combines all the communication branches and the unique feature identifiers corresponding to the communication branches according to the original acquired ID data and the newly-added ID data, determines the attribution ID of the newly-added ID, obtains the new communication branch, and encodes the new communication branch according to the preset encoding rule to obtain the unique feature identifier corresponding to the new communication branch, thereby automatically realizing the updating operation of the unique feature identifier, combining a plurality of unique feature identifiers of the same person and attributing the new ID to the existing unique feature identifier.

FIG. 3 is a flowchart illustrating an ID data processing method according to another embodiment of the present invention; as shown in fig. 3, includes:

301. carrying out one-to-one identification on each two unique feature identifiers according to the unique feature identifiers corresponding to each communication branch;

302. if the two unique feature identifiers are identified to be the same person, outputting the two unique feature identifiers to generate a connected edge, namely, edge connection exists between the communication branches corresponding to the two unique feature identifiers, so that a connection relation diagram between all the communication branches is obtained;

303. determining a plurality of communication branches with connection relations as the largest communication branch according to the connection relation diagram among all the communication branches;

304. And selecting the smallest unique characteristic identifier from the largest communication branch as the attribution unique characteristic identifier of the largest communication branch, namely merging and collecting the unique characteristic identifiers of the same person.

In practical application, since the ID data obtained in each service system has quality problems, for example, there are some foreigners in the data, the passport numbers are the same, the birth date is the same, there is no conflict in gender, the name characters are similar (for example, the last name is different from the first name in order, the writing of the characters is less, the last name is not available, etc.), but the nationality is multiple (possibly the ID data is collected in error), at this time, the embodiment shown in fig. 3 of the present application can perform the identification of the same person and the merging and collection of multiple unique feature identifiers of the same person.

It should be noted that in this embodiment, a forward index and an inverse index may be created for the unique feature identifiers after merging and collecting, that is, a correspondence between the unique feature identifier of each maximum connected branch and the ID data of each node in the unique feature identifier is created, so as to facilitate ID data search with different requirements.

The specific implementation manner of the technical scheme of the application is described in detail below, and specifically, the technical scheme of the application is divided into four steps, namely an initialization step, a daily update step, a common multi-unique feature identifier collection based on a similar algorithm, and a forward index and an inverse index of the unique feature identifiers.

The initialization step only runs once and comprises the steps of extracting ID relation data, constructing an ID relation graph, solving all communication branches and unique feature identification codes;

the daily updating step is operated once a day and comprises the steps of extracting updated ID relation data, combining unique feature identification data of the previous day, constructing a relation diagram with unique feature identifications of the whole ID data of the previous day and the daily updating data, solving the existing unique feature identifications and new communication branches, encoding the unique feature identifications of the new communication branches, and combining the existing unique feature identifications with the newly added unique feature identifications;

the step of collecting the same person multiple unique characteristic identifiers based on a similar algorithm is that the same person multiple unique characteristic identifiers need to be run once after initialization and daily update, and the method specifically comprises the following steps of: and identifying the same person and collecting the unique characteristic identification of the same person.

The forward index and the reverse index of the unique feature identification are established for storage and fast cross-retrieval.

Fig. 4 is a flowchart illustrating an initialization step according to another embodiment of the present invention, where the initialization step shown in fig. 4 specifically includes:

Extracting ID relation data: the dependent data between each ID and ID is extracted from each business system.

Constructing an ID relation diagram: constructing an ID node association Graph by using an ODPS-Graph large-scale Graph calculation tool;

because each point of Graph needs to be provided with an ID that can be uniquely identified in the Graph, and each point in the Graph is an ID data, a unique identification needs to be provided for each value (e.g., each identification card number, each passport number) of each (e.g., identification card, passport, etc.) ID data. By looking at the ID data, the value may be repeated for different types of IDs, thus requiring the addition of an "ID type". In particular, for passports, the passport numbers of different nationalities may be duplicated, so nationalities need to be added to the passport, while other person ID data is not needed, so "ID type+id number+nationality" is used for the passport to uniquely identify; for other types of IDs, an "ID type+id number" is used for unique identification. After the point IDs of the Graph are set in this way, the association between the IDs is then used as an undirected edge structure Graph.

Maximum connected subgraph: after constructing the ID node association Graph, using the union set to solve all communication branches in the ID node association Graph, and using a message propagation mechanism to solve the communication branches in the ODPS-Graph implementation;

At initialization, each node in the D-association graph uses its own ID as its home ID.

The first step: each node propagates its home ID to each node adjacent to itself.

And a second step of: each node receives the information (the attribution ID of each neighboring node) transmitted by all the neighboring nodes, selects the smallest ID in the attribution ID of the neighboring node, sets the smallest ID as MIN_ID, compares the MIN_ID with the attribution ID of the neighboring node, sets the MIN_ID as the new attribution ID of the neighboring node if the MIN_ID is smaller than the attribution ID of the neighboring node, and sets the Change mark of the neighboring node as true, otherwise, the attribution ID is unchanged. If the own home ID is updated, the own new home ID is sent to all neighbors.

If there is a Change flag of a certain node being true (i.e. there is an update), the iteration of the second step is repeated until all nodes' home IDs are no longer updated and stopped, i.e. converged.

And finally, outputting the data of each node and the attributive ID thereof in the same communication branch if the nodes have the same attributive ID (as the communication branch number), and obtaining all the communication branches in the graph.

Unique feature identification code: after all the communication branches in the figure are obtained, each communication branch number can be identified as a unique feature, but the current communication branch number does not have uniformity (because the node ID with the smallest communication branch class is selected as the communication branch number, some communication branches may be an 'identity card+identity card number', and some communication branches may be a 'passport+passport number+nationality'), so that each communication branch number needs to be unified, namely, unified unique feature identification coding is performed, and the coding rule is as follows:

The unique signature consists of 32-bit 16-ary 0-F, defined as:

1 st to 17 th: 17 bits, unique signature generation time, representing the number of milliseconds since 1970-01-01 (inclusive), that granularity may be met, initializing the same unique signature as if the update was on the same day (or multiple updates during the day) without the presence of a different person;

18 th to 19 th bits: 2 bits, reserved bits, used for distinguishing the types of the unique feature identifiers at the future time;

20 th to 25 th bits: 6 bits, primary flag sequence (worker_id may use distributed computation);

26 th to 31 th bits: 6 bits, secondary marker sequence (sequence number within worker that may use distributed computation);

bit 32: 1 bit, check bit (the first four field numbers add up to 16).

Sample example:

00000000000000146-00-000014-10052D-9

the unique feature identification has the following technical characteristics:

the unique characteristic identification has uniformity, uniqueness and invariance;

the unique feature identifiers can be mapped directly one-to-one to Long types, with 32×4=128 bit=16 Byte. This facilitates storage and computation of the relevant application systems.

Specific coding can be implemented using MapReduce:

mapper: the output key-value, key is the communication branch number, value is the person ID data of each belonging to the communication branch.

Reducing: through the output of the Mapper, the IDs of people in the same communication branch are collected under the same Reducer, and the keys of the Reducer are communication branch numbers, so that the unique feature identification coding rule is adopted for numbering, wherein the 20 th-25 th main mark sequence uses the number of the Reducer (the whole calculation task is divided into a plurality of routers for distributed calculation, and one worker is responsible for the calculation of a plurality of reducers), and the 26 th-31 th secondary mark sequence uses the sequence number of the Reducer of the worker (all the reducers in the same worker have one internal number). Finally, each Reducer outputs a unique characteristic identifier, wherein the unique characteristic identifier represents the fields of the person ID1 under the communication branch >, < the unique characteristic identifier represents the fields of the person ID2 under the communication branch > …, and each ID is one row.

FIG. 5 is a flowchart of a unique identifier date updating step according to another embodiment of the present invention, where the unique identifier date updating step shown in FIG. 5 specifically includes:

because new ID data arrives every day, including ID data of a new incoming person (i.e., no ID data of the person in the previous ID data), including ID data that associates multiple communication branches belonging to the same person (the previous ID data cannot associate the multiple communication branches belonging to the same person with ID data) and including ID data that has a unique feature identifier that has been added to the previous ID data. Therefore, for the three types of unique feature identification codes, a plurality of unique feature identification combinations of the same person and the new ID belonging to the existing unique feature identification are respectively needed. These three operations require unique feature identification result data of the previous day, ID association total data of the previous day, and newly added ID association data. It is therefore necessary to combine these three data construction diagrams.

First, ID association data of the previous day and newly added ID association data are combined to construct a Graph (ID node association Graph), and the node ID setting rule of the Graph is the same as that in the initializing step. Then combining the unique characteristic identification result data of the previous day, wherein for the node with the unique characteristic identification result data of the previous day, the original attribution ID is the unique characteristic identification; for the nodes which do not exist in the unique feature identification result data of the previous day, the original attribution ID is the own node ID, so that a relation diagram with part of unique feature identification is constructed.

Then, a merging and gathering algorithm is used for attributing the new ID, solving a new communication branch, merging the multiple unique characteristic identifiers of the same person, and particularly realizing a message transmission mechanism using an ODPS-Graph.

The home ID of each node has been set at the previous initialization, i.e. for the node with the unique feature identity, its home ID is its unique feature identity, otherwise its home ID is its node ID.

The first step: each node sends its home ID to all its neighbor nodes, each home ID carrying a type (i.e., whether it is a unique characteristic identifier);

and a second step of: each node receives the message (the attribution ID of each neighbor node) transmitted by all the neighbor nodes, sets two temporary variables, and initializes the variables to NULL. The first temporary variable stores the minimum attribution unique characteristic identification with the unique characteristic identification type in the message sent by all the neighbor nodes, and the minimum attribution unique characteristic identification is set as T_ID1. The second temporary variable stores the minimum home ID of all neighbor nodes, which is not the unique feature identification, in the message sent by the neighbor nodes, and is set to t_id2. Then, the home ID is compared with the own home ID.

If the own home ID is the unique feature identification, comparing the unique feature identification with T_ID1, if T_ID1 is not NULL and T_ID1 is smaller, updating the own home ID to be T_ID1 and the type to be the unique feature identification, setting the own Change mark to true, and if T_ID1 is NULL, keeping the home ID unchanged. In this case, a plurality of unique feature identifiers of the same person are combined, and the smallest unique feature identifier is selected.

If the own home ID is not the unique feature identifier, if T_ID1 is not null, updating the own home ID to T_ID1 and the type to the unique feature identifier, and setting the own Change mark to true. This is the case when the new ID finds an already existing unique characteristic for attribution. Otherwise, if T_id1 is null, comparing the own home ID with T_id2, if T_id2 is not null and is smaller than the own home ID, updating own home ID to be T_id2, and the type is a non-unique feature identification type, and setting the own Change mark to be true. This may be a new person or a new ID that has not found the home unique feature identification for a while (or a new person if the iteration has stopped).

Therefore, two outputs are obtained, wherein one output is that the same person is combined or a new ID is attributed to an existing unique characteristic identifier, and the new unique characteristic identifier is directly used for replacing or serving as a unique characteristic identifier of the user, namely the updated unique characteristic identifier; another output is a new person, who has not had any ID data before, and therefore has not had his unique feature identification before, and then gets the person's communication branch number (temporary unique feature identification), which also needs to get the unique feature identification according to the unique feature identification code.

Furthermore, for the obtained communication branch of the new person, the unique feature identification is obtained by encoding according to the unique feature identification encoding rule, and the specific implementation is to use a MapReducer task, see the related description in the initializing step.

And finally, merging the updated unique characteristic identifier with the unique characteristic identifier of the new person to obtain a running result of the time.

Fig. 6 is a flow chart of a step of merging multiple unique identification of a same person in another embodiment of the present invention, where the step of merging multiple unique identification of a same person shown in fig. 6 specifically includes:

it should be noted that the step of merging the unique identification of the same person shown in fig. 6 is different from the step of merging the unique identification of the same person in the "daily update step" shown in fig. 5. The merging of the unique multiple unique feature identifiers of the same person in the "daily update step" shown in fig. 5 is caused by insufficient historical ID data, and the new ID data causes the unique feature identifiers to be associated. The combination of unique identification of the same person shown in fig. 6 is caused by quality problem of ID data, and needs to be identified by combining an approximation algorithm. For example, there are some foreigners in the ID data, the passport numbers are the same, the date of birth is the same, there is no conflict in gender, the name characters are similar (like surnames are not in the same order as first names, there are few written more characters, there are no surnames, etc.), but the nationality is multiple (possible ID data collection errors). Thus, two steps are involved, one is identity recognition and the other is identity multiple unique feature identity combination.

Firstly, a MapReducer task is used for identifying the same person, and the same person is at least the same as the certificate type, the certificate number and the birth date, so that the key output by the Mapper is the unique characteristic identifier, the name, the gender and the nationality of the user. Such that identical unique feature identifications of identical "credential type + credential number + date of birth" are all clustered into the same Reducer. Thus, a pairwise unique feature identification calculation is performed on the unique feature identification of each Reducer to identify whether the pairwise unique feature identifications are the same person. The specific strategies are as follows (the following and the division are simultaneously satisfied):

1. gender does not conflict, i.e., it is not one party that is "male" and the other party that is "female".

2. Names are similar (regardless of case). Name similarity rules are as follows (one of them can be satisfied): (1) the strings are equal. (2) The character sets are identical (e.g., both chinese names, both english names). In the case that the character set is satisfied, either one is a substring of the other; or if the characters are Chinese, the length of the name minus the length of the longest common subsequence is less than or equal to 1, namely, the difference is at most 1 character; or if both are non-chinese, the length of the name minus the length of the longest common subsequence is less than or equal to 3, i.e., at most 3 characters apart.

After the two unique feature identifiers are the same person, outputting the two unique feature identifiers to generate a connected edge, wherein the larger unique feature identifier is a unique feature identifier 1, the smaller unique feature identifier is a unique feature identifier 2, < unique feature identifier 1, and the unique feature identifier 2 >, namely the unique feature identifiers representing the persons represented by the two unique feature identifiers are the same person, and the unique feature identifiers 1 are combined to the unique feature identifier 2.

After mapreduce, only those unique feature identities that need to be combined and combined to smaller unique feature identities are obtained, but each final unique feature identity that needs to be combined is not yet available, because there may be cases of indirect combining, such as < unique feature identity 1, unique feature identity 2 > indicating that unique feature identity 1 is combined to unique feature identity 2, < unique feature identity 2, unique feature identity 3 > indicating that unique feature identity 2 is combined to unique feature identity 3, and thus final unique feature identity 1 is needed to be combined to unique feature identity 3, while there may also be other combinations of unique feature identities 1 in the Reducer. Therefore, the output of MapReducer cannot be satisfied yet, and a Graph step for solving a communication branch is needed.

Because the edge connection exists or the minimum unique feature identifier is indirectly accessible in one communication branch, the minimum unique feature identifier in the communication branch is selected to be used as the attribution unique feature identifier of all unique feature identifiers in the communication branch, so that the Graph is constructed by using the output of the MapReducer as the edge, all the communication branches are obtained by using the Graph communication branch solving algorithm described above, the minimum unique feature identifier in the communication branch is the communication branch number of the communication branch, and therefore, the communication branch number of each unique feature identifier and the communication branch where the unique feature identifier exists is output, namely, the unique feature identifiers of the same person are collected.

And finally, creating a forward index and an inverse index for the unique feature identifiers, and facilitating the searching of different requirements, so that any ID in the original input table can be ensured, and at least one unique feature identifier can be found to correspond to the unique feature identifier. On the contrary, any unique characteristic identifier can easily enumerate the associated different types of IDs. The former is called "reverse index" and the latter constitutes "forward index".

The forward index format is exemplified as follows:

the inverted index format is exemplified as follows:

fig. 7 is a schematic structural diagram of an ID data processing device according to an embodiment of the present invention, as shown in fig. 7, including:

Optionally, the initialization module is specifically configured to:

Optionally, the initialization module is specifically further configured to:

Optionally, the apparatus further comprises:

Optionally, the updating module is specifically configured to:

Optionally, the apparatus further comprises:

The apparatus shown in this embodiment may perform the method embodiments shown in fig. 1 to fig. 6, and the implementation principle and technical effects thereof will not be described again.

Accordingly, the embodiments of the present application further provide a computer readable storage medium storing a computer program, where the computer program when executed by a computer can implement the steps or operations related to the ID data processing device in the above method embodiments, which are not described herein.

Fig. 8 is a schematic structural diagram of a server according to another embodiment of the present invention, as shown in fig. 8, including:

a memory 81, a processor 82, and a communication component 83;

a communication component 83 for acquiring a large amount of ID data at each service system;

a memory 82 for storing a computer program;

a processor 81 coupled with the memory and the communication component for executing a computer program for:

Solving a communication branch in the ID node association graph;

Optionally, the processor 81 is further configured to:

acquiring newly added ID data in each service system at regular intervals;

and obtaining a new communication branch, and encoding the new communication branch according to a preset encoding rule to obtain a unique characteristic identifier corresponding to the new communication branch.

Optionally, the processor 81 is further configured to:

Further, as shown in fig. 8, the terminal device further includes: a display 84, a power supply component 85, an audio component 86, and other components. Only some of the components are schematically shown in fig. 8, which does not mean that the server only comprises the components shown in fig. 8.

The server in this embodiment may execute the method embodiments shown in fig. 1 to 6, and the implementation principle and technical effects thereof will not be described in detail.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An ID data processing method, comprising:

solving a communication branch of the ID node association graph;

coding each communication branch according to a preset coding rule to obtain a unique characteristic identifier corresponding to the communication branch;

the construction of the ID node association graph according to the acquired ID data comprises the following steps: acquiring a plurality of ID data from each service system, wherein the ID data at least comprises an ID type and an ID number; according to the association relation among the ID data, using the association among the IDs as an undirected edge to construct an ID node association graph, wherein each node in the ID node association graph uses own ID data as an attribution ID thereof;

the step of obtaining the communication branch in the ID node association graph comprises the following steps:

2. The method of claim 1, wherein the encoding each communication branch according to a preset encoding rule to obtain a corresponding unique feature identifier thereof comprises:

the unique signature consists of 32-bit 16-ary 0-F, defined as:

20 th to 25 th bits: 6 bits, the primary tag sequence;

26 th to 31 th bits: 6 bits, secondary marker sequence;

bit 32: 1 bit, check bit.

3. The method according to claim 1 or 2, wherein after encoding each communication branch according to a preset encoding rule to obtain a unique feature identifier corresponding to the communication branch, the method further comprises:

acquiring newly added ID data in each service system at regular intervals;

4. A method according to claim 3, wherein determining the home ID of the newly added ID based on the original acquired ID data and the newly added ID data in combination with all the communication branches and their corresponding unique feature identifiers comprises:

5. The method of claim 4, wherein obtaining a new communication branch and encoding the new communication branch according to a preset encoding rule to obtain a unique feature identifier corresponding to the new communication branch comprises:

6. The method as recited in claim 5, further comprising:

7. The method as recited in claim 6, further comprising:

8. An ID data processing apparatus, comprising:

the initialization module is used for constructing an ID node association diagram according to the acquired ID data, and each node corresponds to one ID data; solving a communication branch in the ID node association graph; coding each communication branch according to a preset coding rule to obtain a unique characteristic identifier corresponding to the communication branch;

the initialization module is specifically configured to: acquiring a large amount of ID data from each service system, wherein the ID data at least comprises an ID type and an ID number; according to the association relation among the ID data, using the association among the IDs as an undirected edge to construct an ID node association graph, wherein each node in the ID node association graph uses own ID data as an attribution ID thereof;

The initialization module is specifically further configured to execute the following steps:

9. The apparatus of claim 8, wherein the initialization module is further specifically configured to:

10. The apparatus according to claim 8 or 9, further comprising:

11. The apparatus of claim 10, wherein the update module is specifically configured to:

12. The apparatus of claim 11, wherein the update module is further configured to perform the following steps:

13. The apparatus as recited in claim 12, further comprising:

14. The apparatus as recited in claim 13, further comprising: