CN113486218A

CN113486218A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN113486218A
Application number: CN202111036091.0A
Authority: CN
Inventors: 常霄; 王托; 黎积东; 陈晓倩
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2021-10-08
Anticipated expiration: 2041-09-06
Also published as: CN113486218B

Abstract

The present disclosure relates to a data processing method, apparatus, electronic device, and storage medium, the method comprising: acquiring user identification data in a database and a first target association relation among the user identification data; constructing a connected graph according to the user identification data and the first target incidence relation, wherein the connected graph comprises a plurality of connected subgraphs; for each connected subgraph, eliminating a root node in the connected subgraph to obtain at least two sub-connected derivative graphs corresponding to the connected subgraph; obtaining the similarity between at least two sub-connected derivative graphs; and determining a second target incidence relation of the sub-connected derivative graphs according to the similarity between the sub-connected derivative graphs, and generating a target connected subgraph based on the second target incidence relation of the sub-connected derivative graphs, so that user identification data corresponding to the same natural person are connected in series, and a data island is eliminated.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.

Background

With the gradual maturity of internet technology, consumption and behavior habits of people are greatly changed, and the way that people connect to the internet is also diversified. In daily life, a user can access business systems of different domains of a certain company through any one mode of a mobile phone APP, a PC, a WeChat applet, H5 and O2O at any time and any place, so that the user can browse, inquire or consult related interested contents. Accordingly, different behavior feature data of the same user can be generated in business systems of different domains of the company. Before being processed, the behavior characteristic data can be isolated from each other, so that the behavior characteristic data cannot be utilized and is inconvenient to manage, therefore, for the purpose of enhancing data management, many companies can establish a 'one-person-one-file' data management service taking 'person' as a center, namely, the behavior characteristic data in various service systems are gathered, and then the behavior characteristic data of the same user in the whole company is connected in series, so that data islands are eliminated.

In the prior art, user identification data with an association relation are associated by constructing a connected graph in the prior art, but when the same connected graph comprises user identification data of different natural persons, the natural persons cannot be distinguished.

Disclosure of Invention

To solve the technical problem or at least partially solve the technical problem, the present disclosure provides a data processing method, an apparatus, an electronic device, and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a data processing method, including:

acquiring user identification data in a database and a first target association relation among the user identification data;

constructing a connected graph according to the user identification data and the first target incidence relation, wherein the connected graph comprises a plurality of connected subgraphs, each node in each connected subgraph corresponds to one user identification data, and each connecting line in each connected subgraph corresponds to one first target incidence relation;

for each connected subgraph, excluding a root node in the connected subgraph to obtain at least two sub-connected derivative graphs corresponding to the connected subgraph; obtaining the similarity between the at least two sub-connected derivative graphs; and determining a second target incidence relation of the sub-connected derivative graphs according to the similarity between the sub-connected derivative graphs, and generating a target connected subgraph based on the second target incidence relation of the sub-connected derivative graphs, wherein the target connected subgraph comprises nodes of at least two sub-connected derivative graphs.

Optionally, the determining a second target association relationship of the sub-connected derivative graphs according to the similarity between the sub-connected derivative graphs, and generating a target connected subgraph based on the second target association relationship of the sub-connected derivative graphs includes:

determining a second target incidence relation of each main node in the sub communication derivative graph according to the user identification feature similarity between the main nodes in the sub communication derivative graph, and generating a target sub communication derivative graph based on the second target incidence relation of the sub communication derivative graph;

and adding the excluded root nodes in the connected subgraph to the generated target sub-connected derivative graph and generating a target connected subgraph.

Optionally, the determining, according to the user identifier feature similarity between the host nodes in the sub-connectivity derivative diagram, a second target association relationship of each host node in the sub-connectivity derivative diagram, and generating a target sub-connectivity derivative diagram based on the second target association relationship of the sub-connectivity derivative diagram includes:

when the user identification similarity of the main nodes in the sub-connected derivative graphs meets the preset user identification similarity, establishing a second target association relation between the main nodes;

and generating a target sub-communication derivative graph according to the second target incidence relation between the main nodes.

determining a second target incidence relation of each main node corresponding to the slave node in the sub-communication derivative graph according to the user identification feature similarity between the slave nodes in the sub-communication derivative graph, and generating a target sub-communication derivative graph based on the second target incidence relation of the sub-communication derivative graph;

Optionally, the determining, according to the user identification feature similarity between the slave nodes in the sub-connected derivative graph, a second target association relationship of each master node corresponding to the slave node in the sub-connected derivative graph, and generating a target sub-connected derivative graph based on the second target association relationship of the sub-connected derivative graph includes:

when the user identification similarity of the slave nodes of the sub-connected derivative graph meets the preset user identification similarity, establishing a second target incidence relation between the master nodes which have the first target incidence relation with the slave nodes;

and generating a target sub-communication derivative graph according to a second target incidence relation between the master nodes which have the first target incidence relation with the slave nodes.

Optionally, the obtaining of the user identification data in the database and the first target association relationship between the user identification data includes:

acquiring user identification data in a database and a first association relation among the user identification data;

when a plurality of identical first association relations exist between two user identification data, the first association relation with high confidence coefficient is selected as a first target association relation.

Optionally, after determining the second target association relationship of the sub-connected derivative graphs according to the similarity between the sub-connected derivative graphs and generating a target connected subgraph based on the second target association relationship of the sub-connected derivative graphs, the method further includes:

assigning a unique identifier to each of the target connected subgraphs.

Optionally, after assigning a unique identifier to each of the target connected subgraphs, the method further includes:

periodically extracting newly added user identification data in a database and a newly added first target association relation among the user identification data;

adding the newly added user identification data as a new node into the connected graph;

connecting the user identification data with the first target incidence relation in the communicating sub through the connecting line according to the newly added first target incidence relation;

assigning a unique identifier to a connected subgraph to which no unique identifier is assigned;

when a connected subgraph with two or more unique identifiers exists, one of the two or more unique identifiers is selected as a final unique identifier according to a set rule.

Optionally, the selecting one of the two or more unique identifiers as a final unique identifier according to a set rule includes:

selecting the earliest allocation time one from the two or more unique identifiers as a final unique identifier.

In a second aspect, an embodiment of the present disclosure provides a data processing apparatus, including:

the data acquisition module is used for acquiring user identification data in a database and a first target association relation among the user identification data;

the connected graph constructing module is used for constructing a connected graph according to the user identification data and the first target incidence relation, the connected graph comprises a plurality of connected subgraphs, each node in each connected subgraph corresponds to one user identification data, and each connecting line in each connected subgraph corresponds to one first target incidence relation;

the target connected subgraph generation module is used for eliminating a root node in each connected subgraph to obtain at least two sub-connected derivative graphs corresponding to the connected subgraphs; obtaining the similarity between the at least two sub-connected derivative graphs; and determining a second target incidence relation of the sub-connected derivative graphs according to the similarity between the sub-connected derivative graphs, and generating a target connected subgraph based on the second target incidence relation of the sub-connected derivative graphs, wherein the target connected subgraph comprises nodes of at least two sub-connected derivative graphs.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a data processing method as claimed in any one of the first aspects.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, where the computer program is configured to implement the data processing method according to any one of the first aspect when executed by a processor.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

the data processing method, the device, the electronic device and the storage medium provided by the embodiments of the present disclosure construct a plurality of connected subgraphs according to user identification data in a database and a first target association relationship between each user identification data, exclude a root node in each connected subgraph to obtain at least two sub-connected subgraphs corresponding to the connected subgraph, determine a second target association relationship of the sub-connected subgraphs according to a similarity between the at least two sub-connected subgraphs, and generate a target connected subgraph based on the second target association relationship of the sub-connected subgraphs, so that when the same connected subgraph includes user identification data of different natural persons, user identification data corresponding to different natural persons in the connected subgraph are distinguished according to the similarity between the sub-connected subgraphs, and the user identification data corresponding to the same natural person can be connected in series, the method has the advantages of eliminating data islands, along with strong expansibility and low calculation cost, effectively solving the problems of complex identification process, high technical realization threshold and poor landing performance in the prior art, and having higher popularization and application values.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a data processing method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a connected subgraph provided by the embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of another connected subgraph provided by the embodiments of the present disclosure;

FIG. 4 is a schematic structural diagram of another connected subgraph provided by the embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of another connected subgraph provided by the embodiments of the present disclosure;

FIG. 6 is a schematic structural diagram of another connected subgraph provided by the embodiments of the present disclosure;

FIG. 7 is a schematic flow chart diagram of another data processing method provided by the embodiments of the present disclosure;

FIG. 8 is a schematic structural diagram of another connected subgraph provided by the embodiments of the present disclosure;

FIG. 9 is a schematic flow chart diagram illustrating a further data processing method provided by an embodiment of the present disclosure;

FIG. 10 is a schematic structural diagram of another connected subgraph provided by the embodiments of the present disclosure;

FIG. 11 is a schematic structural diagram of another connected subgraph provided by the embodiments of the present disclosure;

FIG. 12 is a schematic structural diagram of another connected subgraph provided by the embodiments of the present disclosure;

FIG. 13 is a schematic structural diagram of another connected subgraph provided by the embodiments of the present disclosure;

FIG. 14 is a schematic flow chart diagram illustrating a further data processing method provided by an embodiment of the present disclosure;

FIG. 15 is a schematic flow chart diagram illustrating a further data processing method provided by an embodiment of the present disclosure;

FIG. 16 is a schematic flow chart diagram illustrating a further data processing method provided by an embodiment of the present disclosure;

fig. 17 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure;

fig. 18 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

The technical scheme of the disclosure can be applied to electronic equipment, wherein the electronic equipment can be a computer, a tablet, a mobile phone or other intelligent terminal equipment and the like. The electronic device is provided with a display screen, wherein the display screen can be a touch screen or a non-touch screen, and for the electronic device with the touch screen, a user can realize interactive operation with the electronic device through gestures, fingers or touch tools (such as a stylus pen). For the electronic device without the touch screen, the interactive operation with the electronic device can be realized through an external device (for example, a mouse, a keyboard, a camera or the like) or voice recognition, expression recognition or the like.

The present disclosure does not limit the type of operating system of the electronic device. For example, an Android system, a Linux system, a Windows system, an iOS system, etc.

Fig. 1 is a schematic flow chart of a data processing method provided in an embodiment of the present disclosure, and the embodiment is applicable to a case of processing data. The method of this embodiment may be executed by a data processing apparatus, which may be implemented in a hardware/software manner, and may be configured in an electronic device, so as to implement the data processing method described in any embodiment of this application.

In the prior art, user identification data with an association relation are associated by constructing a connected subgraph, but when the same connected subgraph comprises user identification data of different natural people, the natural people cannot be distinguished. For example, the same phone ID is used to associate the user identification data corresponding to the natural person 1 and the user identification data corresponding to the natural person 2, at this time, the user identification data corresponding to the natural person 1 and the user identification data corresponding to the natural person 2 in the connected subgraph cannot be distinguished, so that multiple child family members sharing one phone or device are prevented from being recognized as the same natural person by mistake, which is provided in the embodiments of the present disclosure.

As shown in fig. 1, the method specifically includes the following steps:

s10, obtaining the user identification data in the database and the first target incidence relation among the user identification data.

The user identification data refers to an identification ID (such as a device ID, a cell phone number, an IP address, an account ID, a certain information ID, a certain public number ID, an encrypted phone, a service ID, and the like) of the user.

Specifically, when the database corresponds to service data of an education company, the user logs in different service systems through a mobile phone number, the logging manner may be, for example, a manner of a mobile phone APP, a PC, a wechat applet, H5, O2O, and the like, the service systems may be, for example, a service system 1, a service system 2, a service system 3, and the like, and at this time, the different service systems correspond to different ID. When a user can participate in different business activities on a business system after logging in the business system, and different corresponding business acquirements also correspond to different identification IDs, the user will have a plurality of different identification IDs in the business systems of different domains, and in order to eliminate data islands, the embodiment needs to connect user information data of the user in the company universe in series.

Illustratively, a user logs in a service system 1, a service system 2 and a service system 3 respectively through a mobile phone number 1, the user logs in the service system 3 and the service system 4 respectively through a device 1, and the user performs service activity 1 and service activity 2 operations on the service system 1, and performs service activity 2 operations on the service system 2, then the identification ID corresponding to the mobile phone number 1 of the user is a phone 1, the identification ID corresponding to the device 1 that the user logs in is a device 1, the identification ID corresponding to the service system 1 is a service 1, the identification ID corresponding to the service system 2 is a service 2, the identification ID corresponding to the service system 3 is a service 3, the identification ID corresponding to the service system 4 is a service 4, the identification ID corresponding to the service activity 1 operation performed by the service system 1 is a certain public number ID1, the identification ID corresponding to the service activity 2 operation performed by the service system 1 is a business 1 academic number 1, the identification ID corresponding to the operation of the business activity 2 performed by the business system 2 is business number 1 of business 2. The first target association relationship exists between the telephone 1 and the service 1, between the telephone 2 and the service three, between the equipment 1 and the service 3 and between the equipment 1 and the service 4, between the service 1 and the member number ID1 and between the service 1 and the member number 1, and between the service 2 and the member number 1.

S20, constructing a connected graph according to the user identification data and the first target incidence relation, wherein the connected graph comprises a plurality of connected subgraphs.

Each node in each connected subgraph corresponds to one user identification data, and each connecting line in each connected subgraph corresponds to one first target incidence relation.

In this embodiment, all the user identification data are regarded as nodes on a connected subgraph one by one, and if an association relationship exists between any two user identification data, the nodes corresponding to the two user identification data with the association relationship on the connected subgraph are connected by one connecting line, so as to form a complete connected subgraph.

Illustratively, a connectivity sub-graph formed by phone 1, device 1, service 2, service 3, service 4, people ID1, service 1 scholars 1, and service 2 scholars 1 is shown in fig. 2.

S30, for each connected subgraph, eliminating root nodes in the connected subgraph to obtain at least two sub-connected derivative graphs corresponding to the connected subgraph; and acquiring the similarity between at least two sub-connected derivative graphs.

In fig. 2, the user identification data having the first target association relation is associated by the telephone 1 and the device 1. However, if a family includes two natural persons, and both the two natural persons use the telephone 1 and the device 1, at this time, the connected subgraph constructed in fig. 2 includes user identification data of different natural persons, but cannot distinguish the natural persons, in order to avoid misidentifying a plurality of child family members sharing one telephone or device as the same natural person, after constructing a plurality of connected subgraphs, for each connected subgraph, excluding a root node in the connected subgraph, where the root node corresponding to the connected subgraph in fig. 2 is the telephone 1 and the device 1, and excluding the root node telephone 1 and the device 1, obtaining a sub-connected derivative graph corresponding to the connected subgraph, as shown in fig. 3, the sub-connected derivative graph corresponding to the connected subgraph in fig. 2 includes 4 sub-connected derivative graphs a, B, C and D, where the main node corresponding to the sub-connected derivative graph a is the service 1, the corresponding slave nodes are the public number ID1 and the business 1 academic number 1, the master node corresponding to the sub communication derivative diagram B is the business 2, the corresponding slave node is the business 2 academic number 1, the master node corresponding to the sub communication derivative diagram C is the business 3, and the master node corresponding to the sub communication derivative diagram D is the business 4.

After obtaining at least two sub-connected derivative maps corresponding to the connected subgraph in fig. 2, by obtaining the similarity between at least two sub-connected derivative maps, specifically, the similarity between the sub-connected derivative map a and the sub-connected derivative map B, the similarity between the sub-connected derivative map a and the sub-connected derivative map C, the similarity between the sub-connected derivative map a and the sub-connected derivative map D, the similarity between the sub-connected derivative map B and the sub-connected derivative map C, the similarity between the sub-connected derivative map B and the sub-connected derivative map D, and the similarity between the sub-connected derivative map C and the sub-connected derivative map D.

It should be noted that, the obtaining of the similarity between the at least two sub-communication derivative graphs may be the similarity of a master node that obtains the at least two sub-communication derivative graphs, or the similarity of slave nodes that obtain the at least two sub-communication derivative graphs.

Illustratively, as shown in fig. 4, if the similarity between the main node of the sub-communication derivative diagram a and the main node of the sub-communication derivative diagram B is 0.25, the similarity between the main node of the sub-communication derivative diagram a and the main node of the sub-communication derivative diagram C is 1, the similarity between the main node of the sub-communication derivative diagram a and the main node of the sub-communication derivative diagram D is 0, the similarity between the main node of the communication sub-diagram B and the main node of the sub-communication derivative diagram C is 0, the similarity between the main node of the sub-communication derivative diagram B and the main node of the sub-communication derivative diagram D is 1, and the similarity between the main node of the sub-communication derivative diagram C and the main node of the sub-communication derivative diagram D is 0.

In addition, the similarity between the slave nodes in each connected subgraph can also be acquired.

S40, determining a second target incidence relation of the sub-connected derivative graphs according to the similarity between the sub-connected derivative graphs, and generating a target connected subgraph based on the second target incidence relation of the sub-connected derivative graphs.

With reference to fig. 4, since the similarity between the master node of the sub-connected derivative graph a and the master node of the sub-connected derivative graph C is 1, and the similarity between the master node of the sub-connected derivative graph B and the master node of the sub-connected derivative graph D is 1, a second target association relationship exists between the master node of the sub-connected derivative graph a and the master node of the sub-connected derivative graph C, and a second target association relationship exists between the master node of the sub-connected derivative graph B and the master node of the sub-connected derivative graph D, so that a target sub-connected derivative graph is generated based on the second target association relationship of the sub-connected derivative graph as shown in fig. 5, after the target sub-connected derivative graph is obtained, a target connected subgraph is generated by adding the excluded root node to the target sub-connected derivative graph, as shown in fig. 6, at this time, corresponding to the connected subgraph of fig. 2, the generated target connected subgraph includes a target connected subgraph a and target connected subgraph B, the target connection subgraph A corresponds to the user identification data of one natural person in one family, and the target connection subgraph B corresponds to the user identification data of another natural person in each family.

The data processing method provided by the embodiment of the disclosure constructs a plurality of connected subgraphs according to user identification data in a database and a first target association relationship between each user identification data, excludes a root node in each connected subgraph to obtain at least two sub connected derivative graphs corresponding to the connected subgraph, determines a second target association relationship of the sub connected derivative graphs according to the similarity between the at least two sub connected derivative graphs, and generates a target connected subgraph based on the second target association relationship of the sub connected derivative graphs, so that when the same connected subgraph comprises user identification data of different natural persons, the user identification data corresponding to different natural persons in the connected subgraph are distinguished according to the similarity between the sub connected derivative graphs, the user identification data corresponding to the same natural person can be connected in series to eliminate data, and the data processing method has strong expansibility, The method has low calculation cost, effectively solves the problems of complex identification process, high technical realization threshold and poor landing performance in the prior art, and has higher popularization and application value.

Fig. 7 is another data processing method provided in an embodiment of the present disclosure, where the present embodiment is based on the foregoing embodiment, and one implementation manner of step S40 includes:

s41, determining the second target incidence relation of each main node in the sub communication derivative graph according to the user identification feature similarity between each main node in the sub communication derivative graph, and generating a target sub communication derivative graph based on the second target incidence relation of the sub communication derivative graph.

Optionally, when the user identifier similarity of the master node in the sub-connected derivative graph meets a preset user identifier similarity, a second target association relationship between the master nodes is established.

When a user logs in different service systems through a mobile phone number or equipment, because the user information is filled in the login information as an unnecessary item, the information filled in when the user logs in the service systems is different.

For example, in the process that a user logs in a service system 1, a service system 2 and a service system 3 respectively through a mobile phone number 1, and the user logs in the service system 3 and the service system 4 respectively through a device 1, if the user logs in the service system 1 through the mobile phone number 1, a name is filled: the usual age, age: 10, grade: grade four, city: beijing information, a user fills in a name when logging in a business system 2 through a mobile phone number 1: second, age: 6, grade: grade information, which is filled in grade when logging in the service system 3 through the mobile phone number 1: the information of the four grades, the grade is filled in when the device 1 logs in the service system 4: information of grade one.

As shown in fig. 8, the sub-connectivity derivative maps corresponding to the connectivity sub-map in fig. 2 include 4 sub-connectivity derivative maps a, B, C, and D, where the user identifier of the master node in the sub-connectivity derivative map a is a name: the usual age, age: 10, grade: grade four, city: the user identification of the main node of the sub-connected derivative diagram B is name: second, age: 6, grade: in a grade, the user identifier of the main node of the sub-connected derivative graph C is the grade: in the fourth grade, the user identifier of the main node of the sub-connected derivative graph D is the grade: in a first grade, the similarity of the user identification features between the main nodes corresponding to the sub communication derivative diagram a, the sub communication derivative diagram B, the sub communication derivative diagram C and the sub communication derivative diagram D is shown in fig. 4, the similarity between the main node of the sub communication derivative diagram a and the main node of the sub communication derivative diagram B is 0.25, the similarity between the main node of the sub communication derivative diagram a and the main node of the sub communication derivative diagram C is 1, the similarity between the main node of the sub communication derivative diagram a and the main node of the sub communication derivative diagram D is 0, the similarity between the main node of the communication sub diagram B and the main node of the sub communication derivative diagram C is 0, the similarity between the main node of the sub communication derivative diagram B and the main node of the sub communication derivative diagram D is 1, and the similarity between the main node of the sub communication derivative diagram C and the main node of the sub communication derivative diagram D is 0.

If the preset user identifier similarity is 75%, the main node corresponding to the sub-communication derivative diagram a and the main node of the sub-communication derivative diagram C, and the main node of the sub-communication derivative diagram B and the main node of the sub-communication derivative diagram D satisfy the preset user identifier similarity, a second target association relationship exists between the main node of the sub-communication derivative diagram a and the main node of the sub-communication derivative diagram C, and a second target association relationship exists between the main node of the sub-communication derivative diagram B and the main node of the sub-communication derivative diagram D, so that a target sub-communication derivative diagram is generated based on the second target association relationship of the sub-communication derivative diagram as shown in fig. 5.

And S42, adding the root node in the excluded connected subgraph to the generated target sub-connected derivative graph and generating a target connected subgraph.

And after determining the second target incidence relation of each main node in the sub-communication derivative graph according to the user identification feature similarity between the main nodes in the sub-communication derivative graph to generate a target sub-communication derivative graph, adding the root node in the excluded communication subgraph to the generated target sub-communication derivative graph to generate a target communication subgraph. Specifically, root nodes corresponding to the connected subgraph excluded in fig. 2 are a phone 1 and a device 1, where the phone 1 has a first target association relationship with a service 1, a service 2, and a service 3, respectively, and the device 1 has a first target association relationship with a service 3 and a service 4, respectively. A target connectivity sub-graph is generated by adding phone 1 and device 1 to the target sub-connectivity derivative graph generated in fig. 5, as shown in fig. 6.

The data processing method provided by the embodiment of the disclosure determines the second target association relationship of each main node in the sub-connected derivative graph according to the similarity of the user identification features between the main nodes in the sub-connected derivative graph, and generates the target connected subgraph based on the second target association relationship of the sub-connected derivative graph, so that when the same connected subgraph includes user identification data of different natural persons, the user identification data corresponding to different natural persons in the connected subgraph is distinguished according to the similarity between the main nodes in the sub-connected derivative graph, and the user identification data corresponding to the same natural person can be connected in series, thereby eliminating a data island.

Fig. 9 is a further data processing method provided in an embodiment of the present disclosure, where the present embodiment is based on the foregoing embodiment, and another implementation manner of step S40 includes:

s43, determining a second target incidence relation between each master node corresponding to the slave node in the sub communication derivative graph according to the user identification feature similarity between the slave nodes in the sub communication derivative graph, and generating a target sub communication derivative graph based on the second target incidence relation of the sub communication derivative graph.

Optionally, when the user identifier similarity of the slave node of the sub-connected derivative graph meets the preset user identifier similarity, a second target association relationship between the master node and the slave node is established, where the first target association relationship exists between the master node and the slave node.

And generating a target sub-connection derivative graph according to a second target incidence relation between the master nodes with the first target incidence relation with the slave nodes.

When a user logs in different service systems through a mobile phone number or equipment, because the user information is filled in the login information as an unnecessary item, any information is not filled in advance when the user logs in the service system.

Illustratively, if a user logs in the service system 1, the service system 2, and the service system 3 through the mobile phone number 1 respectively, the user does not perform any information filling operation in the process of logging in the service system 3 and the service system 4 through the device 1 respectively, the user logs in the service system 1, the service system 2, and the service system 3 through the mobile phone number 1 respectively, and after the user logs in the service system 3 and the service system 4 through the device 1 respectively, the user pays attention to a certain public number Open in the service system 1, associates the school number 123 in the service system 1, and associates the school number 123 in the service system 2.

As shown in fig. 10, the sub-connectivity derivative graphs corresponding to the connectivity derivative graph in fig. 2 include 4 sub-connectivity derivative graphs a, B, C and D, where a user ID of a first slave node ID1 of the sub-connectivity derivative graph a is Open, a user ID of a second slave node service 1 of the sub-connectivity derivative graph a is 123, a user ID of a slave node service 2 of the sub-connectivity derivative graph B is 123, and a similarity of user ID features between the constructed sub-connectivity derivative graph a and the constructed slave nodes of the sub-connectivity derivative graph B is shown in fig. 11, and a similarity between the constructed slave nodes of the sub-connectivity derivative graph a and the constructed slave nodes of the sub-connectivity derivative graph B is 1.

If the preset user identifier similarity is 75%, the slave node corresponding to the sub-connected derivative graph a and the slave node corresponding to the sub-connected derivative graph B satisfy the preset user identifier similarity, a second target association relationship exists between the master node of the sub-connected derivative graph a and the master node of the sub-connected derivative graph B, and therefore, a target sub-connected derivative graph is generated based on the second target association relationship of the sub-connected derivative graph as shown in fig. 13.

And S44, adding the root node in the excluded connected subgraph to the generated target sub-connected derivative graph and generating a target connected subgraph.

And after determining second target incidence relations of the master nodes corresponding to the slave nodes in the sub-connected derivative graphs according to the user identification feature similarity between the slave nodes in the sub-connected derivative graphs and generating target sub-connected derivative graphs based on the second target incidence relations of the sub-connected derivative graphs, adding the root nodes in the excluded connected subgraphs to the generated target sub-connected derivative graphs and generating the target connected subgraphs. Specifically, root nodes corresponding to the connected subgraph excluded in fig. 2 are a phone 1 and a device 1, where the phone 1 has a first target association relationship with a service 1, a service 2, and a service 3, respectively, and the device 1 has a first target association relationship with a service 3 and a service 4, respectively. A target connectivity sub-graph is generated by adding phone 1 and device 1 to the target sub-connectivity derivative graph generated in fig. 5, as shown in fig. 12.

According to the data processing method provided by the embodiment of the disclosure, the second target incidence relation of each main node corresponding to the slave node in the sub-connected derivative graph is determined according to the similarity of the user identification features between the slave nodes in the sub-connected derivative graph, and the target connected derivative graph is generated based on the second target incidence relation of the sub-connected derivative graph, so that when the same connected derivative graph comprises user identification data of different natural persons, the user identification data corresponding to different natural persons in the connected derivative graph is distinguished according to the similarity between the master nodes in the sub-connected derivative graph, the user identification data corresponding to the same natural person can be connected in series, and a data island is eliminated.

Fig. 14 is a schematic flowchart of another data processing method provided in an embodiment of the present disclosure, where the present embodiment is based on the foregoing embodiment, and one implementation manner of step S10 includes:

s11, obtaining the user identification data in the database and the first association relation among the user identification data.

Illustratively, if a user logs in the service system 3 through the device 1 and also logs in the service system 3 through the device 2, that is, the user adopts different devices to log in the service system 3 respectively, so that a first association relationship exists between the service system 3 and the devices 1 and 2.

S12, when a plurality of same association relations exist between two user identification data, selecting the association relation with higher confidence as a first target association relation.

At this time, the first target association relationship between the service system 3 and the device 1 and the first target association relationship between the service system 3 and the device 2 are the same association relationship, that is, all are association relationships between the devices and the service system 3, and in order to improve the confidence of the data, the association relationship with higher confidence is selected as the first target association relationship, where the confidence may be, for example, the number of times of login, the login time, and the like, for example, if the number of times of login of the device 1 is greater and the login time is longer, the confidence of the first association relationship between the device 1 and the service system 3 is higher, so the first association relationship between the device 1 and the service system 3 is the first target association relationship.

Fig. 15 is a schematic flowchart of another data processing method provided in the embodiment of the present disclosure, and the present embodiment is based on the foregoing embodiment, where after step S40, the method further includes:

and S50, assigning a unique identifier to each target connected subgraph.

After a plurality of target connected subgraphs are generated through the steps, for each target connected subgraph, after all nodes with incidence relation are connected with connecting lines, originally mutually independent nodes form the connected subgraphs. Two or more nodes with connecting lines may exist in one connected subgraph, or only one node may exist, but all the nodes in each connected subgraph are regarded as behavior tracks left by the same user in business systems of different domains of a company in different connecting modes. By allocating a unique identifier to each target connected subgraph, when the user identifier data of a certain user is searched in the subsequent data, all the user identifier data corresponding to the user can be searched through the unique identifier.

Fig. 16 is a schematic flowchart of another data processing method provided in the embodiment of the present disclosure, and the present embodiment is based on the foregoing embodiment, where after step S50, the method further includes:

and S60, periodically extracting the newly added user identification data in the database and the newly added first target association relation among the user identification data.

Because some user identification data are newly added to business systems of different domains of a company every day, the user identification data need to be maintained at regular time, namely, the user identification data need to be classified into a connected graph. The obtained period duration can be one hour, twelve hours and one day, and the value of the period duration is not limited in the application.

And S70, adding the newly added user identification data as a new node into the connected graph.

The newly added user identification data obtained periodically may have categorized user identification data and/or first target association, so that the categorized user identification data and/or first target association need to be excluded first to obtain uncategorized user identification data and/or first target association, and then the uncategorized user identification data and the first target association need to be integrated. Specifically, the uncategorized user identification data is added to the connected graph as a new node.

And S80, connecting the user identification data with the first target incidence relation in the connected graph through a connecting line according to the newly added first target incidence relation.

After the uncategorized user identification data is added into the connected graph as a new node, according to the uncategorized first target association relationship, the nodes having the first target association relationship in the connected graph need to be connected through connecting lines, so that the originally isolated smaller connected subgraph composed of a single node in the connected graph is merged into a larger connected subgraph, or two or more larger connected subgraphs are merged into a larger connected subgraph.

And S90, assigning the unique identifier to the connected subgraph which is not assigned with the unique identifier.

An unallocated connected subgraph refers to a connected subgraph composed of newly added user identification data and/or first target associations. Similarly, a stable, unique and persistent identifier is given to the newly added connected subgraph to identify the connected subgraph

S100, when a connected subgraph with two or more unique identifiers exists, selecting one of the two or more unique identifiers as a final unique identifier according to a set rule.

Since originally isolated connected subgraphs all have a corresponding identifier, multiple identifiers exist in the combined connected subgraphs, which is very inconvenient for subsequent data integration and analysis, and therefore only one of the identifiers needs to be reserved.

Specifically, the one with the earliest allocation time is selected from the two or more unique identifiers as the final unique identifier, or one is randomly selected from the two or more unique identifiers as the final unique identifier.

It should be noted that, taking the allocation time as an example, since the allocation of each identifier is in order, we may follow the logic of allocating first and then having a high priority, and specify that in the subsequent merging process of connected subgraphs, the identifier with the earliest allocation time, that is, the identifier with the highest priority, is selected from the multiple identifiers as the final identifier

Fig. 17 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure, and as shown in fig. 17, the data processing apparatus includes:

the data obtaining module 210 is configured to obtain user identification data in a database and a first target association relationship between each user identification data.

The connected graph constructing module 220 is configured to construct a connected graph according to the user identification data and the first target association relationship, where the connected graph includes a plurality of connected subgraphs, each node in each connected subgraph corresponds to one user identification data, and each connecting line in each connected subgraph corresponds to one first target association relationship.

A target connected subgraph generation module 230, configured to, for each connected subgraph, exclude a root node in the connected subgraph, and obtain at least two sub-connected subgraphs corresponding to the connected subgraph; obtaining the similarity between at least two sub-connected derivative graphs; and determining a second target incidence relation of the sub-connected derivative graphs according to the similarity between the sub-connected derivative graphs, and generating a target connected subgraph based on the second target incidence relation of the sub-connected derivative graphs, wherein the target connected subgraph comprises nodes of at least two sub-connected derivative graphs.

In the data processing apparatus provided in the embodiment of the present disclosure, the connected graph constructing module constructs a plurality of connected subgraphs according to the user identification data in the database acquired by the data acquiring module and the first target association relationship between each user identification data, the target connected subgraph generating module excludes the root node in the connected subgraph for each connected subgraph to obtain at least two sub connected derivative graphs corresponding to the connected subgraph, determines the second target association relationship of the sub connected derivative graphs according to the similarity between the at least two sub connected derivative graphs, and generates the target connected subgraph based on the second target association relationship of the sub connected derivative graphs, so that when the same connected subgraph includes user identification data of different natural persons, the user identification data corresponding to different natural persons in the connected subgraph are distinguished according to the similarity between the sub connected derivative graphs, and the user identification data corresponding to the same natural person can be connected in series, the method has the advantages of eliminating data islands, along with strong expansibility and low calculation cost, effectively solving the problems of complex identification process, high technical realization threshold and poor landing performance in the prior art, and having higher popularization and application values.

The device provided by the embodiment of the invention can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that, in the embodiment of the apparatus, the included units and modules are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

Fig. 18 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure, as shown in fig. 18, the electronic device includes a processor 610, a memory 620, an input device 630, and an output device 640; the number of processors 610 in the computer device may be one or more, and one processor 610 is taken as an example in fig. 18; the processor 610, the memory 620, the input device 630, and the output device 640 in the electronic apparatus may be connected by a bus or other means, and the bus connection is exemplified in fig. 18.

The memory 620 is provided as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. The processor 610 executes various functional applications of the computer device and data processing by executing software programs, instructions and modules stored in the memory 620, namely, implements the method provided by the embodiment of the present invention.

The memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 620 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 620 may further include memory located remotely from the processor 610, which may be connected to a computer device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 630 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, and may include a keyboard, a mouse, and the like. The output device 640 may include a display device such as a display screen.

The disclosed embodiments also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are used to implement the methods provided by the embodiments of the present invention.

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A data processing method, comprising:

2. The data processing method according to claim 1, wherein the determining a second target association relationship of the sub-connected derivatives according to the similarity between the sub-connected derivatives, and generating a target connected subgraph based on the second target association relationship of the sub-connected derivatives comprises:

3. The data processing method according to claim 2, wherein the determining a second target association relationship of each main node in the sub-connectivity derivative graph according to the user identification feature similarity between each main node in the sub-connectivity derivative graph, and generating a target sub-connectivity derivative graph based on the second target association relationship of the sub-connectivity derivative graph includes:

4. The data processing method according to claim 1, wherein the determining a second target association relationship of the sub-connected derivatives according to the similarity between the sub-connected derivatives, and generating a target connected subgraph based on the second target association relationship of the sub-connected derivatives comprises:

5. The data processing method according to claim 4, wherein the determining, according to the similarity of the user identification features between the slave nodes in the sub-connectivity derivative graph, a second target association relationship of each master node in the sub-connectivity derivative graph corresponding to the slave node, and generating a target sub-connectivity derivative graph based on the second target association relationship of the sub-connectivity derivative graph includes:

6. The data processing method according to claim 1, wherein the obtaining of the user identification data in the database and the first target association relationship between each of the user identification data comprises:

7. The data processing method according to claim 1, wherein after determining the second target association relationship of the sub-connected derivatives according to the similarity between the sub-connected derivatives, and generating a target connected subgraph based on the second target association relationship of the sub-connected derivatives, the method further comprises:

assigning a unique identifier to each of the target connected subgraphs.

8. The data processing method of claim 7, wherein after assigning a unique identifier to each of the target connected subgraphs, further comprising:

9. The data processing method of claim 8, wherein the selecting one of the two or more unique identifiers as a final unique identifier according to a set rule comprises:

10. A data processing apparatus, comprising:

11. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a data processing method as claimed in any one of claims 1 to 9.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the data processing method of any one of claims 1 to 9.