CN117952411A

CN117952411A - Knowledge graph-based risk prediction method, device, equipment and storage medium

Info

Publication number: CN117952411A
Application number: CN202311486042.6A
Authority: CN
Inventors: 夏志超; 王鑫; 肖冰; 陆全; 蒋宁; 吴海英
Original assignee: Mashang Xiaofei Finance Co Ltd
Current assignee: Mashang Xiaofei Finance Co Ltd
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2024-04-30

Abstract

The application discloses a risk prediction method, device and equipment based on a knowledge graph and a storage medium. The method comprises the following steps: acquiring a target equipment data table and corresponding target user data thereof, wherein the target equipment data table comprises N pieces of equipment data of users, the target user data comprises user characteristic data of the N users, and N is an integer greater than 1; converting the target equipment data table into a target user relationship table based on the equipment data of the tagged users in the target equipment data table, wherein the target user relationship table represents the association relationship among the N users; generating knowledge graph data based on the target user relationship table; and carrying out risk prediction on unlabeled users in the N users based on the knowledge graph data, the target user data and the risk labels of the labeled users.

Description

Knowledge graph-based risk prediction method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a risk prediction method, apparatus, device, and storage medium based on a knowledge graph.

Background

Risk prediction is an important link in risk control. The difficulty of risk prediction based on big data is how to grasp the data of different sources to integrate together and construct a risk control engine, so that risk cases such as identity falsification, group fraud, package to be handled and the like are effectively identified.

In the risk prediction scheme in the related art, a risk prediction model is usually trained by using original sample user data, and risk prediction is performed on a user by using the risk prediction model. However, the data volume in the big data scene is huge, the data sources are wide, the process of training the risk prediction model is time-consuming and resource-consuming, and the data from different sources cannot be effectively integrated together, so that the risk prediction result is inaccurate.

Disclosure of Invention

The embodiment of the application aims to provide a risk prediction method, device, equipment and storage medium based on a knowledge graph, which are used for improving risk prediction efficiency and accuracy and reducing resource consumption.

In order to achieve the above object, the embodiment of the present application adopts the following technical scheme:

in a first aspect, an embodiment of the present application provides a risk prediction method based on a knowledge graph, including:

Acquiring a target equipment data table and corresponding target user data thereof, wherein the target equipment data table comprises N pieces of equipment data of users, the target user data comprises user characteristic data of the N users, and N is an integer greater than 1;

converting the target equipment data table into a target user relationship table based on the equipment data of the tagged users in the target equipment data table, wherein the target user relationship table represents the association relationship among the N users;

Generating knowledge graph data based on the target user relationship table;

And carrying out risk prediction on unlabeled users in the N users based on the knowledge graph data, the target user data and the risk labels of the labeled users.

In a second aspect, an embodiment of the present application provides a risk prediction apparatus based on a knowledge graph, including

The device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a target device data table and corresponding target user data thereof, the target device data table comprises N pieces of user equipment data of users, the target user data comprises user characteristic data of the N users, and N is an integer greater than 1;

the conversion unit is used for converting the target equipment data table into a target user relationship table based on the equipment data of the tagged users in the target equipment data table, wherein the target user relationship table represents the association relationship among the N users;

the construction unit is used for generating knowledge graph data based on the target user relation table;

and the prediction unit is used for predicting risk of unlabeled users in the N users based on the knowledge graph data, the target user data and the risk labels of the labeled users.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the knowledge-graph based risk prediction method of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the knowledge-graph based risk prediction method according to the first aspect.

The above at least one technical scheme adopted by the embodiment of the application can achieve the following beneficial effects:

The device data of different users are used as the basis for associating the different users, so that a knowledge graph representing the association relationship among the different users is established; further, based on the association relation represented by the knowledge graph and the user characteristic data of the user with the risk tag in the knowledge graph, the risk prediction is carried out on the unlabeled user, so that the accuracy and the efficiency of the risk prediction can be improved; on the basis, the construction process of the knowledge graph is improved, the device data of different users are stored in the form of a data table, the device data table is operated based on the device data of the tagged users, the device data table is converted into a user relation table representing the association relation among different users, the knowledge graph data is generated based on the user relation table, the traditional knowledge graph construction process is simulated, the graph construction operation can be simplified, the consumption of resources is reduced, and the risk prediction efficiency is further improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

Fig. 1 is a schematic flow chart of a risk prediction method based on a knowledge graph according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a data association process according to an embodiment of the present application;

Fig. 3 is a schematic flow chart of a risk prediction method based on a knowledge graph according to another embodiment of the present application;

FIG. 4 is a schematic diagram of a data reduction strategy according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a risk prediction device based on a knowledge graph according to an embodiment of the present application;

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the application may be practiced otherwise than as specifically illustrated or described herein. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/" generally means that the associated object is an or relationship.

Partial conceptual description:

cypher: neo4j graph database query language.

And (3) ball: a set of nodes with directed edges between each pair.

Isolated node: a node is a basic building block of a network, while an isolated node is a special node. An orphan node refers to a single existing node in a network that has no connection relationship with other nodes. That is, the node is not in communication with any of the other nodes.

Undirected graph: a graph with sides having no direction is called an undirected graph.

According to the application, the graph analysis technology is an analysis technology based on a relational network, which can effectively correlate user data (such as structured data and unstructured data) without sources, establish a knowledge graph representing the correlation between different user data, objectively and intuitively reveal potential risk modes and rules, and is beneficial to improving the accuracy and efficiency of risk prediction. Based on the method, the embodiment of the application provides a risk prediction method based on a knowledge graph, and the knowledge graph representing the association relationship among different users is established by taking the equipment data of the different users as the basis for associating the different users; furthermore, based on the association relation represented by the knowledge graph and the user characteristic data of the user with the risk tag in the knowledge graph, the risk prediction is carried out on the unlabeled user, so that the accuracy and the efficiency of the risk prediction can be improved.

In addition, the application also researches the traditional knowledge graph construction method, and discovers that the traditional knowledge graph construction method needs to store nodes, relations, attributes and the like into a graph database, then queries isolated nodes, clusters and the like through cypher, constructs the association relations among different users based on query results, and further constructs the knowledge graph. Because the risk prediction scene generally relates to a large number of user nodes, the whole process is complicated by a traditional knowledge graph construction method, and resources consumed in the construction process are huge. Based on the above, the embodiment of the application improves the construction process of the knowledge graph based on the risk prediction method based on the knowledge graph, stores the device data of different users in the form of a data table, performs table operation on the device data table based on the device data of the tagged users, converts the device data table into a user relationship table representing the association relationship among different users, and generates the knowledge graph data based on the user relationship table to simulate the traditional knowledge graph construction process, thereby simplifying the graph construction operation, reducing the consumption of resources and further improving the risk prediction efficiency.

The risk prediction method based on the knowledge graph provided by the embodiment of the application can be applied to various risk prediction scenes. For example, in an anti-fraud prediction scenario, it may be predicted whether there is a fraud risk for an unlabeled user by the risk prediction method provided by the embodiment of the present application. In another example, in an identity auditing scenario, whether the unlabeled user has an identity counterfeit risk or not can be predicted by the risk prediction method provided by the embodiment of the application. For another example, in a fund service scenario, by using the risk prediction method provided by the embodiment of the present application, it is predicted whether the unlabeled user has a credit risk, so as to determine whether to provide a fund service to the unlabeled user.

It should be understood that the risk prediction method based on the knowledge graph provided by the embodiment of the application can be executed by the electronic device, and in particular can be executed by the processor of the electronic device. The electronic device may be a terminal device such as a smart phone, tablet computer, notebook computer, desktop computer, intelligent voice interaction device, intelligent home appliance, intelligent watch, vehicle terminal, aircraft, etc.; or the electronic device may be a server, such as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server for providing cloud computing services.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Referring to fig. 1, a risk prediction method based on a knowledge graph according to an embodiment of the present application may include the following steps:

s102, acquiring a target device data table and corresponding target user data.

The target device data table includes device data of N users, N being an integer greater than 1. The N users include tagged users and untagged users. The tagged users have corresponding risk tags that are used to indicate whether the tagged users are risk users. The unlabeled users do not have a corresponding risk label, i.e. the users to be predicted.

The device data of the user is used to represent the devices used by the user, which can be used as a basis for associating different users. As one example, the user's device data includes a field value of the user in the target device field. The target device field may be selected according to actual needs, including, for example, but not limited to, at least one of the following fields: device identification, emergency contacts, device type, reserved phones, etc. It is worth to say that the sensitive information in the device data is obtained after user authorization and desensitization.

In practical application, the fields in the target device data table may be set according to practical needs, which is not limited in the embodiment of the present application. As one example, the fields in the target device data table include a user identification and a target device field, the user identification being used to uniquely identify the user.

The target user data includes user characteristic data of N users, which can be used as a basis for predicting whether the users are at risk. The user characteristic data may be set according to an actual risk prediction scenario, which is not limited in the embodiment of the present application. For example, in an anti-fraud scenario, the user characteristic data may include, for example, but not limited to, a user name, an age, an academic, and the like. It is worth to say that the sensitive information in the user characteristic data is obtained after user authorization and desensitization treatment.

In the embodiment of the present application, the target device data table and the corresponding target user data may be obtained in various manners, which is not limited in the embodiment of the present application.

In one embodiment, the original device data table may be obtained from the user database as the target device data table. Because the original equipment data table contains more comprehensive equipment data, the original equipment data table is used as the target equipment data table to conduct risk prediction, and the prediction accuracy can be further improved.

For big data risk prediction scenarios, the original device data table typically contains a huge amount of device data, which would result in prediction efficiency if risk prediction were performed based on the original device data table. In another embodiment, in order to balance between prediction accuracy and prediction efficiency, the original device data table may be reduced, device data with little effect on user association may be filtered out, and representative device data may be retained, so that the prediction efficiency is improved on the premise of ensuring prediction accuracy.

Specifically, the target device data table is obtained by:

s121, acquiring an original equipment data table from a user database.

The user database stores equipment data and user characteristic data used by a large number of users. The user database may employ a HIVE data warehouse. In this case, the original database table may be obtained from the user database by performing HIVE SQL operations on the user database.

The original equipment data table comprises field values of M users in K equipment fields, M is an integer greater than N, and K is an integer greater than 1. In practical application, the K device fields may be set according to actual needs, which is not limited in the embodiment of the present application. For example, in an anti-fraud scenario, K device fields may include, for example, but are not limited to: the bank card reserved telephone, the application mobile phone number, the emergency contact person, the equipment identification, the bank credit reserved telephone and the like. It should be noted that the field values of the device fields are obtained after the user authorization and desensitization processing.

S122, determining a target device field from the K device fields based on the field values of the risk users in the K device fields in the M users.

Because the equipment data of the risk user is an important reference in the risk prediction process based on the knowledge graph, the target equipment fields determined from the K equipment fields are helpful for accurately distinguishing the risk user from the non-risk user based on the field values of the K equipment fields of the risk user, and reliable data support is provided for subsequent risk prediction, so that prediction accuracy is improved.

As an example, the step S122 may include the steps of: determining the number of risk users corresponding to each equipment field in K equipment fields based on the field values of the risk users in the K equipment fields; and selecting the equipment fields with the number of risk users meeting the preset number condition from the K equipment fields as target equipment fields.

The preset number of conditions may be set according to actual needs, for example, the number of risk users exceeds a preset number threshold, the proportion of risk users is 40% in the first row, and the embodiment of the present application is not limited to this.

For example, let k=13, i.e., K device fields include device field 1 to device field 13, and the preset number condition is that the risk user is ranked in the first 5 bits, m=26098. For each equipment field, counting the number of risk users with the field value of the equipment field not being null as the number of risk users corresponding to the equipment field, and calculating the ratio between the number of risk users corresponding to the equipment field and M to obtain the ratio of the risk users corresponding to the equipment field; further, the device fields 1 to 13 are ordered according to the order of the high-to-low ratio of the risk users, and the ordering results shown in the following table 1 are obtained; finally, from these device fields, the top 5-bit device fields, i.e., device field 1, device field 4, device field 3, device field 6, device field 9, are selected as target device fields.

TABLE 1

By determining the target device field from the K device fields, the dimension of the target device data table may be reduced to improve risk prediction efficiency.

S123, grouping M users based on field values of the M users in a field of the target device, and selecting partial users from the M users based on the number of the users contained in each group.

When the M users are grouped, the users with the same field value in the field of the target device may be grouped into the same group, and the number of users included in each group may be counted.

In selecting a portion from the M users, as an example, the groups obtained by grouping are sorted in order of the number of users from high to low; traversing each group based on the ranking result; when traversing to the ith group, determining a difference threshold corresponding to the ith group based on the difference value of the number of users between two adjacent groups in the previous i-1 groups, wherein i is an integer greater than 1; if the difference value of the number of users between the ith group and the (i-1) th group is larger than a difference threshold value, determining the value of the (i-1) as the number of target users corresponding to the target equipment field; and selecting the users with the target number of users from the M users.

More specifically, the average value of the user data difference values between two adjacent groups in the first i groups may be used as the difference threshold value corresponding to the i-th group.

For example, table 2 below shows 14 groups obtained by grouping M users, the number of users included in each group, and the arrangement order of each group.

TABLE 2

Order of arrangement	Number of users
		Group 1	444076665
Group 2	141643452
		Group 3	51649567
Group 4	25595189
		Group 5	13719172
Group 6	8088778
		Group 7	5075675
……	……
		Group 14	732950

By traversing and analyzing the table 2, it is determined that the difference in the number of users between the 6 th group and the 5 th group is far smaller than the difference in the number of users between two adjacent groups in the first 5 groups; further, the number of target users corresponding to the target device field is determined to be 5.

It should be noted that, if the number of the target device fields is greater than 1, for each target device field, the field values of the M users in the target device field are grouped in the above manner, the number of target users corresponding to the target device field is determined based on the number of users included in each group, and the number of users of the target device field is screened from the M users.

As another example, a corresponding user proportion may also be determined for each group based on the number of users included in each group, e.g., a group with a larger number of users may correspond to a larger proportion of users and a group with a smaller number of users may correspond to a smaller proportion of users; the users of the corresponding user proportion are then randomly extracted from each group of users.

In the step S123, based on the field values of the M users in the field of the target device, the M users are grouped and the number of users included in each group is counted, so that the field value with explosion number can be accurately found; based on the number of users contained in each group, partial users are selected from M users, so that the reasonable sampling number, namely the number of the users extracted from the M users, is determined, and the data scale is reduced.

S124, filtering the original equipment data table based on part of the users and the target equipment field to obtain a target equipment data table.

As an example, a part of users may be used as K users in the target device data table, field values of the target users in the target device field may be used as device data of the K users, and the field values of the target users in the target device field may be screened from the original device data table, so as to obtain the target device data table.

As another example, user characteristic data of a portion of users is obtained from a user database; determining N users with user characteristic data meeting preset characteristic conditions from part of users; and selecting field values of N users in the target equipment field from the original equipment data table as equipment data of the N users to obtain the target equipment data table.

The preset feature condition may be set according to actual needs, for example, the user feature data is a non-null value, which is not limited in the embodiment of the present application.

In this example, considering that a part of users selected from the original data table may not generate specific business behaviors, such as relatives, friends, etc. as emergency contacts, and the user feature data is an important input for performing risk prediction based on the knowledge graph data, it is required to ensure that all users in the knowledge graph data have valid node feature data; based on the method, whether the user characteristic data meet the preset characteristic conditions or not is determined, screening is performed on part of the users again to obtain N users, the field values of the N users in the target device are used as device data of the N users, the node characteristic data of all users in the subsequently generated knowledge graph data can be ensured to be effective, the prediction accuracy is improved, and the scale of a target device data table can be further reduced, so that the prediction efficiency is improved.

It should be noted that, in practical application, considering that the user database is dynamically updated, in order to ensure the timeliness and accuracy of prediction, the original device data table may be a daily update full table according to the time zone, i.e. the original data table includes field values of fields of K devices of M users in each day. Accordingly, the target device data table obtained based on the original device data table is also a daily update table divided into areas according to time, i.e., the target device data table includes device data of N users in each day. On this basis, daily risk users can be predicted based on the target device data table.

S104, converting the target device data table into a target user relation table based on the device data of the tagged user in the target device data table.

The target user relationship table represents an association relationship between N users.

The amount of data in the target device data table is still large and in a dynamically updated state, while the device data size of the tagged user is small. Because the equipment data are the same and have relevance generally, the relevance relationship among N users is analyzed through the equipment data of the tagged users, and the target equipment data table is converted into the target user relationship table, so that the spectrum construction efficiency can be improved, the spectrum construction process can be accurately simulated, and the spectrum construction accuracy can be improved.

In an embodiment, the step S104 may include the following steps:

s141, generating an equipment data sub-table based on the equipment data of the tagged user.

As an example, a data table is created that contains user identifications and target device fields, user identifications of tagged users are written into the user identification fields, field values of the tagged users in the target device fields are written into the target device fields, and a device data sub-table is obtained. For example, the following Table 3 shows a device data sub-table.

TABLE 3 Table 3

User identification	Target device field
		U2	P1
U2	P3
		U4	P3
U4	P5

In practical applications, the target device data table may include user identities and device data of N users. In this case, the target device data table may be queried based on the user identification of the tagged user, resulting in the user identification of the tagged user and the device data.

For example, table 4 shows a target device data table.

TABLE 4 Table 4

Referring to fig. 2, the user identifier of the tagged user includes U2 and U4, and further, field values P1 and P3 corresponding to the user identifier U2 in the field of the target device may be obtained from the target device data table shown in table 4, and field values P3 and P5 corresponding to the user identifier U2 in the field of the target device may be obtained.

S142, the same equipment data are used as a first connection condition, and connection operation is carried out on the user identifications of the tagged users in the equipment data sub-table and the user identifications of N users in the target equipment data table, so that a first user relation table is obtained.

The first user relation table is used for representing the association relation between the tagged users and the N users.

As an example, a device data sub-table may be used as a left table (table_name1), a target device data table may be used as a right table (table_name2), the same device data may be used as a first connection condition, i.e., on table_name1. Target device field=table_name2. Target device field, a left join operation may be performed on the user identifier in the device data sub-table and the user identifier of the target device data table using a left join statement, to obtain a first user relationship table.

For example, taking the device data sub-table shown in table 3 and the target device data table shown in table 4 as an example, taking the field value of the target device field as the same as the first connection condition, the first user relationship table shown in table 5 is obtained by performing left join operation on the device data sub-table and the target device data table. The process of performing the connection operation on table 3 and table 4, shown in fig. 2, can simulate the process of establishing the association relationship between the tag users U2 and U4 and the users U1 to U5 in the conventional knowledge graph construction manner.

TABLE 5

S143, using the same equipment data as a second connection condition, and performing self-connection operation on the user identifications of N users in the target equipment data table to obtain a second user relation table.

The second user relation table is used for representing the association relation among N users.

As an example, the target device data table may be simultaneously referred to as a left table (table_name1) and a right table (table_name2), the device data may be identical as a second connection condition, i.e., on table_name1. Target device field=table_name2. Target device field, left join operation is performed on the left table and the right table using left join (left join) statements, resulting in a second user relationship table.

For example, taking the target device data table shown in table 4 as an example, taking the field value of the target device field as the same as the second connection condition, by performing left join operation on the two target device data tables, the second user relationship table shown in table 6 is obtained. The process of connecting the table 4 can simulate the process of establishing the association relationship between the users U1 to U5 by the traditional knowledge graph construction mode by combining the process of connecting the table 2.

TABLE 6

User identification 3	User identification 4
		U1	U1
U1	U2
		U1	U3
U2	U1
		U2	U4
U3	U1
		U3	U5
U4	U2
		U4	U5
U5	U3
		U5	U4
U5	U5

S144, the first user relation table and the second user relation table are connected to obtain a target user relation table.

Because the first user relation table can reflect the association relation between the tag users and N users, the second user relation table can reflect the association relation between N users, and the target user relation table can be obtained by connecting the same user identification in the first user relation table and the second user relation table.

As one example, the first user relationship table includes a first point-of-origin user identity representing a tagged user and a first end-point user identity representing one of N users. The second user relationship table includes a second starting point user identity representing one of the N users and a second ending point user identity representing an associated user of the user having the second starting point user identity.

Accordingly, the step S144 includes: and using the same first end point user identifier and second start point user identifier as a third connection condition, and performing connection operation on the first user relation table and the second user relation table to obtain a target user relation table, wherein the target user relation table comprises the first end point user identifier, the second start point user identifier and the second end point user identifier.

For example, taking the first user relationship table shown in table 5 and the second user relationship table shown in table 6 as an example, the user identifier 1 is a first start point user identifier, the user identifier 2 is a first end point user identifier, the user identifier 3 is a second start point user identifier, and the user identifier 4 is a second end point user identifier. On the basis, with the user identifier 2 and the user identifier 3 being the same as the third connection condition, performing left join operation on the first user relationship table and the second user relationship table to obtain a target user relationship table as shown in table 7. The process of connecting table 5 and table 6 shown in fig. 2 can simulate the process of establishing the association relationship between users U1 to U5 by the conventional knowledge graph construction method.

TABLE 7

An embodiment of the present application is herein shown as a specific implementation of S104 described above. Of course, it should be understood that S104 may be implemented in other manners, which are not limited by the embodiment of the present application.

And S106, generating knowledge graph data based on the target user relation table.

Knowledge-graph data is typically represented by a set of nodes V and a set of edges E, commonly denoted g= (V, E), where N nodes in V represent a user, E edges in E, each attached to two nodes Vi and Vj, and the edges are represented by node pairs, denoted (Vi, vj).

In an embodiment, the step S106 may include the following steps: converting the target user relationship table into a plurality of user relationship sub-tables, each user relationship sub-table comprising two of a first start point user identifier, a first end point user identifier, a second start point user identifier and a second end point user identifier; then, knowledge-graph data is generated based on the plurality of user relationship sub-tables.

For example, taking the target user relationship table shown in table 7 as an example, splitting the target user relationship table can obtain the following 3 user relationship sub-tables and fields contained in each user relationship sub-table:

user relationship sub-table 1: user identification 1, user identification 2;

User relationship sub-table 2: user identification 2, user identification 3;

user relationship sub-table 3: user identification 3, user identification 4.

Further, each user is used as a node, and a node set V is generated; for each user relation sub-table, generating a corresponding connection edge based on each record in the user relation sub-table, thereby obtaining an edge set E.

In this embodiment, the target user relationship table is split into a plurality of user relationship sub-tables, and each user relationship sub-table only includes a combination of two user identifiers, so that one piece of data in each user relationship sub-table is a connection edge between two nodes, and further knowledge graph data can be quickly generated based on the plurality of user relationship sub-tables, without complex data format conversion.

An embodiment of the present application is shown here as a specific implementation of S106. Of course, it should be understood that S106 may be implemented in other manners, which are not limited by the embodiment of the present application. For example, each user is taken as a node, and a node set V is generated; aiming at each user, the query target user relation table acquires other users associated with the user, and further creates corresponding connecting edges for nodes corresponding to the users with the association relation, so as to obtain an edge set E.

In practical applications, the knowledge graph represented by the knowledge graph data may have various forms, such as directed graph, undirected graph, and the like, which is not limited in the embodiment of the present application. As an example, an undirected graph may be employed in embodiments of the present application.

S108, performing risk prediction on unlabeled users in the N users based on the knowledge graph data, the target user data and the risk labels of the labeled users.

In an embodiment, the step S108 may include the following steps: encoding the user characteristic data of the unlabeled users based on at least one encoding strategy to obtain node characteristic data corresponding to the unlabeled users in the knowledge-graph data; encoding the user characteristic data and the risk labels of the tagged users based on at least one encoding strategy to obtain node characteristic data corresponding to the tagged users in the knowledge graph data; and carrying out risk prediction on the unlabeled users based on the graph neural network algorithm, the knowledge graph data and the node characteristic data of N users in the knowledge graph.

In connection with fig. 3, as an example, to accommodate user characteristic data of different sources, the at least One encoding strategy may include One-hot encoding (One-hot) and encoding by a characteristic representation model. For example, if the user feature data is continuous data, the user feature data may be subjected to one-time thermal encoding to obtain corresponding node feature data; if the user characteristic data are discrete data, the user characteristic data can be encoded through the characteristic representation model to obtain corresponding node characteristic data. The feature representation model can be obtained by training lightgbm on the basis of sample user feature data and corresponding node feature data.

In the embodiment of the present application, various graph neural network algorithms commonly used in the art, such as graph roll-up network (Graph Convolutional Nueral Network, GCN) and GRAPHSAGE algorithm, may be used as the graph neural network algorithm, and specifically may be selected according to actual needs, which is not limited by the embodiment of the present application.

In connection with fig. 3, as an example, the graph neural network algorithm employs GRAPHSAGE algorithm. The key to GRAPHSAGE algorithm is to optimize the sampling of the entire knowledge-graph data to the sampling of the current neighbor node. In the process of adopting the method, the neighbor nodes of the target node are randomly sampled, the number of neighbors sampled by each hop is not more than Sk, for example, the first hop samples 3 neighbor nodes, and the second hop samples 5 neighbor nodes; further, node characteristic data of the neighbor nodes obtained by second hop sampling are aggregated to generate node characteristic data of the nodes of the neighbor nodes of the first hop, and then the node characteristic data of the neighbor nodes of the first hop are aggregated to generate node characteristic data of the target node; and finally, inputting the node characteristic data of the target node into a fully-connected network to perform risk prediction, and obtaining a prediction result of whether the user corresponding to the target node has risk.

In practical applications, as shown in fig. 4, the full-connected network may be trained by using the full-scale historical equipment data of the past 90 days and the corresponding full-scale historical user characteristic data. Specifically, the total historical equipment data of the first 60 days and the corresponding total historical user characteristic data are taken as a training set (Train), and the total historical equipment data of the last 30 days and the corresponding total historical user characteristic data are taken as a Test set (Test), namely, the Test set is shown as a white box in fig. 4. Further, the training set and the test set are filtered respectively to obtain a historical equipment data table for training the fully connected network and corresponding historical user characteristic data, namely, the historical equipment data table is shown as a black box in fig. 4. After training the fully-connected network through the historical equipment data table and the corresponding historical user characteristic data in the training set, testing the trained fully-connected network by utilizing the historical equipment data table and the corresponding historical user characteristic data in the testing set.

If the test passes, the risk prediction is carried out on the knowledge graph data (Pre) of the current day based on the trained full-connection network, so that the risk user of the current day is obtained.

It can be understood that the GCN algorithm is a training method of the full graph, that is, the nodes of the full graph need to be updated in each iteration, and when the scale of the knowledge graph is large, the training method is time-consuming and even can not be updated. In addition, when training is performed by using multiple days of data, the data amount under the whole amount of data is large because of one piece of knowledge graph data per day, and the knowledge graph construction and model training can not be almost completed. By training and predicting GRAPHSAGE, the scale of the atlas can be greatly reduced, and the scale of the training test dataset can be reduced.

An embodiment of the present application herein shows a specific implementation of S108 described above. Of course, it should be understood that S108 may also be implemented by various methods of spectrum analysis commonly used in the art, which is not limited in this embodiment of the present application.

According to the risk prediction method based on the knowledge graph, provided by the embodiment of the application, the device data of different users are used as the basis for associating different users, so that the knowledge graph representing the association relation among different users is established; further, based on the association relation represented by the knowledge graph and the user characteristic data of the user with the risk tag in the knowledge graph, the risk prediction is carried out on the unlabeled user, so that the accuracy and the efficiency of the risk prediction can be improved; on the basis, the construction process of the knowledge graph is improved, the device data of different users are stored in the form of a data table, the device data table is operated based on the device data of the tagged users, the device data table is converted into a user relation table representing the association relation among different users, the knowledge graph data is generated based on the user relation table, the construction of the knowledge graph is simulated, the construction process of the knowledge graph can be simplified, the consumption of resources is reduced, and the risk prediction efficiency is further improved.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Based on the same inventive concept, the embodiment of the application also provides a risk prediction device based on the knowledge graph. Referring to fig. 5, a schematic structural diagram of a risk prediction apparatus 500 based on a knowledge-graph according to an embodiment of the present application is provided, where the apparatus 500 may include: an acquisition unit 510, a transformation unit 520, a construction unit 530 and a prediction unit 540.

An obtaining unit 510, configured to obtain a target device data table and corresponding target user data, where the target device data table includes device data of N users, and the target user data includes user feature data of the N users, where N is an integer greater than 1.

And a conversion unit 520, configured to convert the target device data table into a target user relationship table based on the device data of the tagged user in the target device data table, where the target user relationship table represents the association relationship between the N users.

And a construction unit 530, configured to generate knowledge-graph data based on the target user relationship table.

And a prediction unit 540, configured to perform risk prediction on unlabeled users in the N users based on the knowledge-graph data, the target user data, and the risk labels of the labeled users.

In one embodiment, the converting unit 520 converts the target device data table into a target user relationship table by:

Generating an equipment data sub-table based on the equipment data of the tagged user;

The method comprises the steps that equipment data are identical as a first connection condition, connection operation is carried out on user identifications of tagged users in an equipment data sub-table and user identifications of N users in a target equipment data table, and a first user relation table is obtained, wherein the first user relation table is used for representing association relations between the tagged users and the N users;

The equipment data are the same as a second connection condition, and self-connection operation is carried out on the user identifications of the N users in the target equipment data table to obtain a second user relation table, wherein the second user relation table is used for representing the association relation among the N users;

And performing connection operation on the first user relation table and the second user relation table to obtain the target user relation table.

In an embodiment, the first user relationship table includes a first point-of-origin user identity representing the tagged user and a first end-point user identity representing one of the N users;

The second user relation table comprises a second starting point user identifier and a second end point user identifier, wherein the second starting point user identifier represents one of the N users, and the second end point user identifier represents an associated user of the user with the second starting point user identifier;

The conversion unit 520 performs the following steps when performing a connection operation on the first user relationship table and the second user relationship table to obtain the target user relationship table:

And taking the first end point user identifier and the second start point user identifier as a third connection condition, and performing connection operation on the first user relation table and the second user relation table to obtain the target user relation table, wherein the target user relation table comprises the first end point user identifier, the second start point user identifier and the second end point user identifier.

In one embodiment, the construction unit 530 generates knowledge-graph data by:

Converting the target user relationship table into a plurality of user relationship sub-tables, each user relationship sub-table comprising two of the first start point user identifier, the first end point user identifier, the second start point user identifier and the second end point user identifier;

and generating the knowledge graph data based on the plurality of user relation sub-tables.

In one embodiment, the target device data table is obtained by:

Acquiring an original equipment data table from a user database, wherein the original equipment data table comprises field values of M users in K equipment fields, M is an integer greater than N, and K is an integer greater than 1;

determining a target device field from the K device fields based on field values of risk users in the K device fields;

Grouping the M users based on field values of the M users in the field of the target equipment, and selecting partial users from the M users based on the number of the users contained in each group;

and filtering the original equipment data table based on the partial users and the target equipment field to obtain the target equipment data table.

In an embodiment, the obtaining unit 510 performs the following steps when determining a target device field from the K device fields based on field values of risk users among the M users in the K device fields:

determining the number of risk users corresponding to each equipment field in the K equipment fields based on the field values of the risk users in the M users in the K equipment fields;

And selecting the equipment fields with the number of risk users meeting the preset number condition from the K equipment fields as the target equipment fields.

In one embodiment, the obtaining unit 510 performs the following steps when selecting a part of users from the M users based on the number of users included in each group:

Sorting the groups obtained by grouping according to the order from high to low of the number of users;

Traversing each group based on the ranking result;

When traversing to the ith group, determining a difference threshold corresponding to the ith group based on the difference value of the number of users between two adjacent groups in the previous i-1 groups, wherein i is an integer greater than 1;

if the difference value of the number of users between the ith group and the ith-1 th group is larger than the difference threshold value, determining the value of the i-1 as the number of target users corresponding to the target equipment field;

And selecting the users with the target user quantity from the M users.

In an embodiment, the obtaining unit 510 performs the following steps when filtering the original device data table based on the partial users and the target device field to obtain the target device data table:

acquiring user characteristic data of the partial users from the user database;

Determining N users with user characteristic data meeting preset characteristic conditions from the partial users;

And selecting field values of the N users in the target equipment field from the original equipment data table as equipment data of the N users to obtain the target equipment data table.

In an embodiment, the prediction unit 540 performs risk prediction on the unlabeled user among the N users by:

Encoding the user characteristic data of the unlabeled user based on at least one encoding strategy to obtain node characteristic data corresponding to the unlabeled user in the knowledge-graph data;

encoding the user characteristic data and the risk labels of the tagged users based on the at least one encoding strategy to obtain node characteristic data corresponding to the tagged users in the knowledge graph data;

And carrying out risk prediction on the unlabeled users based on a graph neural network algorithm, the knowledge graph data and the node characteristic data of the N users in the knowledge graph.

In one embodiment, the at least one encoding strategy comprises at least one of the following strategies: and (5) single-heat coding and coding through a characteristic representation model.

It is obvious that the risk prediction device based on a knowledge graph provided in the embodiment of the present application can be used as an execution subject of the risk prediction method based on a knowledge graph shown in fig. 1, for example, in the risk prediction method based on a knowledge graph shown in fig. 1, step S102 may be executed by the obtaining unit 510 in the risk prediction device based on a knowledge graph shown in fig. 5, step S104 may be executed by the converting unit 520 in the risk prediction device based on a knowledge graph shown in fig. 5, step S106 may be executed by the constructing unit 530 in the risk prediction device based on a knowledge graph shown in fig. 5, and step S108 may be executed by the predicting unit 540 in the risk prediction device based on a knowledge graph shown in fig. 5.

According to another embodiment of the present application, each unit in the risk prediction apparatus based on a knowledge graph shown in fig. 5 may be separately or completely combined into one or several additional units, or some unit(s) thereof may be further split into a plurality of units with smaller functions to form the same operation, which does not affect the implementation of the technical effects of the embodiments of the present application. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present application, the risk prediction device based on the knowledge graph may also include other units, and in practical applications, these functions may also be implemented with assistance of other units, and may be implemented by cooperation of multiple units.

According to another embodiment of the present application, the knowledge-graph-based risk prediction apparatus as shown in fig. 5 may be constructed by running a computer program (including program code) capable of executing the steps involved in the corresponding method as shown in fig. 1 on a general-purpose computing device such as a computer including a central processing unit (Central Processing Unit, CPU), a random access storage medium (Random Access Memory, RAM), a Read-Only Memory (ROM), etc., and a storage element, and implementing the knowledge-graph-based risk prediction method of the embodiment of the present application. The computer program may be recorded on, for example, a computer readable storage medium, transferred to, and run in, an electronic device via the computer readable storage medium.

Fig. 6 is a schematic structural view of an electronic device according to an embodiment of the present application. Referring to fig. 6, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 6, but not only one bus or type of bus.

And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the risk prediction device based on the knowledge graph on the logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:

Generating knowledge graph data based on the target user relationship table;

The method executed by the risk prediction device based on the knowledge graph disclosed in the embodiment of fig. 1 of the present application may be applied to a processor or implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

The electronic device may further execute the method of fig. 1 and implement the functions of the risk prediction device based on the knowledge-graph in the embodiments shown in fig. 1 to 4, which are not described herein.

Of course, other implementations, such as a logic device or a combination of hardware and software, are not excluded from the electronic device of the present application, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or a logic device.

The embodiments of the present application also provide a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiment of fig. 1, and in particular to perform the operations of:

Generating knowledge graph data based on the target user relationship table;

In summary, the foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims

1. The risk prediction method based on the knowledge graph is characterized by comprising the following steps of:

Generating knowledge graph data based on the target user relationship table;

2. The method of claim 1, wherein the converting the target device data table into a target user relationship table based on the device data of the tagged user in the target device data table comprises:

3. The method of claim 2, wherein the first user relationship table includes a first point-of-origin user identity and a first end-point user identity, the first point-of-origin user identity being indicative of the tagged user, the first end-point user identity being indicative of one of the N users;

The second user relation table comprises a second starting point user identifier and a second end point user identifier, wherein the second starting point user identifier represents one of the N users, and the second end point user identifier represents an associated user of the user represented by the second starting point user identifier;

The step of performing a connection operation on the first user relation table and the second user relation table to obtain the target user relation table includes:

4. A method according to claim 3, wherein generating knowledge-graph data based on the target user relationship table comprises:

5. The method of claim 1, wherein the target device data table is obtained by:

6. The method of claim 5, wherein the determining the target device field from the K device fields based on the field value of the risk user among the M users in the K device fields comprises:

7. The method of claim 5, wherein selecting a portion of the M users based on the number of users included in each group comprises:

Traversing each group based on the ranking result;

And selecting the users with the target user quantity from the M users.

8. The method of claim 5, wherein filtering the original device data table based on the partial user and the target device field to obtain the target device data table comprises:

acquiring user characteristic data of the partial users from the user database;

9. The method of claim 1, wherein the risk prediction of an unlabeled user of the N users based on the knowledge-graph data, the target user data, and the risk labels of the labeled users comprises:

10. A risk prediction apparatus based on knowledge-graph data, comprising:

11. An electronic device, comprising:

A processor;

A memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the knowledge-graph based risk prediction method of any one of claims 1 to 9.

12. A computer readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the knowledge-graph based risk prediction method of any one of claims 1 to 9.