CN115238588A

CN115238588A - Graph data processing method, risk prediction model training method and device

Info

Publication number: CN115238588A
Application number: CN202210946400.6A
Authority: CN
Inventors: 何免; 符国辉; 郭磊; 何保健
Original assignee: Tongdun Technology Co ltd
Current assignee: Tongdun Technology Co ltd
Priority date: 2022-08-08
Filing date: 2022-08-08
Publication date: 2022-10-25

Abstract

The disclosure relates to a graph data processing method, a risk prediction model training method and a risk prediction model training device, and relates to the technical field of computers. The method comprises the steps of representing the relevance of an attribute node and a user node of a target classification through a first weight; determining a second weight corresponding to the user node according to the first weight corresponding to the attribute node connected with the user node, so that the second weight can represent the relevance between the user node and the user node of the target classification; on the basis, pruning is carried out on the user nodes with the second weight meeting the preset pruning conditions. By the scheme, the redundant nodes, the relevance of which to the user nodes classified by the target accords with the preset pruning condition, in the graph data can be effectively removed, the graph data is accurately and effectively simplified, the convergence efficiency of training can be improved in model training of the pruned graph data, the expression effect of the model can be improved, and the application performance of the model can be ensured.

Description

Graph data processing method, risk prediction model training method and device

Technical Field

The disclosure relates to the technical field of computers, in particular to a graph data processing method, a risk prediction model training method and a risk prediction model training device.

Background

In data processing, in order to improve the clustering, classification, generation, and other effects of target data, accurate feature extraction of entities and relationships in data is generally required. Particularly in the fields of finance, social contact, medical treatment and the like, the data has strong population and relationship, and higher requirements are put on the extraction of features in the data.

In the traditional data processing, a manual extraction mode is usually adopted to obtain data features, but the manual extraction cost is high, the efficiency is low, and effective information expressing characteristics such as population, relationship and the like in the data is difficult to fully discover by manual extraction, so that the effect of target processing is difficult to achieve.

The Graph Neural Network (GNN) algorithm refers to a method for learning Graph data through a Neural Network to extract and explore node features and realize target processing. Many-to-many group characteristics among entities can be described through nodes in graph data and edges connecting the nodes, so that data characteristics in fields of finance, social contact and the like can be better expressed.

In the process of constructing graph data based on a sample, entities existing in the sample and relationships among the entities are generally considered to form nodes and edges connecting the nodes in the graph data, and the graph data is complicated and redundant due to excessive nodes and edges, so that the convergence efficiency is low and the expression effect is poor when the graph neural network training is performed on the basis of the graph data, and the actual performance of a model in the applications of clustering, classification, generation and the like is further influenced.

It is noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure and therefore may include information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

The method can remove redundant nodes in the graph data and effectively simplify the graph data, so that the convergence efficiency of training and the expression effect of a model are improved in the model training based on the graph data, and the application performance of the model is ensured.

According to a first aspect of the present disclosure, there is provided a graph data processing method, which may include: acquiring original graph data, wherein the original graph data comprises nodes and edges connecting the nodes, the nodes comprise user nodes and attribute nodes, the user nodes are constructed on the basis of user entities in the samples, the attribute nodes are constructed on the basis of attribute entities in the samples, and the edges are constructed on the basis of incidence relations between the user entities and the attribute entities; determining a first weight corresponding to each attribute node according to a target node connected with each attribute node, wherein the target node is a user node belonging to a target classification; determining a second weight corresponding to the user node according to the first weight corresponding to the attribute node connected with each user node; and under the condition that the second weight meets the preset pruning condition, pruning the user node to obtain target graph data.

Optionally, determining a first weight corresponding to each attribute node according to a target node connected to each attribute node, where the target node is a user node belonging to a target classification, and the method includes: in original graph data, determining a first association number of attribute nodes according to a target node connected with each attribute node; and determining a first weight corresponding to the attribute node according to the distribution condition of the first association number.

Optionally, determining a first weight corresponding to the attribute node according to the distribution condition of the first association number includes: dividing the attribute nodes according to the distribution condition of the first correlation quantity to obtain more than two node sets; and according to the variation trend of the first association quantity among different node sets, distributing first weights corresponding to the attribute nodes to the different node sets.

Optionally, the second weight includes an average weight and a maximum weight, and pruning the user node to obtain the target graph data when the second weight satisfies a preset pruning condition includes: and pruning the user nodes under the condition that the average weight is smaller than the average weight threshold value and the maximum weight is smaller than the maximum weight threshold value to obtain target graph data.

Optionally, after acquiring the original image data, the method further includes: in the original graph data, determining a second association number of the user nodes according to the attribute nodes connected with the user nodes; under the condition that the second weight meets the preset pruning condition, pruning the user node to obtain target graph data, wherein the pruning comprises the following steps: and pruning the user nodes to obtain target graph data under the condition that the second weight meets the preset pruning condition and the second association number is smaller than the association number threshold.

Optionally, the user node in the original graph data further includes a user characteristic, and after the original graph data is obtained, the method further includes: in the original graph data, determining the feature missing rate of the user node according to the user feature of the user node; under the condition that the second weight meets the preset pruning condition, pruning the user node to obtain target graph data, wherein the pruning comprises the following steps: and pruning the user nodes to obtain target graph data under the condition that the second weight meets the preset pruning condition and the characteristic loss rate is greater than the characteristic loss rate threshold.

According to a second aspect of the present disclosure, there is provided a risk prediction model training method, which may include: acquiring target graph data, wherein the target graph data is acquired through the graph data processing method in the first aspect, and attribute nodes in the target graph data and edges connecting the user nodes and the attribute nodes are used for representing risk relation characteristics of the user nodes; sampling and aggregating user nodes on the target graph data by adopting a risk prediction model to obtain prediction classification of each user node; and determining a Focal local Loss function value corresponding to the prediction classification, and adjusting the risk prediction model to be convergent according to the Focal local Loss function value.

Optionally, the risk prediction model includes a sampling submodel and an aggregation submodel, and the sampling and aggregation of the user nodes are performed on the target graph data by using the risk prediction model to obtain the prediction classification of each user node, including: sampling neighbor user nodes with a preset sampling number of each user node in the target graph data by adopting a sampling sub-model to obtain sub-target graph data; in the sub-target graph data, determining a third weight of each neighbor user node by adopting an aggregation sub-model, wherein the third weight is used for representing the importance degree of the neighbor user nodes to the user nodes; aggregating the neighbor user nodes in the sub-target graph data according to the third weight to obtain an aggregation result of the user nodes; and determining the prediction classification of the user node according to the aggregation result.

There is also provided according to a third aspect of the present disclosure a graph data processing apparatus, which may include: the system comprises a graph data acquisition module, a first weight determination module, a second weight determination module and a graph data pruning module; the graph data acquisition module is used for acquiring original graph data, the original graph data comprises nodes and edges connecting the nodes, the nodes comprise user nodes and attribute nodes, the user nodes are constructed based on user entities in the samples, the attribute nodes are constructed based on attribute entities in the samples, and the edges are constructed based on incidence relations between the user entities and the attribute entities; the first weight determining module is used for determining a first weight corresponding to each attribute node according to a target node connected with each attribute node, wherein the target node is a user node belonging to a target classification; the second weight determining module is used for determining a second weight corresponding to the user node according to the first weight corresponding to the attribute node connected with each user node; and the graph data pruning module is used for pruning the user nodes to obtain the target graph data under the condition that the second weight meets the preset pruning condition.

Optionally, the first weight determination module includes a first association number sub-module and a first weight determination sub-module; the first association quantity submodule is used for determining a first association quantity of the attribute nodes according to a target node connected with each attribute node in the original graph data; and determining a first weight corresponding to the attribute node according to the distribution condition of the first association number.

Optionally, the first weight determining submodule includes a node set dividing unit and a first weight allocating unit; the node set dividing unit is used for dividing the attribute nodes according to the distribution condition of the first association number to obtain more than two node sets; and the first weight distribution unit is used for distributing first weights corresponding to the attribute nodes to different node sets according to the variation trend of the first association quantity among the different node sets.

Optionally, the second weight includes an average weight and a maximum weight, and the graph data pruning module is specifically configured to prune the user node to obtain the target graph data when the average weight is smaller than the average weight threshold and the maximum weight is smaller than the maximum weight threshold.

Optionally, the apparatus further includes a second association quantity determining module, configured to determine, in the original graph data, a second association quantity of the user node according to the attribute node to which the user node is connected; the graph data pruning module is specifically configured to prune the user node to obtain the target graph data when the second weight meets the preset pruning condition and the second association number is smaller than the association number threshold.

Optionally, the apparatus further includes a feature missing rate determining module, configured to determine, in the original graph data, a feature missing rate of the user node according to a user feature of the user node; the graph data pruning module is specifically configured to prune the user node to obtain the target graph data when the second weight meets a preset pruning condition and the feature loss rate is greater than the feature loss rate threshold.

According to a fourth aspect of the present disclosure, there is provided a risk prediction model training apparatus, which may include: the risk prediction system comprises a target map data acquisition module and a risk prediction model training module; the target graph data acquisition module acquires target graph data, the target graph data is acquired through the graph data processing device of the third aspect, and the attribute nodes in the target graph data and edges connecting the user nodes and the attribute nodes are used for representing risk relationship characteristics of the user nodes; and the risk prediction model training module is used for sampling and aggregating the user nodes of the target graph data by adopting a risk prediction model to obtain the prediction classification of each user node, determining the Focal Loss function value corresponding to the prediction classification and adjusting the risk prediction model to be convergent according to the Focal Loss function value.

Optionally, the risk prediction model includes a sampling sub-model and an aggregation sub-model, and the risk prediction model training module includes a sampling sub-module, an aggregation sub-module and a prediction sub-module; the sampling submodule is used for sampling neighbor user nodes with preset sampling quantity of each user node in the target graph data by adopting a sampling submodel to obtain sub-target graph data; the aggregation sub-module is used for determining a third weight of each neighbor user node by adopting an aggregation sub-model in the sub-target graph data, wherein the third weight is used for representing the importance degree of the neighbor user nodes to the user nodes, and is also used for aggregating the neighbor user nodes in the sub-target graph data according to the third weight to obtain an aggregation result of the user nodes; and the prediction submodule is used for determining the prediction classification of the user node according to the aggregation result.

According to a fifth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the graph data processing method of the first aspect described above, or the risk prediction model training method of the second aspect.

According to a sixth aspect of the present disclosure, there is provided an electronic device comprising:

a processor; and

a memory for storing a computer program for the processor;

wherein the processor is configured to implement the graph data processing method of the first aspect described above, or the risk prediction model training method of the second aspect, via execution of a computer program.

The graph data processing method comprises the steps of firstly obtaining original graph data, wherein the original graph data comprise nodes and edges connecting the nodes, the nodes can comprise user nodes and attribute nodes, the user nodes are constructed on the basis of user entities in samples, the attribute nodes are constructed on the basis of attribute entities in the samples, and the edges are constructed on the basis of incidence relations between the user entities and the attribute entities; in the original graph data, a first weight corresponding to each attribute node can be determined according to a target node connected with the attribute node, and the target node is a user node belonging to a target classification; further, according to the first weight corresponding to the attribute node connected with each user node, the second weight corresponding to the user node is determined, and the user node can be pruned according to whether the second weight meets the preset pruning condition or not so as to obtain target graph data; according to the scheme, the relevance between the attribute node and the user node of the target classification is represented through the first weight, the second weight corresponding to the user node is determined based on the first weight corresponding to the attribute node connected with the user node, the relevance between the user node and the user node of the target classification can be represented through the second weight, and on the basis, pruning is carried out on the user node of which the second weight meets the preset pruning condition, so that redundant nodes, corresponding to the preset pruning condition, in the graph data and the user node of the target classification can be effectively removed through the scheme, the graph data are accurately and effectively simplified, the convergence efficiency of training can be improved in the model training of the graph data after pruning, the expression effect of the model can be improved, and the application performance of the model can be guaranteed.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

Fig. 1 is a flowchart illustrating steps of a graph data processing method according to an embodiment of the present disclosure.

Fig. 2 is a second flowchart illustrating steps of a graph data processing method according to an embodiment of the disclosure.

Fig. 3 is a partial schematic diagram of raw graph data provided by an embodiment of the present disclosure.

Fig. 4 is a third step flowchart of a graph data processing method according to an embodiment of the present disclosure.

Fig. 5 is a flowchart illustrating a fourth step of the graph data processing method according to the embodiment of the present disclosure.

Fig. 6 is a flowchart illustrating steps of a risk prediction model training method according to an embodiment of the present disclosure.

Fig. 7 illustrates one of the schematic structural diagrams of a graph data processing apparatus in the embodiment of the present disclosure.

Fig. 8 illustrates a second schematic structural diagram of a graph data processing apparatus in an embodiment of the disclosure.

Fig. 9 illustrates a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 is a flowchart illustrating steps of a graph data processing method according to an embodiment of the present disclosure, and as shown in fig. 1, the method may include steps 101 to 104. As follows:

step 101, obtaining original graph data, wherein the original graph data comprises nodes and edges connecting the nodes, the nodes comprise user nodes and attribute nodes, the user nodes are constructed based on user entities in the samples, the attribute nodes are constructed based on attribute entities in the samples, and the edges are constructed based on incidence relations between the user entities and the attribute entities.

The sample can be data containing entities and relationships among the entities, the entities can be abstract objects of people, concepts and things in the real world, and the relationships can represent the connections among the people, the concepts and the things in the real world.

In the embodiment of the present disclosure, the entities may include user entities and attribute entities, and the user entities may be entities with human characteristics, for example, the user entities may include natural people, enterprises, businesses, organizations, and the like; the attribute entity may be an entity triggered by an attribute of the user entity, such as a phone number, an email address, an IP (Internet Protocol) address, a geographic address, a physical address, a network account, and the like. It should be noted that the division of the user entity and the attribute entity may vary according to the demand and the use of the data, for example, a natural person and an enterprise may be different user entities, or an enterprise may also be an attribute entity corresponding to the natural person under all relationships, so as to represent all relationships of the natural person to the enterprise; different natural persons may also be different user entities, or one natural person may also be an attribute entity corresponding to another natural person under the relationship, so as to represent the relationship between the natural person and the natural person. The relationship may be the association between the user entity and the attribute entity, and the type and direction of the relationship may be different according to the different association modes between the user entity and the attribute entity, for example, a natural person may have a living relationship with a geographic address, a login relationship with an IP address, a holding relationship with an email, a relationship with another natural person, and the like.

In the embodiment of the present disclosure, graph data is data in which entities and relationships between entities are used as description objects. On the basis of the sample, nodes corresponding to the entities can be constructed, and edges connecting the nodes are constructed according to the relation between the entities, so that original graph data corresponding to the sample is obtained. The nodes constructed by the user entities in the original graph data can be user nodes, the nodes constructed by the attribute entities can be attribute nodes, and the division of the user nodes and the attribute nodes can be different according to different requirements and purposes of the data.

It should be noted that the source of the sample may be selected according to the requirement and use of data processing, for example, the sample may be interactive data in the social domain, which may include different user account entities, home entities, and the like, and concern relationships, comment relationships, forward relationships, comment approval and disapproval relationships, and the like; the sample may also be transaction data in the financial field, which may include different natural person entities, phone number entities, card number entities, personal device entities, point of Sale (POS) entities, etc., as well as phone number holding relationships, card holding relationships, personal device usage relationships, point of Sale consumption relationships, etc. In addition, after the sample clearly informs the user of information such as the collected content, the data use, the processing mode and the like, the sample is accessed, collected, stored and applied to subsequent analysis processing under the condition of approval and authorization of the user, and the sample can provide the user with a way of accessing, correcting and deleting the data and a method of revoking the approval and authorization.

Moreover, the original graph data can be constructed on the basis of the sample in various ways, for example, an entity and a relationship can be identified in the sample by adopting a manual marking way to further construct the corresponding original graph data, or an entity identification algorithm can be adopted to perform feature identification and marking extraction on the entity in the sample, and an algorithm based on supervised learning or semi-supervised learning is adopted to extract the relationship between the entities, so as to further construct the corresponding original graph data.

And 102, determining a first weight corresponding to each attribute node according to a target node connected with each attribute node, wherein the target node is a user node belonging to a target classification.

More than two classifications corresponding to the user entities in the sample can be obtained according to the requirements of data processing and application, for example, the user entities can be classified into active users and loss users according to the incidence relation between the user entities and the attribute entities in the social contact field, and the user entities can be classified into safe users and risk users according to the incidence relation between the user entities and the attribute entities in the financial field. The corresponding classification of the user entity is determined by the behavior data objectively generated by the user in social practice.

In the embodiment of the present disclosure, each attribute node in the original graph data may be connected to more than one user node, and the user nodes connected to the attribute nodes may correspond to different classifications, where a target classification may be determined in the classifications according to requirements of data processing and application. According to different application scenarios, in a detection scenario of the activity of the social platform, the classification of active users can be used as a target classification, or the classification of attrition users can be used as a target classification; in the prediction scenario of financial domain wind control, the classification of safe users may be targeted for classification, or the classification of risky users may be targeted for classification. The target classification may be a classification related to a data processing purpose, and then a target node belonging to the target classification in the user nodes may be related to the data processing purpose, and a person skilled in the art may select different target classifications according to actual needs.

In the embodiment of the present disclosure, the first weight of the attribute node may be determined according to the target node to which the attribute node is connected, so that the first weight may represent the association between the attribute node and the user node belonging to the target class.

And 103, determining a second weight corresponding to the user node according to the first weight corresponding to the attribute node connected with each user node.

After the first weight corresponding to each attribute node is determined, the second weight corresponding to each user node connected with the attribute node can be determined through the first weight, and the association relationship between the attribute node and the target node can be synthesized based on the second weight because the first weight represents the association between the attribute node and the target node, so that the association between the user node and the target node can be indirectly represented.

And step 104, pruning the user nodes to obtain target graph data under the condition that the second weight meets the preset pruning condition.

The preset pruning conditions can be set based on the target classification, the second weight and the like so as to adapt to the application requirements of the graph data, for example, different preset pruning conditions can be set according to the relevance between the user nodes of the target classification and the application requirements of the graph data, and when the graph data is applied to the prediction, identification or classification of the user nodes of the target classification and the user nodes of the target classification need to be reserved, the preset pruning conditions can be set so that the pruning is performed when the relevance represented by the second weight is lower than a threshold value; on the contrary, when the graph data is applied to processing of user nodes of other classifications, the preset pruning condition is set to be pruning when the relevance of the second weight representation is higher than the threshold value. On the basis, the user nodes with the second weights meeting the preset pruning conditions are pruned, so that the simplified, accurate, reliable and high-quality target graph data can be obtained, and the target graph data can meet the actual application requirements.

Fig. 2 is a second flowchart illustrating steps of a graph data processing method according to an embodiment of the disclosure, and as shown in fig. 2, the method may include steps 201 to 205.

Step 201, obtaining original graph data, where the original graph data includes nodes and edges connecting the nodes, the nodes include user nodes and attribute nodes, the user nodes are constructed based on user entities in the samples, the attribute nodes are constructed based on attribute entities in the samples, and the edges are constructed based on incidence relations between the user entities and the attribute entities.

Step 201 may refer to the related description of step 101, and is not described herein again to avoid repetition.

Fig. 3 is a partial schematic diagram of original graph data provided in an embodiment of the present disclosure, and as shown in fig. 3, the original graph data is constructed according to transaction data in the financial field, and includes "natural person a", "natural person b", "natural person c", "natural person d", "natural person e", "geographic address a", "geographic address b", "IP address", "credit card number", "telephone number", "sales terminal", "personal device", and "email address".

As shown in fig. 3, if "natural person a" is used as a user node, "natural person b" and "natural person c" are attribute nodes of "natural person a" based on a relationship between the user and the user, a geographic address a "is an attribute node of" natural person a "based on a living relationship, an IP address" is an attribute node of "natural person a" based on a login relationship, a credit card number "is an attribute node of" natural person a "based on a card holding relationship, a telephone number" is an attribute node of "natural person a" based on a telephone holding relationship, a sales terminal "is an attribute node of" natural person a "based on a consumption relationship, a personal device" is an attribute node of "natural person a" based on a device using relationship, and an electronic mailbox "is an attribute node of" natural person a "based on an email address holding relationship.

Or if the natural person b is taken as the user node, the natural person a and the natural person c are attribute nodes of the natural person b based on the relationship of relatives, and the geographic address b is attribute nodes of the natural person b based on the living relationship.

Step 202, in the original graph data, according to a target node connected with each attribute node, determining a first association number of the attribute nodes, wherein the target node is a user node belonging to a target classification.

In the original graph data, all user nodes connected with each attribute node can be obtained, target nodes belonging to a target classification are further determined in all the user nodes, and the number of the target nodes is recorded as a first association number corresponding to the attribute nodes. Specifically, the relevant description of the target node, the target class, and the like may be referred to the relevant description of step 102, and is not repeated herein to avoid repetition.

If the natural person a is taken as the attribute node, the user nodes connected with the natural person a based on the relationship of relatives comprise the natural person b and the natural person c; taking the 'sales terminal' as an attribute node, the user nodes connected with the 'sales terminal' based on the consumption relationship comprise 'natural person a' and 'natural person d'; with the geographic address a as the attribute node, the user nodes connected with the geographic address a based on the living relationship include "natural person a" and "natural person e": with "personal device" as the attribute node, the user nodes based on the usage device relationship include "natural person a" and "natural person b".

On this basis, taking the target classification as a risk user as an example, if "natural person b", "natural person d", and "natural person e" are marked as risk users based on the behavior characteristics of the corresponding transaction data, the first association number corresponding to "natural person a" is 1, and the first association number corresponding to "sales terminal" is 1. The first association number corresponding to the "geographic address a" is 1, and the first association number corresponding to the "personal device" is 1, and is denoted as RS { "natural person a":1, "point of sale": 1, "geographic address a":1, "personal device": 1. Cndot. Cndot. Cndot..

In an embodiment of the method disclosed herein, the attribute nodes may also be screened according to actual application requirements and data quality, and the first association number corresponding to the screened attribute nodes is obtained, for example, when the relation corresponding to the attribute node interferes with the pruning effect, and a user node that is required or not required is retained, the association between the attribute node and a user node belonging to the target classification may not be considered, and the first association number of other attribute nodes is further determined after the attribute node is screened out. After pruning, the pruning effect of the target graph data can be evaluated according to whether the number of the user nodes, the number of the attribute nodes, the depiction of the association relation and the like contained in the target graph data meet the application requirements, and when the pruning effect does not meet the expectation, the attribute nodes can be screened one by one in a screening and counting mode to eliminate the attribute nodes interfering with the pruning effect.

For example, in a risk prediction scenario in the financial field, a "sales terminal" is usually set in a physical store, and it is determined in actual statistics that a sales relationship with the "sales terminal" as an attribute node interferes with a pruning effect, so that the first association number of other attribute nodes can be obtained after the "sales terminal" is screened out from the original graph data.

And 203, determining a first weight corresponding to the attribute node according to the distribution condition of the first association quantity.

The distribution of the first association number can be determined according to the actual values of the different first association numbers and the number of the first association numbers with the values. Generally speaking, the greater the numerical value of the first association quantity, the stronger the association between the attribute node and the user node belonging to the target classification, so a larger first weight may be assigned to the attribute node with a larger numerical value of the first association quantity, and a smaller first weight may be assigned to the attribute node with a smaller numerical value of the first association quantity; further, the smaller the first number of associations with the same value, the more easily the association between the attribute node and the user node belonging to the target class is ignored in statistics, and the larger first weight may be assigned to the attribute node corresponding to the first number of associations, so as to improve the occupation ratio of the attribute node in the calculation.

In one method embodiment of the present disclosure, step 203 may include steps S11 to S12, as follows:

and S11, dividing the attribute nodes according to the distribution condition of the first association number to obtain more than two node sets.

According to the distribution condition of the first association number, attribute nodes with similar values and concentrated distribution can be divided into corresponding node sets, so that a plurality of node sets are obtained, and the distribution condition of the first association number is represented by different node sets.

If yes, sorting the attribute nodes according to the first association number, and obtaining a sorting sequence of the attribute nodes based on the first association number, which is marked as RN { "Natural person a":1, "natural person c":1, "geographic address a":1, geographic address b, 1, personal device, 1, IP address, 0, credit card number, 0, telephone number, 0, electronic mail box, 0.

Wherein, ": "attribute nodes on the left side,": the right side of the "number is the first association number corresponding to the attribute node.

And S12, distributing first weights corresponding to the attribute nodes to different node sets according to the variation trend of the first association quantity among the different node sets.

The distribution intervals among different node sets are in a certain variation trend because different node sets are divided according to the distribution condition of the first association number, for example, attribute nodes sorted from large to small are divided into four node sets of 0% to 10%, 10% to 30%, 30% to 60%, and 60% to 100% according to the centralized distribution condition of the first association number, and further, first weights are distributed according to the magnitude of the first association number value and the number distribution of the same value, for example, the first weight of 0% to 10% of attribute nodes is 4, the first weight of 10% to 30% of attribute nodes is 3, the first weight of 30% to 60% of attribute nodes is 2, and the first weight of 60% to 100% of attribute nodes is 1, and technical personnel in the art can specifically set the value of the first weight according to application requirements, calculation conditions, and the like. In the variation trend of the node sets, as the value of the first association number among the node sets is reduced, the number of attribute nodes in each node set is increased, and the distribution of the first weight is gradually reduced. The person skilled in the art may determine the manner of assigning the first weight according to the variation trend thereof according to the distribution of the actual first correlation quantity, and the disclosure is not limited thereto.

And 204, determining a second weight corresponding to the user node according to the first weight corresponding to the attribute node connected with each user node.

Step 204 may correspond to the related description of step 103, and is not repeated herein to avoid repetition.

In one method embodiment of the present disclosure, the second weight includes an average weight, a maximum weight.

Selecting, calculating and converting a first weight corresponding to an attribute node connected with a user node to obtain a second weight corresponding to the user node, for example, comparing the magnitude of a numerical value in the first weight corresponding to the attribute node, and obtaining a maximum weight corresponding to the maximum numerical value as the second weight of the user node; or summing the first weights corresponding to the attribute nodes, and averaging based on the number of all the attribute nodes connected by the user node to obtain an average weight as the second weight corresponding to the user node.

For example, the user node 1 connects the attribute node 1, the attribute node 2, the attribute node 3, and the attribute node 4, where the first weight of the attribute node 1 is 2, the first weight of the attribute node 2 is 1, the first weight of the attribute node 3 is 2, and the first weight of the attribute node 4 is 1. On this basis, the second weight of the user node 1 includes an average weight of 1.5, and a maximum weight of 2.

And step 205, pruning the user nodes under the condition that the average weight is smaller than the average weight threshold and the maximum weight is smaller than the maximum weight threshold to obtain target graph data.

The average weight threshold and the maximum weight threshold are used for measuring the relevance between the user node and the target node, so that whether the user node is pruned or not is determined. When the average weight is smaller than the average weight threshold and the maximum weight is smaller than the maximum weight threshold, it may be determined that the association between the user node and the target node meets the pruning condition, so that the user node is removed from the original graph data. The average weight threshold and the maximum weight threshold can be adjusted according to the number of nodes between the original graph data and the target graph data before and after pruning, for example, when the number of the nodes after pruning is small, the average weight threshold and the maximum weight threshold can be adjusted to reserve more user nodes and avoid excessive pruning; when the number of the nodes after pruning is large, the average weight threshold and the maximum weight threshold can be adjusted to remove more user nodes and ensure the simplification effect on the target graph data.

In one embodiment of the method disclosed by the disclosure, in the process of model training based on target graph data, the average weight threshold and the maximum weight threshold can also be used as hyper-parameters of the model, and the average weight threshold and the maximum weight threshold are subjected to learning adjustment in the process of model training, so that the average weight threshold and the maximum weight threshold can not only ensure effective simplification of original graph data, but also avoid excessive pruning.

Fig. 4 is a third step flowchart of a graph data processing method provided in an embodiment of the present disclosure, and as shown in fig. 4, the method may include steps 401 to 405. As follows:

step 401, obtaining original graph data, where the original graph data includes nodes and edges connecting the nodes, the nodes include user nodes and attribute nodes, the user nodes are constructed based on user entities in the sample, the attribute nodes are constructed based on attribute entities in the sample, and the edges are constructed based on incidence relations between the user entities and the attribute entities.

Step 401 may refer to the related descriptions of step 101 and step 201, and is not described herein again to avoid repetition.

And 402, in the original graph data, determining a second association quantity of the user nodes according to the attribute nodes connected with the user nodes.

The second association quantity may be the quantity of attribute nodes connected by the user node in the original graph data, and the edge is constructed based on the association relationship between the user entity and the attribute entity, so the second association quantity may represent the isolation degree of the user node in the original graph data, and a larger second association quantity indicates that the user node is directly associated with more attribute nodes in the original graph data, and a smaller second association quantity indicates that the user node is directly associated with fewer attribute nodes in the original graph data.

In an embodiment of the method of the present disclosure, the second association number may be the number of all attribute nodes connected to the user node, or may only include the number of partial attribute nodes connected to the user node, and the selection of the partial attribute nodes may be selected according to requirements of data processing and application, as shown in fig. 3, a "natural person a" is used as the user node, the second association number may be 9 and includes the number of all attribute nodes connected to the "natural person a", or 2 and includes only two attribute nodes based on an affinity relationship, where "natural person b" and "natural person c" connected to the "natural person a", and the present disclosure is not particularly limited.

Step 403, according to the target node connected to each attribute node, determining a first weight corresponding to the attribute node, where the target node is a user node belonging to a target classification.

Step 403 may refer to the related descriptions of step 102, step 202, and step 203, and is not described herein again to avoid repetition.

And 404, determining a second weight corresponding to the user node according to the first weight corresponding to the attribute node connected with each user node.

Step 404 may refer to the related descriptions of step 103 and step 204, and is not described herein again to avoid repetition.

And 405, pruning the user nodes to obtain target graph data under the condition that the second weight meets the preset pruning condition and the second association number is smaller than the association number threshold.

The relevance quantity threshold value can be a threshold value for representing that the isolation degree of the user node meets pruning requirements, and on the basis that the second weight meets preset pruning conditions, whether the second relevance quantity of the user node is smaller than the relevance quantity threshold value or not can be further confirmed, so that in the process of simplifying the original graph data, the relevance between the user node and other user nodes belonging to the target classification and the isolation degree of the user node in the original graph data can be fully considered, accurate, sufficient and appropriate pruning of the original graph data is guaranteed, and the target graph data meeting application requirements are obtained.

Fig. 5 is a fourth step flowchart of a graph data processing method provided in an embodiment of the present disclosure, and as shown in fig. 5, the method may include steps 501 to 505. As follows:

step 501, original graph data is obtained, the original graph data comprises nodes and edges connecting the nodes, the nodes comprise user nodes and attribute nodes, the user nodes are constructed based on user entities in the samples, the attribute nodes are constructed based on attribute entities in the samples, and the edges are constructed based on incidence relations between the user entities and the attribute entities.

Step 501 may refer to the related descriptions of step 101 and step 201, and is not described herein again to avoid repetition.

In one method embodiment of the present disclosure, the user nodes in the raw graph data further include user characteristics.

The user characteristics may be characteristics corresponding to the user entity constructed according to the data processing application requirements by performing data processing on the sample through characteristic engineering, and if the original graph data is constructed based on transaction data in the financial field, the user characteristics of the user node may include social attribute characteristics, risk behavior characteristics, and the like of the user entity. Social attribute characteristics may include the age, gender, occupation, etc. of the user entity; the risk behavior characteristics can include credit investigation characteristics, overdue characteristics, consumption behavior characteristics, fund traffic characteristics, blacklist characteristics and the like, credit investigation times, borrowing amount, borrowing times and the like, and the overdue characteristics can include overdue times, overdue amount, overdue duration and the like.

Step 502, in the original graph data, determining the feature missing rate of the user node according to the user feature of the user node.

In the feature engineering, the type of the user features to be extracted can be predetermined according to the data processing application requirements, and then the user features corresponding to the user entities are further extracted. In actual processing, due to the limitation of data acquisition, transmission and processing conditions, a user entity may have a problem of feature loss, and all predetermined user feature types cannot be acquired, so that a user node constructed based on the user entity has user feature loss. In the embodiment of the present disclosure, the feature missing rate of the user node may be determined according to the user feature type predetermined in the relative acquisition process of the user feature of the user node.

Step 503, determining a first weight corresponding to each attribute node according to a target node connected to each attribute node, where the target node is a user node belonging to a target classification.

Step 503 may refer to the related descriptions of step 102, step 202, and step 203, and is not described herein again to avoid repetition.

And step 504, determining a second weight corresponding to the user node according to the first weight corresponding to the attribute node connected with each user node.

Step 504 may refer to the related descriptions of step 103 and step 204, which are not repeated herein for avoiding repetition.

And 505, pruning the user node to obtain target graph data under the condition that the second weight meets the preset pruning condition and the feature missing rate is greater than the feature missing rate threshold value.

The characteristic missing rate can be used for representing the information quantity provided by the user node in the original graph data, the higher the characteristic missing rate is, the less the information quantity provided by the user node is, and the characteristic missing rate threshold can be used for representing that the information quantity provided by the user node in the original graph data meets the threshold of pruning requirements. On the basis that the second weight meets the preset pruning condition, whether the characteristic loss rate of the user node is greater than a threshold of the characteristic loss rate or not can be further confirmed, for example, whether the characteristic loss rate is greater than 90% or not can be further confirmed. Therefore, in the process of simplifying the original graph data, the relevance between the user node and other user nodes belonging to the target classification and the information quantity provided by the user node in the original graph data can be fully considered, the redundant user node with small information quantity is removed, the original graph data can be accurately, fully and properly pruned, and the target graph data meeting the application requirements can be obtained.

In an embodiment of the method of the present disclosure, the second association quantity and the feature missing rate may also be considered comprehensively, after the original graph data is obtained, the second association quantity corresponding to the user node and the feature missing rate corresponding to the user node are obtained, and when the second association quantity is smaller than the association quantity threshold and the feature missing rate is greater than the feature missing rate threshold, the user node is pruned

The graph data processing method comprises the steps of firstly obtaining original graph data, wherein the original graph data comprise nodes and edges connecting the nodes, the nodes can comprise user nodes and attribute nodes, the user nodes are constructed based on user entities in samples, the attribute nodes are constructed based on attribute entities in the samples, and the edges are constructed based on incidence relations between the user entities and the attribute entities; in the original graph data, a first weight corresponding to each attribute node can be determined according to a target node connected with the attribute node, wherein the target node is a user node belonging to a target classification; further, according to the first weight corresponding to the attribute node connected with each user node, the second weight corresponding to the user node is determined, and the user node can be pruned according to whether the second weight meets the preset pruning condition or not, so that target graph data can be obtained; according to the scheme, the relevance between the attribute node and the user node of the target classification is represented through the first weight, the second weight corresponding to the user node is determined based on the first weight corresponding to the attribute node connected with the user node, the relevance between the user node and the user node of the target classification can be represented through the second weight, and on the basis, pruning is carried out on the user node of which the second weight meets the preset pruning condition, so that redundant nodes, corresponding to the preset pruning condition, in the graph data and the user node of the target classification can be effectively removed through the scheme, the graph data are accurately and effectively simplified, the convergence efficiency of training can be improved in the model training of the graph data after pruning, the expression effect of the model can be improved, and the application performance of the model can be guaranteed.

Further, on the basis that the second weight meets the preset pruning condition, a second association number of the user node determined according to the attribute node connected with the user node in the original graph data can be obtained, and the user node is pruned under the condition that the second association number is smaller than the association number threshold. The second association quantity represents the quantity of attribute nodes directly associated with the user nodes in the original graph data, so that the isolation degree of the user nodes in the original graph data can be represented, and the user nodes which are relatively isolated and low in association degree are removed from the original graph data; the feature missing rate of the user node for applying the user feature in the original graph data can also be obtained, and the user node is pruned under the condition that the feature missing rate is also larger than the feature missing rate threshold value, so that the user node which has high feature missing rate and influences the model convergence is removed from the original graph data; and simultaneously acquiring the second association quantity and the feature missing rate, further simplifying the original graph data according to the second association quantity and the feature missing rate, and accurately retaining the required information.

Fig. 6 is a flowchart illustrating steps of a risk prediction model training method provided in an embodiment of the present disclosure, and as shown in fig. 6, the method may include steps 601 to 603. As follows:

step 601, obtaining target graph data, wherein the target graph data is obtained through the graph data processing method, and attribute nodes in the target graph data and edges connecting the user nodes and the attribute nodes are used for representing risk relation characteristics of the user nodes.

On the basis, the target graph data can be obtained by processing the original graph data through the data processing method shown in the figures 1 to 5, the original graph data is constructed based on transaction data in the financial field, the attribute nodes in the target graph data and edges connecting the user nodes and the attribute nodes are used for representing risk relation characteristics of the user, and the risk relation characteristics can refer to incidence relations between the user entities and the attribute entities and used for measuring risk degrees of the user entities.

In one embodiment of the method of the present disclosure, entities such as a natural person, a geographic address, an IP address, a credit card number, a telephone number, a point-of-sale terminal, a personal device, an email, etc. may be selected; and selecting the relationship such as relationship of relatives, relationship of residence, relationship of login, relationship of holding a credit card, relationship of holding a telephone, relationship of consumption, relationship of using equipment, relationship of holding a mailbox and the like to accurately and fully depict the risk expression in the fraud scene of the credit card, thereby being more beneficial to the convergence of a risk prediction model. The sales terminal generally refers to a consumption medium in a physical store, the personal device generally refers to a consumption medium on a personal line, and the sales terminal and the personal device can be generally consumed and used by different user entities.

Step 602, sampling and aggregating user nodes for the target graph data by using a risk prediction model, and obtaining a prediction classification of each user node.

After the target graph data is obtained, the target graph data can be input into the risk prediction model, so that the risk prediction model performs sampling and aggregation of the user nodes on the target graph data to obtain the prediction classification of each user node. The risk prediction model may be constructed using Graph neural Networks, such as GCN (Graph Convolutional Networks), graph sage, GAT (Graph Attention Networks), and the like, which is not specifically limited by the present disclosure.

In one method embodiment of the present disclosure, the risk prediction model includes a sampling submodel and an aggregation submodel, and then step 602 includes steps S21 to S24. As follows:

and S21, sampling neighbor user nodes with a preset sampling number of each user node in the target graph data by adopting a sampling sub-model to obtain sub-target graph data.

The sampling sub-model in the risk prediction model is used for carrying out feature sampling on the neighbor user nodes of the user node so as to obtain the features which represent the neighbor user nodes corresponding to the user node. In the embodiment of the present disclosure, a GraphSAGE mechanism may be adopted to perform feature sampling on the user node, that is, to sample the neighbor user nodes of the preset sampling number to the user node in the target graph data. The preset sampling number may include the number of layers of the neighbor user nodes and the number of the neighbor user nodes in each layer, for example, the number of the neighbor user nodes in the first-degree fixed number and the number of the neighbor user nodes in the second-degree fixed number of the user nodes may be sampled, and when the number of the neighbor user nodes is smaller than the fixed number, the neighbor user nodes may be repeatedly sampled to obtain the sub-target graph data corresponding to the user node. By sampling the neighbor user nodes with the preset sampling number, the computing resource consumption caused by sampling the full-target graph data can be reduced, the model training time is shortened, and the model convergence efficiency is improved.

And S22, in the sub-target graph data, determining a third weight of each neighbor user node by adopting an aggregation sub-model, wherein the third weight is used for representing the importance degree of the neighbor user nodes to the user nodes.

The aggregation sub-model is used for aggregating the characteristics of the neighbor user nodes in the risk prediction model to obtain the characteristics representing the user nodes. In the embodiment of the present disclosure, a GAT mechanism may be adopted to aggregate features of the neighbor user nodes, that is, a third weight of the neighbor user nodes is described by a shallow neural network based on an attention mechanism in an aggregation process, and the third weight may represent an importance degree of the neighbor user nodes to the user nodes.

In one embodiment of the method of the present disclosure, after determining the third weight for different neighboring user nodes corresponding to the user node, normalization processing may be performed on the third weight, so that the third weight is easy to compare between different neighboring user nodes. Wherein the third weight may be normalized by a softmax function.

And S23, aggregating the neighbor user nodes in the sub-target graph data according to the third weight to obtain an aggregation result of the user nodes.

The importance of different neighbor user nodes relative to the user node can be represented through the third weight, so that in the feature aggregation process of the neighbor user nodes, the contribution of the features of the neighbor user nodes with a large importance degree in the aggregation result can be promoted through the alignment weighting processing of the third weight, the contribution of the features of the neighbor user nodes with a small importance degree in the aggregation result can be reduced, and compared with the feature of directly aggregating the neighbor user nodes, the accuracy of the representation of the aggregation result on the user node can be effectively improved.

And S24, determining the prediction classification of the user nodes according to the aggregation result.

After the representation of the user node is formed through the aggregation result, the prediction classification of the user node can be determined based on the aggregation result, and the prediction classification of the user node according to the difference of the aggregation result can be any classification of the user entity in the sample.

And 603, determining a Focal local Loss function value corresponding to the prediction classification, and adjusting the risk prediction model to be convergent according to the Focal local Loss function value.

In the model training process, due to the limitation of conditions such as data generation, acquisition and processing in an actual scene, a sample imbalance problem may exist in the sample, that is, the number distribution of user entities of each class is unbalanced in different user entity classes, the sample imbalance problem may cause that the model cannot fully learn the characteristics of the user entities of different classes, the convergence efficiency is low, and the model performance is poor. At present, the problem of unbalanced samples is often alleviated by adopting an oversampling and undersampling mode, graph data has sparsity and is in a node and edge mesh structure, and the oversampling and undersampling mode can change the structure of the graph data and influence the expression of the graph data on entities and relations.

In the embodiment of the present disclosure, a Focal local Loss function is used for iteration, for example, a weighting factor α may be added to a two-class cross entropy Loss function, and the weighting factor may be set according to proportions of different classification samples to reduce a classification proportion with a higher specific gravity in different classifications, so as to balance unbalanced proportions of different classification samples, as shown in the following formula (1):

CE(p _t )＝-α _t log(p _t ) (1)

wherein,

in the above formula (1), p _t To predict the accuracy of the classification; y is a classification label, and the value in the binary problem can be 1 or 0 to represent different classifications; CE (p) _t ) To balance the cross entropy, the cross entropy is balanced by adding a weighting factor alpha to the binary cross entropy _t Solving the problem of unbalance among different classes in the sample, the weight factor alpha _t Is set according to the distribution of the samples in different categories.

The modulation coefficient γ may be added to reduce the weight of the samples that are easy to classify, so that the model focuses more on the samples that are difficult to classify, and further consider the problem that the degree of difficulty of the model in classifying the samples is not uniform, as shown in the following formula (2):

FL(p _t )＝-(1-p _t ) ^γ log(p _t ) (2)

wherein, FL (p) _t ) For the Focal local Loss function, gamma is the modulation factor, and p is the index of the sample to be classified _t Approaching to 1, then (1-p) _t ) ^γ Approaching 0, p for samples difficult to classify _t Approaching to 0, then (1-p) _t ) ^γ Approaches to 1, therebyThe Loss of the samples which are easy to classify in the Focal local Loss function is reduced, and the Loss of the samples which are difficult to classify is unchanged, so that the weight of the samples which are difficult to classify in the Loss function is improved in the whole training process, and the learning capacity of the model on the samples which are difficult to classify is improved.

On the basis, the formula (3) can be obtained by combining the above formulas (1) and (2), as follows:

FL(p _t )＝-α _t (1-p _t ) ^γ log(p _t ) (3)

in equation (3) by a weighting factor α _t The importance of different classification samples is balanced, the loss function is focused on the training of samples difficult to classify through the modulation coefficient gamma, so that the loss function avoids the influence of over-sampling and under-sampling modes on a graph data structure, the balance of the classification difficulty degree of the samples is considered while the importance of different classifications is balanced, the prediction precision of the model is further improved, and the application performance of the model is ensured.

For example, transaction data in the financial field is used as a sample, in a credit card anti-fraud scene, the proportion of the risk users is 0.0003%, the problem of sample imbalance exists, the risk users are difficult to effectively identify and classify, after the adjustment of a Focal Loss function, the value of a risk prediction model KS (Kolmogorov-Smirnov) obtained by training is improved by 6%, the capability of distinguishing different classification samples is improved, and the performance of the risk prediction model for user prediction and classification is effectively improved.

The risk prediction model training method comprises the steps of obtaining target graph data obtained by processing through the graph data processing method, enabling attribute nodes in the target graph data and edges connecting user nodes and the attribute nodes to be used for representing risk relation characteristics of the user nodes, further adopting a risk prediction model to conduct sampling and aggregation of the user nodes on the target graph data, obtaining prediction classification corresponding to each user node, and adjusting the risk prediction model to be convergent according to Focal Loss function values corresponding to the prediction classification to obtain the risk prediction model; according to the scheme, a risk prediction model is obtained based on target graph data training, in the pruning processing process of the target graph data, a first weight can represent the relevance between an attribute node and a user node of a target classification, and a second weight corresponding to the user node is determined based on the first weight, so that the relevance between the user node and the user node of the target classification can be represented by the second weight, therefore, the user node is pruned under the condition that the second weight meets a preset pruning condition, on the basis that the target classification is set according to data analysis, processing and application requirements, redundant nodes, the relevance between the user node of the target classification and the user node of the target classification in the graph data, meeting the preset pruning condition can be effectively removed through the scheme, the graph data are accurately and effectively simplified, the problem of sample classification balance can be effectively balanced, samples with different classification difficulty degrees are considered, the convergence efficiency of the training can be improved in the model training, the expression effect of the model is achieved, and the application performance of the model is guaranteed. Furthermore, on the basis that the attribute nodes in the target graph data and the edges connecting the user nodes and the attribute nodes are used for representing the risk relationship characteristics of the user nodes, the trained risk prediction model can accurately classify and predict the risk relationship of unknown user nodes, and the application performance of the model is guaranteed.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.

Fig. 7 is a schematic structural diagram of a graph data processing apparatus 700 according to an embodiment of the present disclosure, and as shown in fig. 7, the apparatus may include: a graph data acquisition module 701, a first weight determination module 702, a second weight determination module 703 and a graph data pruning module 704; the graph data acquiring module 701 is configured to acquire original graph data, where the original graph data includes nodes and edges connecting the nodes, the nodes include user nodes and attribute nodes, the user nodes are constructed based on user entities in the sample, the attribute nodes are constructed based on attribute entities in the sample, and the edges are constructed based on association relationships between the user entities and the attribute entities; a first weight determining module 702, configured to determine, according to a target node connected to each attribute node, a first weight corresponding to the attribute node, where the target node is a user node belonging to a target classification; a second weight determining module 703, configured to determine, according to a first weight corresponding to an attribute node to which each user node is connected, a second weight corresponding to the user node; and the graph data pruning module 704 is configured to prune the user node to obtain the target graph data when the second weight meets a preset pruning condition.

In an apparatus embodiment of the present disclosure, the first weight determination module 702 includes a first association number sub-module and a first weight determination sub-module; the first association quantity submodule is used for determining a first association quantity of the attribute nodes according to a target node connected with each attribute node in the original graph data; and determining a first weight corresponding to the attribute node according to the distribution condition of the first association quantity.

In an embodiment of the apparatus of the present disclosure, the first weight determining submodule includes a node set dividing unit and a first weight assigning unit; the node set dividing unit is used for dividing the attribute nodes according to the distribution condition of the first association number to obtain more than two node sets; and the first weight distribution unit is used for distributing first weights corresponding to the attribute nodes to different node sets according to the variation trend of the first association quantity among the different node sets.

In an embodiment of the apparatus of the present disclosure, if the second weight includes an average weight and a maximum weight, the graph data pruning module 704 is specifically configured to prune the user node to obtain the target graph data when the average weight is smaller than an average weight threshold and the maximum weight is smaller than a maximum weight threshold.

In an embodiment of the apparatus of the present disclosure, the apparatus may further include a second association number determining module, configured to determine, in the original graph data, a second association number of the user node according to the attribute node to which the user node is connected; the graph data pruning module 704 is specifically configured to prune the user node to obtain the target graph data when the second weight meets the preset pruning condition and the second association number is smaller than the association number threshold.

In an apparatus embodiment of the present disclosure, the apparatus may further include a feature missing rate determining module, configured to determine, in the original graph data, a feature missing rate of the user node according to a user feature of the user node; the graph data pruning module 704 is specifically configured to prune the user node to obtain the target graph data when the second weight meets a preset pruning condition and the feature missing rate is greater than the feature missing rate threshold.

The graph data processing device provided by the disclosure firstly acquires original graph data, wherein the original graph data comprises nodes and edges connecting the nodes, the nodes can comprise user nodes and attribute nodes, the user nodes are constructed based on user entities in a sample, the attribute nodes are constructed based on attribute entities in the sample, and the edges are constructed based on incidence relations between the user entities and the attribute entities; in the original graph data, a first weight corresponding to each attribute node can be determined according to a target node connected with the attribute node, wherein the target node is a user node belonging to a target classification; further, according to the first weight corresponding to the attribute node connected with each user node, the second weight corresponding to the user node is determined, and the user node can be pruned according to whether the second weight meets the preset pruning condition or not, so that target graph data can be obtained; according to the scheme, the relevance between the attribute node and the user node of the target classification is represented through the first weight, the second weight corresponding to the user node is determined based on the first weight corresponding to the attribute node connected with the user node, the relevance between the user node and the user node of the target classification can be represented through the second weight, and on the basis, the user node of which the second weight meets the preset pruning condition is pruned, so that redundant nodes, of which the relevance between the graph data and the user node of the target classification meets the preset pruning condition, can be effectively removed through the scheme, the graph data are accurately and effectively simplified, the convergence efficiency of training can be improved in the model training of the graph data after pruning, the expression effect of the model can be improved, and the application performance of the model can be guaranteed.

Further, on the basis that the second weight meets the preset pruning condition, a second association number of the user nodes can be determined according to attribute nodes connected with the user nodes in the original graph data, and under the condition that the second association number is smaller than an association number threshold, pruning is performed on the user nodes so as to remove the user nodes which are relatively isolated and have low association degree from the original graph data; the feature missing rate of the user node to the application user feature in the original graph data can also be obtained, and the user node is pruned under the condition that the feature missing rate is smaller than or equal to the feature missing rate threshold value, so that the user node with high feature missing rate and influence on model convergence is removed from the original graph data; the second correlation quantity and the feature missing rate can also be obtained to further simplify the original graph data and accurately retain the required information.

Fig. 8 is a second schematic structural diagram of a risk prediction model training apparatus 800 according to an embodiment of the disclosure, as shown in fig. 8, the apparatus may include: a target map data acquisition module 801 and a risk prediction model training module 802; the target graph data obtaining module 801 obtains target graph data, where the target graph data is obtained by the graph data processing apparatus of the third aspect, and the attribute node in the target graph data and the edge connecting the user node and the attribute node are used to represent a risk relationship characteristic of the user node; the risk prediction model training module 802 is configured to perform sampling and aggregation on user nodes for the target graph data by using a risk prediction model, obtain a prediction classification for each user node, determine a Focal Loss function value corresponding to the prediction classification, and adjust the risk prediction model to converge according to the Focal Loss function value.

In an embodiment of the present disclosure, the risk prediction model includes a sampling submodel and an aggregation submodel, and the risk prediction model training module 802 includes a sampling submodule, an aggregation submodule, and a prediction submodule; the sampling submodule is used for sampling neighbor user nodes with a preset sampling quantity of each user node in the target graph data by adopting a sampling submodel to obtain sub-target graph data; the aggregation sub-module is used for determining a third weight of each neighbor user node in the sub-target graph data by adopting an aggregation sub-model, wherein the third weight is used for representing the importance degree of the neighbor user nodes to the user nodes, and is also used for aggregating the neighbor user nodes in the sub-target graph data according to the third weight to obtain an aggregation result of the user nodes; and the prediction submodule is used for determining the prediction classification of the user node according to the aggregation result.

The risk prediction model training device provided by the disclosure acquires target graph data obtained by processing the graph data processing method, attribute nodes in the target graph data and edges connecting the user nodes and the attribute nodes are used for representing risk relation characteristics of the user nodes, further the risk prediction model is adopted to sample and aggregate the user nodes for the target graph data, prediction classification corresponding to each user node is obtained, and then the risk prediction model is adjusted to be convergent according to the Focal Loss function values corresponding to the prediction classification, so that the risk prediction model is obtained; according to the scheme, a risk prediction model is obtained based on target graph data training, in the pruning processing process of the target graph data, the association condition of an attribute node and a user node of a target classification can be represented through a first weight, and then a second weight corresponding to the user node is determined based on the first weight, so that the association condition of the user node and the user node of the target classification can be represented through the second weight, therefore, the user node is pruned under the condition that the second weight meets a preset pruning condition, on the basis that the target classification is set according to data analysis, processing and application requirements, redundant nodes, the association condition of the user node of the target classification in the graph data meets the preset pruning condition, are effectively removed through the scheme, the graph data are accurately and effectively simplified, and Focal Loss functions are adopted for model iteration, the problem of sample classification imbalance can be effectively balanced, samples with different classification difficulty degrees are considered, so that the convergence efficiency of training in the model can be improved, the expression effect of the model is achieved, and the application performance of the model is guaranteed. Furthermore, on the basis that the attribute nodes in the target graph data and the edges connecting the user nodes and the attribute nodes are used for representing the risk relationship characteristics of the user nodes, the trained risk prediction model can accurately classify and predict the risk relationship of unknown user nodes, and the application performance of the model is guaranteed.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken into multiple step executions, etc.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 900 according to this embodiment of the disclosure is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present disclosure.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one memory unit 920, and a bus 930 that couples various system components including the memory unit 920 and the processing unit 910.

Where the storage unit stores program code, which may be executed by the processing unit 910, to cause the processing unit 910 to perform the steps according to various exemplary embodiments of the present disclosure described in the above section "exemplary method" of this specification.

The storage unit 920 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 9201 and/or a cache storage unit 9202, and may further include a read only storage unit (ROM) 9203.

Storage unit 920 may also include a program/utility 9204 having a set (at least one) of program modules 9205, such program modules 9205 including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 930 can be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 900 may also communicate with one or more external devices 900 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any device (e.g., router, modem, etc.) that enables the electronic device 900 to communicate with one or more other computing devices. Such communication may occur through the display unit 940 and an input/output (I/O) interface 950 connected to the display unit 940. Also, the electronic device 900 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 960. As shown, the network adapter 960 communicates with the other modules of the electronic device 900 via the bus 930. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the terminal device.

In an embodiment of the present disclosure, there is also provided a program product for implementing the above method, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described drawings are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes illustrated in the above figures are not intended to indicate or limit the temporal order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A graph data processing method, the method comprising:

acquiring original graph data, wherein the original graph data comprises nodes and edges connecting the nodes, the nodes comprise user nodes and attribute nodes, the user nodes are constructed based on user entities in a sample, the attribute nodes are constructed based on attribute entities in the sample, and the edges are constructed based on incidence relations between the user entities and the attribute entities;

determining a first weight corresponding to each attribute node according to a target node connected with each attribute node, wherein the target node is the user node belonging to a target classification;

determining a second weight corresponding to the user node according to the first weight corresponding to the attribute node connected with each user node;

and under the condition that the second weight meets a preset pruning condition, pruning the user node to obtain target graph data.

2. The method according to claim 1, wherein the determining, according to a target node connected to each attribute node, a first weight corresponding to the attribute node, the target node being the user node belonging to a target class, comprises:

in the original graph data, determining a first association number of the attribute nodes according to the target node connected with each attribute node;

and determining the first weight corresponding to the attribute node according to the distribution condition of the first association number.

3. The method according to claim 2, wherein the determining the first weight corresponding to the attribute node according to the distribution of the first association number comprises:

dividing the attribute nodes according to the distribution condition of the first correlation quantity to obtain more than two node sets;

and according to the variation trend of the first association quantity among different node sets, distributing first weights corresponding to the attribute nodes to the different node sets.

4. The method according to claim 1, wherein the second weight includes an average weight and a maximum weight, and the pruning the user node to obtain the target graph data when the second weight satisfies a preset pruning condition includes:

and pruning the user nodes under the condition that the average weight is smaller than an average weight threshold value and the maximum weight is smaller than a maximum weight threshold value to obtain target graph data.

5. The method of claim 1, wherein after the obtaining the raw map data, further comprising:

determining a second correlation number of the user nodes according to the attribute nodes connected with the user nodes in the original graph data;

the pruning the user node to obtain target graph data when the second weight meets a preset pruning condition comprises the following steps:

and pruning the user nodes to obtain the target graph data under the condition that the second weight meets a preset pruning condition and the second association number is smaller than the association number threshold.

6. The method of claim 5, wherein the user node in the raw graph data further comprises a user characteristic, and wherein after the obtaining of the raw graph data, further comprises:

determining the feature missing rate of the user node according to the user feature of the user node in the original graph data;

under the condition that the second weight meets a preset pruning condition, pruning is performed on the user node to obtain target graph data, and the pruning includes:

and pruning the user node to obtain the target graph data under the condition that the second weight meets a preset pruning condition and the characteristic loss rate is greater than a characteristic loss rate threshold value.

7. A method for risk prediction model training, the method comprising:

acquiring target graph data, wherein the target graph data is acquired by the graph data processing method of any one of claims 1 to 6, and attribute nodes and edges connecting the user nodes and the attribute nodes in the target graph data are used for representing risk relationship characteristics of the user nodes;

sampling and aggregating the user nodes on the target graph data by adopting a risk prediction model to obtain the prediction classification of each user node;

and determining a local Loss function value corresponding to the prediction classification, and adjusting the risk prediction model to be convergent according to the local Loss function value.

8. The method of claim 7, wherein the risk prediction model comprises a sampling submodel and an aggregation submodel, and the sampling and aggregation of the user nodes on the target graph data by using the risk prediction model to obtain the prediction classification of each user node comprises:

sampling neighbor user nodes with a preset sampling number of each user node in the target graph data by adopting the sampling sub-model to obtain sub-target graph data;

in the sub-target graph data, determining a third weight of each neighbor user node by adopting the aggregation sub-model, wherein the third weight is used for representing the importance degree of the neighbor user node to the user node;

aggregating the neighbor user nodes in the sub-target graph data according to the third weight to obtain an aggregation result of the user nodes;

and determining the prediction classification of the user node according to the aggregation result.

9. A graph data processing apparatus, characterized in that the apparatus comprises:

the graph data acquisition module is used for acquiring original graph data, the original graph data comprises nodes and edges connecting the nodes, the nodes comprise user nodes and attribute nodes, the user nodes are constructed based on user entities in a sample, the attribute nodes are constructed based on attribute entities in the sample, and the edges are constructed based on incidence relations between the user entities and the attribute entities;

a first weight determining module, configured to determine, according to a target node to which each attribute node is connected, a first weight corresponding to the attribute node, where the target node is the user node belonging to a target classification;

a second weight determining module, configured to determine, according to the first weight corresponding to the attribute node to which each user node is connected, a second weight corresponding to the user node;

and the graph data pruning module is used for pruning the user node to obtain the target graph data under the condition that the second weight meets a preset pruning condition.

10. A risk prediction model training apparatus, the apparatus comprising:

a target graph data acquisition module, configured to acquire target graph data, where the target graph data is acquired by the graph data processing apparatus according to claim 9, and attribute nodes in the target graph data, and edges connecting the user nodes and the attribute nodes are used to represent risk relationship features of the user nodes;

the risk prediction model training module is used for sampling and aggregating the user nodes on the target graph data by adopting a risk prediction model to obtain the prediction classification of each user node;

the risk prediction model training module is further used for determining a Focal Loss function value corresponding to the prediction classification, and adjusting the risk prediction model to be convergent according to the Focal Loss function value.