CN115293872A

CN115293872A - Method for establishing risk identification model and corresponding device

Info

Publication number: CN115293872A
Application number: CN202210793704.3A
Authority: CN
Inventors: 李金膛; 陈亮; 吴若凡; 朱亮; 田胜; 但家旺; 孟昌华; 王维强
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-07-07
Filing date: 2022-07-07
Publication date: 2022-11-04

Abstract

The embodiment of the specification provides a method and a corresponding device for establishing a risk identification model. The method comprises the following steps: the method comprises the steps that a heterogeneous network graph constructed by using network behavior data of a user is obtained, the heterogeneous network graph comprises nodes and edges, the nodes comprise behavior bodies and behavior objects, and the edges are determined according to behavior relations between the behavior bodies and the behavior objects; carrying out mask processing on edges in the heterogeneous network graph to obtain a mask subgraph and residual subgraphs; training a self-encoder of the graph by using the residual subgraph and the mask subgraph; the graph self-encoder comprises an encoding network and a first decoding network; the coding network obtains the characteristic vector of each node by using the input residual subgraph, the first decoding network predicts the edge to be masked by using the characteristic vector of each node, and the training target comprises the following steps: minimizing the difference between the prediction result and the mask subgraph; and constructing a risk identification model by using a coding network in the trained graph self-encoder. The method and the device can improve the recognition effect of the risk recognition model.

Description

Method for establishing risk identification model and corresponding device

Technical Field

One or more embodiments of the present disclosure relate to the field of artificial intelligence technologies, and in particular, to a method and a corresponding apparatus for establishing a risk identification model.

Background

Today, with the increasing development of internet technology, users present various risks in various behaviors using the internet. For example, there may be various forms of fraud such as cyber-transaction fraud, false-part fraud, cyber-friend fraud, and so on. There may also be forms of risk such as money laundering, cheating, etc. In an actual risk control scenario, the graph neural network model is a deep neural network model which is widely applied at present. The graph neural network shows a strong learning and characterization capability in modeling the incidence relation between nodes in the graph structure. However, currently, the characterization learning based on the graph neural network adopts a supervised or semi-supervised mode, so that the characterization learning effect of the graph neural network depends on the labeled data to a great extent. However, in a risk control scenario, the labeled data is scarce, is not easy to obtain, and is very high in cost, which easily causes poor representation learning effect of the neural network of the graph, and further affects the recognition effect of the risk recognition model.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure disclose a method for establishing a risk identification model and a corresponding apparatus, so as to improve an identification effect of the risk identification model.

According to a first aspect, the present disclosure provides a method of building a risk identification model, the method comprising:

the method comprises the steps that a heterogeneous network graph constructed by using network behavior data of a user is obtained, the heterogeneous network graph comprises nodes and edges, the nodes comprise behavior bodies and behavior objects, and the edges are determined according to behavior relations between the behavior bodies and the behavior objects;

masking the edges in the heterogeneous network graph to obtain a mask subgraph and a residual subgraph;

training the graph auto-encoder using the residual subgraph and the mask subgraph; wherein the graph self-encoder comprises an encoding network and a first decoding network; the coding network obtains a feature vector of each node by using the input residual subgraph, the first decoding network predicts the masked edge by using the feature vector of each node, and the training of the target comprises the following steps: minimizing the difference between the prediction result and the mask subgraph;

and constructing a risk identification model by using a coding network in the trained graph self-encoder.

According to an implementation manner in the embodiment of the present application, the masking the edge in the heterogeneous network graph includes:

randomly sampling edges in the heterogeneous network graph, forming M sampled edges into mask subgraphs, and performing mask processing on the M sampled edges in the heterogeneous network graph to obtain residual subgraphs; alternatively, the first and second electrodes may be,

randomly sampling edges in the heterogeneous network graph, and taking M1 sampled edges as root nodes; respectively taking each node as a starting point to carry out path random walk, forming a mask subgraph by each obtained path, and carrying out mask processing on M2 edges contained in each path to obtain a residual subgraph;

wherein M, M1 and M2 are positive integers.

According to an implementation manner in the embodiment of the present application, the graph self-encoder further includes a second decoding network, where the second decoding network predicts degrees of the nodes by using the characterization vectors of the nodes;

the training target further comprises: minimizing a difference between a prediction result of the second decoding network and a degree of each node in the heterogeneous network graph.

According to an implementable manner of an embodiment of the present application, training the graph self-encoder using the remaining subgraph and the mask subgraph comprises:

determining a total training loss in each iteration, wherein the total training loss is determined by a first training loss and a second training loss, the first training loss is obtained by the difference between the prediction result of the first decoding network and a mask subgraph, and the second training loss is obtained by the difference between the prediction result of the second decoding network and the degree of each node in the heterogeneous network graph; and updating the model parameters of the graph self-encoder by using the value of the total training loss until a preset training ending condition is reached.

According to an implementable manner of an embodiment of the present application, the risk identification model is used for performing risk identification on a target node, a target edge or a target subgraph in the heterogeneous network graph.

According to an implementation manner in the embodiment of the present application, the constructing a risk identification model using a coding network in a trained graph self-encoder includes:

acquiring training data of a risk identification model;

and performing transfer learning of a risk recognition model on a coding network in a graph self-encoder obtained through training by using the training data, wherein the risk recognition model comprises the coding network and a classification network.

According to an implementable manner in an embodiment of the present application, the obtaining of the training data of the risk identification model includes at least one of:

acquiring nodes marked as risk users and non-risk users from the heterogeneous network graph as training data; alternatively, the first and second electrodes may be,

obtaining edges labeled as risky behaviors and non-risky behaviors from the heterogeneous network graph as training data; alternatively, the first and second electrodes may be,

and acquiring subgraphs labeled as a risk user set and a non-risk user set from the heterogeneous network graph as training data.

In a second aspect, an apparatus for establishing a risk identification model is provided, the apparatus comprising:

the network behavior data acquisition unit is configured to acquire a heterogeneous network graph constructed by using network behavior data of a user, the heterogeneous network graph comprises nodes and edges, the nodes comprise behavior bodies and behavior objects, and the edges are determined according to behavior relations between the behavior bodies and the behavior objects;

the graph mask unit is configured to perform mask processing on edges in the heterogeneous network graph to obtain a mask subgraph and remaining subgraphs;

a graph training unit configured to train the graph autoencoder using the remaining subgraph and the mask subgraph; wherein the graph self-encoder comprises an encoding network and a first decoding network; the coding network obtains the characteristic vector of each node by using the input residual subgraph, the first decoding network predicts the masked edge by using the characteristic vector of each node, and the training of the target comprises the following steps: minimizing the difference between the prediction result and the mask subgraph;

and the model building unit is configured to build a risk recognition model by utilizing the coding network in the trained graph self-coder, wherein the risk recognition model is used for carrying out risk recognition on target nodes, target edges or target subgraphs in the input network graph to be recognized.

According to a third aspect, the present disclosure provides a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method as described above.

According to a fourth aspect, the present disclosure provides a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and the processor, when executing the executable code, implements the method as described above.

According to the specific embodiment provided by the application, the application can have the following technical effects:

1) According to the method and the device, the mask subgraph and the residual subgraph are obtained by performing mask processing on the edges in the heterogeneous network graph, the graph self-encoder predicts the mask subgraph by using the residual subgraph so as to perform self-supervision learning, and compared with the traditional comparison learning mode, the method and the device are not limited by the number of labeled data, so that the characterization capability of the coding network obtained by training is ensured, and the recognition effect of the risk recognition model is improved.

2) According to the method, the path-level edge mask is carried out in a path random walk mode, path-level learning can be achieved, remote features of nodes can be captured by the graph self-encoder, the overfitting phenomenon is avoided, and the characterization vectors obtained by the encoding network have higher generalization and robustness.

3) The method adopts an asymmetric graph self-encoder structure, edge reconstruction and node degree information reconstruction are realized through a first decoding network and a second decoding network, graph representation learning is carried out by utilizing two reconstruction tasks, the representation capability of the encoding network is improved, and the identification effect of a risk identification model is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 illustrates an exemplary system architecture diagram to which embodiments of the disclosure may be applied;

FIG. 2 is a flowchart of a method for establishing a risk identification model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating a principle of constructing a risk identification model according to an embodiment of the present disclosure;

fig. 4 is a block diagram of an apparatus for establishing a risk identification model according to an embodiment of the present application.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the description of the invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely a relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The word "if" as used herein may be interpreted as "at 8230; \8230;" or "when 8230; \8230;" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

In order to solve the problem that labeled data are difficult to obtain, graph self-supervision learning is proposed and becomes a brand-new graph characterization learning mode. The graph self-supervision learning can reduce the transition dependence of the graph characteristic model on labeled data, and provides a new idea for training on a large amount of unlabeled data.

Currently, the existing graph self-supervision learning mainly represents contrast learning, and a plurality of contrast views of graph data are constructed through data enhancement. The training objectives used in the training process are to minimize the distance of the positive pairs of samples for different views and to maximize the distance of the negative pairs of samples for different views as much as possible. However, this method has two main disadvantages:

1) The effect of the graph representation model depends on a data enhancement algorithm and the selection of positive and negative sample pairs, and manual selection is often needed to ensure the performance of the model.

2) Multiple contrast views need to be constructed through graph data enhancement, and a large extra computational overhead is needed in generating the representations of the model under different views in the training process.

The present application provides a new graph self-supervision learning manner to improve graph characterization capability, and for the convenience of understanding of the present application, first, a system architecture on which the present application is based is described. FIG. 1 illustrates an exemplary system architecture to which embodiments of the disclosure may be applied. The system mainly comprises a device for establishing a risk identification model and a risk identification device. The device for establishing the risk identification model acquires batch user network behavior data from the data warehouse and analyzes the user network behavior data so as to establish the risk identification model.

And the risk recognition device carries out risk recognition on the target nodes, the target edges or the target subgraphs in the graph data by using the trained risk recognition model.

The risk identification model establishing device and the risk identification device in the system can be realized at a server side. The server side can be a single server, a server group formed by a plurality of servers, or a cloud server. The cloud Server is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPs) service. In addition to being implemented on the server side, it may also be implemented on a computer terminal with powerful computing capabilities.

It should be understood that the number of risk identification model building devices, risk identification devices, and data warehouses in fig. 1 is merely illustrative. There may be any number of risk identification model building devices, risk identification devices, and data repositories, as desired for an implementation.

Fig. 2 is a flowchart of a method for establishing a risk identification model according to an embodiment of the present disclosure. It will be appreciated that the method may be performed by the means for establishing a risk identification model in the system shown in figure 1. Referring to fig. 2, the method includes:

step 202: the method comprises the steps of obtaining a heterogeneous network graph constructed by using network behavior data of a user, wherein the heterogeneous network graph comprises nodes and edges, the nodes comprise behavior bodies and behavior objects, and the edges are determined according to behavior relations between the behavior bodies and the behavior objects.

Step 204: and carrying out mask processing on the edges in the heterogeneous network graph to obtain a mask subgraph and the remaining subgraphs.

Step 206: training a graph self-encoder by using the residual subgraphs and the mask subgraphs; the graph self-encoder comprises an encoding network and a first decoding network; the coding network obtains the characteristic vector of each node by using the input residual subgraph, the first decoding network predicts the edge to be masked by using the characteristic vector of each node, and the training target comprises the following steps: the difference between the predicted result and the mask subgraph is minimized.

Step 208: and constructing a risk identification model by using a coding network in the trained graph self-encoder.

According to the technical contents provided by the embodiment, the mask subgraph and the residual subgraph are obtained by performing mask processing on the edges in the heterogeneous network graph, the graph self-encoder predicts the mask subgraph by using the residual subgraph so as to perform self-supervision learning, and compared with the traditional comparison learning mode, the method is not limited by the number of labeled data, so that the characterization capability of the trained coding network is ensured, and the recognition effect of the risk recognition model is improved.

The respective steps shown in fig. 2 will be explained below.

First, the above step 202, namely "acquiring a heterogeneous network graph constructed by using network behavior data of a user" will be described in detail with reference to an embodiment.

During the process of using the network, a user is recorded with a large amount of network behavior data by the server side, and the network behavior data is usually recorded in a data warehouse and reflects the association between a large amount of behavior subjects and behavior objects.

The risk identification that is usually performed is scenario-specific, as are the types of behavior bodies, behavior objects, and network behaviors that are to be attended to and analyzed in a scenario. Therefore, the behavior body, the behavior object and the network behavior of the behavior body type, the behavior object type and the network behavior type corresponding to the target scene can be obtained from the data warehouse to construct the heterogeneous network diagram. The method for constructing the heterogeneous network graph based on the specific scene can greatly reduce the size of graph data. The heterogeneous network graph comprises nodes and edges, the nodes comprise behavior bodies and behavior objects, and the edges are determined according to network behavior relations between the behavior bodies and the behavior objects.

The main body type, the behavior object type and the network behavior type corresponding to the target scene can be set in advance according to experience.

Taking the risk of network transaction as an example, the action subject may be an account, a bank card, etc. The action object may also be an account, a bank card, etc., and may also be a red envelope id, etc. That is, the behavioral subjects and behavioral objects are subjects and objects related to financial classes. The edges between the nodes can be the payment behavior, the deposit and withdrawal behavior, the contract binding behavior, the red envelope receiving and sending behavior and other behavior relations which are related to the finance class and occur between the behavior main body and the behavior object.

Taking the risk of cyber-friend fraud as an example, the behavioral entity may be a social network account, a real-time communication instrument account, a financial account, a host address, a client identification, and so on. The edges between nodes may be an act of sending a friend request, an act of adding a friend, a chat act, a transfer act, an act of sending and receiving a red envelope, an act of sending a link, and so on.

The following describes step 204 in detail, that is, "mask the edge in the heterogeneous network graph to obtain a mask subgraph and remaining subgraphs" in combination with the embodiment.

As one of the realizable manners, when the edges in the heterogeneous network graph G are subjected to Mask (Mask) processing in this step, the edges in the heterogeneous network graph may be randomly sampled. The edges obtained by sampling form a mask subgraph represented as

And masking the edges obtained by sampling in the heterogeneous network graph to obtain a residual subgraph which is represented as G'. In the field of artificial intelligence, the mask is shielding, and the embodiment of the application is used for heterogeneousAnd (4) shielding the edges obtained by sampling in the network graph so that the edges are considered to be non-existent edges in the rest subgraphs. In the strategy, each edge in the heterogeneous network graph is actually regarded as an independent sample, M edges are randomly sampled and are subjected to mask processing to obtain a residual subgraph, and the residual subgraph comprises the structure of the edge left after being masked. Included in the mask subgraph is the structure of the edge being masked. Where M is a positive integer, and may be an empirical value or a trial value.

Remaining subgraph G' and mask subgraph

Can be expressed as:

the edge-level masking method cannot capture remote information in the graph structure of the heterogeneous network graph, and is difficult to learn the correlation between different edges. The present application thus also provides another way to implement masking based on path level. The edges in the heterogeneous network graph G may be randomly sampled, and M1 sampled edges are used as root nodes. Then, each root node is taken as a starting point to carry out path random walk, and each obtained path forms a mask subgraph

And (4) carrying out mask processing on M2 edges contained in each path to obtain a residual subgraph G'. Wherein, M1 and M2 are positive integers, and can be empirical values or experimental values.

Random Walk (Random Walk) is a mathematical statistical model that generates a series of paths, each of which is Random. Random walk to generate mask subgraph

Can be expressed as:

the RandomWalk () represents a function adopted by a random walk model, wherein R is a root node, n is the number of paths, and l is the length of the path. The number of paths and the path length may adopt empirical values and experimental values, and may be specifically determined according to sparsity of the heterogeneous network graph, and if the structure of the heterogeneous network graph is sparse, n and l may take smaller values; if the structure of the heterogeneous network graph is compact, n and l can take larger values.

The random walk may be an unbiased random walk or a biased random walk (Node 2 VecWalk), and since the random walk is the existing technology at present, this is only utilized in the embodiment of the present application, and therefore, detailed description is not given.

Similar to the edge-level mask approach, the remaining subgraph G' and the mask subgraph

The relationship (c) can still be expressed by the formula (1). The mask mode of the path level can better construct a mask subgraph, so that the model learns the correlation between the node representations on the mask path through the rest subgraphs, and the long-distance features of the nodes are captured. And the generalization and robustness of the representation learned by the model can be enhanced through path learning.

The above step 206, i.e., "training the graph self-encoder with residual subgraph and mask subgraph" is described in detail below with reference to the embodiment.

Graph self-coders also employ a self-supervised learning approach, whose idea is to reconstruct the input data to learn an efficient graph representation. The structure of the self-encoder in the embodiment of the present application may be as shown in fig. 3, and mainly includes an encoding network and a first decoding network.

The coding network is used for obtaining the characterization vector Z of each node by using the input residual subgraph G'. Wherein the encoding network may employ a graph neural network, which may include, but is not limited to, a graph convolutional neural network, a graph attention network, and the like.

Taking two-layer graph convolution neural network as an example, the working principle of the coding network can be expressed as follows:

where f () is a processing function used by the encoder, reLU () is a linear rectifying function, which is one of the activation functions. X is a node feature matrix, W ^(l) Is the parameter matrix of the l-th layer.

For the adjacency matrix of the graph G' after being normalized, the following formula can be used to determine:

wherein the content of the first and second substances,

adding self-loop (self-loop) to the adjacent matrix A of G', i.e. adding an identity matrix to A

The matrix obtained after the self-loop is added to the degree matrix of G'. The degree matrix is a diagonal matrix, the elements on the diagonal are degrees of each node in the graph G', and the degree of a node represents the number of edges connected by the node.

And the first decoding network is used for predicting the edge to be masked by using the characterization vector of each node, which is equivalent to reconstructing a mask subgraph. The first decoding network may be implemented using a classification network, which is equivalent to predicting whether masked edges exist between nodes using the characterization vectors of the nodes. After the heterogeneous network graph is masked in step 204, a natural positive sample is actually formed in the formed mask subgraph, and a natural negative sample is formed on an edge which does not exist between original nodes, so that manual construction and marking of the sample are not needed. The first decoding network may be implemented using a classification network.

The prediction result output for the first decoding network should be as consistent as possible with the mask subgraph, so the training objectives of the graph autoencoder may include: the difference between the prediction result of the first decoding network and the mask subgraph is minimized.

As one of the realizable ways, the first training loss L may be constructed using the training objectives described above _structure For example, a cross entropy loss function is used, as shown in the following formula:

wherein z is _i The subscript denoting node i, i.e., z, is the node identification. (i, j) represents an edge between node i and node j.

Where h () is a processing function adopted by the first decoding network, and can be represented as:

h(x,y)＝σ(x ^T y) (6)

alternatively, the first and second electrodes may be,

h(x,y)＝σ(MLP(x·y)) (7)

wherein σ () is a sigmoid activation function. MLP () is a function of the fully connected network and can be expressed as:

MLP(x·y)＝ReLU(ReLU((x·y)W ⁽⁰⁾ )W ⁽¹⁾ )…W ^(l-1) (8)

as an achievable way, the parameters of the model (graph self-encoder) can be updated in a way such as gradient descent by using the value of the first training loss in each iteration until the preset training end condition is met. The training ending condition may include, for example, that a value of the training loss is less than or equal to a preset training loss threshold, the number of iterations reaches a preset number threshold, and the like.

Still further, the graph self-encoding may further include a second decoding network. The second decoding network is used for predicting the degree of each node by using the characterization vector of each node. After the edge is subjected to the mask processing, the degrees of partial nodes are changed, and the second decoding network is used for reconstructing the degree information of the nodes. The second decoding network may be implemented using a regression network (e.g., a fully connected network).

The prediction result output for the second decoding network should coincide as much as possible with the degrees of each node in the heterogeneous network graph, and therefore, the training target of the graph self-encoder may further include minimizing the difference between the prediction result of the second decoding network and the degrees of each node in the heterogeneous network graph.

As one of the realizable ways, the output of the second decoding network can be utilized to construct the second training loss L _degree For example, a mean square error loss function is used, as shown in the following formula:

wherein d is _i Representing the degree of node i in the heterogeneous network graph,

representing the degree of node i predicted by the second decoding network.

As another achievable way, the value of the total training loss L can be determined in each iteration, and the parameters of the model (graph self-encoder) are updated by using a way such as gradient descent, until the preset training end condition is met. The training ending condition may include, for example, that a value of the training loss is less than or equal to a preset training loss threshold, the number of iterations reaches a preset number threshold, and the like.

Wherein the total training loss can be represented by L _structure And L _degree Jointly, the following formula can be used, for example:

L＝L _structure +αL _degree (10)

wherein alpha is a hyperparameter for controlling L _degree The weight of (c).

The method has the advantages that an asymmetric graph self-encoder structure is adopted, edge reconstruction and node degree information reconstruction are achieved through the first decoding network and the second decoding network, graph representation learning is conducted through two reconstruction tasks, representation capability of the encoding network is improved, and the identification effect of the risk identification model is further improved.

After the graph self-encoder is obtained through training, the encoding network in the graph self-encoder is obtained, and the encoding network can output the characterization vectors of all the nodes under the condition that graph data are input. The encoding network can be used for constructing a specific risk identification model by interfacing with downstream tasks.

The above step 208, namely "building a risk identification model by using a coding network in a trained graph self-encoder" will be described in detail below with reference to the embodiments.

The main structure of the risk identification model comprises a coding network and a classification network in the graph self-encoder obtained through the training of the steps. The coding network can output the characterization vectors of all nodes in the heterogeneous network graph when the heterogeneous network graph is input. The classification network can use the characterization vectors of the nodes for risk identification. The risk identified objects mainly include, but are not limited to, three types: and identifying whether a target node in the heterogeneous network graph is a risk user, identifying whether a target edge in the heterogeneous network graph is a risk behavior, and identifying whether a target subgraph in the heterogeneous network graph is a risk user set.

As shown in fig. 3, in constructing the risk identification model, it is necessary to perform transfer learning on the basis of the already trained coding network by using labeled training data. The training process corresponding to the transfer learning is a supervised learning process, namely, firstly, training data of a risk identification model is obtained, and the training data is actually obtained by labeling the heterogeneous network diagram; and then carrying out transfer learning of the risk identification model on the coding network in the trained graph self-coder by using the training data. In the transfer learning process, because the coding network already learns the characteristic vectors of all the nodes, only the classification network needs to be learned, and therefore, the learning speed is high.

If the risk identification model identifies the target node, when training data is acquired, nodes marked as risk users and non-risk users can be acquired from the heterogeneous network graph to serve as the training data, and the marked nodes serve as samples.

For example, the message may be obtained from a database of an official institution such as public security, court, or the like, to indicate that some users are illegal users, low-credit users, or the like, and the user indicated by the message may be determined as a known-risk user, and a corresponding node may be determined and labeled in the heterogeneous network graph. For another example, if some users are complained frequently, the users can be considered as known risk users, and corresponding nodes are determined and labeled in the heterogeneous network graph. For another example, some existing detection tools with high accuracy are used to detect that some users are risky users, or some risky users may be identified by a manual identification method, and corresponding nodes are determined and labeled in the heterogeneous network.

Similarly, there are also some users who are explicitly non-risky users. For example, obtaining a message from some official channel indicates that some users are highly recommended or approved, or are highly reputable users, such as users with a large amount of charitable behaviors, users who promote city construction, users who are rated as models, and the like, which are determined to be known safe users, and corresponding nodes are determined and labeled in the heterogeneous network. For another example, some existing high-accuracy detection tools can detect that some users are safe users, or some safe users can be identified in a manual identification mode, and corresponding nodes are determined and labeled in the heterogeneous network.

If the risk recognition model is used for recognizing the target edge, when the training data is obtained, edges marked as risk behaviors and non-risk behaviors can be obtained from the heterogeneous network graph to serve as the training data, and the marked edges are samples.

For example, messages indicating that some user behaviors are behaviors that violate laws, regulations, and the like or result in violations of laws, regulations, and the like may be obtained from a database of an official agency such as a public security, a court, and the like, and the user behaviors indicated by the messages may be determined as known risk behaviors, and corresponding edges may be determined and labeled in the heterogeneous network. For another example, if some user behaviors are complained, the user behaviors may be considered as known risk behaviors, and corresponding edges are determined and labeled in the heterogeneous network. For another example, some existing high-accuracy detection tools detect that some user behaviors are risk behaviors, or some risk behaviors can be identified in a manual identification mode, and corresponding edges are determined and labeled in the heterogeneous network.

Similarly, there are also some user behaviors that are explicitly non-risk behaviors. For example, obtaining a message from some official channel indicates that some user behavior is highly recommended or approved, such as charitable behavior, investment behavior that promotes city construction, etc., which are determined to be known security behavior, and corresponding edges are determined in the heterogeneous network. For another example, some user behaviors can be detected to be security behaviors through some existing high-accuracy detection tools, or some security behaviors can be identified through a manual identification mode, and corresponding edges are determined and labeled in the heterogeneous network graph.

If the risk recognition model is used for recognizing a target subgraph, subgraphs labeled as a risk user set and a non-risk user set can be obtained from the heterogeneous network graph as training data when training data are obtained, and the labeled subgraphs are used as samples. One more typical set of risk users is a sub-graph of the corresponding nodes and edges of the users in a fraud group in the heterogeneous network graph.

For example, messages indicating that some user sets are fraud groups can be obtained from a database of an official institution such as public security, court, etc., the user sets indicated by the messages can be determined as risk user sets, and corresponding sub-graphs can be determined and labeled in the heterogeneous network graph. For another example, if some detection technical means detect that some users are fraud groups, corresponding sub-graphs are determined and labeled in the heterogeneous network graph.

It should be noted that, during the training process of the risk identification model, the adjacency matrix and the node feature matrix of the heterogeneous network graph are mainly input into the risk identification model. The node characteristic data adopted in the node characteristic matrix may be, for example, a node type, a registration duration, and related attribute information of a corresponding user. The characteristics of the edges, such as behavior type, behavior time, behavior location, behavior times, etc., may be fused in determining the node characteristics. Since the present application does not make any changes to this section, it will not be described in detail here.

After the training of the risk recognition model is completed, the trained risk recognition model can be used for risk recognition, namely, the risk recognition model is used for carrying out risk recognition on target nodes, target edges or target sub-graphs in the heterogeneous network graph. For example, the information of the target node, the graph adjacency matrix and the node feature matrix of the heterogeneous network graph are input into a risk identification model, and a risk identification result for the target node is output by the risk identification model, for example, whether the target node has a preset type of risk or not.

Taking the network transaction risk as an example, the heterogeneous network graph obtained according to the user network behavior data comprises nodes such as an account, a bank card, a red packet id and the like. The edges among the nodes are the payment behavior, the depositing and withdrawing behavior, the signing and binding behavior, the red envelope receiving and sending behavior and other behavior relations related to finance, which occur among accounts, bank cards, red envelope ids and the like. Firstly, masking edges in the heterogeneous network graph to obtain a mask subgraph and a residual subgraph, and then training a graph self-encoder by using the residual subgraph and the mask subgraph.

After the training is finished and the graph self-encoder is obtained, the coding network in the graph self-encoder is connected with a classification network to construct a risk identification model. Nodes in the heterogeneous network graph that are definitely non-risk users and risk users are labeled. And training the risk recognition model by using the labeled heterogeneous network diagram.

After the training is finished, a risk identification model is obtained, and the risk identification model can carry out risk identification on the target node in the heterogeneous network graph so as to determine whether the target node is a non-risk user.

In the training process of the risk identification model, the training target is to minimize the difference between the output result of the classification network to the sample and the corresponding labeling result of the sample. The supervised learning process can construct a loss function according to the training target, and update model parameters in a mode such as gradient descent by using the value of the loss function in each iteration until a preset training end condition is met. The training end condition may include, for example, that a value of the loss function is less than or equal to a preset loss function threshold, the number of iterations reaches a preset number threshold, and the like. The parameters of the classification network may be updated only, or the parameters of the coding network and the classification network may be updated.

The foregoing is a detailed description of the methods provided by the present disclosure and has described certain embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The apparatus provided by the present disclosure is described in detail below. Fig. 4 is a block diagram illustrating an apparatus for establishing a risk identification model according to an embodiment of the present disclosure, and as shown in fig. 4, the apparatus 400 may include: a graph acquisition unit 401, a graph masking unit 402, a graph training unit 403, and a model construction unit 404. The main functions of each component unit are as follows:

the graph obtaining unit 401 is configured to obtain a heterogeneous network graph constructed by using network behavior data of a user, where the heterogeneous network graph includes nodes and edges, the nodes include behavior bodies and behavior objects, and the edges are determined according to behavior relationships between the behavior bodies and the behavior objects.

And a graph masking unit 402 configured to perform masking processing on edges in the heterogeneous network graph to obtain a masked subgraph and remaining subgraphs.

A graph training unit 403 configured to train a graph self-encoder using the remaining subgraphs and the masked subgraphs; wherein the graph self-encoder comprises an encoding network and a first decoding network; the coding network obtains the characteristic vector of each node by using the input residual subgraph, the first decoding network predicts the edge to be masked by using the characteristic vector of each node, and the training target comprises the following steps: minimizing the difference between the predicted result and the mask subgraph.

And a model building unit 404 configured to build a risk recognition model by using the trained coding network in the graph self-encoder, wherein the risk recognition model is used for performing risk recognition on a target node, a target edge or a target sub-graph in the input network graph to be recognized.

As one of the realizable manners, the graph masking unit 402 may be specifically configured to randomly sample edges in the heterogeneous network graph, configure M sampled edges as a mask subgraph, and mask M sampled edges in the heterogeneous network graph to obtain a remaining subgraph; wherein M is a positive integer.

As one of the realizable manners, the graph masking unit 402 may be specifically configured to randomly sample edges in the heterogeneous network graph, and take M1 sampled edges as root nodes; respectively taking each node as a starting point to carry out path random walk, forming a mask subgraph by each obtained path, and carrying out mask processing on M2 edges contained in each path to obtain a residual subgraph; wherein M1 and M2 are positive integers.

The random walk may be unbiased random walk or biased random walk (Node 2 VecWalk).

Furthermore, the graph self-encoder may further include a second decoding network, and the second decoding network predicts the degree of each node by using the characterization vector of each node. The training target further comprises: the difference between the prediction result of the second decoding network and the degree of each node in the heterogeneous network graph is minimized.

As one of the realizable ways, the graph training unit 403 determines a total training loss in each iteration, the total training loss being determined by a first training loss resulting from a difference between the prediction result of the first decoding network and the mask subgraph and a second training loss resulting from a difference between the prediction result of the second decoding network and the degrees of each node in the heterogeneous network graph; and updating the model parameters of the graph self-encoder by using the value of the total training loss until a preset training end condition is reached.

The constructed risk identification model is used for carrying out risk identification on target nodes, target edges or target subgraphs in the heterogeneous network graph.

As one of the realizable ways, the model building unit 404 may be specifically configured to: acquiring training data of a risk identification model; and performing transfer learning of a risk identification model on the coding network in the trained graph self-encoder by using the training data, wherein the risk identification model comprises the coding network and the classification network.

The model construction unit 404 may perform at least one of the following when obtaining the training data of the risk recognition model:

obtaining edges marked as risk behaviors and non-risk behaviors from the heterogeneous network graph as training data; alternatively, the first and second electrodes may be,

subgraphs labeled as a risk user set and a non-risk user set are obtained from the heterogeneous network graph as training data.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.

The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement without inventive effort.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The computer storage media described above may take any combination of one or more computer-readable media, including, but not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The above embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above embodiments are only examples of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A method for establishing a risk identification model, the method comprising:

training the graph autocoder using the residual subgraph and the mask subgraph; wherein the graph self-encoder comprises an encoding network and a first decoding network; the coding network obtains a feature vector of each node by using the input residual subgraph, the first decoding network predicts the masked edge by using the feature vector of each node, and the training of the target comprises the following steps: minimizing the difference between the prediction result and the mask subgraph;

2. The method of claim 1, wherein masking edges in the heterogeneous network graph comprises:

wherein M, M1 and M2 are positive integers.

3. The method of claim 1, wherein the graph self-encoder further comprises a second decoding network that predicts the degree of each node using the characterization vector of each node;

4. The method of claim 3, wherein training the graph self-encoder using the remaining subgraphs and mask subgraphs comprises:

5. The method of claim 1, wherein the risk identification model is used to identify risk for a target node, a target edge, or a target subgraph in the heterogeneous network graph.

6. The method according to any one of claims 1 to 5, wherein the constructing a risk identification model using the trained coding network in the graph self-encoder comprises:

acquiring training data of a risk identification model;

and performing transfer learning of a risk identification model on a coding network in the trained graph self-encoder by using the training data, wherein the risk identification model comprises the coding network and a classification network.

7. The method of claim 6, wherein the obtaining training data for a risk identification model comprises at least one of:

8. An apparatus for establishing a risk identification model, the apparatus comprising:

the graph acquisition unit is configured to acquire the heterogeneous network graph constructed by using network behavior data of a user, the heterogeneous network graph comprises nodes and edges, the nodes comprise behavior bodies and behavior objects, and the edges are determined according to behavior relations between the behavior bodies and the behavior objects;

the graph masking unit is configured to mask edges in the heterogeneous network graph to obtain a mask subgraph and residual subgraphs;

a graph training unit configured to train the graph autoencoder using the remaining subgraph and the mask subgraph; wherein the graph self-encoder comprises an encoding network and a first decoding network; the coding network obtains a feature vector of each node by using the input residual subgraph, the first decoding network predicts the masked edge by using the feature vector of each node, and the training of the target comprises the following steps: minimizing the difference between the prediction result and the mask subgraph;

9. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1 to 7.

10. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code, the processor when executing the executable code implementing the method of any one of claims 1 to 7.