CN113360908A

CN113360908A - Data processing method, violation recognition model training method and related equipment

Info

Publication number: CN113360908A
Application number: CN202110678064.7A
Authority: CN
Inventors: 赵宏宇; 赵国庆; 蒋宁; 王洪斌; 吴海英; 林亚臣
Original assignee: Mashang Consumer Finance Co Ltd
Current assignee: Mashang Consumer Finance Co Ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-09-07

Abstract

The invention provides a data processing method, a violation recognition model training method and related equipment, wherein the method comprises the following steps: acquiring operation behavior data in a preset time period; determining a first feature vector and a second feature vector based on the operation behavior data, wherein the first feature vector is a vector of the graph data, and the second feature vector is a vector of the operation behavior feature data; the graph data is used for representing the structural characteristics of the operation behaviors, and the operation behavior characteristic data is used for reflecting the relevant information of the operation behaviors; inputting the first feature vector and the second feature vector into the violation identification model to obtain an identification result; the violation identification model comprises a first neural network layer constructed based on a graph embedding algorithm and a second neural network layer constructed based on an ensemble learning algorithm, wherein the first neural network layer is a self-coding network. Due to the fact that the structural information of the nodes in the graph data is blended into the second feature vector corresponding to the abnormal operation behavior, accuracy of identifying the abnormal operation behavior is improved.

Description

Data processing method, violation recognition model training method and related equipment

Technical Field

The invention relates to the technical field of data processing, in particular to a data processing method, a violation recognition model training method and related equipment.

Background

With the development of computer technology, safe operation behavior becomes an important index for current data management. Currently, for an abnormal operation behavior, a determination condition of the abnormal operation behavior is generally set based on experience, and the abnormal operation behavior is identified based on the determination condition to improve the security of data management, for example, an operation behavior of a user is determined that the number of times of generating the operation behavior within a preset time period is greater than a preset value, and the user is determined as the abnormal operation behavior. Due to the complexity of the abnormal operation behavior, the identification accuracy of the abnormal operation behavior by the judgment condition is poor.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a violation identification model training method and related equipment, and aims to solve the problem of poor accuracy of violation identification model identification.

In a first aspect, an embodiment of the present invention provides a data processing method, including:

acquiring operation behavior data in a preset time period;

determining a first feature vector and a second feature vector based on the operation behavior data, wherein the first feature vector is a vector of graph data, the second feature vector is a vector of operation behavior feature data, the graph data is used for representing structural features of operation behaviors, the graph data comprises a plurality of nodes and connecting lines of the nodes, each node is used for representing one piece of data in the operation behavior feature data, and each connecting line is used for representing that the operation behavior feature data corresponding to two nodes have an association relationship; the operation behavior characteristic data is used for reflecting relevant information of operation behaviors;

inputting the first feature vector and the second feature vector into a violation identification model to obtain an identification result;

the violation identification model comprises a first neural network layer constructed based on a graph embedding algorithm and a second neural network layer constructed based on an ensemble learning algorithm, wherein the first neural network layer is a self-coding network; the self-coding network is used for supervising and learning abnormal operation behaviors, the input of the first neural network layer is the first feature vector, the output of the first neural network layer is a first graph embedding vector, the first graph embedding vector is used for representing structural information of nodes in the graph data, the input of the second neural network layer is the first graph embedding vector and the second feature vector, and the output is the identification result.

In a second aspect, an embodiment of the present invention provides a method for training a violation recognition model, including:

acquiring sample data to be trained;

determining a third feature vector and a fourth feature vector based on the sample data, wherein the third feature vector is a vector of graph data, the fourth feature vector is a vector of operation behavior feature data, the graph data is used for representing structural features of operation behaviors, the graph data comprises a plurality of nodes and connecting lines between the nodes, each node is used for representing one piece of data in the operation behavior feature data, and each connecting line is used for representing that the operation behavior feature data corresponding to two nodes have an association relationship; the operation behavior characteristic data is used for reflecting relevant information of operation behaviors;

training a violation identification model to be trained based on the third feature vector and the fourth feature vector to obtain the violation identification model;

the violation identification model to be trained comprises a first neural network layer constructed based on a graph embedding algorithm and a second neural network layer constructed based on an ensemble learning algorithm, wherein the first neural network layer is a self-coding network; the self-coding network is used for supervising and learning abnormal operation behaviors, the input of the first neural network layer is the third feature vector, the output of the first neural network layer is a second graph embedding vector, the second graph embedding vector is used for representing structural information of nodes in the graph data, the input of the second neural network layer is the second graph embedding vector and the fourth feature vector, and the output is an identification result of the abnormal operation behaviors.

The structural information of the nodes in the graph data is merged into the fourth feature vector, so that the operation behavior feature data and the structural information of the nodes in the graph data are trained, and the accuracy of abnormal operation behavior identification can be improved when the abnormal operation behavior identification is carried out by using the violation identification model obtained by training.

In a third aspect, an embodiment of the present invention provides a data processing apparatus, including:

the first acquisition module is used for acquiring operation behavior data in a preset time period;

a first determining module, configured to determine a first feature vector and a second feature vector based on the operation behavior data, where the first feature vector is a vector of graph data, the second feature vector is a vector of operation behavior feature data, the graph data is used to represent a structural feature of an operation behavior, the graph data includes a plurality of nodes and connection lines between nodes, each node is used to represent one piece of data in the operation behavior feature data, and each connection line is used to represent that the operation behavior feature data corresponding to two nodes have an association relationship; the operation behavior characteristic data is used for reflecting relevant information of operation behaviors;

the input module is used for inputting the first feature vector and the second feature vector into a violation identification model to obtain an identification result;

In a fourth aspect, an embodiment of the present invention provides a device for training a violation recognition model, including:

the second acquisition module is used for acquiring sample data to be trained;

a second determining module, configured to determine, based on the sample data, a third feature vector and a fourth feature vector, where the third feature vector is a vector of graph data, the fourth feature vector is a vector of operation behavior feature data, the graph data is used to represent a structural feature of an operation behavior, the graph data includes multiple nodes and connection lines between the nodes, each node is used to represent one piece of data in the operation behavior feature data, and each connection line is used to represent that the operation behavior feature data corresponding to two nodes have an association relationship; the operation behavior characteristic data is used for reflecting relevant information of operation behaviors;

the training module is used for training the violation identification model to be trained based on the third feature vector and the fourth feature vector to obtain the violation identification model;

In a fifth aspect, an embodiment of the present invention provides an electronic device, which includes a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the steps of the data processing method described above or implements the steps of the violation recognition model training method described above.

In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by the processor to implement the steps of the above data processing method or the steps of the above violation recognition model training method.

The method comprises the steps of obtaining operation behavior data in a preset time period; determining a first feature vector and a second feature vector based on the operation behavior data, wherein the first feature vector is a vector of graph data, the second feature vector is a vector of operation behavior feature data, the graph data is used for representing structural features of operation behaviors, the graph data comprises a plurality of nodes and connecting lines of the nodes, each node is used for representing one piece of data in the operation behavior feature data, and each connecting line is used for representing that the operation behavior feature data corresponding to two nodes have an association relationship; the operation behavior characteristic data is used for reflecting relevant information of operation behaviors; inputting the first feature vector and the second feature vector into a violation identification model to obtain an identification result; the violation identification model comprises a first neural network layer constructed based on a graph embedding algorithm and a second neural network layer constructed based on an ensemble learning algorithm, wherein the first neural network layer is a self-coding network; the self-coding network is used for supervising and learning abnormal operation behaviors, the input of the first neural network layer is the first feature vector, the output of the first neural network layer is a first graph embedding vector, the first graph embedding vector is used for representing structural information of nodes in the graph data, the input of the second neural network layer is the first graph embedding vector and the second feature vector, and the output is the identification result. In this way, the structural information of the nodes in the graph data is merged into the second feature vector corresponding to the abnormal operation behavior, so that the accuracy of the violation identification model in identifying the abnormal operation behavior is improved. In addition, the abnormal operation behaviors are supervised and learned in the self-coding network, so that the task learning of the abnormal operation behaviors based on the graph data can be realized, and the accuracy of the violation identification model in identifying the abnormal operation behaviors is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart of a data processing method provided by an embodiment of the invention;

FIG. 2 is a second flowchart of a data processing method according to an embodiment of the present invention;

fig. 3 is a network structure diagram of a first neural network layer in the data processing method according to the embodiment of the present invention;

FIG. 4 is a flowchart of a method for training a violation identification model according to an embodiment of the present invention;

FIG. 5 is a block diagram of a data processing apparatus provided by an embodiment of the present invention;

FIG. 6 is a block diagram of an exemplary violation identification model training apparatus;

fig. 7 is a block diagram of an electronic device provided in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention, as shown in fig. 1, including the following steps:

step 101, acquiring operation behavior data in a preset time period;

the data processing method provided by the embodiment of the application can be applied to a data management operation platform or system and is used for managing the operation behavior of a user under the platform or system. The user can register the account under the platform or the system and perform corresponding operation on the account. Such as resource storage or resource roll-out. The operation behavior data can be understood as behavior data generated by the user performing relevant operations.

Alternatively, operational behavior data may be obtained over a period of time of a data management operating platform or system history.

Step 102, determining a first feature vector and a second feature vector based on the operation behavior data, where the first feature vector is a vector of graph data, the second feature vector is a vector of the operation behavior feature data, the graph data is used for representing structural features of an operation behavior, the graph data includes a plurality of nodes and connection lines between the nodes, each node is used for representing one piece of data in the operation behavior feature data, and each connection line is used for representing that the operation behavior feature data corresponding to two nodes have an association relationship; the operation behavior characteristic data is used for reflecting relevant information of operation behaviors;

as shown in fig. 2, in the embodiment of the present application, after the operation behavior data is obtained, data processing may be performed on the operation behavior data to obtain graph data used for representing structural features of the operation behavior and operation behavior feature data used for reflecting relevant conditions of the operation behavior, that is, each piece of data in the operation behavior feature data is used to reflect one piece of operation behavior information generated by one operation behavior. Each node in the graph data corresponds to one piece of data in the operation behavior feature data, namely one node is used for representing one operation behavior, and the change information of resources in the account is reflected by establishing a connection line for the nodes with the operation behavior association relation in the preset time.

It should be understood that if the operation behavior data includes a large number of normal operation behaviors, the node structure corresponding to the graph data is embodied as a relatively discrete structure. Under the condition that the operation behavior data comprises a large number of abnormal operation behaviors, the node structure corresponding to the graph data is embodied as a chain operation behavior structure, a nested ring operation behavior structure, a centralized transfer-in and decentralized transfer-out operation behavior structure and a decentralized transfer-in and centralized operation behavior structure.

Optionally, the operational behavior feature data includes user information, resource interaction information, and derived features. The user information may include a user type, an identification card, an e-mail, a telephone number, a home address, an Internet Protocol (IP) address acquired when an operation is performed, a Global Positioning System (GPS), and the like.

Step 103, inputting the first feature vector and the second feature vector into a violation identification model to obtain an identification result;

In an embodiment of the present application, the first graph embedding vector may include a matrix formed by embedding vectors of a plurality of nodes. As shown in fig. 2, the first neural network layer may output a first graph embedding vector based on the first feature vector, and the second neural network layer may splice an embedding vector of each node in the first graph embedding vector with a corresponding second feature vector, and then identify the spliced vector to determine whether the operation behavior is abnormal. Due to the fact that the structural information of the nodes in the graph data is blended into the second feature vector corresponding to the operation behaviors, accuracy of the violation identification model in identifying the abnormal operation behaviors is improved. In addition, the abnormal operation behavior is supervised and learned in the self-coding network, so that the task learning of the abnormal operation behavior can be carried out based on the graph data, and the accuracy of the violation identification model in identifying the abnormal operation behavior can be further improved.

Optionally, as shown in fig. 3, in some embodiments, the first neural network layer includes a self-coding subnetwork formed by sequentially connecting 2 × N layers of first Fully-Connected networks (FCs), and an abnormal operation behavior task learning subnetwork formed by sequentially connecting M layers of second Fully-Connected networks;

the input of the self-coding sub-network is the first characteristic vector, the output of the first full-connection network structure at the Nth layer is the first graph embedding vector, the input of the abnormal operation behavior task learning sub-network is the first graph embedding vector, and the output is a prediction result of whether the node has the abnormal operation behavior or not based on the first graph embedding vector.

In this embodiment, the values of N and M may be set according to actual needs, for example, in some embodiments, the value of N is 2, and the value of M is 2. The first eigenvector described above can be understood as a adjacency matrix composed of free-manipulation behavior nodes. A self-coding subnetwork can be understood as a self-coding Network based on Structured Deep Network Embedding (SDNE), which can form an encoder (Encode) and a decoder (Decode). The self-coding sub-network is used for learning graph data in the training process. The input is the adjacency matrix (i.e., the first eigenvector) and the output is the reconstructed adjacency matrix.

Alternatively, the abnormal operation behavior task learning sub-network may be referred to as a Multi-Layer neural network (MPL) or a Multi-Layer Perceptron. The abnormal operation behavior task learning subnetwork can predict whether the abnormal operation behavior exists in the node or not based on the structural information of the node and the node by taking the first embedded vector as input.

It should be noted that, in the embodiment of the present application, the loss function L of the violation identification model is based on the reconstruction loss and the task loss L₀And regularization loss λ L_regForming, the reconstruction loss includes a first-order similarity loss L for representing between neighboring nodes_1stAnd for representing a second order similarity loss L between neighboring nodes_2stThe loss function L satisfies:

L＝L_1st+αL_2st+βL₀+λL_regwherein, alpha, beta and lambda are adjusting parameters.

Optionally, in some embodiments, the first order similarity loss L_1stSatisfies the following conditions:

wherein s is_i,jRepresenting the connection between the ith node and the jth node, z_iA graph embedding vector, z, representing the ith node in the first graph embedding vector_jA graph embedding vector representing a jth node in the first graph embedding vector.

It will be appreciated that the first order similarity loss L is determined by equation (1) above_1stThe purpose of the method is to imply a space, so that the expression vectors between the ith node and the first-order neighbor node (i.e. the jth node) are close, the difference is smaller, and the local structure is kept as much as possible.

It should be noted that, the ith node may be understood as i nodes obtained by arranging the nodes according to a preset sequence, or i nodes obtained by numbering the nodes, in some embodiments, the ith node may also be referred to as a node i, and similarly, the jth node may also be referred to as a node j.

Optionally, in some embodiments, the second order similarity loss L_2stSatisfies the following conditions:

wherein x is_iIs a neighbor vector of the ith node in the first feature vector, b_iA weight greater than 1 indicates that a hadamard product is calculated,

a reconstructed adjacency vector representing the output of the ith node through the self-coding subnetwork.

It will be appreciated that the second order similarity loss L is determined by equation (2) above_2stTherefore, the similarity degree of the neighbor set after the ith node can be measured, the aim is to enable identification vectors among nodes with more similar neighbors to be more similar, and the global structure is reserved.

Optionally, a task loss L₀Error for identifying abnormal operating behavior based on the first graph-embedded vector, this loss introduces the objective of task learning of abnormal operating behavior, which in some embodiments may be calculated using a cross-entropy function. For example, in some embodiments, the task loss L₀Satisfies the following conditions:

wherein y represents a label of a true abnormal operation behavior or a label of a non-abnormal operation behavior,

and representing the prediction result output by the abnormal operation behavior task learning sub-network.

In the embodiment of the present application, the task loss L₀The task learning is taken as the aim, so that the first graph embedded vector can be better combined with the second characteristic vector, and the accuracy of identifying abnormal operation behaviors is improved.

It should be noted that, in the embodiment of the present application, the training of learning the sub-network by the abnormal operation behavior task requires a tag (abnormal operation behavior)Operational behavior tags or normal operational behavior tags), while the training of the self-encoding subnetwork does not require tagged operational behavior data, and therefore, in the embodiments of the present application, a large amount of untagged (abnormal operational behavior tags or normal operational behavior tags) operational behavior data and a small amount of tagged operational behavior data may be used for model training. It should be understood that the above description is intended to cover

The final abnormal operation behavior recognition result is not the final abnormal operation behavior recognition result, but only the prediction result obtained by the relationship between the operation behaviors.

Optionally, in some embodiments, the regularization loss may also be referred to as a penalty term, and the specific calculation satisfies:

wherein, θ generally refers to all parameters to be trained in model training, and the term realizes the L2 regularization of the model, reduces the complexity of the model, and avoids the occurrence of overfitting.

It should be noted that, various optional implementations described in the embodiments of the present invention may be implemented in combination with each other or implemented separately, and the embodiments of the present invention are not limited thereto.

It should be understood that the above data management operation platform or system may be applied to a financial management platform or system, such as a bank system, the above user refers to a bank user, the above operation behavior may be understood as an operation behavior for performing a transaction, and the operation behavior is described in detail below as an operation behavior for performing a transaction, in this case, the operation behavior data may be understood as transaction data, the above abnormal operation behavior may be understood as an anti-money laundering behavior, and the above violation identification model may be understood as an anti-money laundering model. In the embodiment of the application, when the method is applied to anti-money laundering identification, the method specifically comprises the following steps:

step 201, acquiring transaction data in a preset time period;

in the embodiment of the present application, the transaction data may be money transaction data or virtual money (e.g., bitcoin) transaction data. The transaction data is transaction behavior data generated in a preset time period.

Step 202, determining a first feature vector and a second feature vector based on the transaction data, wherein the first feature vector is a vector of graph data used for representing a transaction structure, the second feature vector is a vector of transaction feature data, the graph data comprises a plurality of nodes and connection lines between the nodes, each node is used for one piece of data in the transaction feature data, and each connection line is used for representing a fund flow of the transaction; the transaction characteristic data is used for reflecting transaction conditions;

as shown in fig. 2, in the embodiment of the present application, after the transaction data is obtained, data processing may be performed on the transaction data to obtain graph data used for representing a transaction structure and transaction characteristic data used for reflecting a transaction situation, that is, each piece of data in the transaction characteristic data is used for reflecting a piece of transaction information generated by a transaction. Each node in the graph data corresponds to one piece of data in the transaction characteristic data, namely one node is used for representing one transaction, and connection lines are established for the nodes with the transaction association relation in the preset time so as to embody the fund flow.

It should be understood that if the transaction data includes a large number of normal transactions, the node structure corresponding to the graph data is embodied as a relatively discrete structure. Under the condition that the transaction data comprises a large number of abnormal transactions, the node structures corresponding to the graph data are embodied as a chain transaction structure, a nested ring transaction structure, a centralized transfer-in and decentralized transfer-out transaction structure and a decentralized transfer-in and centralized transfer-out transaction structure.

Optionally, the transaction characteristic data comprises user information, funds transaction information and derived characteristics. The user information may include a user type, an identity card, an e-mail, a telephone number, a home address, an Internet Protocol (IP) address acquired during a transaction, a Global Positioning System (GPS), and the like; the fund transaction information comprises transaction time, transaction amount, purchased items, profit conditions, redemption time and the like; the derived features may include the number of spins, the ratio of spins to total, the sum of spins to total, the statistical features of the spins/spins, and the difference/absolute difference between adjacent transactions.

Step 203, inputting the first feature vector and the second feature vector into an anti-money laundering model to obtain a recognition result;

the anti-money laundering model comprises a first neural network layer constructed based on a graph embedding algorithm and a second neural network layer constructed based on an ensemble learning algorithm, and the first neural network layer is a self-coding network; the self-coding network is used for supervising and learning abnormal transaction behaviors, the input of the first neural network layer is the first feature vector, the output of the first neural network layer is a first graph embedding vector, the first graph embedding vector is used for representing structural information of nodes in the graph data, the input of the second neural network layer is the first graph embedding vector and the second feature vector, and the output is the identification result.

In an embodiment of the present application, the first graph embedding vector may include a matrix formed by embedding vectors of a plurality of nodes. As shown in fig. 2, the first neural network layer may output a first graph embedding vector based on the first feature vector, and the second neural network layer may splice the embedding vector of each node in the first graph embedding vector with a corresponding second feature vector, and then identify the spliced vector to determine whether the abnormal transaction is an abnormal transaction, which may be understood as a money laundering transaction. Due to the fact that the structural information of the nodes in the graph data is blended into the second feature vector corresponding to the abnormal transaction, the accuracy of the anti-money laundering model for recognizing the abnormal transaction is improved. In addition, the supervised learning of abnormal transaction behaviors is added in the self-coding network, so that abnormal transaction task learning can be performed based on graph data, and the accuracy of the anti-money laundering model in recognizing abnormal transactions can be further improved.

It should be noted that the data processing method provided in the embodiment of the present application may also be used to implement management of secure login. For example, in some embodiments, the operation behavior may be understood as a login behavior of a user, the violation identification model is used to identify an abnormal login behavior of the user, and corresponding security policy management is performed on an account of the user based on the identified abnormal login behavior. In this case, the operation behavior data may be data generated by a user logging on once, which is recorded by the platform or the system.

Further, as shown in fig. 4, an embodiment of the present application further provides a method for training a violation recognition model, including the following steps:

step 401, obtaining sample data to be trained;

step 402, determining a third feature vector and a fourth feature vector based on the sample data, where the third feature vector is a vector of graph data, the fourth feature vector is a vector of operation behavior feature data, the graph data is used for representing structural features of an operation behavior, the graph data includes a plurality of nodes and connection lines between the nodes, each node is used for representing one piece of data in the operation behavior feature data, and each connection line is used for representing that the operation behavior feature data corresponding to two nodes have an association relationship; the operation behavior characteristic data is used for reflecting relevant information of operation behaviors;

step 403, training the violation identification model to be trained based on the third feature vector and the fourth feature vector to obtain the violation identification model;

the violation identification model to be trained comprises a first neural network layer constructed based on a graph embedding algorithm and a second neural network layer constructed based on an ensemble learning algorithm, wherein the first neural network layer is a self-coding network based on multi-task learning; the self-coding network based on the multitask learning is used for supervising and learning abnormal operation behaviors, the input of the first neural network layer is the third feature vector, the output of the first neural network layer is a second graph embedding vector, the second graph embedding vector is used for representing structural information of nodes in graph data, and the input of the second neural network layer is the second graph embedding vector and the fourth feature vector, and the output is a recognition result of the abnormal operation behaviors.

In this embodiment of the application, the sample data to be trained may be operation behavior data within a period of time, and may include normal operation behavior data and abnormal operation behavior data. The operational behavior data may be monetary transaction data or virtual currency (e.g., bitcoin) transaction data, or the like. The operation behavior data is transaction behavior data generated in a preset time period.

Optionally, a tag may be added to the sample data in a manual tag manner, so that a part of the sample data carries the tag, and the tag may be an abnormal transaction tag.

In the embodiment of the application, after the sample data is obtained, data processing may be performed on the sample data to obtain graph data used for representing an operation behavior structure and operation behavior feature data used for reflecting relevant information of an operation behavior, that is, each piece of data in the transaction feature data is used for reflecting one piece of operation behavior information generated by one transaction. Each node in the graph data corresponds to one piece of data in the operation behavior feature data, namely one node is used for representing one operation behavior, and the change information of resources in the account is reflected by establishing a connection line for the nodes with the operation behavior association relation in the preset time.

In an embodiment of the present application, the second graph embedding vector may include a matrix formed by embedding vectors of a plurality of nodes. The first neural network layer can output a second graph embedding vector based on the third feature vector, the second neural network layer can splice the embedding vector of each node in the second graph embedding vector with the corresponding fourth feature vector, and then the spliced vectors are identified to judge whether the operation behaviors are abnormal or not.

The structural information of the nodes in the graph data is merged into the fourth feature vector, so that the operation behavior feature data and the structural information of the nodes in the graph data are trained, and the accuracy of abnormal operation behavior identification can be improved when the abnormal operation behavior identification is carried out by using the violation identification model obtained by training. Meanwhile, the abnormal operation behavior is supervised and learned in the self-coding network, so that the task learning of the abnormal operation behavior can be carried out based on the graph data, and the accuracy of the violation identification model for identifying the abnormal operation behavior can be further improved.

Optionally, the first neural network layer comprises a self-coding sub-network formed by sequentially connecting first fully-connected network structures based on 2 × N layers, and an abnormal operation behavior task learning sub-network formed by sequentially connecting second fully-connected network structures based on M layers;

and N and M are positive integers, the input of the self-coding sub-network is the third feature vector, the output of the Nth layer first full-connection network structure is the second graph embedding vector, the input of the abnormal operation behavior task learning sub-network is the second graph embedding vector, and the output is a prediction result of the abnormal operation behavior of the node based on the second graph embedding vector.

It should be noted that, since the abnormal operation behavior task learning sub-network helps the model to be generalized to the anti-money laundering task, the assumption space which is well represented on a sufficient number of training tasks (graph embedding generation tasks) will also be well used for learning the anti-money laundering task with the same environment.

Optionally, the loss function L of the violation identification model to be trained is the same as the loss function L in the data processing method embodiment, which may be specifically referred to the data processing method embodiment, and is not described herein again.

It should be noted that the operation behavior may be understood as an operation behavior for performing a transaction, the abnormal transaction behavior may be understood as a money laundering behavior, and the training method for the violation identification model may be set according to actual needs, for example, in some embodiments, a Batch Gradient Descent method (Batch Gradient decision) may be used for training. The training process is shown in fig. 3, since money laundering transactions have Time attribute, here we select edge data Et At intervals (Time step) to construct a small adjacency matrix At, i.e. a local graph, which is trained as a Batch (Batch), to generate a local graph embedding vector Vt, where input of tags Yt is required for tagged items in a training set or validation set (e.g. Batch t), and input of tags is not required for a test set (e.g. Batch q). Data is input in chronological order, no connection is made between different Time steps, and the amount of data Et can be changed for each Time step.

It should be noted that, in the second neural network layer, an ensemble learning algorithm based on a decision tree may be used as a classifier to identify abnormal operation behaviors. Optionally, the ensemble learning algorithm based on the decision tree may be a Random Forest (RF), XGBoost, or the like, and has an advantage that the importance of the features may be automatically obtained, so as to effectively perform feature screening, and perform parallelization operation, so that the violation identification model of the present invention has good stability and generalization.

For a better understanding of the present application, the following detailed description is given by way of specific examples.

For example, transaction data of bitcoin may be employed as the experimental data, wherein the transaction data includes legitimate transactions and illegitimate money laundering transactions (i.e., anomalous transactions). Graph data and transaction characteristic data can be determined based on the transaction data, wherein nodes in the graph data represent entities of transactions, and lines between the nodes represent transaction flows of bitcoins. The experiment used the first 94 regular features of the transaction, for a total of 49 time steps, 34 for training, 15 for testing, and ae (auto encoder) in the experiment represents the self-encoding method corresponding to the self-encoding network of the present application. The classifiers used were Random Forest (RF) and XGBoost, using the same parameter settings. Because of the data imbalance, the experiment only evaluates illegal transactions. The experimental results are shown below:

numbering	Algorithm	Rate of accuracy	Recall rate	F1
					1	RF	0.8491	0.6962	0.7651
2	AE+RF	0.9635	0.8283	0.8908
					3	XGBoost	0.9707	0.6427	0.7733
4	AE+XGBoost	0.9054	0.8042	0.8518

As can be seen from experiments, the algorithm 1 and the algorithm 3 only use the transaction characteristic data of a single transaction, and do not consider the connection between the transactions, so the experimental effect is general. Due to the fact that the incidence relation between the transactions is increased, most evaluation indexes (including the accuracy and the recall ratio, and F1 is the score calculated according to the accuracy and the recall ratio) of the model are improved, and particularly the recall ratio is remarkably improved. Therefore, by adopting the scheme of the application, more abnormal transactions can be identified on the premise of ensuring the accuracy.

Referring to fig. 5, fig. 5 is a structural diagram of a data processing apparatus according to an embodiment of the present invention, and as shown in fig. 5, the data processing apparatus 500 includes:

a first obtaining module 501, configured to obtain operation behavior data within a preset time period;

a first determining module 502, configured to determine, based on the operation behavior data, a first feature vector and a second feature vector, where the first feature vector is a vector of graph data, the second feature vector is a vector of operation behavior feature data, the graph data is used to represent a structural feature of an operation behavior, the graph data includes a plurality of nodes and connection lines between nodes, each node is used to represent one piece of data in the operation behavior feature data, and each connection line is used to represent that the operation behavior feature data corresponding to two nodes have an association relationship; the operation behavior characteristic data is used for reflecting relevant information of operation behaviors;

an input module 503, configured to input the first feature vector and the second feature vector to the violation identification model, so as to obtain an identification result;

the violation identification model comprises a first neural network layer constructed based on a graph embedding algorithm and a second neural network layer constructed based on an ensemble learning algorithm, wherein the first neural network layer is a self-coding network based on multi-task learning; the self-coding network based on the multitask learning is used for supervising and learning abnormal operation behaviors, the input of the first neural network layer is the first feature vector, the output of the first neural network layer is a first graph embedding vector, the first graph embedding vector is used for representing structural information of nodes in graph data, the input of the second neural network layer is the first graph embedding vector and the second feature vector, and the output is the identification result.

the input of the self-coding sub-network is the first characteristic vector, the output of the first full-connection network structure at the Nth layer is the first graph embedding vector, the input of the abnormal operation behavior task learning sub-network is the first graph embedding vector, and the output is a prediction result of the abnormal operation behavior of the node based on the first graph embedding vector.

Optionally, the loss function L of the violation identification model is based on reconstruction loss and task loss L₀And regularization loss L_regForming, the reconstruction loss includes a first-order similarity loss L for representing between neighboring nodes_1stAnd for representing a second order similarity loss L between neighboring nodes_2stThe loss function L satisfies:

Optionally, the first order similarity loss L_1stSatisfies the following conditions:

Optionally, the second order similarity loss L_2stSatisfies the following conditions:

Optionally, the task loss L₀Satisfies the following conditions:

The electronic device provided in the embodiment of the present invention can implement each process implemented by the electronic device in the method embodiments of fig. 1 to fig. 3, and is not described herein again to avoid repetition.

Referring to fig. 6, fig. 6 is a structural diagram of a violation recognition model training apparatus according to an embodiment of the present invention, and as shown in fig. 6, the violation recognition model training apparatus 600 includes:

a second obtaining module 601, configured to obtain sample data to be trained;

a second determining module 602, configured to determine, based on the sample data, a third feature vector and a fourth feature vector, where the third feature vector is a vector of graph data, the fourth feature vector is a vector of operation behavior feature data, the graph data is used to represent a structural feature of an operation behavior, the graph data includes a plurality of nodes and connection lines between nodes, each node is used to represent one piece of data in the operation behavior feature data, and each connection line is used to represent that the operation behavior feature data corresponding to two nodes have an association relationship; the operation behavior characteristic data is used for reflecting relevant information of operation behaviors;

a training module 603, configured to train a violation identification model to be trained based on the third feature vector and the fourth feature vector, to obtain the violation identification model;

Optionally, the loss function L of the violation identification model to be trained is based on reconstruction loss and task loss L₀And regularization loss L_regForming, the reconstruction loss includes a first-order similarity loss L for representing between neighboring nodes_1stAnd for representing second order between neighboring nodesLoss of similarity L_2stThe loss function L satisfies:

wherein s is_i,jRepresenting the connection between the ith node and the jth node, z_iA graph embedding vector, z, representing the ith node in the second graph embedding vector_jA graph embedding vector representing a jth node in the second graph embedding vector.

wherein x is_iIs a neighbor vector of the ith node in the third feature vector, b_iA weight greater than 1 indicates that a hadamard product is calculated,

Optionally, the task loss L₀Satisfies the following conditions:

task learning sub-network representing the abnormal operation behaviorAnd outputting the prediction result.

The electronic device provided in the embodiment of the present invention can implement each process implemented by the electronic device in the method embodiment of fig. 4, and is not described here again to avoid repetition.

Fig. 7 is a schematic diagram of a hardware structure of an electronic device implementing various embodiments of the present invention.

The electronic device 700 includes, but is not limited to: a radio frequency unit 701, a network module 702, an audio output unit 703, an input unit 704, a sensor 705, a display unit 706, a user input unit 707, an interface unit 708, a memory 709, a processor 710, a power supply 711, and the like. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 7 does not constitute a limitation of the electronic device, and that the electronic device may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. In the embodiment of the present invention, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

The processor 710 is configured to perform the following operations:

acquiring operation behavior data in a preset time period;

Alternatively, processor 710 is configured to perform the following operations:

acquiring sample data to be trained;

It should be understood that, in the embodiment of the present invention, the radio frequency unit 701 may be used for receiving and sending signals during a message transmission and reception process or a call process, and specifically, receives downlink data from a base station and then processes the received downlink data to the processor 710; in addition, the uplink data is transmitted to the base station. In general, radio frequency unit 701 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 701 may also communicate with a network and other devices through a wireless communication system.

The electronic device provides wireless broadband internet access to the user via the network module 702, such as assisting the user in sending and receiving e-mails, browsing web pages, and accessing streaming media.

The audio output unit 703 may convert audio data received by the radio frequency unit 701 or the network module 702 or stored in the memory 709 into an audio signal and output as sound. Also, the audio output unit 703 may also provide audio output related to a specific function performed by the electronic apparatus 700 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 703 includes a speaker, a buzzer, a receiver, and the like.

The input unit 704 is used to receive audio or video signals. The input Unit 704 may include a Graphics Processing Unit (GPU) 7041 and a microphone 7042, and the Graphics processor 7041 processes image data of a still picture or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 706. The image frames processed by the graphic processor 7041 may be stored in the memory 709 (or other storage medium) or transmitted via the radio unit 701 or the network module 702. The microphone 7042 may receive sounds and may be capable of processing such sounds into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 701 in case of a phone call mode.

The electronic device 700 also includes at least one sensor 705, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display panel 7061 according to the brightness of ambient light, and a proximity sensor that can turn off the display panel 7061 and/or a backlight when the electronic device 700 is moved to the ear. As one type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of an electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), and vibration identification related functions (such as pedometer, tapping); the sensors 705 may also include fingerprint sensors, pressure sensors, iris sensors, molecular sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which are not described in detail herein.

The display unit 706 is used to display information input by the user or information provided to the user. The Display unit 706 may include a Display panel 7061, and the Display panel 7061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 707 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 707 includes a touch panel 7071 and other input devices 7072. The touch panel 7071, also referred to as a touch screen, may collect touch operations by a user on or near the touch panel 7071 (e.g., operations by a user on or near the touch panel 7071 using a finger, a stylus, or any other suitable object or attachment). The touch panel 7071 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 710, receives a command from the processor 710, and executes the command. In addition, the touch panel 7071 can be implemented by various types such as resistive, capacitive, infrared, and surface acoustic wave. The user input unit 707 may include other input devices 7072 in addition to the touch panel 7071. In particular, the other input devices 7072 may include, but are not limited to, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described herein again.

Further, the touch panel 7071 may be overlaid on the display panel 7061, and when the touch panel 7071 detects a touch operation on or near the touch panel 7071, the touch operation is transmitted to the processor 710 to determine the type of the touch event, and then the processor 710 provides a corresponding visual output on the display panel 7061 according to the type of the touch event. Although the touch panel 7071 and the display panel 7061 are shown in fig. 7 as two separate components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 7071 and the display panel 7061 may be integrated to implement the input and output functions of the electronic device, which is not limited herein.

The interface unit 708 is an interface for connecting an external device to the electronic apparatus 700. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 708 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic apparatus 700 or may be used to transmit data between the electronic apparatus 700 and the external device.

The memory 709 may be used to store software programs as well as various data. The memory 709 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 709 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 710 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 709 and calling data stored in the memory 709, thereby monitoring the whole electronic device. Processor 710 may include one or more processing units; preferably, the processor 710 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 710.

The electronic device 700 may also include a power supply 711 (e.g., a battery) for providing power to the various components, and preferably, the power supply 711 may be logically coupled to the processor 710 via a power management system, such that functions of managing charging, discharging, and power consumption may be performed via the power management system.

In addition, the electronic device 700 includes some functional modules that are not shown, and are not described in detail herein.

Preferably, an embodiment of the present invention further provides an electronic device, which includes a processor 710, a memory 709, and a computer program stored in the memory 709 and capable of running on the processor 710, where the computer program, when executed by the processor 710, implements each process of the above-mentioned data processing method or the violation identification model training method, and can achieve the same technical effect, and in order to avoid repetition, it is not described herein again.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned data processing method or violation identification model training method embodiment, and can achieve the same technical effect, and is not described here again to avoid repetition. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A data processing method, comprising:

acquiring operation behavior data in a preset time period;

2. The method of claim 1, wherein the first neural network layer comprises a self-coding subnetwork formed by sequentially connecting 2 x N layers of first fully-connected networks, and an abnormal operation behavior task learning subnetwork formed by sequentially connecting M layers of second fully-connected networks;

and N and M are positive integers, the input of the self-coding sub-network is the first feature vector, the output of the Nth layer of first fully-connected network is the first graph embedding vector, the input of the abnormal operation behavior task learning sub-network is the first graph embedding vector, and the output is a prediction result of the abnormal operation behavior of the node based on the first graph embedding vector.

3. The method of claim 2, wherein the penalty function L of the violation identification model is based on reconstruction penalty, task penalty L₀And regularization loss L_regForming, the reconstruction loss includes a first-order similarity loss L for representing between neighboring nodes_1stAnd for representing a second order similarity loss L between neighboring nodes_2stThe loss function L satisfies:

4. A method for training a violation recognition model is characterized by comprising the following steps:

acquiring sample data to be trained;

5. The method of claim 4, wherein the first neural network layer comprises a self-coding subnetwork formed by sequentially connecting 2 x N layers of first fully-connected networks, and an abnormal operation behavior task learning subnetwork formed by sequentially connecting M layers of second fully-connected networks;

and N and M are positive integers, the input of the self-coding sub-network is the third feature vector, the output of the Nth layer of first fully-connected network is the second graph embedding vector, the input of the abnormal operation behavior task learning sub-network is the second graph embedding vector, and the output is a prediction result of the abnormal operation behavior of the node based on the second graph embedding vector.

6. The method of claim 5, wherein the loss function L of the violation identification model to be trained is based on reconstruction loss, task loss L₀And regularization loss L_regForming, the reconstruction loss includes a first-order similarity loss L for representing between neighboring nodes_1stAnd for representing a second order similarity loss L between neighboring nodes_2stThe loss function L satisfies:

7. A data processing apparatus, comprising:

8. An apparatus for training a violation recognition model, comprising:

the second acquisition module is used for acquiring sample data to be trained;

9. An electronic device, comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the data processing method according to any one of claims 1 to 3, or implementing the steps of the violation recognition model training method according to any one of claims 4 to 6.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the data processing method according to any one of claims 1 to 3, or carries out the steps of the violation recognition model training method according to any one of claims 4 to 6.