CN114861746A

CN114861746A - Anti-fraud identification method and device based on big data and related equipment

Info

Publication number: CN114861746A
Application number: CN202111537687.9A
Authority: CN
Inventors: 沈越
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-08-05
Anticipated expiration: 2041-12-15
Also published as: CN114861746B

Abstract

The invention relates to the technical field of artificial intelligence, and discloses an anti-fraud identification method and device based on big data, computer equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining user information of each user in a user group, wherein the user information comprises a user identification and an attribute feature set corresponding to the user identification, inputting the user identification, the attribute feature set and a preset abnormal attribute feature set into a trained isolated forest model for abnormal recognition to obtain high-risk user identifications, importing the user information into a database to generate a graph model, obtaining a community sub-network with each high-risk user identification as a center node based on the graph model and a preset data query request, calculating the high-risk probability of each node in the community sub-network, extracting the features of the community sub-network based on a neural network to obtain sample data of each node, inputting the sample data and the high-risk probability into a logistic regression model for training to obtain an anti-fraud recognition group model, and improving the recognition accuracy of a fraud group.

Description

Anti-fraud identification method and device based on big data and related equipment

Technical Field

The invention relates to the technical field of data processing, in particular to an anti-fraud identification method and device based on big data, computer equipment and a storage medium.

Background

With the rapid development of the internet, new credit transaction modes are generated, online transactions become a part of people's lives, and anti-fraud is one of important links for enterprise credit risk management and control. At present, in the process of constructing an anti-fraud model, a large amount of historical data needs to be acquired, the historical data not only comprises a large amount of historical default data and fraud data, but also contains a large amount of useless data, and the accuracy of identifying a fraud group by the constructed anti-fraud model is low due to the fact that the anti-fraud model is constructed by adopting the historical data.

Disclosure of Invention

The embodiment of the invention provides an anti-fraud identification method, an anti-fraud identification device, computer equipment and a storage medium based on big data, so as to improve the identification accuracy of fraudulent groups.

In order to solve the above technical problem, an embodiment of the present application provides an anti-fraud identification method based on big data, including:

acquiring user information of each user in a user group, wherein the user information comprises a user identification and an attribute feature set corresponding to the user identification;

inputting the user identification, the attribute feature set and a preset abnormal attribute feature set into a trained isolated forest model, performing abnormal recognition by adopting the trained isolated forest model to obtain an abnormal user identification, and taking the abnormal user identification as a high-risk user identification;

importing each user information into a graph database, taking all the user identifications as nodes, and performing edge connection on any two nodes with an incidence relation to generate a graph model;

acquiring a community sub-network with each high-risk user identifier as a central node based on the graph model and a preset data query request, and performing sample expansion on the community sub-network to obtain an expanded sample network;

calculating the abnormal user proportion of each node in each extended sample network as high risk probability;

based on a neural network, performing feature extraction on the extended sample network to obtain feature data of each node as sample data;

and inputting the sample data and the high-risk probability into a logistic regression model for training until the logistic regression model converges to obtain an anti-fraud recognition cluster model.

In order to solve the above technical problem, an embodiment of the present application further provides an anti-fraud recognition apparatus based on big data, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring user information of each user in a user group, and the user information comprises user identifications and attribute feature sets corresponding to the user identifications;

the abnormal user identification module is used for inputting the user identification, the attribute feature set and a preset abnormal attribute feature set into a trained isolated forest model, performing abnormal identification by adopting the trained isolated forest model to obtain an abnormal user identification, and taking the abnormal user identification as a high-risk user identification;

the graph model generating module is used for importing each user information into a graph database, taking all the user identifications as nodes, and performing edge connection on any two nodes with incidence relation to generate a graph model;

the sample expansion module is used for acquiring a community sub-network taking each high-risk user identifier as a central node based on the graph model and a preset data query request, and performing sample expansion on the community sub-network to obtain an expanded sample network;

the calculation module is used for calculating the abnormal user proportion of each node in each extended sample network as the high risk probability;

the characteristic extraction module is used for extracting the characteristics of the extended sample network based on a neural network to obtain the characteristic data of each node as sample data;

and the training module is used for inputting the sample data and the high-risk probability into a logistic regression model for training until the logistic regression model converges to obtain an anti-fraud recognition cluster model.

In order to solve the technical problem, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the big data based anti-fraud identification method when executing the computer program.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above big data-based anti-fraud identification method.

The embodiment of the invention provides a big data-based anti-fraud recognition method, a device, computer equipment and a storage medium, which comprises the steps of acquiring user information of each user in a user group, wherein the user information comprises a user identification and an attribute feature set corresponding to the user identification, inputting the user identification, the attribute feature set and a preset abnormal attribute feature set into a trained isolated forest model, performing abnormal recognition by adopting the trained isolated forest model to obtain an abnormal user identification, taking the abnormal user identification as a high-risk user identification, importing each user information into a graph database, taking all the user identifications as nodes, performing edge connection on any two nodes with an incidence relation to generate a graph model, acquiring a community sub-network taking each high-risk user identification as a central node based on the graph model and a preset data query request, and performing sample expansion on the community sub-network, the method comprises the steps of obtaining an extended sample network, calculating the abnormal user proportion of each node in each extended sample network to serve as a high-risk probability, extracting features of the extended sample network based on a neural network to obtain feature data of each node to serve as sample data, inputting the sample data and the high-risk probability into a logistic regression model to train until the logistic regression model converges to obtain an anti-fraud recognition cluster model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a big-data based anti-fraud identification method of the present application;

FIG. 3 is a schematic block diagram of one embodiment of a big data based anti-fraud identification apparatus according to the present application;

FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application;

FIG. 5 is a schematic structural diagram of a decision tree in an embodiment of the anti-fraud group model building method of the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, as shown in fig. 1, a system architecture 100 may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that, the anti-fraud identification method based on big data provided in the embodiments of the present application is executed by a server, and accordingly, an anti-fraud identification apparatus based on big data is disposed in the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks and servers may be provided according to implementation needs, and the

terminal devices

101, 102 and 103 in this embodiment may specifically correspond to an application system in actual production.

Referring to fig. 2, fig. 2 shows a big data-based anti-fraud recognition method according to an embodiment of the present invention, which is described by taking the application of the method to the server in fig. 1 as an example, and is detailed as follows:

s201: the method comprises the steps of obtaining user information of each user in a user group, wherein the user information comprises user identification and an attribute feature set corresponding to the user identification.

Specifically, the user identifier may be one of numbers, letters or characters, or a combination thereof, one user identifier represents one user, and has a unique representation to facilitate user confirmation, and the attribute feature set includes a plurality of attribute features, such as a mobile phone number, a unique identifier of a mobile phone, a relationship between friends and friends, a browser cookie of the same client, and the like.

S202: and inputting the user identification, the attribute feature set and the preset abnormal attribute feature set into the trained isolated forest model, performing abnormal recognition by adopting the trained isolated forest model to obtain an abnormal user identification, and taking the abnormal user identification as a high-risk user identification.

Specifically, a preset abnormal attribute feature set is obtained by analyzing attribute feature data corresponding to historical abnormal users, an isolated forest model is a tree integration-based rapid abnormal detection method, and the core idea of abnormal detection is that an abnormal point is an outlier which is easy to isolate. Therefore, the isolated forest model generates a plurality of trees by adopting random features or random threshold division until the trees reach a certain height or only one point in each leaf node, the outliers are divided, in the application, the user identification, the attribute feature set and the preset abnormal attribute feature set are led into the trained isolated forest model, the abnormal attribute features in the abnormal attribute feature set are used as judgment conditions, and the user identification is divided according to the fact that whether the attribute features consistent with the abnormal attribute features exist in the attribute feature set, so that the abnormal user identification is obtained.

S203: and importing each user information into a graph database, taking all user identifications as nodes, and performing edge connection on any two nodes with an association relation to generate a graph model.

Specifically, the graph database includes, but is not limited to Neo4j, Arangodb, Orientdb, and the like, where the graph database refers to a type of NoSQL database, and is a non-relational database, and stores relationship information between entities by applying a graph theory, for example, relationships between people in a social network, in the graph database, each person is represented as a node, and relationships between people are represented by edges between nodes.

S204: and acquiring a community sub-network taking each high-risk user identifier as a central node based on the graph model and a preset data query request, and performing sample expansion on the community sub-network to obtain an expanded sample network.

Specifically, a preset data query request is generated through a client, the preset data query request is used for querying a community sub-network with each high-risk user identifier as a central node, associated data query of each high-risk user identifier is performed in a graph model based on the preset data query request to obtain a plurality of community sub-networks, and further, the plurality of community sub-networks are oversampled based on a SMOTE (Synthetic minimum ownership Oversampling Technique) algorithm to generate a specific number of extended sample networks.

S205: and calculating the abnormal user proportion of each node in each extended sample network as the high risk probability. Specifically, the high risk probability L is calculated according to the formula (1):

L＝A/B (1)

in the formula (1), a is the number of high-risk user identifiers directly associated with the node, and B is the number of all user identifiers directly associated with the node.

S206: and based on the neural network, performing feature extraction on the extended sample network to obtain feature data of each node as sample data.

Specifically, the neural network is a convolutional neural network, and in the present application, the step of constructing the convolutional neural network CNN specifically includes: setting hyper-parameters and loss functions of a Convolutional Neural Network (CNN), using a community sub-network training set and known node classification labels as input values of the Convolutional Neural Network (CNN), starting to train the Convolutional Neural Network (CNN), calculating a loss function according to the output of the Convolutional Neural Network (CNN) and image labels, then performing error back propagation, adopting a gradient direction parameter optimization mode in the error back propagation process to update parameters in the Convolutional Neural Network (CNN) training process, and after a termination condition is met, terminating iterative training to obtain the trained Convolutional Neural Network (CNN) as a neural network.

S207: and inputting the sample data and the high-risk probability into the logistic regression model for training until the logistic regression model converges to obtain the anti-fraud recognition cluster model.

Specifically, the sample data and the high-risk probability are input into a logistic regression model for training until the difference value between the prediction probability of the node output by the logistic regression model and the high-risk probability corresponding to the node is smaller than a preset threshold value, the logistic regression model converges, the model parameters of the logistic regression model are updated, the anti-fraud recognition cluster model is obtained, and the preset threshold value can be obtained by analyzing historical empirical data. The logistic regression model can be expressed as:

y＝w1×z1+w2×z2+w3×z3+...+wm×zm+b(2)

in the formula (2), y is the prediction probability of the node, z1, z2, z3,. cnz, represents m attribute features corresponding to the node in the sample data, w1, w2, w3,. cnz, represents model parameters corresponding to the m attribute features, and b is a model offset value, wherein m is a positive integer greater than 0.

In some optional implementation manners of this embodiment, in step S202, the preset abnormal attribute feature set includes Z abnormal attribute features, where Z is a positive integer greater than 0, the user identifier, the attribute feature set, and the preset abnormal attribute feature set are input to the trained isolated forest model, and the trained isolated forest model is used to perform abnormal recognition to obtain an abnormal user identifier, and using the abnormal user identifier as a high-risk user identifier includes the following steps S2020 to S2023:

and S2020, generating a root node of the isolated forest model, taking the abnormal attribute characteristics as judgment conditions, taking the root node as an abnormal root node, and taking all the user identifications and the attribute characteristics as data samples corresponding to the abnormal root node.

Step S2021, extracting the abnormal attribute feature from the preset abnormal attribute feature set based on the preset rule.

Specifically, the preset extraction rule may be set according to an importance degree of each abnormal attribute feature in the preset abnormal attribute feature set, where the importance degree of the abnormal attribute feature may be artificially analyzed and determined according to an actual application scenario.

And S2022, generating a first child node and a second child node corresponding to the abnormal root node, and performing secondary classification on the data samples according to the abnormal attribute characteristics to obtain new data samples.

Specifically, based on a similarity algorithm, calculating similarity of an abnormal attribute feature and an attribute feature in a data sample to obtain a similarity value, if the similarity value is greater than a preset similarity threshold, judging that the attribute feature is consistent with the abnormal attribute feature, otherwise, judging that the attribute feature is inconsistent with the abnormal attribute feature, wherein the attribute feature consistent with the abnormal attribute feature and a user identifier corresponding to the attribute feature are used as a first-class data sample, and the attribute feature inconsistent with the abnormal attribute feature and the user identifier corresponding to the attribute feature are used as a second-class data sample, wherein the first-class data sample is used as a new data sample, and the similarity algorithm includes but is not limited to euclidean distance and cosine similarity.

Step S2023, taking the first child node as an abnormal root node, taking the new data sample as a data sample of the abnormal root node, returning to "extracting abnormal attribute features from the preset abnormal attribute feature set based on a preset rule", and continuing to execute until the preset abnormal attribute feature set is empty, so as to obtain the decision tree.

In order to better understand the above step S2020 to step S2023, the above step S2020 to step S2023 are further described by taking an example, which is as follows:

as shown in fig. 5, assuming that all the user identifiers include a user identifier a, a user identifier b, a user identifier c and a user identifier d, the attribute feature set corresponding to each user identifier includes three attribute features of an identity card number, a mobile phone number and a browser cookie of the same client, the preset abnormal attribute feature set includes three abnormal attribute features of an abnormal identity card number, an abnormal mobile phone number and an abnormal browser cookie of the same client, the preset extraction rule is that according to the arrangement sequence of the abnormal identity card number, the abnormal browser cookie of the client and the abnormal mobile phone number, the abnormal identity card number is firstly extracted from the preset abnormal attribute feature set, according to whether the identity card number whose similarity value with the abnormal identity card number exceeds a preset threshold exists in the attribute feature set corresponding to the user identifier, the user identifier a, the user identifier b and the user identifier c are taken as a group, taking the user identifier d as a group to obtain two groups, extracting the abnormal browser cookie of the client for the second time from the preset abnormal attribute feature set, dividing the user identifier a, the user identifier b and the user identifier c into two groups of { user identifier a, user identifier b } and { user identifier c } according to whether the attribute feature set corresponding to the user identifier has the browser cookie of the client with the similarity value with the abnormal browser cookie of the client exceeding the preset threshold value, extracting the abnormal mobile phone number for the third time from the preset abnormal attribute feature set from the attribute feature set, dividing the user identifier b and the user identifier c into two groups of { user identifier b } and { user identifier c } according to whether the attribute feature set corresponding to the user identifier has the mobile phone number with the similarity value with the abnormal mobile phone number exceeding the preset threshold value, and finally obtaining the decision tree.

And acquiring an abnormal user identifier as a high-risk user identifier according to a decision result of the decision tree.

Specifically, the decision result of the decision tree is a first child node and a second child node of the deepest layer of the decision tree, and a data sample corresponding to the first child node is obtained, where the data sample includes an abnormal user identifier.

In the embodiment, by obtaining user information of each user in a user group, the user information including a user identifier and an attribute feature set corresponding to the user identifier, inputting the user identifier, the attribute feature set and a preset abnormal attribute feature set into a trained isolated forest model, performing abnormal recognition by using the trained isolated forest model to obtain an abnormal user identifier, taking the abnormal user identifier as a high-risk user identifier, importing each user information into a graph database, taking all the user identifiers as nodes, performing edge connection on any two nodes having an association relation to generate a graph model, obtaining a community sub-network taking each high-risk user identifier as a central node based on the graph model and a preset data query request, performing sample expansion on the community sub-network to obtain an expanded sample network, and calculating the abnormal user proportion of each node in each expanded sample network, the method comprises the steps of performing feature extraction on an extended sample network based on a neural network to obtain feature data of each node, inputting sample data and high-risk probability into a logistic regression model to be trained as sample data until the logistic regression model is converged to obtain an anti-fraud recognition cluster model, performing edge connection on associated user identifications by adopting a graph model, performing sample extension on a community sub-network formed by taking high-risk users as centers and calculating the high-risk probability of each user identification, and increasing the sample data through the sample extension to be beneficial to improving the parameter accuracy after training the logistic regression model is completed, so that the recognition accuracy of a fraud group is improved.

In some optional implementation manners of this embodiment, in step S203, importing the user information into the graph database, taking all the user identifiers as nodes, and performing edge connection on any two nodes having an association relationship, and generating the graph model includes the following steps S2030 to S203:

and step S2030, calculating attribute-phase feature similarity between any two nodes to obtain attribute feature correlation values.

Specifically, a similarity algorithm is adopted to calculate the attribute-phase feature similarity between two nodes, wherein the similarity algorithm includes, but is not limited to, Euclidean distance and cosine similarity.

Step S2031, if the attribute feature correlation value is larger than a preset threshold value, determining that a correlation exists between any two nodes, performing edge connection, and generating a graph model.

Specifically, the association relationship includes a first-degree association and a second-degree association, where K is a positive integer greater than or equal to 1, where the first-degree association refers to direct association between any two nodes, and the second-degree association refers to a second-degree indirect association between any two nodes.

In the embodiment, the association relationship between the nodes is determined by calculating the attribute-phase feature similarity between any two nodes, and the graph model is generated based on the association relationship, so that the accuracy of identifying the cheating group is improved.

In some optional implementation manners of this embodiment, in step S2030, calculating attribute-to-facies feature similarity between any two nodes, and obtaining the attribute-to-feature association value includes:

and acquiring the attribute feature sets of any two nodes as a first attribute feature set and a second attribute feature set respectively.

Specifically, it is assumed that the attribute feature set of the node a includes an identity card, a mobile phone number, and a browser cookie of the same client as a first attribute feature set, and the attribute feature set of the node B includes an identity card, a mobile phone number, and a browser cookie of the same client as a second attribute feature set.

And calculating the attribute feature similarity of the same type of attribute features in the first attribute feature set and the second attribute feature set to obtain a similarity value set.

Specifically, based on a similarity algorithm, similarity values of the attribute features in the first attribute feature set and the attribute features of the same type in the second attribute feature set are calculated to obtain a plurality of similarity values, and a similarity set is formed.

And carrying out weighted summation on the similarity value set to obtain an attribute feature association value.

Specifically, the weight of the similarity value between the attribute features is related to the importance degree of the attribute features, that is, the more important the attribute features are, the higher the weight occupied by the similarity value is, the sum of all the weights is 1, and the value obtained by weighted summation is used as the attribute feature related value.

In the embodiment, the accuracy of the association relationship between any two nodes is improved by performing weighted summation on the similarity sets of any two nodes, and the accuracy of identifying the cheating group is further improved.

In some optional implementation manners of this embodiment, in step S204, the sample expansion is performed on the community sub-network, and obtaining an expanded sample network includes steps S2040 to S2042:

step S2040, based on the Euclidean distance, the distance from each node in each community sub-network to the rest of nodes in the community sub-network is calculated, and the k neighbor set of each node is obtained.

Specifically, euclidean distance is a commonly used definition of distance, which refers to the true distance between two points in an m-dimensional space, or the natural length of a vector (i.e., the distance of the point from the origin). The Euclidean distance in two-dimensional and three-dimensional spaces is the actual distance between two points, in the application, the distance from each node in a community sub-network to the rest of nodes in the community sub-network is calculated based on the Euclidean distance, if the distance from the node to another node in the community sub-network is smaller than a preset distance threshold value, the other node is used as the K neighbor of the node, the rest of the K neighbors of the node is obtained in the same way, and finally the K neighbor set of the node is obtained.

Step S2041, at least one K neighbor is acquired from the K neighbor set.

Step S2042, the distance between the node and the K neighbor, the node and a preset random number are calculated, and an expansion sample network is obtained.

Specifically, the extended sample network includes an extended node and an original node in a community sub-network, and the extended sample network X is obtained according to formula 3:

X＝S _i +rand(0,1)*(S _j -S _i ) (3)

in formula (3), X is an extended network sample, S _i Is a node, S _j -S _i Is the distance between the node and the K neighbor, wherein S _j K is adjacent.

Here, it should be particularly noted that before calculating the distance between the node and the K neighbor, the node, and the preset random number to obtain the extended sample network, a sampling magnification may also be set, and the number of the extended nodes may be controlled and generated by the sampling magnification.

In the embodiment, the sample size is increased by sample expansion of the community sub-network, which is beneficial to improving the parameter accuracy of the anti-fraud recognition group model after training is completed, thereby improving the accuracy of the model in recognizing the fraud group.

In some optional implementation manners of this embodiment, for step S2042, the step of calculating the distance between the node and the K neighbor, the node, and the preset random number further includes:

and coding the extended sample network based on the coding model to obtain a coding sample.

Specifically, the encoding model converts human-readable data into machine-stored data (in the form of 0, 1 data) by a preset encoding scheme, i.e., converts a character stream into a byte stream. Common encoding schemes include ASCII, Latin, GBK, UTF8, etc., wherein the predetermined encoding scheme is set according to the actual application.

And expanding the coding samples based on the generative countermeasure network to obtain expanded coding samples. And decoding the extended coding samples based on the decoding model to obtain decoded samples, and taking the decoded samples as a new extended sample network.

Specifically, the generative confrontation network is constructed by adopting a Deep & CrossNet network model, and the generative confrontation network is a network model obtained by pre-training. And generating the countermeasure network for carrying out data construction on the minority class feature data sets and outputting the same construction feature data as the corresponding minority class feature data sets. After the data construction is carried out on the feature data sets of the minority class based on the generated countermeasure network, the number of the generated construction feature data can be set in a user-defined mode according to the actual situation. In the application, the few types of feature data are nodes in the community sub-network, and the generated structural feature data are also nodes in the community sub-network.

The generation of the countermeasure network is a generation of countermeasure network model and mainly comprises a generation network model and a judgment network model. Therefore, the method for building and generating the countermeasure network by adopting the Deep & CrossNet network model mainly refers to building and generating a network model and a distinguishing network model in the countermeasure network by adopting the Deep & CrossNet network model. Wherein, the Deep & crossnet (DCN) network model is a cross network model. The DCN network model network is a network composed of a first embedded and stacked layer, a second layer, a cross network and a depth network parallel to the first layer, and a third combined layer. DCN networks combine the outputs of the cross network and the deep network. The DCN model network can further abstract information on the basis of keeping original characteristic information, can efficiently extract interaction and interaction information in limited important characteristics, does not need artificial characteristic engineering or traversal search, and is easier to train than a general neural network. In addition, the DCN can further abstract information on the basis of keeping original characteristic information, the adaptability is better in the aspect of structured data, the anti-fraud recognition cluster model obtained by training the extended sample has higher recognition accuracy, and the recognition accuracy of the fraud cluster can be improved.

And decoding the extended coding samples based on the decoding model to obtain decoded samples, and taking the decoded samples as a new extended sample network.

Specifically, the decoding model is mainly used for mapping and converting a preset encoding scheme in the encoding model, and converting machine storage data (in a data form of 0 and 1) into human-readable data, namely converting a byte stream into a character stream.

In the embodiment, the sample size is further increased by carrying out sample expansion on the expanded sample network, which is beneficial to improving the parameter accuracy of the anti-fraud recognition group model after the training is completed, thereby improving the accuracy of the model for recognizing the fraud group.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by functions and internal logic of the process, and should not limit the implementation process of the embodiments of the present invention in any way.

Fig. 3 shows a schematic block diagram of a big data-based anti-fraud recognition apparatus in one-to-one correspondence with the big data-based anti-fraud recognition method of the above-described embodiment. As shown in fig. 3, the big data-based anti-fraud recognition apparatus includes a first data obtaining module 30, an abnormal user recognition module 31, a graph model generating module 32, a sample expanding module 33, a calculating module 34, a feature extracting module 35 and a training module 36. The functional modules are explained in detail as follows:

the first obtaining module 30 is configured to obtain user information of each user in a user group, where the user information includes a user identifier and an attribute feature set corresponding to the user identifier.

And the abnormal user identification module 31 is configured to input the user identifier, the attribute feature set and the preset abnormal attribute feature set to the trained isolated forest model, perform abnormal identification by using the trained isolated forest model to obtain an abnormal user identifier, and use the abnormal user identifier as a high-risk user identifier.

And the graph model generating module 32 is configured to import each user information into the graph database, use all the user identifiers as nodes, and perform edge connection on any two nodes having an association relationship to generate a graph model.

And the sample expansion module 33 is configured to obtain a community sub-network with each high-risk user identifier as a central node based on the graph model and a preset data query request, and perform sample expansion on the community sub-network to obtain an expanded sample network.

And the calculating module 34 is used for calculating the abnormal user proportion of each node in each extended sample network as the high risk probability.

The feature extraction module 35 is configured to perform feature extraction on the extended sample network based on the neural network to obtain feature data of each node as sample data.

And the training module 36 is configured to input the sample data and the high-risk probability into the logistic regression model for training until the logistic regression model converges to obtain an anti-fraud recognition cluster model.

Optionally, the abnormal user identification module 31 includes:

and the determining module is used for generating a root node of the isolated forest model, taking the abnormal attribute characteristics as judging conditions, taking the root node as an abnormal root node, and taking all the user identifications and the attribute characteristics as data samples corresponding to the abnormal root node.

And the attribute feature extraction module is used for extracting the abnormal attribute features from the preset abnormal attribute feature set based on a preset rule.

And the classification module is used for generating a first child node and a second child node corresponding to the abnormal root node, and performing secondary classification on the data sample according to the abnormal attribute characteristics to obtain a new data sample.

And the decision tree construction module is used for taking the first child node as an abnormal root node, taking the new data sample as a data sample of the abnormal root node, returning to the step of extracting abnormal attribute features from the preset abnormal attribute feature set based on a preset rule, and continuing to execute the step until the preset abnormal attribute feature set is empty, so as to obtain the decision tree.

And the high-risk user identification determining module is used for acquiring the abnormal user identification as the high-risk user identification according to the decision result of the decision tree.

Optionally, the graph model generating module 32 includes:

and the similarity calculation module is used for calculating the attribute-phase feature similarity between any two nodes to obtain an attribute feature correlation value.

And the judging module is used for determining that an association relation exists between any two nodes if the attribute characteristic association value is larger than a preset threshold value, and performing edge connection to generate a graph model.

Optionally, the similarity calculation module includes:

and the second acquisition module is used for acquiring the attribute feature sets of any two nodes, and the attribute feature sets are respectively used as the first attribute feature set and the second attribute feature set.

And the similarity value acquisition module is used for calculating the attribute feature similarity of the same type of attribute features in the first attribute feature set and the second attribute feature set to obtain a similarity value set.

And the summation module is used for carrying out weighted summation on the similarity value set to obtain the attribute feature association value.

Optionally, the sample expansion module 33 includes:

and the distance calculation module is used for calculating the distance from each node in each community sub-network to the rest nodes in the community sub-network based on the Euclidean distance to obtain a k neighbor set of each node.

And the third acquiring module is used for acquiring at least one K neighbor from the K neighbor set.

And the extended sample network acquisition module is used for calculating the distance between the node and the K neighbor, the node and the preset random number to obtain an extended sample network.

Optionally, the expanded sample network module further includes:

and the coding module is used for coding the extended sample network based on the coding model to obtain a coding sample.

And the expansion module is used for expanding the coding samples based on the generative confrontation network to obtain expanded coding samples.

And the decoding module is used for decoding the extended coding sample based on the decoding model to obtain a decoding sample, and taking the decoding sample as a new extended sample network.

For the specific definition of the big data based anti-fraud recognition apparatus, reference may be made to the above definition of the big data based anti-fraud recognition method, and details are not repeated here. The modules in the big data based anti-fraud recognition apparatus can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In order to solve the technical problem, the embodiment of the application further provides computer equipment. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only the computer device 4 having the components connection memory 41, processor 42, network interface 43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes for controlling electronic files. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, such as program code for executing control of an electronic file.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing a communication connection between the computer device 4 and other electronic devices.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing an interface display program, which is executable by at least one processor to cause the at least one processor to execute the steps of the big data based anti-fraud identification method as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. An anti-fraud identification method based on big data is characterized by comprising the following steps:

2. The big-data-based anti-fraud recognition method according to claim 1, wherein the preset abnormal attribute feature set comprises Z abnormal attribute features, the user identifier, the attribute feature set, and the preset abnormal attribute feature set are input into a trained isolated forest model, and abnormal recognition is performed by using the trained isolated forest model to obtain an abnormal user identifier, and taking the abnormal user identifier as a high-risk user identifier comprises:

generating a root node of the isolated forest model, taking the abnormal attribute characteristics as judgment conditions, taking the root node as an abnormal root node, and taking all the user identifications and the attribute characteristics as data samples corresponding to the abnormal root node;

extracting the abnormal attribute features from the preset abnormal attribute feature set based on a preset rule;

generating a first child node and a second child node corresponding to the abnormal root node, and performing secondary classification on the data sample according to the abnormal attribute characteristics to obtain a new data sample;

taking the first child node as an abnormal root node, taking the new data sample as a data sample of the abnormal root node, returning to the step of extracting the abnormal attribute features from the preset abnormal attribute feature set based on a preset rule, and continuing to execute until the preset abnormal attribute feature set is empty, so as to obtain a decision tree;

3. The big-data-based anti-fraud recognition method according to claim 1, wherein the importing the user information into a graph database, using all the user identifiers as nodes, and performing edge connection on any two nodes having an association relationship to generate a graph model comprises:

calculating the attribute-phase feature similarity between any two nodes to obtain an attribute feature correlation value;

and if the attribute characteristic association value is larger than a preset threshold value, determining that an association relation exists between any two nodes, performing edge connection, and generating a graph model.

4. The big-data-based anti-fraud recognition method according to claim 1, wherein the calculating attribute-to-facies feature similarity between any two nodes and obtaining attribute feature association values comprises:

acquiring attribute feature sets of any two nodes, and respectively taking the attribute feature sets as a first attribute feature set and a second attribute feature set;

calculating attribute feature similarity of the same type of attribute features in the first attribute feature set and the second attribute feature set to obtain a similarity value set;

and carrying out weighted summation on the similarity value set to obtain an attribute feature correlation value.

5. The big-data-based anti-fraud identification method according to claim 1, wherein the sample expanding the community sub-networks to obtain an expanded sample network comprises:

calculating the distance from each node in each community sub-network to the rest nodes in the community sub-network based on the Euclidean distance to obtain a k neighbor set of each node;

acquiring at least one K neighbor from the K neighbor set;

and calculating the distance between the node and the K neighbor, the node and the preset random number to obtain an extended sample network.

6. The big data-based anti-fraud identification method according to claim 5, wherein the calculating the distance between the node and the K-neighbor, the node and the preset random number further comprises, after obtaining the extended sample network:

based on a coding model, coding the extended sample network to obtain a coding sample;

based on the generative countermeasure network, expanding the coding samples to obtain expanded coding samples;

and decoding the extended coding samples based on a decoding model to obtain decoded samples, and taking the decoded samples as a new extended sample network.

7. An anti-fraud recognition apparatus based on big data, comprising:

the calculation module is used for calculating the abnormal user proportion of each node in each extended sample network as the high-risk probability;

8. The big-data based anti-fraud identification apparatus of claim 7, wherein the abnormal subscriber identification module comprises:

the determining module is used for generating a root node of the isolated forest model, taking the abnormal attribute characteristics as judging conditions, taking the root node as an abnormal root node, and taking all the user identifications and the attribute characteristics as data samples corresponding to the abnormal root node;

the attribute feature extraction module is used for extracting the abnormal attribute features from the preset abnormal attribute feature set based on a preset rule;

the classification module is used for generating a first child node and a second child node corresponding to the abnormal root node, and performing secondary classification on the data sample according to the abnormal attribute characteristics to obtain a new data sample;

a decision tree construction module, configured to use the first child node as an abnormal root node, use the new data sample as a data sample of the abnormal root node, and return to "extract the abnormal attribute features from the preset abnormal attribute feature set based on a preset rule", and continue execution until the preset abnormal attribute feature set is empty, so as to obtain a decision tree;

and the high-risk user identification determining module is used for acquiring an abnormal user identification as the high-risk user identification according to the decision result of the decision tree.

9. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the big data based anti-fraud identification method according to any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the big-data based anti-fraud identification method according to any one of claims 1 to 6.