CN117131460A - Telecom fraud account identification model training method, device, equipment and medium - Google Patents

Telecom fraud account identification model training method, device, equipment and medium Download PDF

Info

Publication number
CN117131460A
CN117131460A CN202310778878.7A CN202310778878A CN117131460A CN 117131460 A CN117131460 A CN 117131460A CN 202310778878 A CN202310778878 A CN 202310778878A CN 117131460 A CN117131460 A CN 117131460A
Authority
CN
China
Prior art keywords
data
model
training
node characteristic
account identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310778878.7A
Other languages
Chinese (zh)
Inventor
朱继良
陈政宇
姬赛霜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Boc Financial Technology Suzhou Co ltd
Original Assignee
Boc Financial Technology Suzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Boc Financial Technology Suzhou Co ltd filed Critical Boc Financial Technology Suzhou Co ltd
Priority to CN202310778878.7A priority Critical patent/CN117131460A/en
Publication of CN117131460A publication Critical patent/CN117131460A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The application provides a telecommunication fraud account identification model training method, a device, equipment and a medium. When the method is executed, firstly acquiring target data; the target data are data preprocessed according to related data of fraudulent activity, and then a database is built according to the target data; and the database comprises node characteristic data and associated data, and finally, a telecom fraud account identification model based on multi-model fusion is trained by utilizing an integrated learning strategy according to the node characteristic data and the associated data. Therefore, by utilizing the original information of the user attribute and the association relation between the users and fusing the results of the machine learning model and the graph neural network model, the overall prediction capacity of the model is improved, the model can more effectively identify the risk account, and the effect of improving the overall prediction capacity of the model is achieved. In this way, the accuracy of the model to identify the telecommunication fraud account can be improved.

Description

Telecom fraud account identification model training method, device, equipment and medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for training a telecommunication fraud account identification model.
Background
The network telecom fraud refers to fraud actions of fraudsters by using Internet to compile false information through telephone, network, short message and the like, setting fraud bureau, implementing remote and non-contact fraud on victims, inducing the victims to pay money or transfer money, and obtaining illegal benefits. To contain telecommunication fraud, quickly and accurately identifying risk accounts is a relatively common field of application of artificial intelligence.
In the prior art, a feature broad table is firstly constructed based on basic information of a user, basic information of a card holding account, behavior features of the user and the like, and then a machine learning or deep learning algorithm is selected to model and predict the risk degree of the account.
In the modeling and prediction process, only the value of the attribute of the user is usually mined, but the value information of the association relationship between the users is ignored, and a single machine learning or deep learning model is usually adopted, so that the final recognition result of the model is low in accuracy, and therefore, how to provide a telecom fraud account recognition model to improve the accuracy of the model in recognizing the telecom fraud account is a problem to be solved at present.
Disclosure of Invention
In view of the above, the application provides a method, a device, equipment and a medium for training a telecom fraud account identification model, which aim to improve the accuracy of model identification of a telecom fraud account.
In a first aspect, the present application provides a telecommunications fraud account identification model training method, the method comprising:
acquiring target data; the target data are data preprocessed according to related data of fraudulent activity;
constructing a database according to the target data; the database comprises node characteristic data and associated data;
and training a telecom fraud account identification model based on multi-model fusion by utilizing an integrated learning strategy according to the node characteristic data and the associated data.
Optionally, training a telecom fraud account identification model based on multi-model fusion by using an integrated learning strategy according to the node characteristic data and the associated data, including:
training a machine model according to the node characteristic data and obtaining a prediction result of the machine model;
and carrying out fusion training on the graph neural network model by using a prediction result of the machine model, the node characteristic data and the associated data in a Stacking mode to obtain a multi-model fused telecom fraud account identification model.
Optionally, the fusion training is performed on the prediction result of the machine model and the graph neural network model by a Stacking manner to obtain a multimodal fusion telecom fraud account identification model, which includes:
obtaining a machine model node characteristic value of the node characteristic data based on a prediction result of the machine model;
inputting the node characteristic values, the node characteristic data and the associated data of the machine model into the graph neural network model for training so as to obtain a telecommunication fraud account identification model with multi-model fusion.
Optionally, the training the machine model according to the node characteristic data includes:
screening the node characteristic data by using at least one data screening mode to obtain screened first node characteristic data;
dividing a data set corresponding to the first node characteristic data into a training set and a testing set respectively; the training set is training data for model training, and the test set is test data for model testing;
and inputting the training set and the testing set into a machine model for training.
Optionally, when training the machine model according to the node characteristic data, the method includes:
performing parameter adjustment on the machine model by using Bayesian optimization;
the Bayesian optimization custom function formula is as follows:
F=offks-abs(devks-offks)*λ
wherein offks is a parameter of the degree of discrimination of the good and bad samples of the cross-time test set calculated according to each round of iteration, devks is a parameter of the degree of discrimination of the good and bad samples of the cross-time training set calculated according to each round of iteration, abs is an absolute value, and lambda is a super parameter.
Optionally, the machine model is a LightGBM model, and the graph neural network model is a graphSage model.
Optionally, the constructing a database according to the target data includes:
constructing a first node characteristic matrix according to the node characteristic data;
and extracting node characteristics according to the first node characteristic matrix to obtain a second node characteristic matrix.
Optionally, the target data is obtained by:
acquiring original fraud related data;
preprocessing the related data of the original fraudulent activity to obtain target data; the preprocessing at least comprises missing value processing, continuous node characteristic processing, data normalization and data filtering.
In a second aspect, the present application provides a telecommunications fraud account identification model training apparatus, the apparatus comprising:
the acquisition module is used for acquiring target data; the target data are data preprocessed according to related data of fraudulent activity;
the construction module is used for constructing a database according to the target data; the database comprises node characteristic data and associated data;
and the training module is used for training a telecom fraud account identification model based on multi-model fusion by utilizing an integrated learning strategy according to the node characteristic data and the associated data.
In a third aspect, the present application provides an apparatus comprising a memory for storing instructions or code and a processor for executing the instructions or code to cause the apparatus to perform the telecommunications fraud account identification model training method of any preceding aspect.
In a fourth aspect, the present application provides a computer storage medium having code stored therein, which when executed, causes an apparatus executing the code to implement the telecommunications fraud account identification model training method of any preceding aspect.
The application provides a telecommunication fraud account identification model training method, a device, equipment and a medium. When the method is executed, firstly acquiring target data; the target data are data preprocessed according to related data of fraudulent activity, and then a database is built according to the target data; and the database comprises node characteristic data and associated data, and finally, a telecom fraud account identification model based on multi-model fusion is trained by utilizing an integrated learning strategy according to the node characteristic data and the associated data. Therefore, by utilizing the original information of the user attribute and the association relation between the users and fusing the results of the machine learning model and the graph neural network model, the overall prediction capacity of the model is improved, the model can more effectively identify the risk account, and the effect of improving the overall prediction capacity of the model is achieved. In this way, the accuracy of the model to identify the telecommunication fraud account can be improved.
Drawings
In order to more clearly illustrate this embodiment or the technical solutions of the prior art, the drawings that are required for the description of the embodiment or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for training a telecommunication fraud account identification model according to an embodiment of the present application;
FIG. 2 is a flowchart of a method for implementing step S103 according to one embodiment of the present application;
FIG. 3 is a flowchart of a method for implementing one possible implementation of step S1031 provided in an embodiment of the application;
fig. 4 is a schematic structural diagram of a training device for a telecommunication fraud account identification model according to an embodiment of the present application.
Detailed Description
As described throughout, in the prior art, for identifying a telecommunication fraud account, a feature broad table is generally constructed based on basic information of a user, basic information of a card holding account, behavior features of the user, and the like, and then a machine learning algorithm or a deep learning algorithm is selected to model and predict risk degrees of the account. The modeling data is selected only by mining the value of the attribute of the user, and ignoring the value information of the association relationship between the users, but the information value of the attribute of the user is limited, and only a single characteristic attribute can be provided in the occurrence process of fraudulent activity. On the other hand, the adoption of a single machine learning model or a deep learning model can lead to low accuracy of model prediction.
In actual information exchange, users with potential risks can be further locked by deeply mining other users with association relation with the users with risks, so that the accuracy of the model on identifying the telecommunication fraud account is improved. Furthermore, by means of multi-model fusion, for example, a traditional machine learning model and a graph neural network model are fused, and the overall prediction capability of the fused model can be improved.
In view of the above, the application provides a method, a device, equipment and a medium for training a telecom fraud account identification model. When the method is executed, firstly acquiring target data; the target data are data preprocessed according to related data of fraudulent activity, and then a database is built according to the target data; and the database comprises node characteristic data and associated data, and finally, a telecom fraud account identification model based on multi-model fusion is trained by utilizing an integrated learning strategy according to the node characteristic data and the associated data. Therefore, by utilizing the original information of the user attribute and the association relation between the users and fusing the results of the machine learning model and the graph neural network model, the overall prediction capacity of the model is improved, the model can more effectively identify the risk account, and the effect of improving the overall prediction capacity of the model is achieved. In this way, the accuracy of the model to identify the telecommunication fraud account can be improved.
In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In the case of acquiring original data of fraud related data from a user account, since data related to privacy of the user such as account information of the user may exist, when the embodiment of the present application is applied to a specific product or technology, permission or consent of the user needs to be obtained, and collection, use and processing of the related data need to comply with related laws and regulations and standards of related countries and regions. For example, prior to obtaining raw data from a user's account, the user's permissions or consent may need to be assessed to obtain the user's authorization to obtain raw data related to fraudulent activity in the user's account.
Referring to fig. 1, fig. 1 is a flowchart of a method for training a telecommunication fraud account identification model according to an embodiment of the present application, and in combination with fig. 1, the method for training a telecommunication fraud account identification model according to an embodiment of the present application may include:
s101: acquiring target data; the target data is data preprocessed according to related data of fraudulent activity.
In this embodiment, target data relating to fraudulent activity is first acquired. In the application, data of multiple dimensions can be acquired, wherein the data of multiple dimensions can come from internal data and external data. Specifically, the internal data is data included in the user account, such as basic information of the user, business information of the user, transaction information of the user, and the like. External data is data included in non-user accounts associated with the user, such as a peer blacklist, a user score for fraud risk associated with the user account, and the like. It should be noted that, in the process of obtaining the target data, if the target data relates to the user privacy data, permission and authorization of the user need to be obtained before the target data is obtained, and related laws and regulations and standards of related countries and regions are strictly complied with in the process of collecting, using and processing the user data.
Optionally, the target data is obtained by: and acquiring the related data of the original fraudulent conduct, and preprocessing based on the related data of the original fraudulent conduct to obtain target data, wherein the preprocessing at least comprises missing value processing, continuous node characteristic processing, data standardization and data filtering.
Wherein missing value processing may include missing value padding, missing value deleting, or not processing missing values. Specifically, missing value filling refers to filling null values in data, and specific filling methods can include manual filling, special value filling, average value filling, regression methods and the like. Deletion of a missing value means that a field with too high a deletion rate can be selected for direct deletion. Not processing missing values refers to a method of data mining directly on data containing null values. Wherein performing data mining on data containing null values may include: bayesian networks, artificial neural networks, and the like.
Specifically, the average filling is to divide the attribute in the initial dataset into a numerical attribute and a non-numerical attribute for processing respectively. If the null value is numerical, filling the missing attribute value according to the average value of the values of the attribute in all other objects; if the null value is non-numerical, the value of the missing attribute is complemented with the value of the attribute with the most number of values among all other objects (i.e., the value with the highest frequency of occurrence) according to the mode principle in statistics. Another alternative average filling method is called conditional average filling (Conditional Mean Completer). In this method, the value used for averaging is not taken from all objects of the dataset, but from objects having the same decision attribute value as the object. The above average value filling method is to supplement the missing attribute value with the possible value of the highest probability, and it can use most information of the existing data to infer the missing value. Regression methods (Regression) refer to establishing Regression equations based on complete data sets, or using Regression algorithms in machine learning. For objects containing null values, substituting known attribute values into the equation to estimate unknown attribute values, and filling the estimated values.
It should be noted that, the missing value processing method in the above embodiment of the present application is only an optional example, and in the actual application process, in order to improve the integrity and reliability of the original data, a hot card filling method, a K nearest neighbor method, a multiple interpolation method, and a C4.5 method may also be selected.
The continuous node feature processing may include feature binning, discretizing, and the like. Discretization refers to the process of dividing a continuous attribute, feature or variable transformation into discrete or nominal attributes/features/variables/intervals. In the embodiment of the application, the continuous variable is discretized, and after the characteristic is discretized, the model is more stable, so that the risk of overfitting of the model is reduced. Feature binning, also called variable binning, is a method of processing features of successive nodes. Feature binning may enhance the interpretability and predictive ability of node features. In the characteristic box division process, the missing value can be used as a special variable to participate in the box division, the uncertainty of filling of the missing value is reduced, the influence of the abnormal value can be reduced by the characteristic box division, the stability of the model is improved, namely noise is reduced, the robustness of the model is better, and the fitting capacity of the model is improved.
Data normalization may include mean normalization, min-max normalization, exponential normalization, and the like. The accuracy of the model can be improved and the convergence rate can be assessed by using the data normalization method.
The data filtering can comprise the characteristics with high filtering relevance, the characteristics with low filtering importance and the like, and the rationality of the data can be ensured by utilizing the data filtering, so that the accuracy of the model is improved.
S102: constructing a database according to the target data; the database comprises node characteristic data and associated data.
In the present embodiment, after target data is acquired, a database is constructed based on the target data. In particular, the database may include node characteristic data and association data. The node characteristic data is composed of basic characteristic data of multiple dimensions of a user, and can specifically comprise user characteristics, user service characteristics, transaction characteristics, behavior characteristics, card/account information data, some external data and the like. Wherein the user characteristic data may include: age, academic, occupation, marital status, income, account opening row, age of the account opening, etc. of the user; the user product features include: deposit, loan, credit card, financing, fund, insurance, payoff, water and electricity fee, mobile banking, online banking, etc.; the transaction characteristics include: transaction amount, post-transaction balance, transaction frequency, transaction amount change trend in a period of time, whether large transfers are involved, frequent transaction hours, frequent transaction amount intervals, and the like; the behavior characteristics include: login date, login time, login device type, login device ID number, login region, MAC address, login IP address, etc.; the card/account information includes: card number, account number, card opening date, home network point, card opening mode (batch/non-batch), account type (I, II, III class user), daily final balance, month balance, year balance, etc.; the external data may include the peer information and an external anti-fraud score, in particular, the peer information: whether it is a homonymy blacklist, whether there is another line card, etc.; external anti-fraud scoring: fraud risk values, fraud risk details, etc. The association data is derived from the trade relationship between users. The form of the association relationship among the plurality of users may not be specifically limited, and for convenience of understanding, the following description will be given in the form of table 1.
TABLE 1
from_id to_id time amount
A B 2022/3/26 1789
C D 2022/3/25 50
B C 2022/4/15 900
D C 2022/4/17 188
E F 2022/4/16 45
As shown in Table 1, the users with trade relation are connected by constructing directed side list data, for example, a side from A to B is constructed to connect A, B two users, a side from C to D is constructed to connect C, D two users, a side from C to D is constructed to connect B, C two users, a side from B to C is constructed to connect 3548 two users, and the directed side list can also include trade time and trade amount. It is understood that the above-mentioned association relationship can also be represented by a directed edge graph.
Optionally, the constructing a database according to the target data may include: and constructing a first node characteristic matrix according to the node characteristic data, and extracting node characteristics according to the first node characteristic matrix to obtain a second node characteristic matrix.
In this embodiment, since node feature data of multiple dimensions are obtained, the first node feature matrix may be constructed for the multiple dimensions, and for the representation form of the first node feature matrix constructed for the multiple dimensions, the embodiment of the present application may not be specifically limited, and for convenience of understanding, the following description will be given in the form of table 2.
TABLE 2
id feat_0 feat_1 feat_2 feat_31 feat_32 label
As shown in table 2, id represents a user, feat_0 represents an age, feat_1 represents a deposit, feat_2 represents a transaction amount, feat_31 represents a card number, feat_32 represents blacklist data, and label represents a user tag. It should be noted that table 2 above is only one possible example of a partial matrix as an exemplary illustration of the present embodiment, and in the actual application process, the order and number of matrices may be adjusted according to the characteristics of the acquired data.
After the first node feature matrix is obtained, node feature extraction can be performed according to the first node feature matrix to obtain a second node feature matrix. The second node feature matrix representation is not particularly limited, and for ease of understanding, the following description will be given in table 3.
TABLE 3 Table 3
id feat_0 feat_1 feat_2 feat_31 feat_32 label
0 0 2 1 -1 1 1
1 0 3 0.58 0.008984 0.14989 0
2 0 4 -1 -1 -1 0
3 1 -1 1 -1 -1 0
4 0 1 0.39 0.05943 0 0
Different user ids may correspond to different feature data, as shown in connection with table 3. In particular, the second node characteristic matrix may be stored in the database in the form of a node characteristic broad table.
S103: and training a telecom fraud account identification model based on multi-model fusion by utilizing an integrated learning strategy according to the node characteristic data and the associated data.
In this embodiment, the multiple model fused telecommunication fraud account model is trained by the integrated learning strategy using the node characteristic data and the associated data. The integrated learning strategy may include Boosting method, bagging (Bootstrap aggregating) method, and Stacking method. Specifically, boosting is based on a serial strategy, with a new learner being generated by an old learner. Representative algorithms are: adaBoost, lifting tree BT, gradient lifting tree GBDT and XGBoost. The Bagging method samples randomly from the original sample set. And selecting n training samples which are put back from the original sample set in each round, extracting k rounds to obtain k training sets, training a model by using one training set each time, obtaining k base models by the k training sets, predicting a test set by using the k base models, and aggregating k prediction results. The Stacking method generally comprises two layers of models, wherein a base model, namely a first layer model, can predict data to obtain a predicted result of the first layer model on the data, and the predicted result of the base model is used as a characteristic to further train by using a second layer model to obtain a final training result. For the first layer model, xgboost (eXtreme Gradient Boosting), lightGBM (Light Gradient Boosting Machine), randomForest, GBDT (Gradient Boosting Decision Tree), extraTrees, etc. may be included. The second layer model may include a LR (Linear Regression) model, a Graph neural network model, and in particular, the Graph neural network model may include GCN (Graph Convolution Networks), graphSAGE (Graph Sample and Aggregate), GAT (Graph Attention Networks), graph Pooling, and the like.
Optionally, fig. 2 is a flowchart of a method for implementing a possible implementation of step S103 provided by the embodiment of the present application, and in combination with fig. 2, training, according to the node feature data and the associated data, a telecommunications fraud account identification model based on multimodal fusion by using an integrated learning policy provided by the embodiment of the present application may include:
s1031, training a machine model according to the node characteristic data and obtaining a prediction result of the machine model.
Optionally, in this embodiment, when training the machine model according to the node feature data, the method may include: and performing parameter adjustment on the machine model by using Bayesian optimization, wherein the Bayesian optimization custom function formula is as follows:
F=offks-abs(devks-offks)*λ
wherein offks is a parameter of the discrimination degree of the good and bad samples of the cross-time test set calculated according to each round of iteration, devks is a parameter of the discrimination degree of the good and bad samples of the cross-time training set calculated according to each round of iteration, abs is an absolute value, lambda is a super-parameter, and can be set between 0.1 and 0.4.
In this embodiment, an automatic parameter adjustment method based on service data is provided. Namely, according to a cross-time verification set KS (offks) and a training set KS (devks) calculated by each round of iteration, a group of parameters which enable the objective function to be maximum are found, the parameters have business meaning, namely, the parameters can be changed according to the change of adjacent business data, and the optimization efficiency of a model can be improved through automatic parameter adjustment.
S1032, carrying out fusion training on the prediction result of the machine model and the graph neural network model in a Stacking mode to obtain a multi-model fused telecom fraud account identification model.
In this embodiment, in the training process of the multi-model fused telecom fraud account identification model, a machine model is trained through node feature data to obtain a prediction result of the machine model, then the prediction result trained by the machine model based on each user data is used as a new user feature, and the new user feature, the node feature data and the associated data are used to train the graph neural network model to obtain the multi-model fused telecom fraud account identification model. It should be noted that, in the present application, the machine model may select one machine model to generate a new user feature, or may select multiple machine models to generate multiple user features, so as to improve the prediction capability of the telecommunication fraud account identification model fused by multiple models.
Optionally, the fusion training is performed on the prediction result of the machine model and the graph neural network model by a Stacking manner to obtain a multimodal fusion telecom fraud account identification model, which includes: and obtaining a machine model node characteristic value of the node characteristic data based on a prediction result of the machine model, and inputting the machine model node characteristic value, the node characteristic data and the associated data into the graph neural network model for training so as to obtain a telecommunication fraud account identification model with multi-model fusion.
Optionally, in an embodiment of the present application, the preferred machine model is a LightGBM model, and the graph neural network model is a graphSage model.
In this embodiment, lightGBM (Light Gradient Boosting Machine) is a framework for implementing the GBDT algorithm, supports efficient parallel training, and has the advantages of faster training speed, lower memory consumption, better accuracy, supporting distributed data processing, etc., where GBDT (Gradient Boosting Decision Tree) is a model in machine learning, and the main idea is to use a weak classifier (decision tree) to perform iterative training to obtain an optimal model, where the model has the advantages of good training effect, difficulty in fitting, etc. The graph SAGE is a frame of inductive learning, in specific implementation, only the training samples are reserved to the edges of the training samples during training, and then two steps of sampling and aggregation are included, wherein the sampling refers to how to sample the number of neighbors, and the aggregation refers to gathering the embedding relations after the embedding relations of the neighbor nodes are taken so as to update the embedding information of the neighbor nodes. Therefore, the association relation between the data is included in the training network, and the prediction accuracy of the model is improved.
In the present embodiment, the target data is acquired; the target data are data preprocessed according to related data of fraudulent activity; constructing a database according to the target data; the database comprises node characteristic data and associated data; and training a telecom fraud account identification model based on multi-model fusion by utilizing an integrated learning strategy according to the node characteristic data and the associated data. According to the method, the original information of the user attribute is utilized, the association relation between the users is deeply mined, so that the model can more effectively identify the risk account, the model fusion mode is adopted, the results of the machine learning model and the graph neural network model are fused, and the overall prediction capability of the model is improved.
In the embodiment of the present application, there are a plurality of possible implementations of step S1031 described in fig. 2, and the following description will be given. It should be noted that the implementations presented in the following description are only exemplary and not representative of all implementations of the embodiments of the present application.
Referring to fig. 3, fig. 3 is a flowchart of a method for implementing one possible implementation of step S1031 provided by the embodiment of the application, and in conjunction with the description of fig. 3, the method for training a telecommunication fraud account identification model provided by the embodiment of the application may include:
s10311, screening the node characteristic data by using at least one data screening mode to obtain screened first node characteristic data.
In this embodiment, when training the machine model according to the node feature data, the node feature data is first screened, and a specific data screening manner may include data screening by using one or more of an IV value, a PSI value, a Bivar graph, and the like.
The node characteristic data used in the IV value screening method is data subjected to WOE (Weight of Evidence, evidence weight). There are two kinds of data, one is the numerical variable produced after the WOE is performed on the character type variable, each variable characteristic value is not more than 10, and the other is the numerical variable, and the numerical variable is continuous without box division and box division. For continuous variable, it is first to divide the continuous variable into equal frequency boxes, for example, 100 boxes, then calculate the square value of two adjacent boxes, and combine the two boxes with the least square. And then calculating the chi-square value, and combining the two minimum boxes of the chi-square. Until the number of bins is less than 10 bins and the duty cycle of at least one bin is greater than 0.05, the variable for which the IV value is less than the threshold is deleted. The Psi value is used for screening variables with poor stability in the model, and guaranteeing the stability of the model after being on line. The Bivar graph enables the barrate curve of the bin result to be converted from a non-strictly increasing curve to a strictly increasing curve through equal frequency bin division or equal distance bin division, and accurate availability of node characteristic data is guaranteed.
In this embodiment, the IV value, the PSI value and the Bivar graph are filtered, so as to obtain the filtered first node characteristic data.
S10312, the data set corresponding to the first node characteristic data is divided into a training set and a testing set respectively.
In this embodiment, the training set is training data for model training, and the test set is test data for model testing.
S10313, inputting the training set and the testing set into a machine model for training and obtaining a prediction result of the machine model.
Based on the related content of the telecom fraud account identification model training method, the embodiment of the application can also provide a telecom fraud account identification method, and the telecom fraud account identification method is respectively described below with reference to the embodiment.
The telecommunication fraud account identification method provided by the embodiment of the application can comprise the following steps:
and acquiring data to be identified.
In the embodiment of the present application, the data to be identified may be obtained by the method in step S101 in the above embodiment, and detailed description thereof is omitted herein. It should be noted that, when the present application acquires the related data, the license or consent of the user needs to be obtained, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
And identifying the target data based on the telecom fraud account identification model to obtain a telecom fraud account identification result.
The telecom fraud account identification model can be obtained based on any implementation mode of the telecom fraud account identification model training method.
The embodiment of the application does not limit the execution main body of the telecom fraud account identification method, for example, the telecom fraud account identification method provided by the embodiment of the application can be applied to data processing equipment such as terminal equipment or a server. The terminal device may be a smart phone, a computer, a personal digital assistant (Personal Digital Assistant, PDA), a tablet computer, or the like. The servers may be stand alone servers, clustered servers, or cloud servers.
Based on some specific implementation modes of the telecommunication fraud account identification model training method provided by the embodiment, the application also provides a corresponding device. The apparatus provided by the embodiment of the present application will be described in terms of functional modularization.
Referring to the schematic structure of the telecommunications fraud account identification model training apparatus 400 shown in fig. 4, the apparatus 400 includes an acquisition module 401, a construction module 402, and a training module 403.
An acquisition module 401, configured to acquire target data; the target data are data preprocessed according to related data of fraudulent activity;
a construction module 402, configured to construct a database according to the target data; the database comprises node characteristic data and associated data;
and the training module 403 is configured to train the telecommunication fraud account identification model based on multi-model fusion by using an integrated learning strategy according to the node characteristic data and the associated data.
The construction module 402 specifically includes: a first building unit and a second building unit.
The first construction unit is used for constructing a first node characteristic matrix according to the node characteristic data;
and the second construction unit is used for extracting node characteristics according to the first node characteristic matrix to obtain a second node characteristic matrix.
The training module 403 specifically includes: a first training unit and a second training unit.
The first training unit is used for training the machine model according to the node characteristic data and obtaining a prediction result of the machine model;
and the second training unit is used for carrying out fusion training on the graph neural network model by the prediction result of the machine model, the node characteristic data and the associated data in a Stacking mode so as to obtain a multi-model fusion telecom fraud account identification model.
The first training unit is specifically configured to:
screening the node characteristic data by using at least one data screening mode to obtain screened first node characteristic data;
dividing a data set corresponding to the first node characteristic data into a training set and a testing set respectively; the training set is training data for model training, and the test set is test data for model testing;
and inputting the training set and the testing set into a machine model for training.
The second training unit is specifically configured to:
obtaining a machine model node characteristic value of the node characteristic data based on a prediction result of the machine model;
inputting the node characteristic values, the node characteristic data and the associated data of the machine model into the graph neural network model for training so as to obtain a telecommunication fraud account identification model with multi-model fusion.
The embodiment of the application also provides corresponding equipment and a computer storage medium, which are used for realizing the scheme provided by the embodiment of the application.
The device comprises a memory and a processor, wherein the memory is used for storing instructions or codes, and the processor is used for executing the instructions or codes to enable the device to execute the telecommunication fraud account identification model training method according to any embodiment of the application.
The computer storage medium stores code, and when the code is executed, equipment for executing the code realizes the training method of the telecom fraud account identification model according to any embodiment of the application.
The "first" and "second" in the names of "first", "second" (where present) and the like in the embodiments of the present application are used for name identification only, and do not represent the first and second in sequence.
From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described example methods may be implemented in software plus general hardware platforms. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a router) to perform the method according to the embodiments or some parts of the embodiments of the present application.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.
The foregoing description of the exemplary embodiments of the application is merely illustrative of the application and is not intended to limit the scope of the application.

Claims (11)

1. A method for training a telecommunications fraud account identification model, the method comprising:
acquiring target data; the target data are data preprocessed according to related data of fraudulent activity;
constructing a database according to the target data; the database comprises node characteristic data and associated data;
and training a telecom fraud account identification model based on multi-model fusion by utilizing an integrated learning strategy according to the node characteristic data and the associated data.
2. The method of claim 1, wherein training a model of telecommunications fraud account identification based on multimodal fusion using an ensemble learning strategy based on the node characteristic data and associated data comprises:
training a machine model according to the node characteristic data and obtaining a prediction result of the machine model;
and carrying out fusion training on the graph neural network model by using a prediction result of the machine model, the node characteristic data and the associated data in a Stacking mode to obtain a multi-model fused telecom fraud account identification model.
3. The method according to claim 2, wherein the fusion training of the prediction result of the machine model and the graph neural network model by means of Stacking to obtain a multimodal fusion telecom fraud account identification model comprises:
obtaining a machine model node characteristic value of the node characteristic data based on a prediction result of the machine model;
inputting the node characteristic values, the node characteristic data and the associated data of the machine model into the graph neural network model for training so as to obtain a telecommunication fraud account identification model with multi-model fusion.
4. The method of claim 2, wherein the training a machine model based on the node characteristic data comprises:
screening the node characteristic data by using at least one data screening mode to obtain screened first node characteristic data;
dividing a data set corresponding to the first node characteristic data into a training set and a testing set respectively; the training set is training data for model training, and the test set is test data for model testing;
and inputting the training set and the testing set into a machine model for training and obtaining a prediction result of the machine model.
5. The method of claim 2, wherein training the machine model based on the node characteristic data comprises:
performing parameter adjustment on the machine model by using Bayesian optimization;
the Bayesian optimization custom function formula is as follows:
F=offks-abs(devks-offks)*λ
wherein offks is a parameter of the degree of discrimination of the good and bad samples of the cross-time test set calculated according to each round of iteration, devks is a parameter of the degree of discrimination of the good and bad samples of the cross-time training set calculated according to each round of iteration, abs is an absolute value, and lambda is a super parameter.
6. The method of claim 2, wherein the machine model is a LightGBM model and the graph neural network model is a graphpage model.
7. The method of claim 1, wherein said constructing a database from said target data comprises:
constructing a first node characteristic matrix according to the node characteristic data;
and extracting node characteristics according to the first node characteristic matrix to obtain a second node characteristic matrix.
8. The method of claim 1, wherein the target data is obtained by:
acquiring original fraud related data;
preprocessing the related data of the original fraudulent activity to obtain target data; the preprocessing at least comprises missing value processing, continuous node characteristic processing, data normalization and data filtering.
9. A telecommunications fraud account identification model training apparatus, the apparatus comprising:
the acquisition module is used for acquiring target data; the target data are data preprocessed according to related data of fraudulent activity;
the construction module is used for constructing a database according to the target data; the database comprises node characteristic data and associated data;
and the training module is used for training a telecom fraud account identification model based on multi-model fusion by utilizing an integrated learning strategy according to the node characteristic data and the associated data.
10. An electronic device comprising a memory for storing instructions or code and a processor for executing the instructions or code to cause the device to perform the telecommunications fraud account identification model training method of any of claims 1 to 8.
11. A computer storage medium having code stored therein, which when executed, causes an apparatus executing the code to implement the telecommunications fraud account identification model training method of any of claims 1-8.
CN202310778878.7A 2023-06-28 2023-06-28 Telecom fraud account identification model training method, device, equipment and medium Pending CN117131460A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310778878.7A CN117131460A (en) 2023-06-28 2023-06-28 Telecom fraud account identification model training method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310778878.7A CN117131460A (en) 2023-06-28 2023-06-28 Telecom fraud account identification model training method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN117131460A true CN117131460A (en) 2023-11-28

Family

ID=88860673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310778878.7A Pending CN117131460A (en) 2023-06-28 2023-06-28 Telecom fraud account identification model training method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117131460A (en)

Similar Documents

Publication Publication Date Title
CN109255586B (en) Online personalized recommendation method for e-government affairs handling
CN112766550B (en) Random forest-based power failure sensitive user prediction method, system, storage medium and computer equipment
CN108269012A (en) Construction method, device, storage medium and the terminal of risk score model
CN111368147B (en) Graph feature processing method and device
CN110166344B (en) Identity identification method, device and related equipment
CN110837963A (en) Risk control platform construction method based on data, model and strategy
CN112989059A (en) Method and device for identifying potential customer, equipment and readable computer storage medium
CN111127062B (en) Group fraud identification method and device based on space search algorithm
CN112580902B (en) Object data processing method and device, computer equipment and storage medium
CN111639690A (en) Fraud analysis method, system, medium, and apparatus based on relational graph learning
CN114782161A (en) Method, device, storage medium and electronic device for identifying risky users
CN111062444A (en) Credit risk prediction method, system, terminal and storage medium
CN111986027A (en) Abnormal transaction processing method and device based on artificial intelligence
CN116401379A (en) Financial product data pushing method, device, equipment and storage medium
CN111127185A (en) Credit fraud identification model construction method and device
CN114331473A (en) Method and device for identifying telecommunication fraud event and computer-readable storage medium
CN115115369A (en) Data processing method, device, equipment and storage medium
CN113139876A (en) Risk model training method and device, computer equipment and readable storage medium
CN113112347A (en) Determination method of hasty collection decision, related device and computer storage medium
CN112200665A (en) Method and device for determining credit limit
CN117131460A (en) Telecom fraud account identification model training method, device, equipment and medium
CN107402984B (en) A kind of classification method and device based on theme
CN111951050A (en) Financial product recommendation method and device
CN116074135B (en) Quota configuration method and quota configuration device
CN109308565A (en) The recognition methods of crowd's performance ratings, device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination