CN114693317A

CN114693317A - Telecommunication fraud security federation detection method fusing homogeneous graph and bipartite graph

Info

Publication number: CN114693317A
Application number: CN202210397973.8A
Authority: CN
Inventors: 许国良; 张林泉
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2022-07-01

Abstract

The invention relates to a safety federal telecommunication fraud detection method fusing a homogeneity diagram and a bipartite diagram, belonging to the field of big data analysis and mining, and comprising the following steps of S1: based on user service data of a telecom operator, extracting and preprocessing voice call data, short message communication data and mobile phone application access data of a user; s2: constructing a telecommunication user social network homogeneous graph and a user mobile phone application bipartite graph data set; s3: constructing a homogeneous graph embedded network aiming at a social network homogeneous graph, constructing a bipartite graph embedded network aiming at a bipartite graph applied to a mobile phone accessed by a user, sampling user nodes to obtain a neighbor node co-occurrence sequence, performing iterative training to obtain embedded representations of all nodes, and fusing the embedded representations as embedded representations of the user; s4: and extracting local telecommunication user characteristics by different participants according to the local data characteristics, performing combined training on local data of different organizations by adopting a safe federal gradient elevated tree classification model, and outputting a final prediction result of the fraud number.

Description

Telecommunication fraud security federation detection method fusing homogeneous graph and bipartite graph

Technical Field

The invention belongs to the field of big data analysis and mining, and relates to a telecommunication fraud security federation detection method fusing a homogeneity diagram and a bipartite diagram.

Background

With the development of mobile communication and the popularization of various network applications, the global telecommunication phishing situation is getting stronger and there is a trend of gradually moving to high technology and moving to phishing. Today, with the rapid development of internet technology, telecommunication phishing is increasingly becoming one of the "stubborn" societies in countries around the world. At present, the implementation of global telecommunication network fraud still mainly involves telephone contact, and increasingly presents new problems with new characteristics such as intellectualization, industrialization, homogenization and the like, and fraud objects gradually change from wide-spread type to precise fraud. The fraud modes are gradually spread from telephone, short message and email to social network sites and mobile phone applications, various fraud means are continuously renewed, the technical antagonism is continuously enhanced, fraud scripts are closely attached to social hotspots and personal privacy, and the fraud modes are gradually changed from domestic fraud to cross-border fraud.

Currently, the fraud number detection schemes in the industry mainly include two schemes, namely a rule-based expert system and a machine learning-based model system. The rule-based expert system needs anti-fraud experts to manually analyze a large amount of normal and abnormal telecommunication data, accurately identify the fraudulent behavior modes of fraudulent molecules, find important characteristics capable of effectively distinguishing whether fraud is caused, and write expert rules to detect the fraudulent behavior. Rule-based expert systems are therefore strongly dependent on the expertise and business knowledge of anti-fraud experts, causing huge losses if the experts are not able to detect the increasingly complex patterns of fraud with great acuity in time.

With the continuous expansion of data scale and the continuous increase of machine computing power, model systems based on machine learning have appeared. Machine learning based models are typically feature analyzed from historical transaction data, after which the models are trained and evaluated on feature data sets using machine learning classification algorithms and then applied to fraud number detection. Whether it is a rule-based expert system or a machine learning-based model system, individual behavior patterns that repeatedly occur when transaction fraud occurs are discovered from historical data. As the specialization degree of telecommunication fraud is continuously increased, fraud molecules can evade fraud detection by changing self fraud techniques, but the fraud molecules have difficulty in changing all the association relations. When the associated network covers a large range, the spidrome trail is revealed by the fraudulent molecules even if they take further care. Therefore, in the context of large-scale data, how to mine effective features to improve the effect of model fraud detection is a new direction currently explored by researchers.

In the present day that data security is more and more emphasized, there is often great difficulty in directly using telecommunication big data. The problem of difficult data integration exists among operators and related enterprises, even among different business departments of the same organization, so that the joint training of the telecommunication user characteristic data extracted by different departments is also the current research focus.

Disclosure of Invention

In view of the above, in order to fully utilize the communication service data of each operator and the fraud number label data of the public security department to identify fraud numbers, the invention provides a fraud number feature extraction and classification method based on a voice short message social graph and a mobile phone application access bipartite graph based on graph embedding learning.

In order to achieve the purpose, the invention provides the following technical scheme:

a telecommunication fraud security federation detection method fusing a homogeneity graph and a bipartite graph comprises the following steps:

s1: based on user service data of a telecom operator, extracting voice call data, short message communication data and mobile phone application access data of a user, and preprocessing the voice call data, the short message communication data and the mobile phone application access data;

s2: constructing a telecommunication user social network homogeneous graph and a user mobile phone application bipartite graph data set by utilizing the preprocessed data, wherein the graph data set comprises three types of weighted graphs of a voice social network homogeneous graph, a short message social network homogeneous graph and a mobile phone application access bipartite graph, and the weight setting of the sides is used for carrying out statistical feature extraction and weight aggregation according to the characteristics of different services;

s3: constructing a homogeneous graph embedded network aiming at a social network homogeneous graph, constructing a bipartite graph embedded network aiming at a bipartite graph applied to a mobile phone accessed by a user, sampling user nodes by adopting a graph embedding learning mode to obtain a neighbor node co-occurrence sequence, and then obtaining an embedded representation of each node by reconstructing an embedded function and performing negative sampling iterative training on co-occurrence information; fusing the embedding characteristics obtained by training as the embedding representation of the user;

s4: different participants extract local telecommunication user characteristics according to local data characteristics, and a safe federal gradient elevated tree classification model is adopted to carry out combined training on local data of different organizations; and performing encrypted data sample alignment and encrypted model parameter exchange on sample data among different mechanisms through a reliable third-party server, thereby realizing multi-party model combined training, wherein a two-stage training method is adopted in the training process, the first-stage training is used for screening the features, the second-stage training is used for classifying the screened features, and a final prediction result of the fraud number is output.

Further, step S1 specifically includes: constructing a fraud number detection data set by utilizing different service data of users collected from a telecom operator; the data is divided into the following four types according to different service data characteristics: the mobile phone comprises user basic information data, voice call data, short message communication data and mobile phone application access data; performing data cleaning operation on the acquired data, wherein the data cleaning operation comprises abnormal value processing, missing value processing and standardized processing; and meanwhile, marking the extracted telecommunication users according to the grasped telecommunication fraud report information, wherein the fraud users are marked as 1, and the non-fraud users are marked as 0.

Further, in step S2, the process of constructing the voice and short message social network homogeneity map and the user application bipartite map for accessing the mobile phone includes: aiming at voice and short message data, extracting a telecommunication user voice social graph G according to the calling and called relations of the voice call₁(ii) a Constructing a short message social graph G according to the uplink and downlink transceiving relation of short message communication₂(ii) a Aiming at the user internet log data, summarizing and merging the data according to the record of the user accessing the mobile phone application to obtain a mobile phone application access bipartite graph G₃(ii) a The three types of graph data are in the form of weighted graphs, wherein the edge weight of the voice social graph is weighted and evaluated according to the communication relation characteristics between calling and called partiesThe edge weight of the short message social graph is subjected to weighted evaluation according to the communication relation characteristics of the receiving and sending users, and the edge weight of the bipartite graph accessed by the user is subjected to weighted evaluation according to the internet access condition characteristics of the application accessed by the user.

Further, in step S2, constructing a telecommunication user social network homogeneity graph and a user mobile phone application bipartite graph data set by using the preprocessed data specifically includes:

voice social network graph G₁＝(U₁,E₁) Social network G with short message₂＝(U₂,E₂) Wherein U is_iIs a set of user nodes, E_iIs a user and user communication relation set; each edge (i, j) in the edge set belongs to E and has a pair of user node pairs (u)_i,u_j) Having a weight w_ijThe number is more than or equal to 0, which represents the interaction condition between two users;

for Voice social networking graph G₁User pair (u)_i,u_j) Directed edge weights between

By extracting (u)_i,u_j) Feature set of conversation between parties

Feature set F₁Including but not limited to a number of calls feature f₁ ⁽¹⁾Total call duration characteristics

Average call duration feature

Talk time period feature

Toll call feature

Caller on-network time feature

Called on-line time characteristics

Then, the weighted summation is carried out on all the elements in the set to obtain the weight of the edge, and the weight solving formula is shown as the following formula:

wherein alpha is_iThe weighting coefficient is n, and the n is the total number of the extracted voice call features;

for short message social network graph G₂Directed edge weight of

By extracting (u)_i,u_j) Feature set of conversation between parties

Feature set F₂Including but not limited to a transmission times characteristic f₁ ⁽²⁾Total byte number characteristic of short message

Average byte number characteristic of short message

Short message sending time period characteristics

Whether the short message is the verification code

Sender on-network time characteristics

Receiver on-line time characteristics

wherein beta is_iM is the total number of the extracted short message communication characteristics;

mobile phone application access bipartite graph G₃＝(U₃,V₃,E₃) Wherein U is₃Representing a set of user nodes, V₃Representing a mobile phone application node set;

representing a set of relational edges for a user to access a mobile application, each edge having a non-negative weight w_ijThe number is more than or equal to 0, which represents the internet access use condition of the user accessing the mobile phone application; dichotomy G for mobile phone application access₃User handset application relationship pair (u)_i,v_j) Directed edge weights between

By extracting (u)_i,v_j) Inter-networking feature set

Feature set F₃Including but not limited to the access times characteristic f₁ ⁽³⁾Total length of access feature

Average access duration characteristic f₃ ⁽³⁾Access total consumption traffic characteristics

Average consumption flow characteristic

User on-network time characteristics

Mobile phone application class features

wherein gamma is_iK is the total number of features extracted from the user access APP data, as a weighting coefficient.

Further, step S3 specifically includes the following steps:

s31: according to the constructed social network homogeneous graph, the short message social network homogeneous graph and the mobile phone application access bipartite graph, graph embedding training is carried out on the user nodes by adopting corresponding graph embedding models respectively;

s32: finding out a neighbor sequence set of the user node according to first-order and second-order neighbor similarity between the nodes of the homogeneous graph, and finding out a neighbor sequence set of the user node according to an explicit relation and an implicit relation of the bipartite graph;

s33: and respectively splicing the node embedding obtained by the first-order similarity training and the node embedding obtained by the second-order training to obtain the embedding vector of the user node of the homogeneous graph, and carrying out combined optimization training on the explicit relation and the implicit relation to obtain the user node embedding vector of the bipartite graph.

Further, in step S3, for the homogenous graph, the user node is mapped from the graph domain to the embedded domain, i.e. when the user node index i is given, the node u is directly obtained_iIs embedded in u_iThe mapping function is expressed as:

wherein e is_i∈{0,1}^NRepresenting user node u_iOne-hot encoding (where N ═ U | represents the number of user nodes; e.g. of the type_iRepresenting the corresponding i-th element e in the vector_i[i]Is 1, and the other elements are 0; w_N×dIs the embedding parameter matrix to learn, where d is the embedding dimension; the ith row of the matrix W is node u_iAn embedded representation of (a);

for bipartite graph, due to the original bipartite graph G₃Two types of node sets exist, and considering that the fraud number detection task only needs to pay attention to the characteristics of the user nodes, the user node-based homogeneous graph needs to be split out G_UExtracting features as implicit relation, mapping nodes of bipartite graph from graph domain to embedded domain, and using u as each node_iAnd v_iRepresenting user node u_i∈U₃And mobile phone application node v_i∈V₃The embedded vector of (2);

extracting key structure information of user nodes in graph domain

Wherein the homogeneous graph network reconstructs domain information for the nodes based on first and second order similarities of the nodes

The bipartite graph network models and extracts the key structure information of the user nodes in the graph domain according to the explicit relation and the implicit relation of the graph domain nodes

Reconstructing extracted graph domain co-occurrence information using embedded representation of embedded domain

And

the reconstructed information is represented as

And

by pairing co-occurrence-based information

And reconstructing information

The target function of (2) is optimized, and a mapping function and all parameters related in a reconstructor are learned;

for the homogeneity map, the objective function that the first order similarity needs to be optimized is:

the objective function to be optimized for the second order similarity is:

for bipartite graph G₃The optimization objective function for modeling by the explicit relationship is:

the optimization objective function for modeling by the implicit relationship is as follows:

by pairing co-occurrence-based information

And reconstructing information

Object function O of₅And optimizing, and learning all parameters related in the mapping function and the reconstructor. The final joint optimization overall objective function of the bipartite graph is as follows:

maximize O₅＝-μO₃+ηO₄

where μ and η are the hyper-parameters to be specified for combining the different components in the joint optimization.

Further, step S4 specifically includes the following steps:

s41: splicing the homogeneity graph and the bipartite graph embedding vector to obtain a final node embedding characteristic, and combining the basic user characteristic and label information, wherein information is input into a safe federal gradient elevation tree classification model for primary training;

s42: sorting the features obtained by the training of the first stage according to importance, screening out the features n before ranking, and distributing the features to different participants for optimizing the features;

s43: after different participants carry out feature screening, carrying out two-stage federal gradient elevated tree classification training again, and outputting fraud number prediction results;

s44: and processing the final classification result of the user and outputting a suspicious fraud number list.

Further, the two-stage training process of the safe federated gradient lifting tree model comprises encrypted sample alignment and encrypted model training; in the training process, the central server carries out encryption exchange on the intermediate calculation result and the parameters of the model to finally obtain the optimal model parameter combination; the encryption mode is carried out by adopting a mode based on an RSA algorithm and a hash function; in the training process, the local data are calculated only locally, the calculation result is encrypted and then transmitted to the central server, and other participants cannot obtain the details of the local data. Thus, the local data can be secured.

The invention has the beneficial effects that: the method solves the problem of feature extraction for interactive features of the historical telephone bill and the online data of the telecommunication user in the detection task of the fraud user, and combines the basic information features of the user acquired by feature engineering for classification prediction of the machine learning model. A more multivariate data feature extraction method is provided for the traditional fraud number detection task. The method can be mutually fused and supplemented with other traditional fraud number detection models, and has good generalization capability in fraud number detection tasks. The data required to be collected by the method can be processed in an anonymized encrypted data form, the same characteristic extraction effect can be achieved, and the method has positive practical significance for protecting the privacy safety of the user to a certain extent. The invention can combine data of different telecom operators and other related organizations as model input to carry out joint training, and the used safety federal machine learning model can ensure that the data of all the participants are not leaked to each other. The data security can be guaranteed, and meanwhile, the telecommunication fraud detection can be carried out by fully utilizing multi-party data. For the scenes that the use of the current privacy data is stricter and stricter, the scheme can well solve the problems of data isolation and data fragmentation. The invention adopts a two-stage training mode in the multi-party combined modeling, can perform characteristic screening on multi-party data characteristics, and can improve the generalization capability of the model to a certain extent. The method belongs to a mode of model optimization, and can be applied to different training models.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a schematic representation of the steps of the process of the present invention;

FIG. 2 is a schematic general flow diagram of the process of the present invention;

FIG. 3 is a schematic diagram of a social graph embedding module for voice messages according to the present invention;

FIG. 4 is a schematic diagram of a cell phone application access bipartite graph embedded module employed in the present invention;

FIG. 5 is a schematic diagram of a local machine learning classification module employed by the present invention;

FIG. 6 is a schematic diagram of a secure federal multi-party training model used in the present invention;

fig. 7 is a schematic diagram of secure federal encrypted training in the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

The invention provides a telecommunication fraud security federation detection method fusing a homogeneity diagram and a bipartite diagram, as shown in FIG. 1, which specifically comprises the following steps:

a fraud number detection data set is constructed from user different service data collected at a telecom operator. Firstly, data are divided into the following four types according to different service data characteristics: user information data, voice call data, short message communication data and mobile phone application access data. And performing data cleaning operations such as abnormal value processing, missing value processing, normalization processing and the like on the collected metadata, and marking the extracted telecommunication users according to the grasped telecommunication fraud report information, wherein the fraud users are marked as 1, and the non-fraud users are marked as 0.

And constructing a telecommunication user social network homogeneous graph and a user mobile phone application bipartite graph data set by utilizing the preprocessed data, wherein the data set comprises the label information of the user for fraud number two-class training and testing. The specific construction process is as follows:

voice social network graph G₁＝(U₁,E₁) Social network G with short message₂＝(U₂,E₂) Wherein U is_iIs a set of user nodes, E_iIs a set of user-user communication relationships. Each edge (i, j) in the edge set belongs to E and has a pair of user node pairs (u)_i,u_j) While having a weight, i.e. w_ijAnd the value is more than or equal to 0 and represents the interaction condition between the two users. For Voice social networking graph G₁User pair (u)_i,u_j) Directed edge weights between

By extracting (u)_i,u_j) Feature set of conversation between parties

Feature set F₁Including but not limited to a number of calls feature f₁ ⁽¹⁾Total call duration feature

Average call duration feature

Talk time period feature

Toll call feature

Caller on-network time feature

Called on-line time characteristics

And then, carrying out weighted summation on all elements in the set to obtain the weight of the edge, wherein the weight solving formula is shown as the following formula:

wherein alpha is_iN is the total number of extracted voice call features as a weighting coefficient. Social network diagram G for short messages as well₂Directed edge weight of

By extracting (u)_i,u_j) Feature set of conversation between parties

Feature set F₂Including but not limited to a transmission times characteristic f₁ ⁽²⁾Total number of bytes of short message

Short message average byte number characteristic f₃ ⁽²⁾Time period characteristics for sending short messages

Whether the short message is the verification code

Sender on-network time characteristics

Receiver on-line time characteristics

wherein beta is_iAnd m is the total number of the extracted short message communication characteristics.

Mobile phone application access bipartite graph G₃＝(U₃,V₃,E₃) Wherein U is₃Representing a set of user nodes, V₃Representing a set of handset application nodes.

Representing a set of relational edges for a user to access a mobile application, each edge having a non-negative weight w_ijAnd the number is more than or equal to 0, which represents the internet access use condition of the user accessing the mobile phone application. Dichotomy G for mobile phone application access₃User APP relationship Pair (u)_i,v_j) Directed edge weights between

By extractingu_i,v_j) Inter-networking feature set

Feature set F₃Including but not limited to a number of accesses feature f₁ ⁽³⁾Accessing total duration features

Average access duration characteristic f₃ ⁽³⁾Accessing total consumed traffic characteristics

Average consumption flow characteristic

User on-network time characteristics

Mobile phone application class features

The method comprises the steps of constructing a homogeneous graph embedded network aiming at a voice short message social graph, constructing a bipartite graph embedded network aiming at a mobile phone application access bipartite graph, sampling user nodes in an unsupervised learning mode to obtain a neighbor node co-occurrence sequence, and then obtaining embedded representation of each node through reconstruction of an embedded function and co-occurrence information negative sampling iterative training.

And carrying out splicing operation on the node embedding characteristics output by each embedding model, screening out sample data with label data, dividing the sample data into a training set and a test set according to the label attributes in proportion to be used as the input of a classification model, and finally obtaining an optimal model for the classification prediction of fraud numbers through iterative training of the model on the training set and the test set. And finally, predicting other user data by using the model, and outputting a prediction result to a suspected fraud number database for reference of an operator.

The invention also provides a telecommunication fraud security federation detection device fusing the homogeneity map and the bipartite map, as shown in fig. 2, specifically comprising:

the original data acquisition module is firstly connected to a data warehouse of an operator, periodically extracts user communication data and user access mobile phone application data through HiveSQL, and combines and summarizes data records according to a time period to obtain three user communication tables which are stored in the storage module, wherein the three tables are voice call data, short message communication data and mobile phone application flow use condition data respectively.

The graph data preprocessing module periodically reads a voice call data table, a short message communication data table and a mobile phone application flow use condition data table stored in the memory, extracts interaction relations between users and between mobile phone applications in each table through combination and summarization, and stores three kinds of interaction graph data in a form of an adjacent table.

And the graph embedding feature extraction module is used for dividing the processed three graph structure data into two types and respectively extracting features. The first type is a telecommunication user social network homogeneity graph G based on voice and short message data₁And G₂. The second type is a mobile phone application access bipartite graph G based on the traffic use condition of the mobile phone application₃。

FIG. 3 is a schematic diagram of a feature-embedded network of a homogeneity map. For homogeneity map G₁And G₂The embedded feature extraction comprises the following specific steps:

the method comprises the following steps: a node embedding mapping module for mapping the user node from the graph domain to the embedding domain, i.e. when the index i of the user node is given, the node u can be directly obtained_iIs embedded in u_iThe mapping function can be expressed as:

wherein e is_i∈{0,1}^NRepresenting user node u_iOne-hot encoding (where N ═ U denotes the number of user nodes). e.g. of the type_iRepresenting the corresponding i-th element e in the vector_i[i]Is 1, and the other elements are all 0. W_N×dIs the embedding parameter matrix to be learned, where d is the dimension of the embedding. The ith row of the matrix W is node u_iIs shown embedded.

Step two: a map domain co-occurrence information extraction module for extracting the key structure information of the user node in the map domain

I.e. reconstructing the domain information of the node according to the first-order and second-order similarities of the node

Wherein, the first-order similarity refers to the local pairwise similarity between user nodes in the network, and the formalization is described as if node u_iAnd u_jThere is a direct edge between them, then the weight w of the direct edge_ijThe similarity of the two vertexes is obtained; if no straight edge exists, the first-order similarity is 0. For node u_iAnd u_jThe similarity joint probability distribution of the undirected edges is defined as:

the empirical distribution among nodes in the embedded domain is defined as follows:

wherein v is_i∈R^dDisplay sectionPoint u_iD-dimensional vector representation in the embedded domain.

The second-order similarity refers to similarity of neighbor nodes of user nodes in the network. Formalized by the definition of p_u＝(w_u,1,w_u,2,…,w_u,|V|) Representing the first-order similarity between the node u and all other nodes, the second-order similarity between the nodes u and v can be represented by p_uAnd p_vIs expressed by the similarity of (c). If there is no common neighbor node between nodes u and v, the second-order similarity between u and v is 0. The second order similarity may express a global feature of the graph.

For second-order similarity, two embedded vectors are required to be introduced into each node for characterization, and one embedded vector is used for characterizing the node per se, namely the embedded vector u of the central node_cen(ii) a The other is an embedded vector of context nodes as other nodes, namely a neighborhood node embedding u_con. Thus in the graph domain, for an arbitrary edge (u)_i,u_j) E, the joint distribution of the two is defined as:

wherein, w_ijIs node u_iAnd u_jWeight of the edge of (d)_iIs a vertex u_iNumber of neighbor nodes, N (u)_i) Is node u_iThe set of domain nodes.

And in the embedding domain, the conditional probability between nodes is u_iIn the presence of u_jThe probability of presence is defined as:

wherein the content of the first and second substances,

representing a node u_iIs embedded in the central node of the network,

representing a node u_iThe neighborhood node of (1) is embedded, and the | V | represents the number of the neighborhood nodes.

Step three: an embedded domain information reconstruction module for reconstructing the extracted map domain co-occurrence information using the embedded representation of the embedded domain

The reconstructed information is represented as

Step four: an objective function optimization module based on co-occurrence information

And reconstructing information

The target function of (2) is optimized, and the mapping function and all parameters involved in the reconstructor are learned.

Where the first order similarity measures the difference between two probability distributions using KL divergence. The optimization objective function of the first-order similarity obtained after the constant term is omitted is as follows:

the second-order similarity also adopts KL divergence to calculate the difference of different distributions, and the optimization objective function of the second-order similarity obtained by neglecting a constant term is as follows:

FIG. 4 is a schematic diagram of a bipartite graph embedded network architecture. For bipartite graph G₃The embedded feature extraction comprises the following specific steps:

the method comprises the following steps: bipartite graph reconstruction Module, original bipartite graph G₃In which there are two types of node assemblies, fraud is taken into accountThe number detection task only needs to pay attention to the characteristics of the user nodes, so that the user node-based homogeneous graph only needs to be split out G in the module_UFeature extraction is performed as an implicit relationship.

Step two: a node embedding mapping module for mapping each node of the bipartite graph from the graph domain to the embedding domain by u_iAnd v_iRepresenting user node u_i∈U₃And mobile phone application node v_i∈V₃The embedded vector of (2).

Step three: the map domain co-occurrence information extraction module is used for extracting key structure information of the user node in the map domain

Step four: an embedded domain information reconstruction module for reconstructing the extracted map domain co-occurrence information using the embedded representation of the embedded domain

The reconstructed information is represented as

For bipartite graph G₃Given a node pair (u)_i,v_j)∈E₃Wherein u is_i∈U₃And v₃∈V₃And the joint probability between two nodes in the graph domain is as follows:

whereas the empirical distribution of nodes within the embedding domain is:

for explicit relationships, the difference in the distribution of the map domain and the embedded domain is measured by the KL divergence, so the objective function is:

the final objective function after ignoring the constant term is:

implicit relationship homogeneity map G for bipartite maps_UAnd training optimization is carried out by adopting homogeneity graph embedding based on first-order similarity. The joint probability distribution of the user nodes, the experience distribution of the nodes in the embedded domain and the objective function to be optimized are as follows:

And reconstructing information

Object function O of₅And optimizing, and learning all parameters related in the mapping function and the reconstructor. The final joint optimization overall objective function is:

maximize O₅＝-μO₃+ηO₄

wherein, O₃For bipartite graph nodes explicit relationship objective function, O₄Implicit relational objective function for bipartite graph nodesThe numbers μ and η are the hyper-parameters to be specified for combining the different components in the joint optimization.

Obtaining three kinds of embedded vector feature representations X of the user through the iterative optimization of the graph embedding module₁,X₂,X₃。

Fig. 5 is a fraud user detection local classification model architecture employed by the present invention, and fig. 6 is a joint training model architecture for performing secure federal learning in conjunction with a multi-party local model. For the local models of a plurality of participants, firstly, the fraud user detection module is used for sorting the user basic information characteristics X of the data processing module₀Embedding features X with a user₁,X₂,X₃And splicing to obtain a telecommunication user characteristic combination table, and combining the telecommunication user label data obtained by the actual alarm information to form a fraud user detection data set. The same method is adopted for a plurality of participants to construct a sample data set, then each participant conducts encryption sample entity alignment on the operation process through a central server as a coordinator, encryption operation and exchange are conducted on the operation results of local data models of each participant, optimal model parameters are finally obtained through continuous iterative optimization to be used for result prediction of fraud numbers, and users predicted as fraud numbers are led out to a suspicious user list to be further researched and used. In this module, the local classification models used by each organization include, but are not limited to, logistic regression, decision trees, deep learning networks, ensemble learning, and the like.

Fig. 7 is a schematic diagram of secure federal multi-party combined encryption training. And (3) adopting a secondary training mode in the training process, wherein the first training is used for carrying out feature screening, the feature importance weight is obtained after the first training of each participant feature is finished, and the features are sorted according to the value to screen out the features which are ranked at the top 50. And then enabling the participants owning the characteristics to carry out secondary joint modeling, and providing the result of secondary training as output to the label owner operator. The operator extracts a list of fraud numbers from the prediction as a reference.

In a preferred embodiment, when a new type of telecommunication fraud mode occurs, the new fraud sample is classified and labeled, sample data of normal users and new fraud user are selected and input into the trained model, and the model can adapt to the detection of the new fraud type through iterative optimization of model parameters.

According to the embodiment of the invention, different types and different quantities of data sets are selected in different processes, so that a telecommunication user fraud detection method based on the voice short message social graph and the mobile phone application access bipartite graph can be realized, and a fraud user in the telecommunication users can be detected and identified.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A telecommunication fraud security federation detection method fusing a homogeneity graph and a bipartite graph is characterized in that: the method comprises the following steps:

2. The telecommunication fraud security federation detection method of fusing a homogeneity map and a bipartite map according to claim 1, wherein: step S1 specifically includes: constructing a fraud number detection data set by utilizing different service data of users collected from a telecom operator; the data is divided into the following four types according to different service data characteristics: user basic information data, voice call data, short message communication data and mobile phone application access data; performing data cleaning operation on the acquired data, wherein the data cleaning operation comprises abnormal value processing, missing value processing and standardized processing; and meanwhile, marking the extracted telecommunication users according to the grasped telecommunication fraud report information, wherein the fraud users are marked as 1, and the non-fraud users are marked as 0.

3. The telecommunication fraud security federation detection method fusing a homogeneity map and a bipartite map according to claim 1, wherein: in step S2, the process of constructing the bipartite graph for the social network homogeneity graph of voice and sms messages and the user access to the mobile phone includes: aiming at voice and short message data, extracting a voice social graph G of a telecommunication user according to calling and called relations of voice communication₁(ii) a Constructing a short message social graph G according to the uplink and downlink receiving and sending relations of the short message communication₂(ii) a Aiming at the user internet log data, summarizing and merging the data according to the record of the user accessing the mobile phone application to obtain a mobile phone application access bipartite graph G₃(ii) a All the three types of graph data are weighted graphsThe method comprises the following steps that the side weight of a voice social graph is weighted and evaluated according to the communication relation characteristics between calling and called parties, the side weight of a short message social graph is weighted and evaluated according to the communication relation characteristics of a receiving and sending user, and the side weight of a user accessing a mobile phone application bipartite graph is weighted and evaluated according to the internet access condition characteristics of the user accessing the mobile phone application.

4. The telecommunication fraud security federation detection method fusing a homogeneity map and a bipartite map according to claim 3, wherein: in step S2, the constructing a telecommunication user social network homogeneity map and user mobile phone application bipartite graph data set by using the preprocessed data specifically includes:

By extracting (u)_i,u_j) Feature set of conversation between parties

Average call duration feature

Talk time period feature

Toll call feature

Caller on-network time feature

Called on-line time characteristics

for short message social network graph G₂Directed edge weight of

By extracting (u)_i,u_j) Feature set of conversation between parties

Average byte number characteristic of short message

Short message sending time period characteristics

Whether the short message is the verification code

Sender on-network time characteristics

Receiver on-line time characteristics

wherein beta is_iM is the total number of the extracted short message communication characteristics as a weighting coefficient;

By extracting (u)_i,v_j) Inter-networking feature set

Feature set F₃Including but not limited to a number of accesses feature f₁ ⁽³⁾Total length of access feature

Average access duration feature

Accessing total consumed traffic characteristics

Average consumption flow characteristic

User on-network time characteristics

Mobile phone application class features

5. The telecommunication fraud security federation detection method fusing a homogeneity map and a bipartite map according to claim 1, wherein: step S3 specifically includes the following steps:

6. The telecommunication fraud security federation detection method fusing a homogeneity map and a bipartite map according to claim 5, wherein: in step S3, for the homogenous graph, the user node is mapped from the graph domain to the embedded domain, i.e. when the user node index i is given, the node u is directly obtained_iIs embedded in u_iThe mapping function is expressed as:

wherein e is_i∈{0,1}^NRepresenting user node u_iOne-hot encoding (where N ═ U | represents the number of user nodes; e.g. of the type_iRepresenting the corresponding i-th element e in the vector_i[i]Is 1, and the other elements are 0; w_N×dIs the embedding parameter matrix to be learned, where d is the embedding dimension; the ith row of the matrix W is node u_iAn embedded representation of (a);

splitting the homogeneous graph based on the user nodes into G for the bipartite graph_UExtracting features as implicit relation, mapping nodes of bipartite graph from graph domain to embedded domain, and using u as each node_iAnd v_iRepresenting user node u_i∈U₃And mobile phone application node v_i∈V₃The embedded vector of (2);

extracting key nodes of user nodes in graph domainConstruct information

And

the reconstructed information is represented as

And

by pairing co-occurrence-based information

And reconstructing information

the objective function to be optimized for the second order similarity is:

by pairing co-occurrence-based information

And reconstructing information

Object function O of₅Optimizing, and learning all parameters related in the mapping function and the reconstructor; the final joint optimization overall objective function of the bipartite graph is as follows:

maximize O₅＝-μO₃+ηO₄

7. The telecommunication fraud security federation detection method fusing a homogeneity map and a bipartite map according to claim 1, wherein: step S4 specifically includes the following steps:

s43: after different participants carry out feature screening, carrying out second-stage federal gradient elevation tree classification training again, and outputting fraud number prediction results;

8. The telecommunication fraud security federation detection method fusing a homogeneity map and a bipartite map according to claim 1, wherein: the two-stage training process of the safe federated gradient lifting tree model comprises encrypted sample alignment and encrypted model training; in the training process, the central server carries out encryption exchange on the intermediate calculation result and the parameters of the model to finally obtain the optimal model parameter combination; the encryption mode is carried out by adopting a mode based on an RSA algorithm and a hash function; in the training process, the local data are calculated only locally, the calculation result is encrypted and then transmitted to the central server, and other participants cannot obtain the details of the local data.