CN113051557B - Social network cross-platform malicious user detection method based on longitudinal federal learning - Google Patents

Social network cross-platform malicious user detection method based on longitudinal federal learning Download PDF

Info

Publication number
CN113051557B
CN113051557B CN202110275639.0A CN202110275639A CN113051557B CN 113051557 B CN113051557 B CN 113051557B CN 202110275639 A CN202110275639 A CN 202110275639A CN 113051557 B CN113051557 B CN 113051557B
Authority
CN
China
Prior art keywords
data
party
passive
active
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110275639.0A
Other languages
Chinese (zh)
Other versions
CN113051557A (en
Inventor
张志勇
宋斌
梁腾翔
张丽丽
卫新乐
牛丹梅
李玉祥
张孝国
向菲
张蓝方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University of Science and Technology
Original Assignee
Henan University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University of Science and Technology filed Critical Henan University of Science and Technology
Priority to CN202110275639.0A priority Critical patent/CN113051557B/en
Publication of CN113051557A publication Critical patent/CN113051557A/en
Application granted granted Critical
Publication of CN113051557B publication Critical patent/CN113051557B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Computing Systems (AREA)
  • Business, Economics & Management (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Economics (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A social network cross-platform malicious user detection method based on longitudinal federal learning comprises the following steps: step 1, constructing a cross-platform malicious user detection hierarchical architecture of a social network based on longitudinal federal learning; step 2, dividing the participating party into an active party and a passive party, and carrying out preprocessing operation on sample data of the active party and the passive party on a data preprocessing layer to obtain structured data; step 3, mapping the structured data processed by the data preprocessing layer to sample data shared by the active side and the passive side; step 4, cooperatively training a global model under the definition of machine learning, and encrypting and decrypting data of an active side and a passive side by using homomorphic encryption to complete the Federal learning layer training; and step 6, transmitting the prediction result obtained by the federal learning layer back to each participant in a data application layer, thereby realizing the high-quality malicious user detection effect.

Description

Social network cross-platform malicious user detection method based on longitudinal federal learning
Technical Field
The invention belongs to the technical field of internet, and particularly relates to a social network cross-platform malicious user detection method based on longitudinal federal learning.
Background
With the rapid development of Online Social Networks (OSNs), as shown by the 45 th statistical report of the development conditions of the china internet, 3 months in 2020, OSNs users reach 9.04 hundred million in scale and the popularity of the internet reaches 64.5%, so that OSNs help people establish Social network application services and gradually become the primary target of malicious users trying to execute illegal activities and malicious hazards, and these malicious behaviors cause adverse effects and huge hazards to the current society.
At present, the traditional machine learning methods, such as semi-supervised clustering, classifiers of support vector machines and the like, rely on big data means to extract and train the behavior characteristics of malicious users, and obtain high-quality detection effect on the OSNs platform.
The technical scheme 1: an article "Detecting MalicusSocial boxes Based on Clicks streams" by Shi P et al (IEEE Access, 2019), which provides a Malicious user detection algorithm Based on spatial and temporal characteristics Based on the transition probability characteristics between click streams of context awareness.
The technical scheme 2 is as follows: WU et al thesis "malicious program multi-feature detection on Android platform" small-sized microcomputer system provides a mixed algorithm based on multiple types of features, and different classifiers are constructed by using large-scale feature data to realize high-efficiency detection results.
However, none of the successful applications of the above schemes is based on social big data, and in an actual application scene, a malicious user has the characteristics of dispersibility, latency, complexity and the like, and data of a single party hardly meets the detection requirement, so that the data of two parties or even multiple parties needs to be jointly trained to achieve a satisfactory detection effect; secondly, with the soundness of laws and regulations, the emphasis on user privacy and Data security has become a worldwide accepted trend, as specified in General Data Protection Regulations (GDPR) issued by the european union, and it has been clearly prohibited to gather user Data of each party without user consent. Therefore, how to solve the problem of data fragmentation under the premise of complying with laws and regulations is undoubtedly an important research subject in the current social network scenario.
Disclosure of Invention
In order to solve the technical problems, the invention provides a social network cross-platform malicious user detection method based on longitudinal federal learning, which fuses multi-party data to perform modeling analysis on the premise of ensuring the privacy of common user data, thereby realizing a high-quality malicious user detection effect.
In order to realize the technical purpose, the adopted technical scheme is as follows: a social network cross-platform malicious user detection method based on longitudinal federal learning comprises the following steps:
step 1, constructing a social network cross-platform malicious user detection hierarchical architecture based on longitudinal federated learning, wherein the architecture comprises a data preprocessing layer, an encryption sample alignment layer, a federated learning layer and a data application layer;
step 2, firstly, selecting a plurality of participants, dividing the participants into an active party and a passive party, providing sample data and tag values of a user as the active party, providing only sample data of the user as the passive party, and carrying out preprocessing operation on the sample data of the active party and the sample data of the passive party in a data preprocessing layer to obtain structured data;
step 3, mapping the common sample data of the active side and the passive side on the sample alignment layer by using a safe intersection solving scheme of an RSA asymmetric encryption algorithm and a Hash mechanism for the structured data processed by the data preprocessing layer;
step 4, after the processing of the step 3, the active side and the passive side already determine the sample data shared by the two sides, the active side and the passive side cooperatively train a global model under the definition of machine learning by using local models of the active side and the passive side, and the active side and the passive side data are encrypted and decrypted by using homomorphic encryption to complete the training of a federal learning layer;
step 5, after the Federal learning layer is trained, the active side and the passive side update the local model training parameters of the active side and the passive side, and output the prediction result;
and 6, encapsulating a data calling interface in the data application layer, transmitting the prediction result obtained by the federal learning layer back to each participant, and updating and classifying local data by each participant to obtain a malicious user detection result.
The specific implementation process of encrypting and decrypting the data of the active party and the passive party by using homomorphic encryption comprises the following steps:
step 2.1, the driving party calculates a first-order gradient value and a second-order gradient value of the sensitive data, encrypts the first-order gradient value and the second-order gradient value by adopting addition homomorphic encryption, and then sends the encrypted gradient values to the driven party;
step 2.2, the passive side carries out barrel division on all the characteristics of the passive side, maps each characteristic value into each barrel, aggregates corresponding encrypted gradient value information according to the characteristic values after barrel division, and then sends the aggregated encrypted gradient information to the active side;
step 2.3, the active party decrypts the received aggregated encrypted gradient information to obtain the optimal division Divide of the current node max Returning the current node characteristic ID and the threshold ID to the passive party;
step 2.4, the passive side receives the characteristic ID and the threshold value ID to divide the total sample space I of the current node, wherein I R +I L =I,I L ,I R Respectively recording the record ID, the feature ID and the threshold value ID of the current node in the left and right sample spaces, and recording the record ID and the divided left sampleThis space I L Sending the data to the active side;
step 2.5, the master side according to the record ID and the left sample space I L Dividing the current node and entering the division of the next node;
step 2.6, iterating the processes of (2.2) to (2.5), after the construction of all the current decision trees is completed, calculating the optimal weight of each leaf node in the decision trees
Figure GDA0003869893240000031
Finishing the training;
step 2.7, the active side sends the record ID of the current node and the threshold value of the characteristic to the passive side;
step 2.8, the passive side compares the threshold value result of the current node to obtain a search decision and sends the search decision to the active side;
step 2.9, the active side receives the search decision and starts to go to the corresponding child node until reaching a leaf node to obtain a classification label and the optimal weight of the label;
and 2.10, iterating the processes of (2.7) - (2.9), and then performing weighted summation on optimal weights corresponding to the classification labels obtained by traversing all the decision trees to finally obtain label sets of normal users and malicious users.
The preprocessing operation is to convert each participant data into structured data through operations such as data cleaning, random sampling, data binning, data normalization and the like.
The invention has the beneficial effects that:
(1) The invention provides a social network cross-platform malicious user detection method based on longitudinal federal learning, which is realized on the premise of ensuring user privacy and data safety.
(2) The invention provides an effective problem handling mechanism for unstructured data at a data preprocessing layer, and is used for solving the problem of multi-source isomerism of the data.
(3) According to the invention, a data application layer is constructed, and the social network cross-platform malicious user can be detected in real time by packaging a data call interface.
(4) The invention utilizes homomorphic encryption to encrypt and decrypt the data of the active party and the passive party in the algorithm realization process, the algorithm is an end-to-end detection algorithm, has the same accuracy as the traditional machine learning method on the premise of privacy protection, adds a regularization punishment item in the algorithm, improves the generalization capability and the detection effect of the model, encrypts sensitive data and practically ensures the safety and the accuracy of the model.
Drawings
FIG. 1 is a hierarchical architecture for cross-platform malicious user detection in a social network according to the present invention;
FIG. 2 is a flow of a data pre-processing layer of the present invention;
FIG. 3 is a sample alignment layer flow of the present invention;
FIG. 4 is a federated learning layer flow of the present invention;
FIG. 5 is a flow chart of a malicious user detection algorithm for multi-party privacy protection according to the present invention;
FIG. 6 is a social network cross-platform malicious user detection framework of the present invention;
FIG. 7 is a malicious user detection page of the multimedia social network CyVOD desktop version of the present invention.
Detailed Description
With the rapid development of online social networks, social networks gradually become the primary target of malicious users trying to perform illegal activities and malicious hazards while helping people to establish social network application services. Malicious users can remain in a plurality of social network platforms, and try to steal the privacy of the users, penetrate political topics and the like by publishing false information, and the behaviors cause adverse effects and great harm to the current society. At present, the existing machine learning detection method realizes high-quality detection effect based on large-scale data, however, along with the soundness of laws and regulations, the existing machine learning detection method is not good at concentrating user data of all parties to one place, and is still clear. Therefore, the method and the system have the advantages that by means of the federal learning technology, on the premise that data safety and user privacy protection are guaranteed, multi-party data are fused for modeling analysis, and therefore accurate detection of malicious users in the social network platform is achieved.
A social network cross-platform malicious user detection method based on longitudinal federal learning comprises the following steps:
step 1, as shown in fig. 1, a social network cross-platform malicious user detection hierarchical architecture based on longitudinal federated learning is constructed, and the architecture comprises a data preprocessing layer, an encryption sample alignment layer, a federated learning layer and a data application layer.
A data preprocessing layer: in an actual application scenario, due to specific functional requirements, technical levels, storage modes and the like, data of each participant usually does not exist in a structured form, and data preprocessing is used for solving the operation of converting structured data in a modeling process. As shown in fig. 2, the preprocessing operation converts each participant's data into structured data by data cleansing, random sampling, data binning, and data normalization.
Sample alignment layer: the sample alignment layer is used for aligning all participants to share the user by using an encrypted ID matching technology before modeling of all the participants on the premise of ensuring the safety and privacy protection of the user.
Federal learning layer: the federal learning layer is used for model training through an encrypted parameter exchange mode, after determining a common sample of two parties, each participant can cooperatively train a global model under the machine learning definition, however, in order to prevent the privacy disclosure problem in the model training, the federal learning layer needs to introduce a credible cooperative party, and uses a privacy protection technology (such as state encryption) to encrypt and decrypt sample data and coordinate the training process.
A data application layer: after the training of the federal learning layer, each participant updates a local training model and outputs a prediction result, the data application layer transmits the prediction result back to the terminal through a packaged data calling interface, and the terminal updates and classifies local data and provides a detection basis for malicious users.
And 2, firstly selecting a plurality of participants, dividing the participants into an active party and a passive party, providing the sample data and the tag value of the user as the active party, providing only the sample data of the user as the passive party, and preprocessing the sample data of the active party and the passive party in a data preprocessing layer to obtain structured data.
The invention discloses a method for detecting the malicious users, which is characterized in that data with privacy leakage in model training of each participant is called as sensitive data, in order to ensure the safety of the sensitive data, a malicious user detection algorithm facing multi-party privacy protection is packaged in a Federal learning layer, a privacy protection method (homomorphic encryption) is adopted to encrypt the sensitive data, so that the multi-party training can be carried out without exposing the data of each participant, and simultaneously, roles played by each participant in the algorithm are respectively defined as an active party and a passive party.
The initiative side: providing sample data and label values of a user, playing the role of a cooperative party in the training process, and participating in encryption and decryption of sensitive data and coordinating the training process.
A passive side: typically only sample data for the user is provided.
Step 3, mapping the common sample data of the active side and the passive side on the structured data processed by the data preprocessing layer by using a safe intersection solving scheme of an RSA asymmetric encryption algorithm and a Hash mechanism in a sample alignment layer;
step 4, after the processing of the step 3, the active side and the passive side already determine the sample data shared by the two sides, the active side and the passive side cooperatively train a global model under the definition of machine learning by using local models of the active side and the passive side, and the active side and the passive side data are encrypted and decrypted by using homomorphic encryption to complete the training of a federal learning layer;
step 5, after the Federal learning layer is trained, the active side and the passive side update the local model training parameters of the active side and the passive side, and the prediction result is output to the data application layer;
and 6, encapsulating a data calling interface in the data application layer, transmitting the prediction result obtained by the federal learning layer back to each participant, and updating and classifying local data by each participant to obtain a malicious user detection result.
The invention sets an algorithm target function as the sum of a loss function and a regularization penalty term, and introduces the regularization penalty term to control the complexity of the model and prevent the phenomenon of overfitting, so that the algorithm has more classification efficiency in the solving process, and the target function is as follows:
Figure GDA0003869893240000051
wherein n is user sample data, t is decision tree,
Figure GDA0003869893240000052
for the loss function, the true value y is expressed i And the predicted value
Figure GDA0003869893240000053
Residual error between, omega (f) t ) A regularization penalty term.
When the objective function carries out the t-th iteration, the structure and the parameters of the tree of the first t-1 round are determined, and the predicted value of the sample of the t-th round is obtained according to the forward distribution addition method
Figure GDA0003869893240000054
Equal to the predicted value of the previous t-1 trees
Figure GDA0003869893240000055
Adding a new decision tree f t (x i ) As shown in formula (2):
Figure GDA0003869893240000056
at this time, equation (2) is substituted into equation (1), and the objective function of the expansion is expressed by equation (3):
Figure GDA0003869893240000057
next, taylor expansion is performed on equation (3) using a second-order taylor equation, as shown in equation (4):
Figure GDA0003869893240000061
and the regularization penalty term function of the algorithm set forth herein can be expressed as:
Figure GDA0003869893240000062
wherein gamma is a complexity parameter, T is the number of leaf nodes, and lambda is the weight value w of the leaf nodes j The penalty degree parameter of (2). Therefore, formula (5) is substituted for formula (4), and the objective function is further rewritten as:
Figure GDA0003869893240000063
in the formula (6), lambda, gamma, g i 、h i Are all known numbers, only w j As an unknown number, I j The method comprises the steps of calculating the optimal weight of a leaf node j according to the process of solving the extreme value of a unitary quadratic function of samples falling on the same leaf node j in the sample division process
Figure GDA0003869893240000064
Figure GDA0003869893240000065
Will optimize the weight
Figure GDA0003869893240000066
In the formula (6), the optimal objective function is obtained as follows:
Figure GDA0003869893240000067
to obtain the optimal division of the sample space, each time a node is split, the sample of the node is divided into two disjoint sample spaces, and I is set L ,I R Sample spaces of left and right subtrees, respectively, I R +I L = I represents the total sample space of the current node. Therefore, the sum of the first order gradients and the sum of the second order gradients on both sides of the left and right nodes are expressed as:
Figure GDA0003869893240000068
finally, the maximum value is found by subtracting the value before splitting from the evaluation index value after splitting the leaf node, and then the optimal division of the sample space is:
Figure GDA0003869893240000069
as can be seen from the implementation process of the algorithm, in the process of iterating the objective function t each time, the first derivative g of the prediction result y (t-1) of the loss function l relative to the previous t-1 trees is solved i And second derivative h i And according to g i And h i To obtain the optimal weight and the optimal partition. Therefore, we can easily find that the calculation of the optimal weights and optimal partitions depends on g i And h i And g is i And h i The computation depends on class label y in the sample i If g is directly used in the training process i And h i The exchange is carried out, there is a risk of privacy disclosure, so the algorithm herein sets g i And h i Must be calculated by the master and encrypted using additive homomorphism i And h i Encryption, so that the passive party cannot use the derivative information to deduce the label information during the training process.
The specific implementation process of encrypting and decrypting the data of the active side and the passive side by using homomorphic encryption is as shown in fig. 5:
step 2.1, the initiative side calculates a gradient value g of the sensitive data i And a second order gradient value h i And using additive homomorphic encryption to obtain a gradient value g i And a second order gradient value h i Encrypting, and then sending the encrypted gradient value to a passive party;
step 2.2, the passive side carries out barrel division on all the characteristics of the passive side, maps each characteristic value into each barrel, aggregates corresponding encrypted gradient value information according to the characteristic values after barrel division, and then sends the aggregated encrypted gradient information to the active side;
step 2.3, the active side decrypts the received aggregated encrypted gradient information to obtain the optimal division Divide of the current node max Returning the current node characteristic ID and the threshold ID to the passive party;
step 2.4, the passive side receives the characteristic ID and the threshold value ID to divide the total sample space I of the current node, wherein I R +I L =I,I L 、I R Respectively a left sample space and a right sample space, recording the record ID, the feature ID and the threshold ID of the current node, and dividing the record ID and the divided left sample space I L Sending the data to the active side;
step 2.5, the master side according to the record ID and the left sample space I L Dividing the current node and entering the next node;
step 2.6, iterating the processes of (2.2) to (2.5), after the construction of all the current decision trees is completed, calculating the optimal weight of each leaf node in the decision trees
Figure GDA0003869893240000071
Finishing the training;
step 2.7, the active side sends the record ID of the current node and the threshold value of the characteristic to the passive side;
step 2.8, the passive side compares the threshold value result of the current node to obtain a search decision and sends the search decision to the active side;
step 2.9, the active side receives the search decision and starts to go to the corresponding child node until reaching a leaf node to obtain a classification label and the optimal weight of the label;
and 2.10, iterating the processes of (2.7) to (2.9), traversing all the decision trees, carrying out weighted summation on the optimal weights corresponding to the obtained classification labels, and finally obtaining label sets of normal users and malicious users.
In the algorithm training process, more samples are gradually added into the left sample space, and the left sample space is used for dividing the current node, so that the value with the maximum gain is easily found out, namely the optimal division is obtained.
Example 1
According to the method, a conventional federal learning framework is expanded and improved by combining a multimedia social network CyVOD, a social network cross-platform malicious user detection framework based on longitudinal federal learning is built, as shown in FIG. 6, safe and compliant multi-party data are fused for modeling analysis, high-quality detection is realized on malicious users, and the ecological environment of the social network is further maintained.
The whole framework is divided into four parts, namely a data preprocessing stage, a sample alignment stage, a federal learning stage and a data application stage.
A data preprocessing stage: in the stage, an Android mobile party (an active party) and a PC website party (a passive party) of CyVOD are selected as data providers, an OSNs six-tuple (video, policy, guide, notification, post and false information) metadata experimental platform is built on the basis, 68 users click actions are totally performed, 50898 data are totally counted by 28 user static attribute characteristics of a PC end, 1076307 data are totally counted by 40 user dynamic attribute characteristics of a mobile end, an effective problem processing mechanism is set in the stage, and as shown in FIG. 2, the robustness of the training process is further improved by performing data cleaning, random sampling, data binning, numerical value normalization and other operations on the original data of all participants.
The invention adopts the following problem handling mechanism in the data preprocessing stage: (1) when the problems of repetition, deletion and the like occur, the sample data is processed by adopting the operations of deletion method, filling method and the like; (2) when the distribution is unbalanced, the sample data is randomly sampled, so that the model prediction and classification effects are improved; (3) when the continuous characteristic variable appears, the sample data is subjected to box separation, namely discretization is carried out on the continuous characteristic variable, so that the stability of the model is improved; (4) when the data dimension difference is obvious, normalization processing is carried out on the sample data, and the training speed and the convergence direction of the model are improved.
A sample alignment stage: in the stage, a scheme of safely solving intersection of an RSA algorithm and a hash function is adopted, and common sample IDs of an Android mobile party and a PC website party are mapped. As shown in fig. 3, firstly, the Android mobile party generates a public key and a private key pair by RSA algorithm, and transmits the public key to the PC network station; the PC website side performs hash mapping on the local data ID by using a hash function to ensure that the user ID cannot be transmitted in a plaintext form; secondly, the PC website side encrypts by adopting a public key and sends the encrypted data sample to the Android mobile side, the Android mobile side decrypts by using a local private key after receiving the passive side data sample, and then the local data is mapped by a Hash function and the received PC website side sample data is subjected to safe intersection; and finally, the Android mobile party sends the matched sample ID to the PC website party, and the sample alignment stage is completed.
And (3) a federal learning stage: as shown in fig. 4, an encrypted parameter exchange manner is adopted for model training at this stage, and after the Android mobile party and the PC website party determine sample data common to both parties, homomorphic encryption is introduced to encrypt sensitive data in order to prevent data security and user privacy from being revealed. The detailed process is as follows:
(1) The Android mobile party firstly calculates the gradient value, encrypts the gradient value by utilizing addition homomorphic encryption, and then sends the encrypted gradient value to the PC website party.
(2) The PC website side firstly carries out bucket separation on all characteristics of the PC website side, and maps each characteristic value to each bucket; and secondly, the PC website side aggregates the corresponding encryption gradient information according to the characteristic values after the barrel division, and sends an aggregation result to the Android mobile side.
(3) And the Android mobile party decrypts the received aggregation result, obtains the optimal division of the current node, and returns the current node characteristic ID and the threshold value ID to the PC website party.
(4) And the PC website side receives the feature ID and the threshold ID to divide the current sample space, records the current record ID, the feature ID and the threshold ID, and sends the record ID and the divided left sample space to the Android mobile side.
(5) And the Android mobile party divides the current node according to the record ID and the left sample space and enters the division of the next node.
(6) And (4) until all the decision trees are constructed, and calculating the optimal weight of each leaf node.
(7) After the training is finished, the Android mobile party sends the record ID of the current node and the threshold value of the characteristic to the PC website party.
(8) And the PC website side compares the threshold result of the current node to obtain a search decision and sends the search decision to the Android mobile side.
(9) And the Android mobile party receives the search decision, starts to go to the corresponding child node until each leaf node is reached to obtain the classification label and the weight.
(10) And (5) repeating the processes from (7) to (9) until all the decision trees are traversed, and finally, carrying out weighted summation on optimal weights corresponding to the traversed class labels by the Android mobile party to output class label sets of normal users and malicious users.
A data application stage: after the federal learning layer is trained, each participant updates a local training model, a prediction result is output, the prediction result is transmitted back to the terminal through a data calling interface packaged by CyVOD at the stage, the terminal updates and classifies local data to obtain a malicious user detection result, as shown in FIG. 6, a PC website end carries out marking processing on malicious users, and an administrator can timely process the malicious users.
The method can be packaged in hardware equipment, a detection result is directly obtained by using the hardware equipment, and a processing result is detected by using a display screen.

Claims (2)

1. A social network cross-platform malicious user detection method based on longitudinal federal learning is characterized in that: the method comprises the following steps:
step 1, constructing a social network cross-platform malicious user detection hierarchical architecture based on longitudinal federated learning, wherein the architecture comprises a data preprocessing layer, an encryption sample alignment layer, a federated learning layer and a data application layer;
step 2, firstly, selecting a plurality of participants, dividing the participants into an active party and a passive party, providing sample data and tag values of a user as the active party, providing only sample data of the user as the passive party, and carrying out preprocessing operation on the sample data of the active party and the sample data of the passive party in a data preprocessing layer to obtain structured data;
step 3, mapping the common sample data of the active side and the passive side on the sample alignment layer by using a safe intersection solving scheme of an RSA asymmetric encryption algorithm and a Hash mechanism for the structured data processed by the data preprocessing layer;
step 4, after the processing of the step 3, the active side and the passive side already determine that the two sides share sample data, the active side and the passive side cooperatively train a global model under the definition of machine learning by using local models of the active side and the passive side, and the active side and the passive side data are encrypted and decrypted by using homomorphic encryption to complete the training of a federal learning layer;
the specific implementation process of encrypting and decrypting the data of the active side and the passive side by using homomorphic encryption comprises the following steps:
step 4.1, the driving party calculates a first-order gradient value and a second-order gradient value of the sensitive data, encrypts the first-order gradient value and the second-order gradient value by adopting addition homomorphic encryption, and then sends the encrypted gradient values to the driven party;
step 4.2, the passive side carries out barrel division on all the characteristics of the passive side, maps each characteristic value into each barrel, aggregates corresponding encrypted gradient value information according to the characteristic values after barrel division, and then sends the aggregated encrypted gradient information to the active side;
step 4.3, the active party decrypts the received aggregated encrypted gradient information to obtain the optimal division Divide of the current node max Returning the current node characteristic ID and the threshold ID to the passive party;
step 4.4, the passive side receives the characteristic ID and the threshold ID to divide the total sample space I of the current node, wherein I R +I L =I,I L 、I R Respectively as left and right sample spaces, recording record ID, feature ID and threshold ID of current node, and dividing the record ID and left sample space I L Sending the data to the active side;
step 4.5, the master side according to the record ID and the left sample space I L Dividing the current node and entering the division of the next node;
step 4.6, iterating (4.2) to (4.5) processes, and after the construction of all the current decision trees is completed, calculating the optimal weight of each leaf node in the decision trees
Figure FDA0003861713780000011
Finishing the training;
step 4.7, the active side sends the record ID of the current node and the threshold value of the characteristic to the passive side;
step 4.8, the passive side compares the threshold value result of the current node to obtain a search decision and sends the search decision to the active side;
step 4.9, the active side receives the search decision and starts to go to the corresponding child node until reaching a leaf node to obtain a classification label and the optimal weight of the label;
step 4.10, iterating the processes of (4.7) - (4.9), then carrying out weighted summation on optimal weights corresponding to the classification labels obtained by traversing all the decision trees, and finally obtaining label sets of normal users and malicious users;
step 5, after the Federal learning layer is trained, the active side and the passive side update the local model training parameters of the active side and the passive side, and the prediction result is output to the data application layer;
and 6, encapsulating a data calling interface in the data application layer, transmitting the prediction result obtained by the federal learning layer back to each participant, and updating and classifying local data by each participant to obtain a malicious user detection result.
2. The social network cross-platform malicious user detection method based on longitudinal federated learning of claim 1, wherein: the preprocessing operation is to convert the data of each participant into structured data through data cleaning, random sampling, data binning and data normalization.
CN202110275639.0A 2021-03-15 2021-03-15 Social network cross-platform malicious user detection method based on longitudinal federal learning Active CN113051557B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110275639.0A CN113051557B (en) 2021-03-15 2021-03-15 Social network cross-platform malicious user detection method based on longitudinal federal learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110275639.0A CN113051557B (en) 2021-03-15 2021-03-15 Social network cross-platform malicious user detection method based on longitudinal federal learning

Publications (2)

Publication Number Publication Date
CN113051557A CN113051557A (en) 2021-06-29
CN113051557B true CN113051557B (en) 2022-11-11

Family

ID=76512271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110275639.0A Active CN113051557B (en) 2021-03-15 2021-03-15 Social network cross-platform malicious user detection method based on longitudinal federal learning

Country Status (1)

Country Link
CN (1) CN113051557B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113537333B (en) * 2021-07-09 2022-05-24 深圳市洞见智慧科技有限公司 Method for training optimization tree model and longitudinal federal learning system
CN113435537B (en) * 2021-07-16 2022-08-26 同盾控股有限公司 Cross-feature federated learning method and prediction method based on Soft GBDT
CN113554182B (en) * 2021-07-27 2023-09-19 西安电子科技大学 Detection method and system for Bayesian court node in transverse federal learning system
CN113673700A (en) * 2021-08-25 2021-11-19 深圳前海微众银行股份有限公司 Longitudinal federal prediction optimization method, device, medium, and computer program product
CN113657615B (en) * 2021-09-02 2023-12-05 京东科技信息技术有限公司 Updating method and device of federal learning model
CN113506163B (en) * 2021-09-07 2021-11-23 百融云创科技股份有限公司 Isolated forest training and predicting method and system based on longitudinal federation
CN114065950B (en) * 2022-01-14 2022-05-03 华控清交信息科技(北京)有限公司 Gradient aggregation method and device in GBDT model training and electronic equipment
CN114118312B (en) * 2022-01-29 2022-05-13 华控清交信息科技(北京)有限公司 Vertical training method, device, electronic equipment and system for GBDT model
CN114239863B (en) * 2022-02-24 2022-05-20 腾讯科技(深圳)有限公司 Training method of machine learning model, prediction method and device thereof, and electronic equipment
CN114677200A (en) * 2022-04-01 2022-06-28 重庆邮电大学 Business information recommendation method and device based on multi-party high-dimensional data longitudinal federal learning
US11593485B1 (en) * 2022-06-17 2023-02-28 Uab 360 It Malware detection using federated learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108684043A (en) * 2018-05-15 2018-10-19 南京邮电大学 The abnormal user detection method of deep neural network based on minimum risk
CN110874649A (en) * 2020-01-16 2020-03-10 支付宝(杭州)信息技术有限公司 State machine-based federal learning method, system, client and electronic equipment
CN111461874A (en) * 2020-04-13 2020-07-28 浙江大学 Credit risk control system and method based on federal mode
CN111724174A (en) * 2020-06-19 2020-09-29 安徽迪科数金科技有限公司 Citizen credit point evaluation method applying Xgboost modeling
CN112364908A (en) * 2020-11-05 2021-02-12 浙江大学 Decision tree-oriented longitudinal federal learning method
CN112364943A (en) * 2020-12-10 2021-02-12 广西师范大学 Federal prediction method based on federal learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3097353B1 (en) * 2019-06-12 2021-07-02 Commissariat Energie Atomique COLLABORATIVE LEARNING METHOD OF AN ARTIFICIAL NEURON NETWORK WITHOUT DISCLOSURE OF LEARNING DATA
CN112163979A (en) * 2020-10-19 2021-01-01 科技谷(厦门)信息技术有限公司 Urban traffic trip data analysis method based on federal learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108684043A (en) * 2018-05-15 2018-10-19 南京邮电大学 The abnormal user detection method of deep neural network based on minimum risk
CN110874649A (en) * 2020-01-16 2020-03-10 支付宝(杭州)信息技术有限公司 State machine-based federal learning method, system, client and electronic equipment
CN111461874A (en) * 2020-04-13 2020-07-28 浙江大学 Credit risk control system and method based on federal mode
CN111724174A (en) * 2020-06-19 2020-09-29 安徽迪科数金科技有限公司 Citizen credit point evaluation method applying Xgboost modeling
CN112364908A (en) * 2020-11-05 2021-02-12 浙江大学 Decision tree-oriented longitudinal federal learning method
CN112364943A (en) * 2020-12-10 2021-02-12 广西师范大学 Federal prediction method based on federal learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Dynamic Sample Selection for Federated Learning with Heterogeneous Data in Fog Computing;Lingshuang Cai 等;《ICC 2020 - 2020 IEEE International Conference on Communications (ICC)》;20200727;第1-6页 *
社交网络异常用户识别技术综述;仲丽君 等;《计算机工程与应用》;20180831;第54卷(第16期);第13-23页 *
联邦学习在商业银行反欺诈领域的应用;苏建明 等;《中国金融电脑》;20210228(第2期);第39-42页 *

Also Published As

Publication number Publication date
CN113051557A (en) 2021-06-29

Similar Documents

Publication Publication Date Title
CN113051557B (en) Social network cross-platform malicious user detection method based on longitudinal federal learning
Liu et al. Federated forest
Ahmed et al. Graph sample and hold: A framework for big-graph analytics
CN110011784B (en) KNN classification service system and method supporting privacy protection
Wan et al. Privacy-preservation for gradient descent methods
Gheid et al. Efficient and privacy-preserving k-means clustering for big data mining
CN115102763B (en) Multi-domain DDoS attack detection method and device based on trusted federal learning
Makkar et al. Secureiiot environment: Federated learning empowered approach for securing iiot from data breach
CN111143865B (en) User behavior analysis system and method for automatically generating label on ciphertext data
CN115242371B (en) Differential privacy-protected set intersection and base number calculation method, device and system thereof
Zhang et al. A survey on security and privacy threats to federated learning
Dhasade et al. TEE-based decentralized recommender systems: The raw data sharing redemption
Zhou et al. Securing federated learning enabled NWDAF architecture with partial homomorphic encryption
CN109508559B (en) Multi-dimensional data local privacy protection method based on connection function in crowd sensing system
Xu et al. Mining cloud 3D video data for interactive video services
Manzoor et al. Federated learning based privacy ensured sensor communication in IoT networks: a taxonomy, threats and attacks
Wei et al. Efficient multi-party private set intersection protocols for large participants and small sets
Sharma et al. Privacy-preserving boosting with random linear classifiers
CN115481415A (en) Communication cost optimization method, system, device and medium based on longitudinal federal learning
Wang et al. Federated cf: Privacy-preserving collaborative filtering cross multiple datasets
Zhou et al. A survey of security aggregation
Nie et al. A covert network attack detection method based on lstm
Sandeepa et al. Privacy of the Metaverse: Current Issues, AI Attacks, and Possible Solutions
Di Crescenzo et al. Encrypted-Input Program Obfuscation: Simultaneous Security Against White-Box and Black-Box Attacks
Shen et al. Secure Decentralized Aggregation to Prevent Membership Privacy Leakage in Edge-based Federated Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant