CN113051557B - Social network cross-platform malicious user detection method based on longitudinal federal learning - Google Patents
Social network cross-platform malicious user detection method based on longitudinal federal learning Download PDFInfo
- Publication number
- CN113051557B CN113051557B CN202110275639.0A CN202110275639A CN113051557B CN 113051557 B CN113051557 B CN 113051557B CN 202110275639 A CN202110275639 A CN 202110275639A CN 113051557 B CN113051557 B CN 113051557B
- Authority
- CN
- China
- Prior art keywords
- data
- party
- passive
- active
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 41
- 238000012549 training Methods 0.000 claims abstract description 28
- 238000007781 pre-processing Methods 0.000 claims abstract description 25
- 238000010801 machine learning Methods 0.000 claims abstract description 9
- 238000013507 mapping Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 36
- 230000008569 process Effects 0.000 claims description 26
- 238000003066 decision tree Methods 0.000 claims description 13
- 238000012545 processing Methods 0.000 claims description 7
- 230000007246 mechanism Effects 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 4
- 238000004140 cleaning Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 11
- 230000006870 function Effects 0.000 description 14
- 238000004458 analytical method Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000000654 additive Substances 0.000 description 2
- 230000000996 additive effect Effects 0.000 description 2
- 230000002411 adverse Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 238000012887 quadratic function Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/554—Detecting local intrusion or implementing counter-measures involving event detection and direct action
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/602—Providing cryptographic facilities or services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Computer Security & Cryptography (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Computing Systems (AREA)
- Business, Economics & Management (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Economics (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A social network cross-platform malicious user detection method based on longitudinal federal learning comprises the following steps: step 1, constructing a cross-platform malicious user detection hierarchical architecture of a social network based on longitudinal federal learning; step 2, dividing the participating party into an active party and a passive party, and carrying out preprocessing operation on sample data of the active party and the passive party on a data preprocessing layer to obtain structured data; step 3, mapping the structured data processed by the data preprocessing layer to sample data shared by the active side and the passive side; step 4, cooperatively training a global model under the definition of machine learning, and encrypting and decrypting data of an active side and a passive side by using homomorphic encryption to complete the Federal learning layer training; and step 6, transmitting the prediction result obtained by the federal learning layer back to each participant in a data application layer, thereby realizing the high-quality malicious user detection effect.
Description
Technical Field
The invention belongs to the technical field of internet, and particularly relates to a social network cross-platform malicious user detection method based on longitudinal federal learning.
Background
With the rapid development of Online Social Networks (OSNs), as shown by the 45 th statistical report of the development conditions of the china internet, 3 months in 2020, OSNs users reach 9.04 hundred million in scale and the popularity of the internet reaches 64.5%, so that OSNs help people establish Social network application services and gradually become the primary target of malicious users trying to execute illegal activities and malicious hazards, and these malicious behaviors cause adverse effects and huge hazards to the current society.
At present, the traditional machine learning methods, such as semi-supervised clustering, classifiers of support vector machines and the like, rely on big data means to extract and train the behavior characteristics of malicious users, and obtain high-quality detection effect on the OSNs platform.
The technical scheme 1: an article "Detecting MalicusSocial boxes Based on Clicks streams" by Shi P et al (IEEE Access, 2019), which provides a Malicious user detection algorithm Based on spatial and temporal characteristics Based on the transition probability characteristics between click streams of context awareness.
The technical scheme 2 is as follows: WU et al thesis "malicious program multi-feature detection on Android platform" small-sized microcomputer system provides a mixed algorithm based on multiple types of features, and different classifiers are constructed by using large-scale feature data to realize high-efficiency detection results.
However, none of the successful applications of the above schemes is based on social big data, and in an actual application scene, a malicious user has the characteristics of dispersibility, latency, complexity and the like, and data of a single party hardly meets the detection requirement, so that the data of two parties or even multiple parties needs to be jointly trained to achieve a satisfactory detection effect; secondly, with the soundness of laws and regulations, the emphasis on user privacy and Data security has become a worldwide accepted trend, as specified in General Data Protection Regulations (GDPR) issued by the european union, and it has been clearly prohibited to gather user Data of each party without user consent. Therefore, how to solve the problem of data fragmentation under the premise of complying with laws and regulations is undoubtedly an important research subject in the current social network scenario.
Disclosure of Invention
In order to solve the technical problems, the invention provides a social network cross-platform malicious user detection method based on longitudinal federal learning, which fuses multi-party data to perform modeling analysis on the premise of ensuring the privacy of common user data, thereby realizing a high-quality malicious user detection effect.
In order to realize the technical purpose, the adopted technical scheme is as follows: a social network cross-platform malicious user detection method based on longitudinal federal learning comprises the following steps:
step 1, constructing a social network cross-platform malicious user detection hierarchical architecture based on longitudinal federated learning, wherein the architecture comprises a data preprocessing layer, an encryption sample alignment layer, a federated learning layer and a data application layer;
step 2, firstly, selecting a plurality of participants, dividing the participants into an active party and a passive party, providing sample data and tag values of a user as the active party, providing only sample data of the user as the passive party, and carrying out preprocessing operation on the sample data of the active party and the sample data of the passive party in a data preprocessing layer to obtain structured data;
step 3, mapping the common sample data of the active side and the passive side on the sample alignment layer by using a safe intersection solving scheme of an RSA asymmetric encryption algorithm and a Hash mechanism for the structured data processed by the data preprocessing layer;
step 4, after the processing of the step 3, the active side and the passive side already determine the sample data shared by the two sides, the active side and the passive side cooperatively train a global model under the definition of machine learning by using local models of the active side and the passive side, and the active side and the passive side data are encrypted and decrypted by using homomorphic encryption to complete the training of a federal learning layer;
step 5, after the Federal learning layer is trained, the active side and the passive side update the local model training parameters of the active side and the passive side, and output the prediction result;
and 6, encapsulating a data calling interface in the data application layer, transmitting the prediction result obtained by the federal learning layer back to each participant, and updating and classifying local data by each participant to obtain a malicious user detection result.
The specific implementation process of encrypting and decrypting the data of the active party and the passive party by using homomorphic encryption comprises the following steps:
step 2.1, the driving party calculates a first-order gradient value and a second-order gradient value of the sensitive data, encrypts the first-order gradient value and the second-order gradient value by adopting addition homomorphic encryption, and then sends the encrypted gradient values to the driven party;
step 2.2, the passive side carries out barrel division on all the characteristics of the passive side, maps each characteristic value into each barrel, aggregates corresponding encrypted gradient value information according to the characteristic values after barrel division, and then sends the aggregated encrypted gradient information to the active side;
step 2.3, the active party decrypts the received aggregated encrypted gradient information to obtain the optimal division Divide of the current node max Returning the current node characteristic ID and the threshold ID to the passive party;
step 2.4, the passive side receives the characteristic ID and the threshold value ID to divide the total sample space I of the current node, wherein I R +I L =I,I L ,I R Respectively recording the record ID, the feature ID and the threshold value ID of the current node in the left and right sample spaces, and recording the record ID and the divided left sampleThis space I L Sending the data to the active side;
step 2.5, the master side according to the record ID and the left sample space I L Dividing the current node and entering the division of the next node;
step 2.6, iterating the processes of (2.2) to (2.5), after the construction of all the current decision trees is completed, calculating the optimal weight of each leaf node in the decision treesFinishing the training;
step 2.7, the active side sends the record ID of the current node and the threshold value of the characteristic to the passive side;
step 2.8, the passive side compares the threshold value result of the current node to obtain a search decision and sends the search decision to the active side;
step 2.9, the active side receives the search decision and starts to go to the corresponding child node until reaching a leaf node to obtain a classification label and the optimal weight of the label;
and 2.10, iterating the processes of (2.7) - (2.9), and then performing weighted summation on optimal weights corresponding to the classification labels obtained by traversing all the decision trees to finally obtain label sets of normal users and malicious users.
The preprocessing operation is to convert each participant data into structured data through operations such as data cleaning, random sampling, data binning, data normalization and the like.
The invention has the beneficial effects that:
(1) The invention provides a social network cross-platform malicious user detection method based on longitudinal federal learning, which is realized on the premise of ensuring user privacy and data safety.
(2) The invention provides an effective problem handling mechanism for unstructured data at a data preprocessing layer, and is used for solving the problem of multi-source isomerism of the data.
(3) According to the invention, a data application layer is constructed, and the social network cross-platform malicious user can be detected in real time by packaging a data call interface.
(4) The invention utilizes homomorphic encryption to encrypt and decrypt the data of the active party and the passive party in the algorithm realization process, the algorithm is an end-to-end detection algorithm, has the same accuracy as the traditional machine learning method on the premise of privacy protection, adds a regularization punishment item in the algorithm, improves the generalization capability and the detection effect of the model, encrypts sensitive data and practically ensures the safety and the accuracy of the model.
Drawings
FIG. 1 is a hierarchical architecture for cross-platform malicious user detection in a social network according to the present invention;
FIG. 2 is a flow of a data pre-processing layer of the present invention;
FIG. 3 is a sample alignment layer flow of the present invention;
FIG. 4 is a federated learning layer flow of the present invention;
FIG. 5 is a flow chart of a malicious user detection algorithm for multi-party privacy protection according to the present invention;
FIG. 6 is a social network cross-platform malicious user detection framework of the present invention;
FIG. 7 is a malicious user detection page of the multimedia social network CyVOD desktop version of the present invention.
Detailed Description
With the rapid development of online social networks, social networks gradually become the primary target of malicious users trying to perform illegal activities and malicious hazards while helping people to establish social network application services. Malicious users can remain in a plurality of social network platforms, and try to steal the privacy of the users, penetrate political topics and the like by publishing false information, and the behaviors cause adverse effects and great harm to the current society. At present, the existing machine learning detection method realizes high-quality detection effect based on large-scale data, however, along with the soundness of laws and regulations, the existing machine learning detection method is not good at concentrating user data of all parties to one place, and is still clear. Therefore, the method and the system have the advantages that by means of the federal learning technology, on the premise that data safety and user privacy protection are guaranteed, multi-party data are fused for modeling analysis, and therefore accurate detection of malicious users in the social network platform is achieved.
A social network cross-platform malicious user detection method based on longitudinal federal learning comprises the following steps:
step 1, as shown in fig. 1, a social network cross-platform malicious user detection hierarchical architecture based on longitudinal federated learning is constructed, and the architecture comprises a data preprocessing layer, an encryption sample alignment layer, a federated learning layer and a data application layer.
A data preprocessing layer: in an actual application scenario, due to specific functional requirements, technical levels, storage modes and the like, data of each participant usually does not exist in a structured form, and data preprocessing is used for solving the operation of converting structured data in a modeling process. As shown in fig. 2, the preprocessing operation converts each participant's data into structured data by data cleansing, random sampling, data binning, and data normalization.
Sample alignment layer: the sample alignment layer is used for aligning all participants to share the user by using an encrypted ID matching technology before modeling of all the participants on the premise of ensuring the safety and privacy protection of the user.
Federal learning layer: the federal learning layer is used for model training through an encrypted parameter exchange mode, after determining a common sample of two parties, each participant can cooperatively train a global model under the machine learning definition, however, in order to prevent the privacy disclosure problem in the model training, the federal learning layer needs to introduce a credible cooperative party, and uses a privacy protection technology (such as state encryption) to encrypt and decrypt sample data and coordinate the training process.
A data application layer: after the training of the federal learning layer, each participant updates a local training model and outputs a prediction result, the data application layer transmits the prediction result back to the terminal through a packaged data calling interface, and the terminal updates and classifies local data and provides a detection basis for malicious users.
And 2, firstly selecting a plurality of participants, dividing the participants into an active party and a passive party, providing the sample data and the tag value of the user as the active party, providing only the sample data of the user as the passive party, and preprocessing the sample data of the active party and the passive party in a data preprocessing layer to obtain structured data.
The invention discloses a method for detecting the malicious users, which is characterized in that data with privacy leakage in model training of each participant is called as sensitive data, in order to ensure the safety of the sensitive data, a malicious user detection algorithm facing multi-party privacy protection is packaged in a Federal learning layer, a privacy protection method (homomorphic encryption) is adopted to encrypt the sensitive data, so that the multi-party training can be carried out without exposing the data of each participant, and simultaneously, roles played by each participant in the algorithm are respectively defined as an active party and a passive party.
The initiative side: providing sample data and label values of a user, playing the role of a cooperative party in the training process, and participating in encryption and decryption of sensitive data and coordinating the training process.
A passive side: typically only sample data for the user is provided.
Step 3, mapping the common sample data of the active side and the passive side on the structured data processed by the data preprocessing layer by using a safe intersection solving scheme of an RSA asymmetric encryption algorithm and a Hash mechanism in a sample alignment layer;
step 4, after the processing of the step 3, the active side and the passive side already determine the sample data shared by the two sides, the active side and the passive side cooperatively train a global model under the definition of machine learning by using local models of the active side and the passive side, and the active side and the passive side data are encrypted and decrypted by using homomorphic encryption to complete the training of a federal learning layer;
step 5, after the Federal learning layer is trained, the active side and the passive side update the local model training parameters of the active side and the passive side, and the prediction result is output to the data application layer;
and 6, encapsulating a data calling interface in the data application layer, transmitting the prediction result obtained by the federal learning layer back to each participant, and updating and classifying local data by each participant to obtain a malicious user detection result.
The invention sets an algorithm target function as the sum of a loss function and a regularization penalty term, and introduces the regularization penalty term to control the complexity of the model and prevent the phenomenon of overfitting, so that the algorithm has more classification efficiency in the solving process, and the target function is as follows:
wherein n is user sample data, t is decision tree,for the loss function, the true value y is expressed i And the predicted valueResidual error between, omega (f) t ) A regularization penalty term.
When the objective function carries out the t-th iteration, the structure and the parameters of the tree of the first t-1 round are determined, and the predicted value of the sample of the t-th round is obtained according to the forward distribution addition methodEqual to the predicted value of the previous t-1 treesAdding a new decision tree f t (x i ) As shown in formula (2):
at this time, equation (2) is substituted into equation (1), and the objective function of the expansion is expressed by equation (3):
next, taylor expansion is performed on equation (3) using a second-order taylor equation, as shown in equation (4):
and the regularization penalty term function of the algorithm set forth herein can be expressed as:
wherein gamma is a complexity parameter, T is the number of leaf nodes, and lambda is the weight value w of the leaf nodes j The penalty degree parameter of (2). Therefore, formula (5) is substituted for formula (4), and the objective function is further rewritten as:
in the formula (6), lambda, gamma, g i 、h i Are all known numbers, only w j As an unknown number, I j The method comprises the steps of calculating the optimal weight of a leaf node j according to the process of solving the extreme value of a unitary quadratic function of samples falling on the same leaf node j in the sample division process
to obtain the optimal division of the sample space, each time a node is split, the sample of the node is divided into two disjoint sample spaces, and I is set L ,I R Sample spaces of left and right subtrees, respectively, I R +I L = I represents the total sample space of the current node. Therefore, the sum of the first order gradients and the sum of the second order gradients on both sides of the left and right nodes are expressed as:
finally, the maximum value is found by subtracting the value before splitting from the evaluation index value after splitting the leaf node, and then the optimal division of the sample space is:
as can be seen from the implementation process of the algorithm, in the process of iterating the objective function t each time, the first derivative g of the prediction result y (t-1) of the loss function l relative to the previous t-1 trees is solved i And second derivative h i And according to g i And h i To obtain the optimal weight and the optimal partition. Therefore, we can easily find that the calculation of the optimal weights and optimal partitions depends on g i And h i And g is i And h i The computation depends on class label y in the sample i If g is directly used in the training process i And h i The exchange is carried out, there is a risk of privacy disclosure, so the algorithm herein sets g i And h i Must be calculated by the master and encrypted using additive homomorphism i And h i Encryption, so that the passive party cannot use the derivative information to deduce the label information during the training process.
The specific implementation process of encrypting and decrypting the data of the active side and the passive side by using homomorphic encryption is as shown in fig. 5:
step 2.1, the initiative side calculates a gradient value g of the sensitive data i And a second order gradient value h i And using additive homomorphic encryption to obtain a gradient value g i And a second order gradient value h i Encrypting, and then sending the encrypted gradient value to a passive party;
step 2.2, the passive side carries out barrel division on all the characteristics of the passive side, maps each characteristic value into each barrel, aggregates corresponding encrypted gradient value information according to the characteristic values after barrel division, and then sends the aggregated encrypted gradient information to the active side;
step 2.3, the active side decrypts the received aggregated encrypted gradient information to obtain the optimal division Divide of the current node max Returning the current node characteristic ID and the threshold ID to the passive party;
step 2.4, the passive side receives the characteristic ID and the threshold value ID to divide the total sample space I of the current node, wherein I R +I L =I,I L 、I R Respectively a left sample space and a right sample space, recording the record ID, the feature ID and the threshold ID of the current node, and dividing the record ID and the divided left sample space I L Sending the data to the active side;
step 2.5, the master side according to the record ID and the left sample space I L Dividing the current node and entering the next node;
step 2.6, iterating the processes of (2.2) to (2.5), after the construction of all the current decision trees is completed, calculating the optimal weight of each leaf node in the decision treesFinishing the training;
step 2.7, the active side sends the record ID of the current node and the threshold value of the characteristic to the passive side;
step 2.8, the passive side compares the threshold value result of the current node to obtain a search decision and sends the search decision to the active side;
step 2.9, the active side receives the search decision and starts to go to the corresponding child node until reaching a leaf node to obtain a classification label and the optimal weight of the label;
and 2.10, iterating the processes of (2.7) to (2.9), traversing all the decision trees, carrying out weighted summation on the optimal weights corresponding to the obtained classification labels, and finally obtaining label sets of normal users and malicious users.
In the algorithm training process, more samples are gradually added into the left sample space, and the left sample space is used for dividing the current node, so that the value with the maximum gain is easily found out, namely the optimal division is obtained.
Example 1
According to the method, a conventional federal learning framework is expanded and improved by combining a multimedia social network CyVOD, a social network cross-platform malicious user detection framework based on longitudinal federal learning is built, as shown in FIG. 6, safe and compliant multi-party data are fused for modeling analysis, high-quality detection is realized on malicious users, and the ecological environment of the social network is further maintained.
The whole framework is divided into four parts, namely a data preprocessing stage, a sample alignment stage, a federal learning stage and a data application stage.
A data preprocessing stage: in the stage, an Android mobile party (an active party) and a PC website party (a passive party) of CyVOD are selected as data providers, an OSNs six-tuple (video, policy, guide, notification, post and false information) metadata experimental platform is built on the basis, 68 users click actions are totally performed, 50898 data are totally counted by 28 user static attribute characteristics of a PC end, 1076307 data are totally counted by 40 user dynamic attribute characteristics of a mobile end, an effective problem processing mechanism is set in the stage, and as shown in FIG. 2, the robustness of the training process is further improved by performing data cleaning, random sampling, data binning, numerical value normalization and other operations on the original data of all participants.
The invention adopts the following problem handling mechanism in the data preprocessing stage: (1) when the problems of repetition, deletion and the like occur, the sample data is processed by adopting the operations of deletion method, filling method and the like; (2) when the distribution is unbalanced, the sample data is randomly sampled, so that the model prediction and classification effects are improved; (3) when the continuous characteristic variable appears, the sample data is subjected to box separation, namely discretization is carried out on the continuous characteristic variable, so that the stability of the model is improved; (4) when the data dimension difference is obvious, normalization processing is carried out on the sample data, and the training speed and the convergence direction of the model are improved.
A sample alignment stage: in the stage, a scheme of safely solving intersection of an RSA algorithm and a hash function is adopted, and common sample IDs of an Android mobile party and a PC website party are mapped. As shown in fig. 3, firstly, the Android mobile party generates a public key and a private key pair by RSA algorithm, and transmits the public key to the PC network station; the PC website side performs hash mapping on the local data ID by using a hash function to ensure that the user ID cannot be transmitted in a plaintext form; secondly, the PC website side encrypts by adopting a public key and sends the encrypted data sample to the Android mobile side, the Android mobile side decrypts by using a local private key after receiving the passive side data sample, and then the local data is mapped by a Hash function and the received PC website side sample data is subjected to safe intersection; and finally, the Android mobile party sends the matched sample ID to the PC website party, and the sample alignment stage is completed.
And (3) a federal learning stage: as shown in fig. 4, an encrypted parameter exchange manner is adopted for model training at this stage, and after the Android mobile party and the PC website party determine sample data common to both parties, homomorphic encryption is introduced to encrypt sensitive data in order to prevent data security and user privacy from being revealed. The detailed process is as follows:
(1) The Android mobile party firstly calculates the gradient value, encrypts the gradient value by utilizing addition homomorphic encryption, and then sends the encrypted gradient value to the PC website party.
(2) The PC website side firstly carries out bucket separation on all characteristics of the PC website side, and maps each characteristic value to each bucket; and secondly, the PC website side aggregates the corresponding encryption gradient information according to the characteristic values after the barrel division, and sends an aggregation result to the Android mobile side.
(3) And the Android mobile party decrypts the received aggregation result, obtains the optimal division of the current node, and returns the current node characteristic ID and the threshold value ID to the PC website party.
(4) And the PC website side receives the feature ID and the threshold ID to divide the current sample space, records the current record ID, the feature ID and the threshold ID, and sends the record ID and the divided left sample space to the Android mobile side.
(5) And the Android mobile party divides the current node according to the record ID and the left sample space and enters the division of the next node.
(6) And (4) until all the decision trees are constructed, and calculating the optimal weight of each leaf node.
(7) After the training is finished, the Android mobile party sends the record ID of the current node and the threshold value of the characteristic to the PC website party.
(8) And the PC website side compares the threshold result of the current node to obtain a search decision and sends the search decision to the Android mobile side.
(9) And the Android mobile party receives the search decision, starts to go to the corresponding child node until each leaf node is reached to obtain the classification label and the weight.
(10) And (5) repeating the processes from (7) to (9) until all the decision trees are traversed, and finally, carrying out weighted summation on optimal weights corresponding to the traversed class labels by the Android mobile party to output class label sets of normal users and malicious users.
A data application stage: after the federal learning layer is trained, each participant updates a local training model, a prediction result is output, the prediction result is transmitted back to the terminal through a data calling interface packaged by CyVOD at the stage, the terminal updates and classifies local data to obtain a malicious user detection result, as shown in FIG. 6, a PC website end carries out marking processing on malicious users, and an administrator can timely process the malicious users.
The method can be packaged in hardware equipment, a detection result is directly obtained by using the hardware equipment, and a processing result is detected by using a display screen.
Claims (2)
1. A social network cross-platform malicious user detection method based on longitudinal federal learning is characterized in that: the method comprises the following steps:
step 1, constructing a social network cross-platform malicious user detection hierarchical architecture based on longitudinal federated learning, wherein the architecture comprises a data preprocessing layer, an encryption sample alignment layer, a federated learning layer and a data application layer;
step 2, firstly, selecting a plurality of participants, dividing the participants into an active party and a passive party, providing sample data and tag values of a user as the active party, providing only sample data of the user as the passive party, and carrying out preprocessing operation on the sample data of the active party and the sample data of the passive party in a data preprocessing layer to obtain structured data;
step 3, mapping the common sample data of the active side and the passive side on the sample alignment layer by using a safe intersection solving scheme of an RSA asymmetric encryption algorithm and a Hash mechanism for the structured data processed by the data preprocessing layer;
step 4, after the processing of the step 3, the active side and the passive side already determine that the two sides share sample data, the active side and the passive side cooperatively train a global model under the definition of machine learning by using local models of the active side and the passive side, and the active side and the passive side data are encrypted and decrypted by using homomorphic encryption to complete the training of a federal learning layer;
the specific implementation process of encrypting and decrypting the data of the active side and the passive side by using homomorphic encryption comprises the following steps:
step 4.1, the driving party calculates a first-order gradient value and a second-order gradient value of the sensitive data, encrypts the first-order gradient value and the second-order gradient value by adopting addition homomorphic encryption, and then sends the encrypted gradient values to the driven party;
step 4.2, the passive side carries out barrel division on all the characteristics of the passive side, maps each characteristic value into each barrel, aggregates corresponding encrypted gradient value information according to the characteristic values after barrel division, and then sends the aggregated encrypted gradient information to the active side;
step 4.3, the active party decrypts the received aggregated encrypted gradient information to obtain the optimal division Divide of the current node max Returning the current node characteristic ID and the threshold ID to the passive party;
step 4.4, the passive side receives the characteristic ID and the threshold ID to divide the total sample space I of the current node, wherein I R +I L =I,I L 、I R Respectively as left and right sample spaces, recording record ID, feature ID and threshold ID of current node, and dividing the record ID and left sample space I L Sending the data to the active side;
step 4.5, the master side according to the record ID and the left sample space I L Dividing the current node and entering the division of the next node;
step 4.6, iterating (4.2) to (4.5) processes, and after the construction of all the current decision trees is completed, calculating the optimal weight of each leaf node in the decision treesFinishing the training;
step 4.7, the active side sends the record ID of the current node and the threshold value of the characteristic to the passive side;
step 4.8, the passive side compares the threshold value result of the current node to obtain a search decision and sends the search decision to the active side;
step 4.9, the active side receives the search decision and starts to go to the corresponding child node until reaching a leaf node to obtain a classification label and the optimal weight of the label;
step 4.10, iterating the processes of (4.7) - (4.9), then carrying out weighted summation on optimal weights corresponding to the classification labels obtained by traversing all the decision trees, and finally obtaining label sets of normal users and malicious users;
step 5, after the Federal learning layer is trained, the active side and the passive side update the local model training parameters of the active side and the passive side, and the prediction result is output to the data application layer;
and 6, encapsulating a data calling interface in the data application layer, transmitting the prediction result obtained by the federal learning layer back to each participant, and updating and classifying local data by each participant to obtain a malicious user detection result.
2. The social network cross-platform malicious user detection method based on longitudinal federated learning of claim 1, wherein: the preprocessing operation is to convert the data of each participant into structured data through data cleaning, random sampling, data binning and data normalization.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110275639.0A CN113051557B (en) | 2021-03-15 | 2021-03-15 | Social network cross-platform malicious user detection method based on longitudinal federal learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110275639.0A CN113051557B (en) | 2021-03-15 | 2021-03-15 | Social network cross-platform malicious user detection method based on longitudinal federal learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113051557A CN113051557A (en) | 2021-06-29 |
CN113051557B true CN113051557B (en) | 2022-11-11 |
Family
ID=76512271
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110275639.0A Active CN113051557B (en) | 2021-03-15 | 2021-03-15 | Social network cross-platform malicious user detection method based on longitudinal federal learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113051557B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113537333B (en) * | 2021-07-09 | 2022-05-24 | 深圳市洞见智慧科技有限公司 | Method for training optimization tree model and longitudinal federal learning system |
CN113435537B (en) * | 2021-07-16 | 2022-08-26 | 同盾控股有限公司 | Cross-feature federated learning method and prediction method based on Soft GBDT |
CN113554182B (en) * | 2021-07-27 | 2023-09-19 | 西安电子科技大学 | Detection method and system for Bayesian court node in transverse federal learning system |
CN113673700A (en) * | 2021-08-25 | 2021-11-19 | 深圳前海微众银行股份有限公司 | Longitudinal federal prediction optimization method, device, medium, and computer program product |
CN113657615B (en) * | 2021-09-02 | 2023-12-05 | 京东科技信息技术有限公司 | Updating method and device of federal learning model |
CN113506163B (en) * | 2021-09-07 | 2021-11-23 | 百融云创科技股份有限公司 | Isolated forest training and predicting method and system based on longitudinal federation |
CN114065950B (en) * | 2022-01-14 | 2022-05-03 | 华控清交信息科技(北京)有限公司 | Gradient aggregation method and device in GBDT model training and electronic equipment |
CN114118312B (en) * | 2022-01-29 | 2022-05-13 | 华控清交信息科技(北京)有限公司 | Vertical training method, device, electronic equipment and system for GBDT model |
CN114239863B (en) * | 2022-02-24 | 2022-05-20 | 腾讯科技(深圳)有限公司 | Training method of machine learning model, prediction method and device thereof, and electronic equipment |
CN114677200A (en) * | 2022-04-01 | 2022-06-28 | 重庆邮电大学 | Business information recommendation method and device based on multi-party high-dimensional data longitudinal federal learning |
US11593485B1 (en) * | 2022-06-17 | 2023-02-28 | Uab 360 It | Malware detection using federated learning |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108684043A (en) * | 2018-05-15 | 2018-10-19 | 南京邮电大学 | The abnormal user detection method of deep neural network based on minimum risk |
CN110874649A (en) * | 2020-01-16 | 2020-03-10 | 支付宝(杭州)信息技术有限公司 | State machine-based federal learning method, system, client and electronic equipment |
CN111461874A (en) * | 2020-04-13 | 2020-07-28 | 浙江大学 | Credit risk control system and method based on federal mode |
CN111724174A (en) * | 2020-06-19 | 2020-09-29 | 安徽迪科数金科技有限公司 | Citizen credit point evaluation method applying Xgboost modeling |
CN112364908A (en) * | 2020-11-05 | 2021-02-12 | 浙江大学 | Decision tree-oriented longitudinal federal learning method |
CN112364943A (en) * | 2020-12-10 | 2021-02-12 | 广西师范大学 | Federal prediction method based on federal learning |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR3097353B1 (en) * | 2019-06-12 | 2021-07-02 | Commissariat Energie Atomique | COLLABORATIVE LEARNING METHOD OF AN ARTIFICIAL NEURON NETWORK WITHOUT DISCLOSURE OF LEARNING DATA |
CN112163979A (en) * | 2020-10-19 | 2021-01-01 | 科技谷(厦门)信息技术有限公司 | Urban traffic trip data analysis method based on federal learning |
-
2021
- 2021-03-15 CN CN202110275639.0A patent/CN113051557B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108684043A (en) * | 2018-05-15 | 2018-10-19 | 南京邮电大学 | The abnormal user detection method of deep neural network based on minimum risk |
CN110874649A (en) * | 2020-01-16 | 2020-03-10 | 支付宝(杭州)信息技术有限公司 | State machine-based federal learning method, system, client and electronic equipment |
CN111461874A (en) * | 2020-04-13 | 2020-07-28 | 浙江大学 | Credit risk control system and method based on federal mode |
CN111724174A (en) * | 2020-06-19 | 2020-09-29 | 安徽迪科数金科技有限公司 | Citizen credit point evaluation method applying Xgboost modeling |
CN112364908A (en) * | 2020-11-05 | 2021-02-12 | 浙江大学 | Decision tree-oriented longitudinal federal learning method |
CN112364943A (en) * | 2020-12-10 | 2021-02-12 | 广西师范大学 | Federal prediction method based on federal learning |
Non-Patent Citations (3)
Title |
---|
Dynamic Sample Selection for Federated Learning with Heterogeneous Data in Fog Computing;Lingshuang Cai 等;《ICC 2020 - 2020 IEEE International Conference on Communications (ICC)》;20200727;第1-6页 * |
社交网络异常用户识别技术综述;仲丽君 等;《计算机工程与应用》;20180831;第54卷(第16期);第13-23页 * |
联邦学习在商业银行反欺诈领域的应用;苏建明 等;《中国金融电脑》;20210228(第2期);第39-42页 * |
Also Published As
Publication number | Publication date |
---|---|
CN113051557A (en) | 2021-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113051557B (en) | Social network cross-platform malicious user detection method based on longitudinal federal learning | |
Liu et al. | Federated forest | |
Ahmed et al. | Graph sample and hold: A framework for big-graph analytics | |
CN110011784B (en) | KNN classification service system and method supporting privacy protection | |
Wan et al. | Privacy-preservation for gradient descent methods | |
Gheid et al. | Efficient and privacy-preserving k-means clustering for big data mining | |
CN115102763B (en) | Multi-domain DDoS attack detection method and device based on trusted federal learning | |
Makkar et al. | Secureiiot environment: Federated learning empowered approach for securing iiot from data breach | |
CN111143865B (en) | User behavior analysis system and method for automatically generating label on ciphertext data | |
CN115242371B (en) | Differential privacy-protected set intersection and base number calculation method, device and system thereof | |
Zhang et al. | A survey on security and privacy threats to federated learning | |
Dhasade et al. | TEE-based decentralized recommender systems: The raw data sharing redemption | |
Zhou et al. | Securing federated learning enabled NWDAF architecture with partial homomorphic encryption | |
CN109508559B (en) | Multi-dimensional data local privacy protection method based on connection function in crowd sensing system | |
Xu et al. | Mining cloud 3D video data for interactive video services | |
Manzoor et al. | Federated learning based privacy ensured sensor communication in IoT networks: a taxonomy, threats and attacks | |
Wei et al. | Efficient multi-party private set intersection protocols for large participants and small sets | |
Sharma et al. | Privacy-preserving boosting with random linear classifiers | |
CN115481415A (en) | Communication cost optimization method, system, device and medium based on longitudinal federal learning | |
Wang et al. | Federated cf: Privacy-preserving collaborative filtering cross multiple datasets | |
Zhou et al. | A survey of security aggregation | |
Nie et al. | A covert network attack detection method based on lstm | |
Sandeepa et al. | Privacy of the Metaverse: Current Issues, AI Attacks, and Possible Solutions | |
Di Crescenzo et al. | Encrypted-Input Program Obfuscation: Simultaneous Security Against White-Box and Black-Box Attacks | |
Shen et al. | Secure Decentralized Aggregation to Prevent Membership Privacy Leakage in Edge-based Federated Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |