WO2024082514A1 - Service index prediction method and apparatus, and device and storage medium - Google Patents

Service index prediction method and apparatus, and device and storage medium Download PDF

Info

Publication number
WO2024082514A1
WO2024082514A1 PCT/CN2023/079369 CN2023079369W WO2024082514A1 WO 2024082514 A1 WO2024082514 A1 WO 2024082514A1 CN 2023079369 W CN2023079369 W CN 2023079369W WO 2024082514 A1 WO2024082514 A1 WO 2024082514A1
Authority
WO
WIPO (PCT)
Prior art keywords
initiator
partner
data
data set
correlation coefficient
Prior art date
Application number
PCT/CN2023/079369
Other languages
French (fr)
Chinese (zh)
Inventor
孙银银
Original Assignee
上海零数众合信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海零数众合信息科技有限公司 filed Critical 上海零数众合信息科技有限公司
Publication of WO2024082514A1 publication Critical patent/WO2024082514A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/008Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols involving homomorphic encryption

Definitions

  • the embodiments of the present application relate to the field of computers, and in particular to a business indicator prediction method, apparatus, device and storage medium.
  • Vertical federated learning is often used to solve the problem that one of the participants in vertical federated learning has too few data dimensions, and the modeling goal cannot be achieved well with only one party's data. It is mostly used for joint modeling between different industries.
  • vertical multicollinearity federated modeling the data sets of the task initiator and the partner have a common sample space and different feature spaces. Encryption algorithms are required to ensure the data privacy security of the data user and the data party.
  • the federated multicollinearity calculation implemented by the linear model method requires multiple interactive iterations of training to calculate the fit, and the communication consumption and calculation complexity are large, and the calculation results depend on the setting of the model hyperparameters. Therefore, how to improve the training efficiency of the federated learning model while ensuring the accuracy of the federated learning model is a problem that needs to be solved.
  • the present application provides a business indicator prediction method, apparatus, device and storage medium, which can improve the computational efficiency and computational accuracy of linear models in vertical federated learning, and improve the prediction accuracy of the federated learning model, so that when predicting the user's business indicators through the federated learning model, the prediction accuracy of the user's business indicators can be improved.
  • a business indicator prediction method comprising:
  • model training data is determined from the initiator data set and the partner data set, and the model training data is used to train a federated learning model; the federated learning model is used to predict the business indicator value of the user.
  • a business indicator prediction device comprising:
  • a data set determination module used to determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning;
  • a data set encryption transmission module used to encrypt the initiator data set through the initiator device and send the encrypted initiator data set to the partner device; and obtain the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypt the ciphertext correlation coefficient matrix through the initiator device, and determine the plaintext correlation coefficient matrix according to the decryption result;
  • a correlation coefficient matrix determination module used to determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set;
  • a multicollinearity analysis module used for performing characteristic multicollinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix;
  • the model training module is used to determine the model training data from the initiator data set and the partner data set according to the result of feature multicollinearity analysis, and use the model training data to train the federated learning model; the federated learning model is used to predict the user's business indicator value.
  • an electronic device comprising:
  • the memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can execute the business indicator prediction method described in any embodiment of the present application.
  • a computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the business indicator prediction method described in any embodiment of the present application when executed.
  • the technical solution of the embodiment of the present application is based on the number of initiator services of the initiator in the vertical federated learning.
  • determine the initiator data set and the partner data set According to the business data of the initiator and the partner of the partner, determine the initiator data set and the partner data set; encrypt the initiator data set through the initiator device and send the encrypted initiator data set to the partner device; and obtain the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, and decrypt the ciphertext correlation coefficient matrix through the initiator device, and determine the plaintext correlation coefficient matrix according to the decryption result; according to the initiator data set and the partner data set, determine the initiator correlation coefficient matrix of the initiator and the partner correlation coefficient matrix of the partner; according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix, perform feature multicollinearity analysis on the initiator and partner data sets; according to the feature multicollinearity analysis results, determine the model training data from the initiator data set and the partner
  • the above scheme calculates the feature multicollinearity of the initiator and the partner in the vertical federated learning based on the correlation coefficient method. No parameter adjustment is required during the calculation process, which reduces the number of data interactions when calculating the feature multicollinearity of the initiator and the interacting party.
  • the homomorphic encryption algorithm and the correlation coefficient method are used to determine the training data of the federated learning model through distributed computing, realize the training of the federated learning model, improve the training efficiency of the federated learning model, and ensure the interpretability of the linear model in the federated learning model.
  • the use of the federated learning model to predict the business indicator value of the user can improve the prediction accuracy of the business indicator value.
  • FIG1 is a flow chart of a business indicator prediction method provided in Example 1 of the present application.
  • FIG2 is a flow chart of a business indicator prediction method provided in Example 2 of the present application.
  • FIG3 is a flow chart of a business indicator prediction method provided in Example 3 of the present application.
  • FIG4 is a schematic diagram of the structure of a business indicator prediction device provided in Embodiment 4 of the present application.
  • FIG5 is a schematic diagram of the structure of an electronic device provided in Embodiment 5 of the present application.
  • FIG1 is a flowchart of a business indicator prediction method provided in Embodiment 1 of the present application.
  • This embodiment can be applied to the case where a federated learning model for predicting business indicators is trained based on the initiator business data of the initiator and the partner business data of the initiator in vertical federated learning. It is particularly suitable for the case where a multicollinearity analysis is performed on the initiator business data and the partner business data of the initiator in vertical federated learning by the correlation coefficient method, and the training data of the federated learning model is determined based on the multicollinearity analysis results, so as to train the federated learning model for predicting business indicators through the training data.
  • the method can be executed by a business indicator prediction device, which can be implemented in the form of hardware and/or software, and the business indicator prediction device can be configured in an electronic device. As shown in FIG1 , the method includes:
  • S110 Determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning.
  • vertical federated learning is generally applicable to a federated learning scenario composed of participants with the same sample space and different feature spaces on the data set.
  • a federated learning model can be collaboratively trained for different participants through vertical federated learning.
  • the participants are the initiator and the partner in vertical federated learning.
  • the initiator and the partner can be enterprises in different industries with cooperation needs.
  • the initiator's business data refers to the sample data that can characterize the user's business indicators on the initiator, obtained by the initiator with the user's permission;
  • the partner's business data refers to the sample data that can characterize the user's business indicators on the partner, obtained by the partner with the user's permission.
  • Business indicators refer to indicators used to measure a certain aspect of user behavior.
  • business indicators can be credit indicators or performance indicators of users on the initiator and the partner.
  • the initiator data set refers to a data set containing the initiator's feature data.
  • the partner data set refers to a data set containing the partner's feature data.
  • the business data of the initiator corresponding to the initiator in the vertical federated learning is obtained, the feature data of the initiator's business data is extracted, the feature data of the initiator's business data is determined, and the initiator data set is determined according to the feature data of the initiator's business data.
  • the business data of the partner corresponding to the partner in the vertical federated learning is obtained, the feature data of the partner's business data is extracted, the feature data of the partner's business data is determined, and the partner data set is determined according to the feature data of the partner's business data.
  • the method for determining the initiator data set and the partner data set may be: determining the same user of the initiator and the partner in the vertical federated learning, and determining based on the identity of the same user
  • the data intersection between the initiator's business data and the partner's business data; the initiator's business data and the partner's business data are processed respectively according to the data intersection to determine the initiator's data set and the partner's data set.
  • the identity identifier refers to data that can represent the identity of the user, and the identity identifier may include an ID number or a mobile phone number.
  • the initiator and the partner of vertical federated learning may have the same user. Therefore, when training the federated learning model, it is necessary to determine the same user of the initiator and the partner in the vertical federated learning, and obtain the identity of the same user with the user's permission. Based on the identity of the same user, determine the data intersection between the initiator's initiator business data and the partner's partner business data. Integrate the data intersection and the feature data of the initiator's business data to determine the initiator data set; integrate the data intersection and the feature data of the partner's business data to determine the partner data set.
  • the data intersection between the initiator's business data and the partner's business data is determined, and the initiator's data set is determined based on the data intersection and the initiator's business data; the partner data set is determined based on the data intersection and the partner's business data, which can improve the intuitiveness of the association relationship between the initiator's data set and the partner's data set, and facilitate the subsequent calculation of the initiator and partner's feature multicollinearity based on the initiator's data set and the partner's data set.
  • the ciphertext correlation coefficient matrix is specifically the ciphertext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set.
  • the plaintext correlation coefficient matrix is specifically the plaintext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set.
  • the initiator device refers to the terminal device corresponding to the initiator
  • the partner device refers to the terminal device corresponding to the partner.
  • the correlation coefficient is a statistical indicator used to reflect the closeness of the correlation between variables.
  • the following sub-steps may be used to encrypt the initiator data set, decrypt the ciphertext correlation coefficient matrix, and determine the plaintext correlation coefficient matrix based on the decryption result:
  • the homomorphic encryption algorithm refers to an encryption algorithm that satisfies the homomorphic operation property of the ciphertext, that is, after the data is homomorphically encrypted, a specific calculation is performed on the ciphertext, and the ciphertext calculation result is obtained in the corresponding
  • the plaintext after homomorphic decryption is equivalent to directly performing the same calculation on the plaintext data, achieving the "computable but invisible" data.
  • the initiator device uses a homomorphic encryption algorithm to generate a key pair, which includes a public key and a private key.
  • the initiator device uses the public key in the key pair to encrypt the initiator data set.
  • S1202. Send the encrypted initiator data set to the partner device, so that the partner device can calculate the ciphertext feature correlation coefficient between the feature data in the initiator data set and the feature data in the partner data set based on the encrypted initiator data set and the multiplication characteristics of the homomorphic encryption algorithm, and determine the ciphertext correlation coefficient matrix between the initiator and the partner based on the ciphertext feature correlation coefficient.
  • the ciphertext feature correlation coefficient refers to the ciphertext correlation coefficient obtained by calculating the ciphertext between the feature data in the partner data set and the encrypted feature data in the initiator data set.
  • the encrypted initiator data set is sent to the partner device through the initiator device, so that the partner device calculates the ciphertext feature correlation coefficient between each feature data in the encrypted initiator data set and each feature data in the partner data set according to the multiplication characteristics of the homomorphic encryption algorithm, integrates the calculation results, and determines the ciphertext correlation coefficient matrix between the initiator's feature data and the partner's feature data.
  • the ciphertext correlation coefficient matrix is sent to the initiator device through the partner device.
  • the initiator device uses the private key in the key pair to decrypt the ciphertext correlation coefficient matrix, and determines the plaintext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set according to the decryption result.
  • the above scheme transforms the problem of calculating the feature correlation coefficients of the initiator and the partner in vertical federated learning into calculating the ciphertext correlation coefficient matrix between the initiator and the partner using a homomorphic encryption algorithm, thereby reducing the number of data interactions when the regression coefficient method is used to calculate the feature multicollinearity of the initiator and the partner, simplifying the calculation process and improving the calculation efficiency.
  • S130 Determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set.
  • the initiator correlation coefficient matrix refers to a matrix composed of correlation coefficients between characteristic data in the initiator data set
  • the partner correlation coefficient matrix refers to a matrix composed of correlation coefficients between characteristic data in the partner data set.
  • the characteristic data in the initiator's data set that is, the characteristic data of the initiator
  • the correlation coefficients between the characteristic data of the initiator are used to determine the initiator correlation coefficient matrix of the initiator.
  • the characteristic data in the partner data set i.e., the characteristic data of the partner
  • the correlation coefficients between the characteristic data of the partner are calculated
  • the partner correlation coefficient matrix of the partner is determined based on the correlation coefficients between the characteristic data of the partner.
  • the initiator correlation coefficient matrix of the initiator and the partner correlation coefficient matrix of the partner may be determined according to the following sub-steps:
  • standardization processing is to convert the original data according to a certain proportion through certain mathematical transformation methods, so that it falls into a small specific interval.
  • the specific interval can be [0,1] or [-1,1], so as to eliminate the differences in characteristic attributes such as properties, dimensions and orders of magnitude between different variables, and convert it into a dimensionless relative value, that is, a standardized value, so that the values of each indicator are at the same order of magnitude, which facilitates the comprehensive analysis and comparison of indicators of different units or orders of magnitude.
  • the characteristic data in the initiator data set and the characteristic data in the partner data set may be standardized by using a Z-score standardization method to determine the initiator standardized data and the partner standardized data.
  • the method for obtaining the initiator's standardized data may be: determining the characteristic data in the initiator's data set and the number of samples corresponding to the characteristic data; calculating the average value of the characteristic data in the initiator's data set, and determining the characteristic value variance based on the average value of the characteristic data in the initiator's data set and the number of samples of the management party; and determining the initiator's standardized data based on the characteristic value variance, the characteristic data in the initiator's data set and the average value of the characteristic data in the initiator's data set.
  • S1302. Determine the initiator correlation coefficient matrix of the initiator according to the correlation coefficients between the characteristic data of the initiator's standardized data, and determine the partner correlation coefficient matrix of the partner according to the correlation coefficients between the characteristic data of the partner's standardized data.
  • the characteristic data of the initiator's standardized data refers to the standardized result of the characteristic data in the initiator's data set
  • the characteristic data of the partner's standardized data refers to the standardized result of the characteristic data in the partner's data set.
  • the correlation coefficients between the characteristic data of the initiator's standardized data are calculated and integrated to determine the initiator's correlation coefficient matrix.
  • the correlation coefficients between the characteristic data of the partner's standardized data are calculated and integrated to determine the correlation coefficient matrix between the partner's characteristics.
  • the initiator data set and the partner data set are standardized, so that the characteristic data of the initiator and the partner can be calculated.
  • the correlation coefficient matrix problem of the feature data is converted into the homomorphic multiplication problem of the feature data of the initiator and the feature data of the partner, which ensures the data privacy security of the initiator and the partner.
  • S140 Perform feature multicollinearity analysis on the initiator and the partners according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix.
  • feature multicollinearity analysis is performed on all features of the initiator and the partner to obtain the multicollinearity values of each feature data of the initiator and the multicollinearity values of each feature data of the partner.
  • the multicollinearity values of each feature data of the initiator are the federated multicollinearity values of each feature data of the initiator in federated learning.
  • the multicollinearity values of each feature data of the partner are the federated multicollinearity values of each feature data of the partner in federated learning.
  • the federated learning model is used to predict the user's business indicator value.
  • the multicollinearity value of each feature data of the initiator and the multicollinearity value of each feature data of the partner are determined.
  • the model training data is determined from the feature data in the initiator's data set and the feature data in the partner's data set, and the model training data is used to train the federated learning model.
  • the business data of the user who needs to predict the business indicator value is used as the input data of the trained federated learning model, and the prediction result of the user's business indicator value is determined according to the output data of the trained federated learning model.
  • the feature data in the initiator's data set and the feature data in the partner's data set can be screened according to the feature multicollinearity analysis results and the multicollinearity threshold, and the feature data in the initiator's data set and the partner's data set whose multicollinearity values are less than the multicollinearity threshold are determined as model training data, and the federated learning model is trained using the model training data.
  • screening out feature data with multicollinearity values less than the multicollinearity threshold from the initiator data set and the partner data set as model training data for the federated learning model can improve the model training efficiency and the reliability of the model.
  • the technical solution provided in this embodiment determines an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in vertical federated learning; encrypts the initiator data set through the initiator device and sends the encrypted initiator data set to the partner device; and obtains the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device.
  • the ciphertext correlation coefficient matrix is decrypted through the initiator device, and the plaintext correlation coefficient matrix is determined based on the decryption result; the initiator correlation coefficient matrix of the initiator and the partner correlation coefficient matrix of the partner are determined based on the initiator data set and the partner data set; the initiator and partner data sets are analyzed for feature multicollinearity based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix; based on the feature multicollinearity analysis results, the model training data is determined from the initiator data set and the partner data set, and the federated learning model is trained using the model training data.
  • the above scheme calculates the feature multicollinearity of the initiator and the partner in the vertical federated learning based on the correlation coefficient method. No parameter adjustment is required during the calculation process, which reduces the number of data interactions when calculating the feature multicollinearity of the initiator and the interacting party.
  • the homomorphic encryption algorithm and the correlation coefficient method are used to determine the training data of the federated learning model through distributed computing, realize the training of the federated learning model, improve the training efficiency of the federated learning model, and ensure the interpretability of the linear model in the federated learning model.
  • the use of the federated learning model to predict the business indicator value of the user can improve the prediction accuracy of the business indicator value.
  • FIG2 is a flow chart of a business indicator prediction method provided in Example 2 of the present application.
  • This embodiment is optimized on the basis of the above embodiment, and provides an implementation method for performing feature multicollinearity analysis on the initiator and the partner based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix.
  • the method includes:
  • S210 Determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning.
  • S220 encrypting the initiator data set through the initiator device and sending the encrypted initiator data set to the partner device; and obtaining the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypting the ciphertext correlation coefficient matrix through the initiator device, and determining the plaintext correlation coefficient matrix according to the decryption result.
  • S230 Determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set.
  • the correlation coefficient fusion matrix refers to the initiator's correlation coefficient matrix, the partner's correlation coefficient matrix, and the The matrix obtained by combining the matrix and the plaintext correlation coefficient matrix.
  • the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix are combined to obtain a correlation coefficient fusion matrix.
  • the determinant value of the correlation coefficient fusion matrix is calculated, and the determinant value of the correlation coefficient fusion matrix is used as the complete matrix determinant value of the correlation coefficient fusion matrix.
  • matrix feature data refers to the feature data contained in the correlation coefficient fusion matrix.
  • Feature rows refer to the rows where the matrix feature data is located in the correlation coefficient fusion matrix.
  • Feature columns refer to the columns where the matrix feature data is located in the correlation coefficient fusion matrix.
  • the characteristic data of each matrix constituting the correlation coefficient fusion matrix are determined, the characteristic rows and characteristic columns of each matrix data are deleted from the correlation coefficient fusion matrix respectively, the characteristic matrix corresponding to each matrix characteristic data is obtained, and the characteristic matrix determinant value corresponding to each matrix characteristic data is determined.
  • the ratio between the characteristic matrix determinant value and the complete matrix determinant value is used as the multicollinearity value of the matrix characteristic data corresponding to the characteristic matrix determinant, and the multicollinearity values corresponding to all matrix characteristic data are integrated to obtain the multicollinearity values of each characteristic data of the initiator and the multicollinearity values of each characteristic data of the partner.
  • the multicollinearity values of each characteristic data of the initiator and the multicollinearity values of each characteristic data of the partner are used as the characteristic multicollinearity analysis results.
  • the federated learning model is used to predict the user's business indicator value.
  • the technical solution of this embodiment proposes a method for calculating the multicollinearity values of the initiator and the partner by performing feature multicollinearity analysis on the initiator and the partner.
  • the above scheme determines the correlation coefficient fusion matrix based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix, so as to calculate the feature matrix determinant value corresponding to each feature data in the correlation coefficient fusion matrix based on the correlation coefficient fusion matrix.
  • the feature matrix determinant value corresponding to each feature data and the complete matrix determinant value of the correlation coefficient fusion matrix the accurate multicollinearity value of each feature data can be obtained, thereby improving the accuracy of the feature multicollinearity analysis results of the initiator and the partner.
  • FIG3 is a flow chart of a business indicator prediction method provided in Example 3 of the present application.
  • This embodiment is optimized on the basis of the above embodiment, and provides an implementation scheme for filtering the feature data in the initiator's data set and the feature data in the partner's data set according to the data contribution of the feature data in the initiator's data set and the feature data in the partner's data set.
  • the method includes:
  • S310 Determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning.
  • S320 Determine the data contribution of the feature data in the initiator's data set and the feature data in the partner's data set.
  • the data contribution can be determined by the IV (Infromation Value) value of the data. It is used to indicate the contribution of feature data to target prediction, that is, the predictive ability of the feature. Generally speaking, the higher the IV value, the stronger the predictive ability of the feature and the higher the information contribution.
  • IV Infromation Value
  • the IV values of the feature data in the initiator's data set and the feature data in the partner's data set can be calculated by performing WOE (Weight of Evidence) weighted summation on the feature data in the initiator's data set and the feature data in the partner's data set, and the IV value can be used as the data contribution.
  • WOE Weight of Evidence
  • the feature data in the initiator's data set and the feature data in the partner's data set are subjected to WOE calculation, the IV value of each feature data is calculated, and the WOE calculation result is subjected to feature collinearity analysis.
  • the data set is screened for feature collinearity using the IV value and the feature collinearity threshold.
  • S330 Filter the feature data in the initiator's data set and the feature data in the partner's data set according to the data contribution.
  • a contribution threshold is set, and according to the contribution threshold and the data contribution, the feature data in the initiator's data set and the feature data in the partner's data set are filtered and processed, and the feature data whose data contribution is less than the contribution threshold is filtered out from the feature data in the initiator's data set and the feature data in the partner's data set.
  • the contribution threshold may be 0.1.
  • the feature data in the initiator data set and the feature data in the partner data set in this step are the feature data in the initiator data set and the feature data in the partner data set after filtering, respectively.
  • S350 Determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set.
  • S360 Perform feature multicollinearity analysis on the initiator and the partners based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix, and the plaintext correlation coefficient matrix.
  • the federated learning model is used to predict the user's business indicator value.
  • the technical solution of this embodiment is to filter the feature data in the initiator's data set and the feature data in the partner's data set according to the contribution of the feature data in the initiator's data set and the contribution of the feature data in the partner's data set before determining the model training data based on the initiator's data set and the partner's data set, so as to obtain the training data of the federated learning model.
  • the above solution can improve the model training speed while ensuring the reliability of the federated learning model.
  • FIG4 is a schematic diagram of the structure of a business indicator prediction device provided in Example 4 of the present application. This embodiment can be applied to the case where a federated learning model for predicting business indicators is trained based on the business data of the initiator and the business data of the partner in vertical federated learning.
  • the business indicator prediction device includes: a data set determination module 410, a data set encryption transmission module 420, Correlation coefficient matrix determination module 430, multicollinearity analysis module 440 and model training module.
  • the data set determination module 410 is used to determine the initiator data set and the partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning;
  • the data set encryption transmission module 420 is used to encrypt the initiator data set through the initiator device and send the encrypted initiator data set to the partner device; and obtain the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypt the ciphertext correlation coefficient matrix through the initiator device, and determine the plaintext correlation coefficient matrix according to the decryption result;
  • a correlation coefficient matrix determination module 430 is used to determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set;
  • a multicollinearity analysis module 440 is used to perform feature multicollinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix;
  • the model training module 450 is used to determine the model training data from the initiator data set and the partner data set according to the result of the feature multicollinearity analysis, and use the model training data to train the federated learning model; the federated learning model is used to predict the user's business indicator value.
  • the technical solution provided in this embodiment determines the initiator data set and the partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning; encrypts the initiator data set through the initiator device and sends the encrypted initiator data set to the partner device; and obtains the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypts the ciphertext correlation coefficient matrix through the initiator device, and determines the plaintext correlation coefficient matrix according to the decryption result; determines the initiator correlation coefficient matrix of the initiator and the partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set; performs feature multicollinearity analysis on the initiator and partner data sets according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix; determines the model training data from the initiator data set and the partner data set according to the feature multicollinearity analysis results, and uses the model training data to train the federated learning model.
  • the above scheme calculates the feature multicollinearity of the initiator and the partner in the vertical federated learning based on the correlation coefficient method. No parameter adjustment is required during the calculation process, which reduces the number of data interactions when calculating the feature multicollinearity of the initiator and the interacting party.
  • the homomorphic encryption algorithm and the correlation coefficient method are used to determine the training data of the federated learning model through distributed computing, realize the training of the federated learning model, improve the training efficiency of the federated learning model, and ensure the interpretability of the linear model in the federated learning model.
  • the use of the federated learning model to predict the business indicator value of the user can improve the prediction accuracy of the business indicator value.
  • the data set encryption transmission module 420 is specifically used for:
  • the initiator device uses a homomorphic encryption algorithm to generate a key pair, and uses the public key in the key pair to encrypt the initiator data set;
  • the encrypted initiator data set is sent to the partner device, so that the partner device calculates the ciphertext feature correlation coefficient between the feature data in the initiator data set and the feature data in the partner data set according to the encrypted initiator data set and the multiplication characteristics of the homomorphic encryption algorithm, and determines the ciphertext correlation coefficient matrix between the initiator and the partner according to the ciphertext feature correlation coefficient;
  • the ciphertext correlation coefficient matrix is sent to the initiator device, so that the ciphertext correlation coefficient matrix is decrypted by the initiator device using the private key in the key pair, and the plaintext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set is determined according to the decryption result.
  • the multicollinearity analysis module 440 is specifically used for:
  • the ratio between the determinant value of the characteristic matrix and the determinant value of the complete matrix is used as the multicollinearity value of the matrix characteristic data.
  • the multicollinearity values of all matrix characteristic data are integrated to determine the multicollinearity values of the initiator and the partner.
  • the multicollinearity values of the initiator and the partner are used as the characteristic multicollinearity analysis results.
  • the data set determination module 410 is specifically used for:
  • the initiator's business data and the partner's business data are processed separately according to the data intersection to determine the initiator's data set and the partner's data set.
  • the correlation coefficient matrix determination module 430 is used to:
  • the initiator correlation coefficient matrix of the initiator is determined, and according to the correlation coefficients between the characteristic data of the partner's standardized data, the partner correlation coefficient matrix of the partner is determined.
  • model training module 450 is specifically used for:
  • the feature data in the initiator's dataset and the feature data in the partner's dataset are screened, and the feature data in the initiator's dataset and the partner's dataset whose multicollinearity values are less than the multicollinearity threshold are determined as model training data.
  • the federated learning model is trained using the model training data.
  • the business indicator prediction device further includes:
  • a data contribution determination module used to determine the data contribution of the feature data in the initiator's data set and the feature data in the partner's data set;
  • the data filtering module is used to filter the feature data in the initiator's data set and the feature data in the partner's data set according to the data contribution.
  • the business indicator prediction device provided in this embodiment can be applied to the business indicator prediction method provided in any of the above embodiments, and has corresponding functions and beneficial effects.
  • Fig. 5 shows a block diagram of an electronic device 10 that can be used to implement an embodiment of the present application.
  • the electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • the electronic device can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices (such as helmets, glasses, watches, etc.) and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the present application described and/or required herein.
  • the electronic device 10 includes at least one processor 11, and a memory connected to the at least one processor 11, such as a read-only memory (ROM) 12, a random access memory (RAM) 13, etc., wherein the memory stores a computer program that can be executed by at least one processor, and the processor 11 can perform various appropriate actions and processes according to the computer program stored in the read-only memory (ROM) 12 or the computer program loaded from the storage unit 18 to the random access memory (RAM) 13.
  • RAM 13 various programs and data required for the operation of the electronic device 10 can also be stored.
  • the processor 11, ROM 12 and RAM 13 are connected to each other through a bus 14.
  • An input/output (I/O) interface 15 is also connected to the bus 14.
  • a number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a disk, an optical disk, etc.; and a communication unit 19, such as a network card, a modem, a wireless communication transceiver, etc.
  • the communication unit 19 allows the electronic device 10 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
  • the processor 11 may be a variety of general and/or special processing components with processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any appropriate processor, controller, microcontroller, etc.
  • the processor 11 executes the various methods and processes described above, such as a business indicator prediction method.
  • the business indicator prediction method may be implemented as a computer program, which is tangibly contained in a computer-readable storage medium, such as a storage unit 18.
  • part or all of the computer program may be loaded and/or installed on the electronic device 10 via the ROM 12 and/or the communication unit 19.
  • the processor 11 may be configured to perform the business indicator prediction method in any other appropriate manner (e.g., by means of firmware).
  • Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOCs systems on chips
  • CPLDs load programmable logic devices
  • Various implementations can include: being implemented in one or more computer programs that can be executed and/or interpreted on a programmable system including at least one programmable processor, which can be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • a programmable processor which can be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • the computer programs for implementing the methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, so that when the computer programs are executed by the processor, the functions/operations specified in the flow charts and/or block diagrams are implemented.
  • the computer programs may be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a stand-alone software package, or entirely on a remote machine or server.
  • a computer readable storage medium may be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any suitable combination of the foregoing.
  • a computer readable storage medium may be a machine readable signal medium.
  • a more specific example of a machine readable storage medium would include an electrical connection based on one or more wires.
  • the invention may be a computer programmable memory device, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk-read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM compact disk-read-only memory
  • magnetic storage device or any suitable combination of the above.
  • the systems and techniques described herein may be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and a pointing device (e.g., a mouse or trackball) through which the user can provide input to the electronic device.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and a pointing device e.g., a mouse or trackball
  • Other types of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form (including acoustic input, voice input, or tactile input).
  • the systems and techniques described herein may be implemented in a computing system that includes backend components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes frontend components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such backend components, middleware components, or frontend components.
  • the components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: a local area network (LAN), a wide area network (WAN), a blockchain network, and the Internet.
  • a computing system may include a client and a server.
  • the client and the server are generally remote from each other and usually interact through a communication network.
  • the client and server relationship is generated by computer programs running on the corresponding computers and having a client-server relationship with each other.
  • the server may be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the defects of difficult management and weak business scalability in traditional physical hosts and VPS services.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Disclosed in the present application is a service index prediction method, comprising: determining an initiator data set of an initiator and a partner data set of a partner in vertical federated learning; encrypting the initiator data set by means of an initiator device, and sending the encrypted initiator data set to a partner device; decrypting a ciphertext correlation coefficient matrix by means of the initiator device, so as to determine a plaintext correlation coefficient matrix; determining an initiator correlation coefficient matrix and a partner correlation coefficient matrix according to the initiator data set and the partner data set; performing feature multi-collinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix; and determining model training data from the initiator data set and the partner data set according to the analysis result, and training a federated learning model by using the model training data. The training efficiency of a federated learning model is increased, and the feature interpretability of a linear model is ensured.

Description

一种业务指标预测方法、装置、设备和存储介质A business indicator prediction method, device, equipment and storage medium
本申请要求在2022年10月19日提交中国专利局、申请号为202211278879.7的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on October 19, 2022, with application number 202211278879.7, the entire contents of which are incorporated by reference into this application.
技术领域Technical Field
本申请实施例涉及计算机领域,尤其涉及一种业务指标预测方法、装置、设备和存储介质。The embodiments of the present application relate to the field of computers, and in particular to a business indicator prediction method, apparatus, device and storage medium.
背景技术Background technique
纵向联邦学习往往用以解决纵向联邦学习的参与方中一方数据维度过少,仅用一方数据无法较好地实现建模目标,多用于不同行业之间的联合建模。纵向多重共线性联邦建模时,任务发起方和合作方的数据集有共同的样本空间,不同的特征空间,需要使用加密算法保证数据使用方和数据方的数据隐私安全,同时要实现计算每个特征和其他特征的多重共线性,去除共线性较大的特征,提升建模的效率和准确性,线性模型法实现的联邦多重共线性计算,需要多次交互迭代训练计算拟合度,通讯消耗和计算复杂度较大,并且计算结果依赖模型超参数的设置。因此,如何在提高联邦学习模型的训练效率的同时,保证联邦学习模型的准确性,是需要解决的问题。Vertical federated learning is often used to solve the problem that one of the participants in vertical federated learning has too few data dimensions, and the modeling goal cannot be achieved well with only one party's data. It is mostly used for joint modeling between different industries. In vertical multicollinearity federated modeling, the data sets of the task initiator and the partner have a common sample space and different feature spaces. Encryption algorithms are required to ensure the data privacy security of the data user and the data party. At the same time, it is necessary to calculate the multicollinearity of each feature and other features, remove features with large collinearity, and improve the efficiency and accuracy of modeling. The federated multicollinearity calculation implemented by the linear model method requires multiple interactive iterations of training to calculate the fit, and the communication consumption and calculation complexity are large, and the calculation results depend on the setting of the model hyperparameters. Therefore, how to improve the training efficiency of the federated learning model while ensuring the accuracy of the federated learning model is a problem that needs to be solved.
发明内容Summary of the invention
本申请提供了一种业务指标预测方法、装置、设备和存储介质,可以提高纵向联邦学习中线性模型的计算效率和计算精度,提高联邦学习模型的预测精度,从而在通过联邦学习模型预测用户的业务指标时,可以提高对用户业务指标的预测精度。The present application provides a business indicator prediction method, apparatus, device and storage medium, which can improve the computational efficiency and computational accuracy of linear models in vertical federated learning, and improve the prediction accuracy of the federated learning model, so that when predicting the user's business indicators through the federated learning model, the prediction accuracy of the user's business indicators can be improved.
根据本申请的一方面,提供了一种业务指标预测方法,包括:According to one aspect of the present application, a business indicator prediction method is provided, comprising:
根据纵向联邦学习中的发起方的发起方业务数据和合作方的合作方业务数据,确定发起方数据集和合作方数据集;Determine the initiator data set and the partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning;
通过发起方设备对所述发起方数据集进行加密,并向合作方设备发送加密后的发起方数据集;以及,从所述合作方设备获取发起方和合作方之间的密文相关系数矩阵,并通过所述发起方设备对所述密文相关系数矩阵进行解密,且根据解密结果确定明文相关系数矩阵;Encrypting the initiator data set through the initiator device and sending the encrypted initiator data set to the partner device; and obtaining the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypting the ciphertext correlation coefficient matrix through the initiator device, and determining the plaintext correlation coefficient matrix according to the decryption result;
根据所述发起方数据集和所述合作方数据集,确定所述发起方的发起方相 关系数矩阵,以及所述合作方的合作方相关系数矩阵;Determine the initiator's counterpart of the initiator according to the initiator data set and the partner data set a relationship number matrix, and a partner correlation coefficient matrix of the partners;
根据所述发起方相关系数矩阵、所述合作方相关系数矩阵和所述明文相关系数矩阵,对所述发起方和所述合作方进行特征多重共线性分析;Performing feature multicollinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix;
根据特征多重共线性分析结果,从所述发起方数据集和所述合作方数据集中确定模型训练数据,并采用模型训练数据训练联邦学习模型;所述联邦学习模型用于预测用户的业务指标值。According to the result of feature multicollinearity analysis, model training data is determined from the initiator data set and the partner data set, and the model training data is used to train a federated learning model; the federated learning model is used to predict the business indicator value of the user.
根据本申请的另一方面,提供了一种业务指标预测装置,该装置包括:According to another aspect of the present application, a business indicator prediction device is provided, the device comprising:
数据集确定模块,用于根据纵向联邦学习中的发起方的发起方业务数据和合作方的合作方业务数据,确定发起方数据集和合作方数据集;A data set determination module, used to determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning;
数据集加密传输模块,用于通过发起方设备对所述发起方数据集进行加密,并向合作方设备发送加密后的发起方数据集;以及,从所述合作方设备获取发起方和合作方之间的密文相关系数矩阵,并通过所述发起方设备对所述密文相关系数矩阵进行解密,且根据解密结果确定明文相关系数矩阵;A data set encryption transmission module, used to encrypt the initiator data set through the initiator device and send the encrypted initiator data set to the partner device; and obtain the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypt the ciphertext correlation coefficient matrix through the initiator device, and determine the plaintext correlation coefficient matrix according to the decryption result;
相关系数矩阵确定模块,用于根据所述发起方数据集和所述合作方数据集,确定所述发起方的发起方相关系数矩阵,以及所述合作方的合作方相关系数矩阵;A correlation coefficient matrix determination module, used to determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set;
多重共线性分析模块,用于根据所述发起方相关系数矩阵、所述合作方相关系数矩阵和所述明文相关系数矩阵,对所述发起方和所述合作方进行特征多重共线性分析;A multicollinearity analysis module, used for performing characteristic multicollinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix;
模型训练模块,用于根据特征多重共线性分析结果,从所述发起方数据集和所述合作方数据集中确定模型训练数据,并采用模型训练数据训练联邦学习模型;所述联邦学习模型用于预测用户的业务指标值。The model training module is used to determine the model training data from the initiator data set and the partner data set according to the result of feature multicollinearity analysis, and use the model training data to train the federated learning model; the federated learning model is used to predict the user's business indicator value.
根据本申请的另一方面,提供了一种电子设备,所述电子设备包括:According to another aspect of the present application, an electronic device is provided, the electronic device comprising:
至少一个处理器;以及at least one processor; and
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行本申请任一实施例所述的业务指标预测方法。The memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can execute the business indicator prediction method described in any embodiment of the present application.
根据本申请的另一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现本申请任一实施例所述的业务指标预测方法。According to another aspect of the present application, a computer-readable storage medium is provided, wherein the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the business indicator prediction method described in any embodiment of the present application when executed.
本申请实施例的技术方案,根据纵向联邦学习中的发起方的发起方业务数 据和合作方的合作方业务数据,确定发起方数据集和合作方数据集;通过发起方设备对发起方数据集进行加密,并向合作方设备发送加密后的发起方数据集;以及,从合作方设备获取发起方和合作方之间的密文相关系数矩阵,并通过发起方设备对密文相关系数矩阵进行解密,且根据解密结果确定明文相关系数矩阵;根据发起方数据集和合作方数据集,确定发起方的发起方相关系数矩阵,以及合作方的合作方相关系数矩阵;根据发起方相关系数矩阵、合作方相关系数矩阵和明文相关系数矩阵,对发起方和合作方数据集进行特征多重共线性分析;根据特征多重共线性分析结果,从发起方数据集和合作方数据集中确定模型训练数据,并采用模型训练数据训练联邦学习模型。解决了使用回归系数法分析纵向联邦学习中的发起方和合作方的特征多重共线性时,需要多次交互迭代训练计算拟合度,通讯消耗和计算复杂度较大,并且计算结果依赖模型超参数的设置的问题。上述方案,基于相关系数方法计算纵向联邦学习中的发起方和合作方的特征多重共线性,在计算过程中不需要调参,减少了在计算特征多重共线性发起方和交互方的特征多重共线性时数据交互次数,使用了同态加密算法和相关系数法,通过分布式计算确定联邦学习模型的训练数据,实现对联邦学习模型的训练,提高了联邦学习模型的训练效率,同时保证了联邦学习模型中的线性模型的可解释性。采用联邦学习模型预测用户的业务指标值,可以提高业务指标值的预测精度。The technical solution of the embodiment of the present application is based on the number of initiator services of the initiator in the vertical federated learning. According to the business data of the initiator and the partner of the partner, determine the initiator data set and the partner data set; encrypt the initiator data set through the initiator device and send the encrypted initiator data set to the partner device; and obtain the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, and decrypt the ciphertext correlation coefficient matrix through the initiator device, and determine the plaintext correlation coefficient matrix according to the decryption result; according to the initiator data set and the partner data set, determine the initiator correlation coefficient matrix of the initiator and the partner correlation coefficient matrix of the partner; according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix, perform feature multicollinearity analysis on the initiator and partner data sets; according to the feature multicollinearity analysis results, determine the model training data from the initiator data set and the partner data set, and use the model training data to train the federated learning model. It solves the problem that when using the regression coefficient method to analyze the feature multicollinearity of the initiator and the partner in the vertical federated learning, it is necessary to perform multiple interactive iterative training to calculate the fit, the communication consumption and calculation complexity are large, and the calculation results depend on the setting of the model hyperparameters. The above scheme calculates the feature multicollinearity of the initiator and the partner in the vertical federated learning based on the correlation coefficient method. No parameter adjustment is required during the calculation process, which reduces the number of data interactions when calculating the feature multicollinearity of the initiator and the interacting party. The homomorphic encryption algorithm and the correlation coefficient method are used to determine the training data of the federated learning model through distributed computing, realize the training of the federated learning model, improve the training efficiency of the federated learning model, and ensure the interpretability of the linear model in the federated learning model. The use of the federated learning model to predict the business indicator value of the user can improve the prediction accuracy of the business indicator value.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本申请实施例一提供的一种业务指标预测方法的流程图;FIG1 is a flow chart of a business indicator prediction method provided in Example 1 of the present application;
图2为本申请实施例二提供的一种业务指标预测方法的流程图;FIG2 is a flow chart of a business indicator prediction method provided in Example 2 of the present application;
图3为本申请实施例三提供的一种业务指标预测方法的流程图;FIG3 is a flow chart of a business indicator prediction method provided in Example 3 of the present application;
图4为本申请实施例四提供的一种业务指标预测装置的结构示意图;FIG4 is a schematic diagram of the structure of a business indicator prediction device provided in Embodiment 4 of the present application;
图5为本申请实施例五提供的一种电子设备的结构示意图。FIG5 is a schematic diagram of the structure of an electronic device provided in Embodiment 5 of the present application.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。In order to enable those skilled in the art to better understand the scheme of the present application, the technical scheme in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. The described embodiments are only embodiments of a part of the present application, rather than all embodiments.
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“目标”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括” 和“等”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "object" and the like in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the terms used in this way can be interchanged where appropriate, so that the embodiments of the present application described herein can be implemented in an order other than those illustrated or described herein. In addition, the term "including" and "etc." and any variations thereof are intended to cover non-exclusive inclusions, for example, a process, method, system, product or apparatus comprising a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such process, method, product or apparatus.
实施例一Embodiment 1
图1为本申请实施例一提供了一种业务指标预测方法的流程图,本实施例可适用于根据纵向联邦学习中发起方的发起方业务数据和合作方业务数据,训练用于对业务指标进行预测的联邦学习模型的情况。尤其适用于通过相关系数法,对纵向联邦学习中发起方的发起方业务数据和合作方业务数据进行多重共线性分析,根据多重共线性分析结果,确定联邦学习模型的训练数据,以通过训练数据训练用于对业务指标进行预测的联邦学习模型的情况。该方法可以由业务指标预测装置来执行,该业务指标预测装置可以采用硬件和/或软件的形式实现,该业务指标预测装置可配置于电子设备中。如图1所示,该方法包括:FIG1 is a flowchart of a business indicator prediction method provided in Embodiment 1 of the present application. This embodiment can be applied to the case where a federated learning model for predicting business indicators is trained based on the initiator business data of the initiator and the partner business data of the initiator in vertical federated learning. It is particularly suitable for the case where a multicollinearity analysis is performed on the initiator business data and the partner business data of the initiator in vertical federated learning by the correlation coefficient method, and the training data of the federated learning model is determined based on the multicollinearity analysis results, so as to train the federated learning model for predicting business indicators through the training data. The method can be executed by a business indicator prediction device, which can be implemented in the form of hardware and/or software, and the business indicator prediction device can be configured in an electronic device. As shown in FIG1 , the method includes:
S110、根据纵向联邦学习中的发起方的发起方业务数据和合作方的合作方业务数据,确定发起方数据集和合作方数据集。S110. Determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning.
其中,纵向联邦学习一般是适用于数据集上具有相同的样本空间和不同的特征空间的参与方所组成的联邦学习场景。可通过纵向联邦学习为不同的参与方协同训练一个联邦学习模型。在本实施例中,参与方为纵向联邦学习中的发起方和合作方。发起方和合作方可以是具有合作需求的不同行业的企业。发起方业务数据是指发起方在用户许可的情况下,获取的可以表征用户在发起方的业务指标的样本数据;合作方业务数据是指合作方在用户许可的情况下,获取的可以表征用户在合作方的业务指标的样本数据。业务指标是指用于衡量用户某一方面的行为的指标,例如,业务指标可以是用户在发起方和合作方的信用指标或绩效指标。发起方数据集是指包含有发起方的特征数据的数据集合。合作方数据集是指包含有合作方的特征数据的数据集合。Among them, vertical federated learning is generally applicable to a federated learning scenario composed of participants with the same sample space and different feature spaces on the data set. A federated learning model can be collaboratively trained for different participants through vertical federated learning. In this embodiment, the participants are the initiator and the partner in vertical federated learning. The initiator and the partner can be enterprises in different industries with cooperation needs. The initiator's business data refers to the sample data that can characterize the user's business indicators on the initiator, obtained by the initiator with the user's permission; the partner's business data refers to the sample data that can characterize the user's business indicators on the partner, obtained by the partner with the user's permission. Business indicators refer to indicators used to measure a certain aspect of user behavior. For example, business indicators can be credit indicators or performance indicators of users on the initiator and the partner. The initiator data set refers to a data set containing the initiator's feature data. The partner data set refers to a data set containing the partner's feature data.
具体的,在发起方所对应的用户许可的情况下,获取纵向联邦学习中的发起方所对应的发起方业务数据,对发起方业务数据进行特征提取,确定发起方业务数据的特征数据,根据发起方业务数据的特征数据确定发起方数据集。在合作方所对应的用户许可的情况下,获取纵向联邦学习中的合作方所对应的合作方业务数据,对合作方业务数据进行特征提取,确定合作方业务数据的特征数据,根据合作方业务数据的特征数据确定合作方数据集。Specifically, with the permission of the user corresponding to the initiator, the business data of the initiator corresponding to the initiator in the vertical federated learning is obtained, the feature data of the initiator's business data is extracted, the feature data of the initiator's business data is determined, and the initiator data set is determined according to the feature data of the initiator's business data. With the permission of the user corresponding to the partner, the business data of the partner corresponding to the partner in the vertical federated learning is obtained, the feature data of the partner's business data is extracted, the feature data of the partner's business data is determined, and the partner data set is determined according to the feature data of the partner's business data.
示例性的,确定发起方数据集和合作方数据集的方法可以是:确定纵向联邦学习中的发起方和合作方的相同用户,基于相同用户的身份标识,确定 发起方的发起方业务数据和合作方的合作方业务数据之间的数据交集;根据数据交集分别对发起方业务数据和合作方业务数据进行处理,确定发起方数据集和合作方数据集。Exemplarily, the method for determining the initiator data set and the partner data set may be: determining the same user of the initiator and the partner in the vertical federated learning, and determining based on the identity of the same user The data intersection between the initiator's business data and the partner's business data; the initiator's business data and the partner's business data are processed respectively according to the data intersection to determine the initiator's data set and the partner's data set.
其中,身份标识是指可以表征用户身份的数据,身份标识可以包括身份证号码或者手机号码。The identity identifier refers to data that can represent the identity of the user, and the identity identifier may include an ID number or a mobile phone number.
具体的,纵向联邦学习的发起方和合作方之间可能存在相同用户,因此在训练联邦学习模型时,需要确定纵向联邦学习中的发起方和合作方的相同用户,在用户许可的情况下,获取相同用户的身份标识。基于相同用户的身份标识,确定发起方的发起方业务数据和合作方的合作方业务数据之间的数据交集。将数据交集和发起方业务数据的特征数据进行整合,确定发起方数据集;将数据交集和合作方业务数据的特征数据进行整合,确定合作方数据集。Specifically, the initiator and the partner of vertical federated learning may have the same user. Therefore, when training the federated learning model, it is necessary to determine the same user of the initiator and the partner in the vertical federated learning, and obtain the identity of the same user with the user's permission. Based on the identity of the same user, determine the data intersection between the initiator's initiator business data and the partner's partner business data. Integrate the data intersection and the feature data of the initiator's business data to determine the initiator data set; integrate the data intersection and the feature data of the partner's business data to determine the partner data set.
可以理解的是,根据纵向联邦学习中的发起方和合作方的相同用户,确定发起方业务数据和合作方的合作方业务数据之间的数据交集,并根据数据交集和发起方业务数据,确定发起方数据集;根据数据交集和合作方业务数据,确定合作方数据集,可以提高发起方数据集和合作方数据集之间的关联关系的直观性,便于后续根据发起方数据集和合作方数据集计算发起方和合作方的特征多重共线性。It can be understood that based on the same users of the initiator and the partner in the vertical federated learning, the data intersection between the initiator's business data and the partner's business data is determined, and the initiator's data set is determined based on the data intersection and the initiator's business data; the partner data set is determined based on the data intersection and the partner's business data, which can improve the intuitiveness of the association relationship between the initiator's data set and the partner's data set, and facilitate the subsequent calculation of the initiator and partner's feature multicollinearity based on the initiator's data set and the partner's data set.
S120、通过发起方设备对发起方数据集进行加密,并向合作方设备发送加密后的发起方数据集;以及,从合作方设备获取发起方和合作方之间的密文相关系数矩阵,并通过发起方设备对密文相关系数矩阵进行解密,且根据解密结果确定明文相关系数矩阵。S120, encrypting the initiator data set through the initiator device and sending the encrypted initiator data set to the partner device; and obtaining the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypting the ciphertext correlation coefficient matrix through the initiator device, and determining the plaintext correlation coefficient matrix according to the decryption result.
其中,密文相关系数矩阵具体为发起方数据集中的特征数据和合作方数据集中的特征数据之间的密文相关系数矩阵。明文相关系数矩阵具体为发起方数据集中的特征数据和合作方数据集中的特征数据之间的明文相关系数矩阵。发起方设备是指发起方所对应的终端设备,合作方设备是指合作方所对应的终端设备。相关系数是用以反映变量之间相关关系密切程度的统计指标。The ciphertext correlation coefficient matrix is specifically the ciphertext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set. The plaintext correlation coefficient matrix is specifically the plaintext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set. The initiator device refers to the terminal device corresponding to the initiator, and the partner device refers to the terminal device corresponding to the partner. The correlation coefficient is a statistical indicator used to reflect the closeness of the correlation between variables.
示例性的,可以通过如下子步骤实现对发起方数据集进行加密,以及对密文相关系数矩阵进行解密,且根据解密结果确定明文相关系数矩阵:Exemplarily, the following sub-steps may be used to encrypt the initiator data set, decrypt the ciphertext correlation coefficient matrix, and determine the plaintext correlation coefficient matrix based on the decryption result:
S1201、通过发起方设备,采用同态加密算法生成密钥对,采用密钥对中的公钥对发起方数据集进行加密。S1201. Generate a key pair using a homomorphic encryption algorithm through an initiator device, and encrypt the initiator data set using the public key in the key pair.
其中,同态加密算法是指满足密文同态运算性质的加密算法,即数据经过同态加密之后,对密文进行特定的计算,得到的密文计算结果在进行对应 的同态解密后的明文等同于对明文数据直接进行相同的计算,实现数据的“可算不可见”。Among them, the homomorphic encryption algorithm refers to an encryption algorithm that satisfies the homomorphic operation property of the ciphertext, that is, after the data is homomorphically encrypted, a specific calculation is performed on the ciphertext, and the ciphertext calculation result is obtained in the corresponding The plaintext after homomorphic decryption is equivalent to directly performing the same calculation on the plaintext data, achieving the "computable but invisible" data.
具体的,通过发起方设备,采用同态加密算法生成密钥对,密钥对中包含公钥和私钥。发起方设备采用密钥对中的公钥,对发起方数据集加密。Specifically, the initiator device uses a homomorphic encryption algorithm to generate a key pair, which includes a public key and a private key. The initiator device uses the public key in the key pair to encrypt the initiator data set.
S1202、将加密后的发起方数据集发送至合作方设备,以通过合作方设备,根据加密后的发起方数据集,以及同态加密算法的乘法特性,计算发起方数据集中的特征数据和合作方数据集中的特征数据之间的密文特征相关系数,根据密文特征相关系数,确定发起方和合作方之间的密文相关系数矩阵。S1202. Send the encrypted initiator data set to the partner device, so that the partner device can calculate the ciphertext feature correlation coefficient between the feature data in the initiator data set and the feature data in the partner data set based on the encrypted initiator data set and the multiplication characteristics of the homomorphic encryption algorithm, and determine the ciphertext correlation coefficient matrix between the initiator and the partner based on the ciphertext feature correlation coefficient.
其中,密文特征相关系数是指合作方数据集中的特征数据和加密后的发起方数据集中的特征数据之间的密文计算获取的密文相关系数。The ciphertext feature correlation coefficient refers to the ciphertext correlation coefficient obtained by calculating the ciphertext between the feature data in the partner data set and the encrypted feature data in the initiator data set.
具体的,通过发起方设备,将加密后的发起方数据集发送至合作方设备,以使合作方设备根据同态加密算法的乘法特性,根据加密后的发起方数据集,计算加密后的发起方数据集中的各特征数据和合作方数据集中的各特征数据之间的密文特征相关系数,整合计算结果,确定发起方的特征数据和合作方的特征数据之间的密文相关系数矩阵。Specifically, the encrypted initiator data set is sent to the partner device through the initiator device, so that the partner device calculates the ciphertext feature correlation coefficient between each feature data in the encrypted initiator data set and each feature data in the partner data set according to the multiplication characteristics of the homomorphic encryption algorithm, integrates the calculation results, and determines the ciphertext correlation coefficient matrix between the initiator's feature data and the partner's feature data.
S1203、将密文相关系数矩阵发送给发起方设备,以通过发起方设备,采用密钥对中的私钥,对密文相关系数矩阵进行解密,根据解密结果确定发起方数据集中的特征数据和合作方数据集中的特征数据之间的明文相关系数矩阵。S1203. Send the ciphertext correlation coefficient matrix to the initiator device, so that the ciphertext correlation coefficient matrix is decrypted by the initiator device using the private key in the key pair, and the plaintext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set is determined according to the decryption result.
具体的,通过合作方设备,将密文相关系数矩阵发送给发起方设备。通过发起方设备,采用密钥对中的私钥,对密文相关系数矩阵进行解密,根据解密结果确定发起方数据集中的特征数据和合作方数据集中的特征数据之间的明文相关系数矩阵。Specifically, the ciphertext correlation coefficient matrix is sent to the initiator device through the partner device. The initiator device uses the private key in the key pair to decrypt the ciphertext correlation coefficient matrix, and determines the plaintext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set according to the decryption result.
可以理解的是,上述方案,将纵向联邦学习中计算发起方和合作方的特征相关系数的问题转化为使用同态加密算法计算发起方和合作方之间的密文相关系数矩阵,减少了回归系数法在计算发起方和合作方的特征多重共线性时的数据交互次数,简化了计算流程,提高了计算效率。It can be understood that the above scheme transforms the problem of calculating the feature correlation coefficients of the initiator and the partner in vertical federated learning into calculating the ciphertext correlation coefficient matrix between the initiator and the partner using a homomorphic encryption algorithm, thereby reducing the number of data interactions when the regression coefficient method is used to calculate the feature multicollinearity of the initiator and the partner, simplifying the calculation process and improving the calculation efficiency.
S130、根据发起方数据集和合作方数据集,确定发起方的发起方相关系数矩阵,以及合作方的合作方相关系数矩阵。S130: Determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set.
其中,发起方相关系数矩阵是指发起方数据集中各特征数据之间的相关系数构成的矩阵。合作方相关系数矩阵是指合作方数据集中各特征数据之间的相关系数构成的矩阵。The initiator correlation coefficient matrix refers to a matrix composed of correlation coefficients between characteristic data in the initiator data set, and the partner correlation coefficient matrix refers to a matrix composed of correlation coefficients between characteristic data in the partner data set.
具体的,确定发起方数据集中的特征数据,即发起方的特征数据,计算 发起方的各特征数据之间的相关系数,根据发起方的各特征数据之间的相关系数,确定发起方的发起方相关系数矩阵。确定合作方数据集中的特征数据,即合作方的特征数据,计算合作方的各特征数据之间的相关系数,根据合作方的各特征数据之间的相关系数,确定合作方的合作方相关系数矩阵。Specifically, determine the characteristic data in the initiator's data set, that is, the characteristic data of the initiator, and calculate The correlation coefficients between the characteristic data of the initiator are used to determine the initiator correlation coefficient matrix of the initiator. The characteristic data in the partner data set, i.e., the characteristic data of the partner, are determined, the correlation coefficients between the characteristic data of the partner are calculated, and the partner correlation coefficient matrix of the partner is determined based on the correlation coefficients between the characteristic data of the partner.
示例性的,可以根据如下子步骤确定发起方的发起方相关系数矩阵,以及合作方的合作方相关系数矩阵:Exemplarily, the initiator correlation coefficient matrix of the initiator and the partner correlation coefficient matrix of the partner may be determined according to the following sub-steps:
S1301、对发起方数据集中的特征数据和合作方数据集中的特征数据进行标准化处理,确定发起方标准化数据和合作方标准化数据。S1301. Standardize the feature data in the initiator data set and the feature data in the partner data set to determine the initiator standardized data and the partner standardized data.
其中,标准化处理是通过一定的数学变换方式,将原始数据按照一定的比例进行转换,使之落入到一个小的特定区间内,特定区间可以是[0,1]或[-1,1],以消除不同变量之间性质、量纲和数量级等特征属性的差异,将其转化为一个无量纲的相对数值,也就是标准化数值,使各指标的数值都处于同一个数量级别上,从而便于不同单位或数量级的指标能够进行综合分析和比较。Among them, standardization processing is to convert the original data according to a certain proportion through certain mathematical transformation methods, so that it falls into a small specific interval. The specific interval can be [0,1] or [-1,1], so as to eliminate the differences in characteristic attributes such as properties, dimensions and orders of magnitude between different variables, and convert it into a dimensionless relative value, that is, a standardized value, so that the values of each indicator are at the same order of magnitude, which facilitates the comprehensive analysis and comparison of indicators of different units or orders of magnitude.
具体的,可以通过Z-score标准化法对发起方数据集中的特征数据和合作方数据集中的特征数据进行标准化处理,确定发起方标准化数据和合作方标准化数据。Specifically, the characteristic data in the initiator data set and the characteristic data in the partner data set may be standardized by using a Z-score standardization method to determine the initiator standardized data and the partner standardized data.
例如,发起方标准化数据的获取方法可以是:确定发起方数据集中的特征数据和特征数据对应的样本个数;计算发起方数据集中的特征数据的平均值,根据发起方数据集中的特征数据的平均值和管理方样本个数,确定特征值方差,根据特征值方差、发起方数据集中的特征数据和发起方数据集中的特征数据的平均值,可以确定发起方标准化数据。For example, the method for obtaining the initiator's standardized data may be: determining the characteristic data in the initiator's data set and the number of samples corresponding to the characteristic data; calculating the average value of the characteristic data in the initiator's data set, and determining the characteristic value variance based on the average value of the characteristic data in the initiator's data set and the number of samples of the management party; and determining the initiator's standardized data based on the characteristic value variance, the characteristic data in the initiator's data set and the average value of the characteristic data in the initiator's data set.
S1302、根据发起方标准化数据的各特征数据之间的相关系数,确定发起方的发起方相关系数矩阵,根据合作方标准化数据的各特征数据之间的相关系数,确定合作方的合作方相关系数矩阵。S1302. Determine the initiator correlation coefficient matrix of the initiator according to the correlation coefficients between the characteristic data of the initiator's standardized data, and determine the partner correlation coefficient matrix of the partner according to the correlation coefficients between the characteristic data of the partner's standardized data.
其中,发起方标准化数据的各特征数据是指发起方数据集中的特征数据的标准化结果。合作方标准化数据的各特征数据是指合作方数据集中的特征数据的标准化结果。The characteristic data of the initiator's standardized data refers to the standardized result of the characteristic data in the initiator's data set, and the characteristic data of the partner's standardized data refers to the standardized result of the characteristic data in the partner's data set.
具体的,计算并整合发起方标准化数据各特征数据之间的相关系数,确定发起方的发起方相关系数矩阵。计算并整合合作方标准化数据各特征数据之间的相关系数,确定合作方的合作方各特征之间的相关系数矩阵。Specifically, the correlation coefficients between the characteristic data of the initiator's standardized data are calculated and integrated to determine the initiator's correlation coefficient matrix. The correlation coefficients between the characteristic data of the partner's standardized data are calculated and integrated to determine the correlation coefficient matrix between the partner's characteristics.
在计算发起方相关系数矩阵和合作方相关系数矩阵之前,对发起方数据集和合作方数据集进行标准化处理,可以将计算发起方的特征数据和合作方 的特征数据的相关系数矩阵问题转化为发起方特征数据与合作方的特征数据同态乘法问题,保障了发起方和合作方的数据隐私安全。Before calculating the initiator correlation coefficient matrix and the partner correlation coefficient matrix, the initiator data set and the partner data set are standardized, so that the characteristic data of the initiator and the partner can be calculated. The correlation coefficient matrix problem of the feature data is converted into the homomorphic multiplication problem of the feature data of the initiator and the feature data of the partner, which ensures the data privacy security of the initiator and the partner.
S140、根据发起方相关系数矩阵、合作方相关系数矩阵和明文相关系数矩阵,对发起方和合作方进行特征多重共线性分析。S140: Perform feature multicollinearity analysis on the initiator and the partners according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix.
具体的,根据发起方相关系数矩阵、合作方相关系数矩阵,以及发起方的特征数据与合作方的特征数据之间的明文相关系数矩阵,对发起方和合作方的所有特征进行特征多重共线性分析,以获得发起方的各特征数据的多重共线性值,以及合作方的各特征数据的多重共线性值。发起方的各特征数据的多重共线性值即联邦学习中,发起方的各特征数据的联邦多重共线性值。合作方的各特征数据的多重共线性值即多重共线性值即联邦学习中,合作方的各特征数据的联邦多重共线性值。Specifically, based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix, and the plaintext correlation coefficient matrix between the initiator's feature data and the partner's feature data, feature multicollinearity analysis is performed on all features of the initiator and the partner to obtain the multicollinearity values of each feature data of the initiator and the multicollinearity values of each feature data of the partner. The multicollinearity values of each feature data of the initiator are the federated multicollinearity values of each feature data of the initiator in federated learning. The multicollinearity values of each feature data of the partner are the federated multicollinearity values of each feature data of the partner in federated learning.
S150、根据特征多重共线性分析结果,从发起方数据集和合作方数据集中确定模型训练数据,并采用模型训练数据训练联邦学习模型。S150. According to the result of feature multicollinearity analysis, determine the model training data from the initiator data set and the partner data set, and use the model training data to train the federated learning model.
其中,联邦学习模型用于预测用户的业务指标值。Among them, the federated learning model is used to predict the user's business indicator value.
具体的,根据特征多重共线性分析结果,确定发起方的各特征数据的多重共线性值,以及合作方的各特征数据的多重共线性值。根据发起方的各特征数据的多重共线性值,合作方的各特征数据的多重共线性值,以及预设的模型训练数据筛选条件,从发起方数据集中的特征数据和合作方数据集中的特征数据中确定模型训练数据,采用模型训练数据训练联邦学习模型。将需要进行业务指标值预测的用户的业务数据作为训练后的联邦学习模型的输入数据,根据训练后的联邦学习模型的输出数据,确定用户的业务指标值的预测结果。Specifically, according to the result of feature multicollinearity analysis, the multicollinearity value of each feature data of the initiator and the multicollinearity value of each feature data of the partner are determined. According to the multicollinearity value of each feature data of the initiator, the multicollinearity value of each feature data of the partner, and the preset model training data screening conditions, the model training data is determined from the feature data in the initiator's data set and the feature data in the partner's data set, and the model training data is used to train the federated learning model. The business data of the user who needs to predict the business indicator value is used as the input data of the trained federated learning model, and the prediction result of the user's business indicator value is determined according to the output data of the trained federated learning model.
示例性的,可以根据特征多重共线性分析结果和多重共线性阈值,对发起方数据集中的特征数据和合作方数据集中的特征数据进行筛选,确定发起方数据集和所述合作方数据集中多重共线性值小于多重共线性阈值的特征数据为模型训练数据,通过模型训练数据训练联邦学习模型。Exemplarily, the feature data in the initiator's data set and the feature data in the partner's data set can be screened according to the feature multicollinearity analysis results and the multicollinearity threshold, and the feature data in the initiator's data set and the partner's data set whose multicollinearity values are less than the multicollinearity threshold are determined as model training data, and the federated learning model is trained using the model training data.
可以理解的是,从发起方数据集和合作方数据集中筛选出多重共线性值小于多重共线性阈值的特征数据,作为联邦学习模型的模型训练数据,可以提高模型训练效率和模型的可靠性。It can be understood that screening out feature data with multicollinearity values less than the multicollinearity threshold from the initiator data set and the partner data set as model training data for the federated learning model can improve the model training efficiency and the reliability of the model.
本实施例提供的技术方案,根据纵向联邦学习中的发起方的发起方业务数据和合作方的合作方业务数据,确定发起方数据集和合作方数据集;通过发起方设备对发起方数据集进行加密,并向合作方设备发送加密后的发起方数据集;以及,从合作方设备获取发起方和合作方之间的密文相关系数矩阵, 并通过发起方设备对密文相关系数矩阵进行解密,且根据解密结果确定明文相关系数矩阵;根据发起方数据集和合作方数据集,确定发起方的发起方相关系数矩阵,以及合作方的合作方相关系数矩阵;根据发起方相关系数矩阵、合作方相关系数矩阵和明文相关系数矩阵,对发起方和合作方数据集进行特征多重共线性分析;根据特征多重共线性分析结果,从发起方数据集和合作方数据集中确定模型训练数据,并采用模型训练数据训练联邦学习模型。解决了使用回归系数法分析纵向联邦学习中的发起方和合作方的特征多重共线性时,需要多次交互迭代训练计算拟合度,通讯消耗和计算复杂度较大,并且计算结果依赖模型超参数的设置的问题。上述方案,基于相关系数方法计算纵向联邦学习中的发起方和合作方的特征多重共线性,在计算过程中不需要调参,减少了在计算特征多重共线性发起方和交互方的特征多重共线性时数据交互次数,使用了同态加密算法和相关系数法,通过分布式计算确定联邦学习模型的训练数据,实现对联邦学习模型的训练,提高了联邦学习模型的训练效率,同时保证了联邦学习模型中的线性模型的可解释性。采用联邦学习模型预测用户的业务指标值,可以提高业务指标值的预测精度。The technical solution provided in this embodiment determines an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in vertical federated learning; encrypts the initiator data set through the initiator device and sends the encrypted initiator data set to the partner device; and obtains the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device. The ciphertext correlation coefficient matrix is decrypted through the initiator device, and the plaintext correlation coefficient matrix is determined based on the decryption result; the initiator correlation coefficient matrix of the initiator and the partner correlation coefficient matrix of the partner are determined based on the initiator data set and the partner data set; the initiator and partner data sets are analyzed for feature multicollinearity based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix; based on the feature multicollinearity analysis results, the model training data is determined from the initiator data set and the partner data set, and the federated learning model is trained using the model training data. This solves the problem that when using the regression coefficient method to analyze the feature multicollinearity of the initiator and the partner in the vertical federated learning, multiple interactive iterative training calculations are required, the communication consumption and computational complexity are large, and the calculation results depend on the setting of the model hyperparameters. The above scheme calculates the feature multicollinearity of the initiator and the partner in the vertical federated learning based on the correlation coefficient method. No parameter adjustment is required during the calculation process, which reduces the number of data interactions when calculating the feature multicollinearity of the initiator and the interacting party. The homomorphic encryption algorithm and the correlation coefficient method are used to determine the training data of the federated learning model through distributed computing, realize the training of the federated learning model, improve the training efficiency of the federated learning model, and ensure the interpretability of the linear model in the federated learning model. The use of the federated learning model to predict the business indicator value of the user can improve the prediction accuracy of the business indicator value.
实施例二Embodiment 2
图2为本申请实施例二提供的一种业务指标预测方法的流程图,本实施例在上述实施例的基础上进行了优化,给出了一种根据发起方相关系数矩阵、合作方相关系数矩阵和明文相关系数矩阵,对发起方和合作方进行特征多重共线性分析的实施方式。具体的,如图2所示,该方法包括:FIG2 is a flow chart of a business indicator prediction method provided in Example 2 of the present application. This embodiment is optimized on the basis of the above embodiment, and provides an implementation method for performing feature multicollinearity analysis on the initiator and the partner based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix. Specifically, as shown in FIG2, the method includes:
S210、根据纵向联邦学习中的发起方的发起方业务数据和合作方的合作方业务数据,确定发起方数据集和合作方数据集。S210: Determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning.
S220、通过发起方设备对发起方数据集进行加密,并向合作方设备发送加密后的发起方数据集;以及,从合作方设备获取发起方和合作方之间的密文相关系数矩阵,并通过发起方设备对密文相关系数矩阵进行解密,且根据解密结果确定明文相关系数矩阵。S220, encrypting the initiator data set through the initiator device and sending the encrypted initiator data set to the partner device; and obtaining the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypting the ciphertext correlation coefficient matrix through the initiator device, and determining the plaintext correlation coefficient matrix according to the decryption result.
S230、根据发起方数据集和合作方数据集,确定发起方的发起方相关系数矩阵,以及合作方的合作方相关系数矩阵。S230: Determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set.
S240、根据发起方相关系数矩阵、合作方相关系数矩阵和明文相关系数矩阵,确定相关系数融合矩阵,并确定相关系数融合矩阵的完整矩阵行列式值。S240. Determine a correlation coefficient fusion matrix according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix, and determine a complete matrix determinant value of the correlation coefficient fusion matrix.
其中,相关系数融合矩阵是指发起方相关系数矩阵、合作方相关系数矩 阵和明文相关系数矩阵组合在一起获得的矩阵。The correlation coefficient fusion matrix refers to the initiator's correlation coefficient matrix, the partner's correlation coefficient matrix, and the The matrix obtained by combining the matrix and the plaintext correlation coefficient matrix.
具体的,将发起方相关系数矩阵、合作方相关系数矩阵和明文相关系数矩阵,组合在一起,获得相关系数融合矩阵。计算相关系数融合矩阵的行列式值,将相关系数融合矩阵的行列式值作为相关系数融合矩阵的完整矩阵行列式值。Specifically, the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix are combined to obtain a correlation coefficient fusion matrix. The determinant value of the correlation coefficient fusion matrix is calculated, and the determinant value of the correlation coefficient fusion matrix is used as the complete matrix determinant value of the correlation coefficient fusion matrix.
S250、确定从相关系数融合矩阵中删除矩阵特征数据所在的特征行和特征列后的特征矩阵行列式值。S250, determining the characteristic matrix determinant value after deleting the characteristic rows and characteristic columns where the matrix characteristic data are located from the correlation coefficient fusion matrix.
其中,矩阵特征数据是指相关系数融合矩阵中所包含的特征数据。特征行是指矩阵特征数据在相关系数融合矩阵中所在的行。特征列是指矩阵特征数据在相关系数融合矩阵中所在的列。Among them, matrix feature data refers to the feature data contained in the correlation coefficient fusion matrix. Feature rows refer to the rows where the matrix feature data is located in the correlation coefficient fusion matrix. Feature columns refer to the columns where the matrix feature data is located in the correlation coefficient fusion matrix.
具体的,确定组成相关系数融合矩阵的各矩阵特征数据,分别从相关系数融合矩阵中删除各矩阵数据所在的特征行和特征列,获得各矩阵特征数据对应的特征矩阵,并确定各矩阵特征数据对应的特征矩阵行列式值。Specifically, the characteristic data of each matrix constituting the correlation coefficient fusion matrix are determined, the characteristic rows and characteristic columns of each matrix data are deleted from the correlation coefficient fusion matrix respectively, the characteristic matrix corresponding to each matrix characteristic data is obtained, and the characteristic matrix determinant value corresponding to each matrix characteristic data is determined.
S260、将特征矩阵行列式值和完整矩阵行列式值之间的比值,作为矩阵特征数据的多重共线性值,整合所有矩阵特征数据的多重共线性值,确定发起方和合作方的多重共线性值,将发起方和合作方的多重共线性值作为特征多重共线性分析结果。S260. Use the ratio between the characteristic matrix determinant value and the complete matrix determinant value as the multicollinearity value of the matrix characteristic data, integrate the multicollinearity values of all matrix characteristic data, determine the multicollinearity values of the initiator and the partner, and use the multicollinearity values of the initiator and the partner as the characteristic multicollinearity analysis results.
具体的,将特征矩阵行列式值和完整矩阵行列式值之间的比值,作为特征矩阵行列式对应的矩阵特征数据的多重共线性值,整合所有矩阵特征数据所对应的多重共线性值,获得发起方的各特征数据的多重共线性值,以及合作方的各特征数据的多重共线性值。将发起方的各特征数据的多重共线性值,以及合作方的各特征数据的多重共线性值作为特征多重共线性分析结果。Specifically, the ratio between the characteristic matrix determinant value and the complete matrix determinant value is used as the multicollinearity value of the matrix characteristic data corresponding to the characteristic matrix determinant, and the multicollinearity values corresponding to all matrix characteristic data are integrated to obtain the multicollinearity values of each characteristic data of the initiator and the multicollinearity values of each characteristic data of the partner. The multicollinearity values of each characteristic data of the initiator and the multicollinearity values of each characteristic data of the partner are used as the characteristic multicollinearity analysis results.
S270、根据特征多重共线性分析结果,从发起方数据集和合作方数据集中确定模型训练数据,并采用模型训练数据训练联邦学习模型。S270. According to the result of feature multicollinearity analysis, determine the model training data from the initiator data set and the partner data set, and use the model training data to train the federated learning model.
其中,联邦学习模型用于预测用户的业务指标值。Among them, the federated learning model is used to predict the user's business indicator value.
本实施例的技术方案,提出了一种对发起方和合作方进行特征多重共线性分析,以确定发起方和合作方的多重共线性值的计算方法。上述方案,根据发起方相关系数矩阵、合作方相关系数矩阵和明文相关系数矩阵,确定相关系数融合矩阵,以根据相关系数融合矩阵,计算相关系数融合矩阵中各特征数据所对应的特征矩阵行列式值。根据各特征数据所对应的特征矩阵行列式值和相关系数融合矩阵的完整矩阵行列式值,可以获得准确的各特征数据的多重共线性值,从而提高的发起方和合作方的特征多重共线性分析结果的准确性。 The technical solution of this embodiment proposes a method for calculating the multicollinearity values of the initiator and the partner by performing feature multicollinearity analysis on the initiator and the partner. The above scheme determines the correlation coefficient fusion matrix based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix, so as to calculate the feature matrix determinant value corresponding to each feature data in the correlation coefficient fusion matrix based on the correlation coefficient fusion matrix. According to the feature matrix determinant value corresponding to each feature data and the complete matrix determinant value of the correlation coefficient fusion matrix, the accurate multicollinearity value of each feature data can be obtained, thereby improving the accuracy of the feature multicollinearity analysis results of the initiator and the partner.
实施例三Embodiment 3
图3为本申请实施例三提供的一种业务指标预测方法的流程图,本实施例在上述实施例的基础上进行了优化,给出了一种根据发起方数据集中的特征数据和合作方数据集中的特征数据的数据贡献度,对发起方数据集中的特征数据和合作方数据集中的特征数据进行过滤处理的实施方案。具体的,如图3所示,该方法包括:FIG3 is a flow chart of a business indicator prediction method provided in Example 3 of the present application. This embodiment is optimized on the basis of the above embodiment, and provides an implementation scheme for filtering the feature data in the initiator's data set and the feature data in the partner's data set according to the data contribution of the feature data in the initiator's data set and the feature data in the partner's data set. Specifically, as shown in FIG3, the method includes:
S310、根据纵向联邦学习中的发起方的发起方业务数据和合作方的合作方业务数据,确定发起方数据集和合作方数据集。S310: Determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning.
S320、确定发起方数据集中的特征数据和合作方数据集中的特征数据的数据贡献度。S320: Determine the data contribution of the feature data in the initiator's data set and the feature data in the partner's data set.
其中,数据贡献度可以由数据的IV(Infromation Value,信息价值)值确定。用来表示特征数据对目标预测的贡献程度,即特征的预测能力,一般来说,IV值越高,该特征的预测能力越强,信息贡献程度越高。Among them, the data contribution can be determined by the IV (Infromation Value) value of the data. It is used to indicate the contribution of feature data to target prediction, that is, the predictive ability of the feature. Generally speaking, the higher the IV value, the stronger the predictive ability of the feature and the higher the information contribution.
具体的,可以通过对发起方数据集中的特征数据和合作方数据集中的特征数据进行WOE(Weight of Evidence,证据权重)加权求和,计算得出发起方数据集中的特征数据和合作方数据集中的特征数据的IV值,将IV值作为数据贡献度。Specifically, the IV values of the feature data in the initiator's data set and the feature data in the partner's data set can be calculated by performing WOE (Weight of Evidence) weighted summation on the feature data in the initiator's data set and the feature data in the partner's data set, and the IV value can be used as the data contribution.
例如,根据数据贡献度,将发起方数据集中的特征数据和合作方数据集中的特征数据进行WOE计算,计算每个特征数据的IV值,将WOE的计算结果进行特征共线性分析,通过IV值和特征共线性阈值对数据集进行特征共线性筛选,对比特征共线性筛选前的特征数据的贡献度和筛选后的特征数据的贡献度,可以得知特征数据的贡献度由筛选前的负数变为筛选后的正数,因此,特征共线性分析对模型的解释性有影响。For example, based on the data contribution, the feature data in the initiator's data set and the feature data in the partner's data set are subjected to WOE calculation, the IV value of each feature data is calculated, and the WOE calculation result is subjected to feature collinearity analysis. The data set is screened for feature collinearity using the IV value and the feature collinearity threshold. By comparing the contribution of the feature data before and after the feature collinearity screening, it can be seen that the contribution of the feature data changes from a negative number before screening to a positive number after screening. Therefore, feature collinearity analysis has an impact on the interpretability of the model.
S330、根据数据贡献度,对发起方数据集中的特征数据和合作方数据集中的特征数据进行过滤处理。S330: Filter the feature data in the initiator's data set and the feature data in the partner's data set according to the data contribution.
具体的,根据实际需要,设置贡献阈值,根据贡献阈值和数据贡献度,对发起方数据集中的特征数据和合作方数据集中的特征数据进行过滤处理,从发起方数据集中的特征数据和合作方数据集中的特征数据过滤删除数据贡献度小于贡献阈值的特征数据。例如,贡献阈值可以是0.1。Specifically, according to actual needs, a contribution threshold is set, and according to the contribution threshold and the data contribution, the feature data in the initiator's data set and the feature data in the partner's data set are filtered and processed, and the feature data whose data contribution is less than the contribution threshold is filtered out from the feature data in the initiator's data set and the feature data in the partner's data set. For example, the contribution threshold may be 0.1.
需要说明的是,未进行过滤处理的发起方和合作方所有的特征进行模型训练,所有的发起方数据集中的特征数据和合作方数据集中的特征数据均具有权重。对发起方数据集中的特征数据和合作方数据集中的特征数据进行过 滤处理,以根据发起方数据集和合作方数据集中贡献率较高的特征数据进行模型训练,过滤后获得的特征数据在模型训练过程中也会有其相应的权重,对比两次训练中相同的特征的权重,发现有些特征数据,进行数据过滤处理前权重为负数,进行数据过滤处理后权重为正数,说明了多重共线性分析对联邦学习模型的可解释性有影响。经过联邦多重共线性分析筛选特征,过滤掉发起方数据集和合作方数据集中IV值较小的特征数据,可以提高模型训练的速度。It should be noted that all the features of the initiator and partner that have not been filtered are used for model training, and all the feature data in the initiator's data set and the feature data in the partner's data set have weights. Filtering is performed to train the model based on the feature data with higher contribution rates in the initiator's data set and the partner's data set. The feature data obtained after filtering will also have corresponding weights in the model training process. By comparing the weights of the same features in the two trainings, it is found that some feature data have negative weights before data filtering and positive weights after data filtering, which shows that multicollinearity analysis has an impact on the interpretability of the federated learning model. After filtering features through federated multicollinearity analysis, feature data with smaller IV values in the initiator's data set and the partner's data set are filtered out, which can improve the speed of model training.
S340、通过发起方设备对发起方数据集进行加密,并向合作方设备发送加密后的发起方数据集;以及,从合作方设备获取发起方和合作方之间的密文相关系数矩阵,并通过发起方设备对密文相关系数矩阵进行解密,且根据解密结果确定明文相关系数矩阵。S340, encrypting the initiator data set through the initiator device and sending the encrypted initiator data set to the partner device; and obtaining the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypting the ciphertext correlation coefficient matrix through the initiator device, and determining the plaintext correlation coefficient matrix according to the decryption result.
需要说明的是,本步骤中发起方数据集中的特征数据和合作方数据集中的特征数据,分别为过滤处理后的发起方数据集中的特征数据和合作方数据集中的特征数据。It should be noted that the feature data in the initiator data set and the feature data in the partner data set in this step are the feature data in the initiator data set and the feature data in the partner data set after filtering, respectively.
S350、根据发起方数据集和合作方数据集,确定发起方的发起方相关系数矩阵,以及合作方的合作方相关系数矩阵。S350: Determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set.
S360、根据发起方相关系数矩阵、合作方相关系数矩阵和明文相关系数矩阵,对发起方和合作方进行特征多重共线性分析。S360: Perform feature multicollinearity analysis on the initiator and the partners based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix, and the plaintext correlation coefficient matrix.
S370、根据特征多重共线性分析结果,从发起方数据集和合作方数据集中确定模型训练数据,并采用模型训练数据训练联邦学习模型。S370. According to the result of feature multicollinearity analysis, determine the model training data from the initiator data set and the partner data set, and use the model training data to train the federated learning model.
其中,联邦学习模型用于预测用户的业务指标值。Among them, the federated learning model is used to predict the user's business indicator value.
本实施例的技术方案,在对根据发起方数据集和合作方数据集确定模型训练数据,并采用模型训练数据训练联邦学习模型之前,根据发起方数据集中的特征数据的贡献度,以及合作方数据集中的特征数据的贡献度,对发起方数据集中的特征数据和合作方数据集中的特征数据进行过滤处理,以获得联邦学习模型的训练数据。上述方案,可以在保证联邦学习模型的可靠性的同时,提高模型训练速度。The technical solution of this embodiment is to filter the feature data in the initiator's data set and the feature data in the partner's data set according to the contribution of the feature data in the initiator's data set and the contribution of the feature data in the partner's data set before determining the model training data based on the initiator's data set and the partner's data set, so as to obtain the training data of the federated learning model. The above solution can improve the model training speed while ensuring the reliability of the federated learning model.
实施例四Embodiment 4
图4为本申请实施例四提供的一种业务指标预测装置的结构示意图。本实施例可适用于根据纵向联邦学习中发起方的发起方业务数据和合作方业务数据,训练用于对业务指标进行预测的联邦学习模型的情况。如图4所示,该业务指标预测装置包括:数据集确定模块410、数据集加密传输模块420、 相关系数矩阵确定模块430、多重共线性分析模块440和模型训练模块。FIG4 is a schematic diagram of the structure of a business indicator prediction device provided in Example 4 of the present application. This embodiment can be applied to the case where a federated learning model for predicting business indicators is trained based on the business data of the initiator and the business data of the partner in vertical federated learning. As shown in FIG4, the business indicator prediction device includes: a data set determination module 410, a data set encryption transmission module 420, Correlation coefficient matrix determination module 430, multicollinearity analysis module 440 and model training module.
其中,数据集确定模块410,用于根据纵向联邦学习中的发起方的发起方业务数据和合作方的合作方业务数据,确定发起方数据集和合作方数据集;The data set determination module 410 is used to determine the initiator data set and the partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning;
数据集加密传输模块420,用于通过发起方设备对发起方数据集进行加密,并向合作方设备发送加密后的发起方数据集;以及,从合作方设备获取发起方和合作方之间的密文相关系数矩阵,并通过发起方设备对密文相关系数矩阵进行解密,且根据解密结果确定明文相关系数矩阵;The data set encryption transmission module 420 is used to encrypt the initiator data set through the initiator device and send the encrypted initiator data set to the partner device; and obtain the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypt the ciphertext correlation coefficient matrix through the initiator device, and determine the plaintext correlation coefficient matrix according to the decryption result;
相关系数矩阵确定模块430,用于根据发起方数据集和合作方数据集,确定发起方的发起方相关系数矩阵,以及合作方的合作方相关系数矩阵;A correlation coefficient matrix determination module 430 is used to determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set;
多重共线性分析模块440,用于根据发起方相关系数矩阵、合作方相关系数矩阵和明文相关系数矩阵,对发起方和合作方进行特征多重共线性分析;A multicollinearity analysis module 440 is used to perform feature multicollinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix;
模型训练模块450,用于根据特征多重共线性分析结果,从发起方数据集和合作方数据集中确定模型训练数据,并采用模型训练数据训练联邦学习模型;联邦学习模型用于预测用户的业务指标值。The model training module 450 is used to determine the model training data from the initiator data set and the partner data set according to the result of the feature multicollinearity analysis, and use the model training data to train the federated learning model; the federated learning model is used to predict the user's business indicator value.
本实施例提供的技术方案,根据纵向联邦学习中的发起方的发起方业务数据和合作方的合作方业务数据,确定发起方数据集和合作方数据集;通过发起方设备对发起方数据集进行加密,并向合作方设备发送加密后的发起方数据集;以及,从合作方设备获取发起方和合作方之间的密文相关系数矩阵,并通过发起方设备对密文相关系数矩阵进行解密,且根据解密结果确定明文相关系数矩阵;根据发起方数据集和合作方数据集,确定发起方的发起方相关系数矩阵,以及合作方的合作方相关系数矩阵;根据发起方相关系数矩阵、合作方相关系数矩阵和明文相关系数矩阵,对发起方和合作方数据集进行特征多重共线性分析;根据特征多重共线性分析结果,从发起方数据集和合作方数据集中确定模型训练数据,并采用模型训练数据训练联邦学习模型。解决了使用回归系数法分析纵向联邦学习中的发起方和合作方的特征多重共线性时,需要多次交互迭代训练计算拟合度,通讯消耗和计算复杂度较大,并且计算结果依赖模型超参数的设置的问题。上述方案,基于相关系数方法计算纵向联邦学习中的发起方和合作方的特征多重共线性,在计算过程中不需要调参,减少了在计算特征多重共线性发起方和交互方的特征多重共线性时数据交互次数,使用了同态加密算法和相关系数法,通过分布式计算确定联邦学习模型的训练数据,实现对联邦学习模型的训练,提高了联邦学习模型的训练效率,同时保证了联邦学习模型中的线性模型的可解释性。采用联邦学习模型预测用户的业务指标值,可以提高业务指标值的预测精度。 The technical solution provided in this embodiment determines the initiator data set and the partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning; encrypts the initiator data set through the initiator device and sends the encrypted initiator data set to the partner device; and obtains the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypts the ciphertext correlation coefficient matrix through the initiator device, and determines the plaintext correlation coefficient matrix according to the decryption result; determines the initiator correlation coefficient matrix of the initiator and the partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set; performs feature multicollinearity analysis on the initiator and partner data sets according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix; determines the model training data from the initiator data set and the partner data set according to the feature multicollinearity analysis results, and uses the model training data to train the federated learning model. This solves the problem that when using the regression coefficient method to analyze the feature multicollinearity of the initiator and the partner in the vertical federated learning, multiple interactive iterative training calculations are required, the communication consumption and calculation complexity are large, and the calculation results depend on the setting of the model hyperparameters. The above scheme calculates the feature multicollinearity of the initiator and the partner in the vertical federated learning based on the correlation coefficient method. No parameter adjustment is required during the calculation process, which reduces the number of data interactions when calculating the feature multicollinearity of the initiator and the interacting party. The homomorphic encryption algorithm and the correlation coefficient method are used to determine the training data of the federated learning model through distributed computing, realize the training of the federated learning model, improve the training efficiency of the federated learning model, and ensure the interpretability of the linear model in the federated learning model. The use of the federated learning model to predict the business indicator value of the user can improve the prediction accuracy of the business indicator value.
示例性的,数据集加密传输模块420具体用于:Exemplarily, the data set encryption transmission module 420 is specifically used for:
通过发起方设备,采用同态加密算法生成密钥对,采用密钥对中的公钥对发起方数据集进行加密;The initiator device uses a homomorphic encryption algorithm to generate a key pair, and uses the public key in the key pair to encrypt the initiator data set;
将加密后的发起方数据集发送至合作方设备,以通过合作方设备,根据加密后的发起方数据集,以及同态加密算法的乘法特性,计算发起方数据集中的特征数据和合作方数据集中的特征数据之间的密文特征相关系数,根据密文特征相关系数,确定发起方和合作方之间的密文相关系数矩阵;The encrypted initiator data set is sent to the partner device, so that the partner device calculates the ciphertext feature correlation coefficient between the feature data in the initiator data set and the feature data in the partner data set according to the encrypted initiator data set and the multiplication characteristics of the homomorphic encryption algorithm, and determines the ciphertext correlation coefficient matrix between the initiator and the partner according to the ciphertext feature correlation coefficient;
将密文相关系数矩阵发送给发起方设备,以通过发起方设备,采用密钥对中的私钥,对密文相关系数矩阵进行解密,根据解密结果确定发起方数据集中的特征数据和合作方数据集中的特征数据之间的明文相关系数矩阵。The ciphertext correlation coefficient matrix is sent to the initiator device, so that the ciphertext correlation coefficient matrix is decrypted by the initiator device using the private key in the key pair, and the plaintext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set is determined according to the decryption result.
示例性的,多重共线性分析模块440具体用于:Exemplarily, the multicollinearity analysis module 440 is specifically used for:
根据发起方相关系数矩阵、合作方相关系数矩阵和明文相关系数矩阵,确定相关系数融合矩阵,并确定相关系数融合矩阵的完整矩阵行列式值;Determine a correlation coefficient fusion matrix according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix, and determine a complete matrix determinant value of the correlation coefficient fusion matrix;
确定从相关系数融合矩阵中删除矩阵特征数据所在的特征行和特征列后的特征矩阵行列式值;Determine the characteristic matrix determinant value after deleting the characteristic row and characteristic column where the matrix characteristic data is located from the correlation coefficient fusion matrix;
将特征矩阵行列式值和完整矩阵行列式值之间的比值,作为矩阵特征数据的多重共线性值,整合所有矩阵特征数据的多重共线性值,确定发起方和合作方的多重共线性值,将发起方和合作方的多重共线性值作为特征多重共线性分析结果。The ratio between the determinant value of the characteristic matrix and the determinant value of the complete matrix is used as the multicollinearity value of the matrix characteristic data. The multicollinearity values of all matrix characteristic data are integrated to determine the multicollinearity values of the initiator and the partner. The multicollinearity values of the initiator and the partner are used as the characteristic multicollinearity analysis results.
示例性的,数据集确定模块410具体用于:Exemplarily, the data set determination module 410 is specifically used for:
确定纵向联邦学习中的发起方和合作方的相同用户,基于相同用户的身份标识,确定发起方的发起方业务数据和合作方的合作方业务数据之间的数据交集;Determine the same users of the initiator and the partner in the vertical federated learning, and determine the data intersection between the initiator's initiator business data and the partner's partner business data based on the same user's identity;
根据数据交集分别对发起方业务数据和合作方业务数据进行处理,确定发起方数据集和合作方数据集。The initiator's business data and the partner's business data are processed separately according to the data intersection to determine the initiator's data set and the partner's data set.
示例性的,相关系数矩阵确定模块430用于:Exemplarily, the correlation coefficient matrix determination module 430 is used to:
对发起方数据集中的特征数据和合作方数据集中的特征数据进行标准化处理,确定发起方标准化数据和合作方标准化数据;Standardize the feature data in the initiator's data set and the feature data in the partner's data set to determine the initiator's standardized data and the partner's standardized data;
根据发起方标准化数据的各特征数据之间的相关系数,确定发起方的发起方相关系数矩阵,根据合作方标准化数据的各特征数据之间的相关系数,确定合作方的合作方相关系数矩阵。According to the correlation coefficients between the characteristic data of the initiator's standardized data, the initiator correlation coefficient matrix of the initiator is determined, and according to the correlation coefficients between the characteristic data of the partner's standardized data, the partner correlation coefficient matrix of the partner is determined.
示例性的,模型训练模块450具体用于: Exemplarily, the model training module 450 is specifically used for:
根据特征多重共线性分析结果和多重共线性阈值,对发起方数据集中的特征数据和合作方数据集中的特征数据进行筛选,确定发起方数据集和合作方数据集中多重共线性值小于多重共线性阈值的特征数据为模型训练数据,通过模型训练数据训练联邦学习模型。According to the feature multicollinearity analysis results and the multicollinearity threshold, the feature data in the initiator's dataset and the feature data in the partner's dataset are screened, and the feature data in the initiator's dataset and the partner's dataset whose multicollinearity values are less than the multicollinearity threshold are determined as model training data. The federated learning model is trained using the model training data.
示例性的,上述业务指标预测装置还包括:Exemplarily, the business indicator prediction device further includes:
数据贡献度确定模块,用于确定发起方数据集中的特征数据和合作方数据集中的特征数据的数据贡献度;A data contribution determination module, used to determine the data contribution of the feature data in the initiator's data set and the feature data in the partner's data set;
数据过滤模块,用于根据数据贡献度,对发起方数据集中的特征数据和合作方数据集中的特征数据进行过滤处理。The data filtering module is used to filter the feature data in the initiator's data set and the feature data in the partner's data set according to the data contribution.
本实施例提供的业务指标预测装置可适用于上述任意实施例提供的业务指标预测方法,具备相应的功能和有益效果。The business indicator prediction device provided in this embodiment can be applied to the business indicator prediction method provided in any of the above embodiments, and has corresponding functions and beneficial effects.
实施例五Embodiment 5
图5示出了可以用来实施本申请的实施例的电子设备10的结构示意图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备(如头盔、眼镜、手表等)和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本申请的实现。Fig. 5 shows a block diagram of an electronic device 10 that can be used to implement an embodiment of the present application. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices (such as helmets, glasses, watches, etc.) and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the present application described and/or required herein.
如图5所示,电子设备10包括至少一个处理器11,以及与至少一个处理器11通信连接的存储器,如只读存储器(ROM)12、随机访问存储器(RAM)13等,其中,存储器存储有可被至少一个处理器执行的计算机程序,处理器11可以根据存储在只读存储器(ROM)12中的计算机程序或者从存储单元18加载到随机访问存储器(RAM)13中的计算机程序,来执行各种适当的动作和处理。在RAM 13中,还可存储电子设备10操作所需的各种程序和数据。处理器11、ROM 12以及RAM 13通过总线14彼此相连。输入/输出(I/O)接口15也连接至总线14。As shown in FIG5 , the electronic device 10 includes at least one processor 11, and a memory connected to the at least one processor 11, such as a read-only memory (ROM) 12, a random access memory (RAM) 13, etc., wherein the memory stores a computer program that can be executed by at least one processor, and the processor 11 can perform various appropriate actions and processes according to the computer program stored in the read-only memory (ROM) 12 or the computer program loaded from the storage unit 18 to the random access memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 can also be stored. The processor 11, ROM 12 and RAM 13 are connected to each other through a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.
电子设备10中的多个部件连接至I/O接口15,包括:输入单元16,例如键盘、鼠标等;输出单元17,例如各种类型的显示器、扬声器等;存储单元18,例如磁盘、光盘等;以及通信单元19,例如网卡、调制解调器、无线通信收发机等。通信单元19允许电子设备10通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。 A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a disk, an optical disk, etc.; and a communication unit 19, such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
处理器11可以是各种具有处理和计算能力的通用和/或专用处理组件。处理器11的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的处理器、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。处理器11执行上文所描述的各个方法和处理,例如业务指标预测方法。The processor 11 may be a variety of general and/or special processing components with processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any appropriate processor, controller, microcontroller, etc. The processor 11 executes the various methods and processes described above, such as a business indicator prediction method.
在一些实施例中,业务指标预测方法可被实现为计算机程序,其被有形地包含于计算机可读存储介质,例如存储单元18。在一些实施例中,计算机程序的部分或者全部可以经由ROM 12和/或通信单元19而被载入和/或安装到电子设备10上。当计算机程序加载到RAM 13并由处理器11执行时,可以执行上文描述的业务指标预测方法的一个或多个步骤。备选地,在其他实施例中,处理器11可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行业务指标预测方法。In some embodiments, the business indicator prediction method may be implemented as a computer program, which is tangibly contained in a computer-readable storage medium, such as a storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed on the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the business indicator prediction method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the business indicator prediction method in any other appropriate manner (e.g., by means of firmware).
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include: being implemented in one or more computer programs that can be executed and/or interpreted on a programmable system including at least one programmable processor, which can be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
用于实施本申请的方法的计算机程序可以采用一个或多个编程语言的任何组合来编写。这些计算机程序可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器,使得计算机程序当由处理器执行时使流程图和/或框图中所规定的功能/操作被实施。计算机程序可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。The computer programs for implementing the methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, so that when the computer programs are executed by the processor, the functions/operations specified in the flow charts and/or block diagrams are implemented. The computer programs may be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a stand-alone software package, or entirely on a remote machine or server.
在本申请的上下文中,计算机可读存储介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的计算机程序。计算机可读存储介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。备选地,计算机可读存储介质可以是机器可读信号介质。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连 接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of this application, a computer readable storage medium may be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any suitable combination of the foregoing. Alternatively, a computer readable storage medium may be a machine readable signal medium. A more specific example of a machine readable storage medium would include an electrical connection based on one or more wires. The invention may be a computer programmable memory device, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk-read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
为了提供与用户的交互,可以在电子设备上实施此处描述的系统和技术,该电子设备具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给电子设备。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and a pointing device (e.g., a mouse or trackball) through which the user can provide input to the electronic device. Other types of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form (including acoustic input, voice input, or tactile input).
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)、区块链网络和互联网。The systems and techniques described herein may be implemented in a computing system that includes backend components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes frontend components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such backend components, middleware components, or frontend components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: a local area network (LAN), a wide area network (WAN), a blockchain network, and the Internet.
计算系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务中,存在的管理难度大,业务扩展性弱的缺陷。A computing system may include a client and a server. The client and the server are generally remote from each other and usually interact through a communication network. The client and server relationship is generated by computer programs running on the corresponding computers and having a client-server relationship with each other. The server may be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the defects of difficult management and weak business scalability in traditional physical hosts and VPS services.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本申请中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本申请的技术方案所期望的结果,本文在此不进行限制。 It should be understood that the various forms of processes shown above can be used to reorder, add or delete steps. For example, the steps recorded in this application can be executed in parallel, sequentially or in different orders, as long as the expected results of the technical solution of this application can be achieved, and this document is not limited here.

Claims (10)

  1. 一种业务指标预测方法,包括:A business indicator prediction method, comprising:
    根据纵向联邦学习中的发起方的发起方业务数据和合作方的合作方业务数据,确定发起方数据集和合作方数据集;Determine the initiator data set and the partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning;
    通过发起方设备对所述发起方数据集进行加密,并向合作方设备发送加密后的发起方数据集;以及,从所述合作方设备获取发起方和合作方之间的密文相关系数矩阵,并通过所述发起方设备对所述密文相关系数矩阵进行解密,且根据解密结果确定明文相关系数矩阵;Encrypting the initiator data set through the initiator device and sending the encrypted initiator data set to the partner device; and obtaining the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypting the ciphertext correlation coefficient matrix through the initiator device, and determining the plaintext correlation coefficient matrix according to the decryption result;
    根据所述发起方数据集和所述合作方数据集,确定所述发起方的发起方相关系数矩阵,以及所述合作方的合作方相关系数矩阵;Determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set;
    根据所述发起方相关系数矩阵、所述合作方相关系数矩阵和所述明文相关系数矩阵,对所述发起方和所述合作方进行特征多重共线性分析;Performing feature multicollinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix;
    根据特征多重共线性分析结果,从所述发起方数据集和所述合作方数据集中确定模型训练数据,并采用模型训练数据训练联邦学习模型;所述联邦学习模型用于预测用户的业务指标值。According to the result of feature multicollinearity analysis, model training data is determined from the initiator data set and the partner data set, and the model training data is used to train a federated learning model; the federated learning model is used to predict the business indicator value of the user.
  2. 根据权利要求1所述的方法,其中,通过发起方设备对所述发起方数据集进行加密,并向合作方设备发送加密后的发起方数据集;以及,从所述合作方设备获取发起方和合作方之间的密文相关系数矩阵,并通过所述发起方设备对所述密文相关系数矩阵进行解密,且根据解密结果确定明文相关系数矩阵,包括:The method according to claim 1, wherein the initiator data set is encrypted by the initiator device and the encrypted initiator data set is sent to the partner device; and the ciphertext correlation coefficient matrix between the initiator and the partner is obtained from the partner device, and the ciphertext correlation coefficient matrix is decrypted by the initiator device, and the plaintext correlation coefficient matrix is determined according to the decryption result, comprising:
    通过发起方设备,采用同态加密算法生成密钥对,采用所述密钥对中的公钥对所述发起方数据集进行加密;Generate a key pair using a homomorphic encryption algorithm through an initiator device, and encrypt the initiator data set using a public key in the key pair;
    将加密后的发起方数据集发送至所述合作方设备,以通过所述合作方设备,根据加密后的发起方数据集,以及同态加密算法的乘法特性,计算所述发起方数据集中的特征数据和所述合作方数据集中的特征数据之间的密文特征相关系数,根据所述密文特征相关系数,确定所述发起方和所述合作方之间的密文相关系数矩阵;The encrypted initiator data set is sent to the partner device, so that the ciphertext feature correlation coefficient between the feature data in the initiator data set and the feature data in the partner data set is calculated by the partner device according to the encrypted initiator data set and the multiplication characteristic of the homomorphic encryption algorithm, and the ciphertext correlation coefficient matrix between the initiator and the partner is determined according to the ciphertext feature correlation coefficient;
    将所述密文相关系数矩阵发送给所述发起方设备,以通过所述发起方设备,采用所述密钥对中的私钥,对所述密文相关系数矩阵进行解密,根据解密结果确定所述发起方数据集中的特征数据和所述合作方数据集中的特征数据之间的明文相关系数矩阵。The ciphertext correlation coefficient matrix is sent to the initiator device so that the ciphertext correlation coefficient matrix is decrypted by the initiator device using the private key in the key pair, and the plaintext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set is determined according to the decryption result.
  3. 根据权利要求1所述的方法,其中,根据所述发起方相关系数矩阵、所述合作方相关系数矩阵和所述明文相关系数矩阵,对所述发起方和所述合作方 进行特征多重共线性分析,包括:The method according to claim 1, wherein the initiator and the partner are correlated based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix. Perform feature multicollinearity analysis, including:
    根据所述发起方相关系数矩阵、所述合作方相关系数矩阵和所述明文相关系数矩阵,确定相关系数融合矩阵,并确定所述相关系数融合矩阵的完整矩阵行列式值;Determine a correlation coefficient fusion matrix according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix, and determine a complete matrix determinant value of the correlation coefficient fusion matrix;
    确定从所述相关系数融合矩阵中删除矩阵特征数据所在的特征行和特征列后特征矩阵行列式值;Determine the characteristic matrix determinant value after deleting the characteristic row and characteristic column where the matrix characteristic data is located from the correlation coefficient fusion matrix;
    将所述特征矩阵行列式值和所述完整矩阵行列式值之间的比值,作为所述矩阵特征数据的多重共线性值,整合所有矩阵特征数据的多重共线性值,确定所述发起方和所述合作方的多重共线性值,将所述发起方和所述合作方的多重共线性值作为特征多重共线性分析结果。The ratio between the determinant value of the characteristic matrix and the determinant value of the complete matrix is used as the multicollinearity value of the matrix characteristic data, the multicollinearity values of all matrix characteristic data are integrated, the multicollinearity values of the initiator and the partner are determined, and the multicollinearity values of the initiator and the partner are used as the characteristic multicollinearity analysis results.
  4. 根据权利要求1所述的方法,其中,根据纵向联邦学习中的发起方的发起方业务数据和合作方的合作方业务数据,确定发起方数据集和合作方数据集,包括:The method according to claim 1, wherein determining the initiator data set and the partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning comprises:
    确定纵向联邦学习中的发起方和合作方的相同用户,基于所述相同用户的身份标识,确定所述发起方的发起方业务数据和所述合作方的合作方业务数据之间的数据交集;Determine the same user of the initiator and the partner in the vertical federated learning, and determine the data intersection between the initiator business data of the initiator and the partner business data of the partner based on the identity identifier of the same user;
    根据所述数据交集分别对所述发起方业务数据和所述合作方业务数据进行处理,确定发起方数据集和合作方数据集。The initiator business data and the partner business data are processed respectively according to the data intersection to determine an initiator data set and a partner data set.
  5. 根据权利要求1所述的方法,其中,根据所述发起方数据集和所述合作方数据集,确定所述发起方的发起方相关系数矩阵,以及所述合作方的合作方相关系数矩阵,包括:The method according to claim 1, wherein determining an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner based on the initiator data set and the partner data set comprises:
    对所述发起方数据集中的特征数据和所述合作方数据集中的特征数据进行标准化处理,确定发起方标准化数据和合作方标准化数据;Standardizing the feature data in the initiator data set and the feature data in the partner data set to determine initiator standardized data and partner standardized data;
    根据发起方标准化数据的各特征数据之间的相关系数,确定所述发起方的发起方相关系数矩阵,根据合作方标准化数据的各特征数据之间的相关系数,确定所述合作方的合作方相关系数矩阵。According to the correlation coefficients between the characteristic data of the initiator's standardized data, the initiator correlation coefficient matrix of the initiator is determined, and according to the correlation coefficients between the characteristic data of the partner's standardized data, the partner correlation coefficient matrix of the partner is determined.
  6. 根据权利要求1所述的方法,其中,根据特征多重共线性分析结果,从所述发起方数据集和所述合作方数据集中确定模型训练数据,并采用模型训练数据训练联邦学习模型,包括:The method according to claim 1, wherein, according to the result of feature multicollinearity analysis, determining model training data from the initiator data set and the partner data set, and using the model training data to train the federated learning model, comprises:
    根据特征多重共线性分析结果和多重共线性阈值,对所述发起方数据集中的特征数据和所述合作方数据集中的特征数据进行筛选,确定所述发起方数据集和所述合作方数据集中多重共线性值小于多重共线性阈值的特征数据为模型 训练数据,通过模型训练数据训练联邦学习模型。According to the feature multicollinearity analysis results and the multicollinearity threshold, the feature data in the initiator data set and the feature data in the partner data set are screened to determine the feature data in the initiator data set and the partner data set whose multicollinearity values are less than the multicollinearity threshold as the model Training data: Train the federated learning model through model training data.
  7. 根据权利要求1所述的方法,其中,根据纵向联邦学习中的发起方的发起方业务数据和合作方的合作方业务数据,确定发起方数据集和合作方数据集之后,还包括:The method according to claim 1, wherein after determining the initiator data set and the partner data set based on the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning, the method further comprises:
    确定发起方数据集中的特征数据和所述合作方数据集中的特征数据的数据贡献度;Determining data contribution of feature data in the initiator data set and feature data in the partner data set;
    根据所述数据贡献度,对所述发起方数据集中的特征数据和所述合作方数据集中的特征数据进行过滤处理。According to the data contribution, the feature data in the initiator data set and the feature data in the partner data set are filtered.
  8. 一种业务指标预测装置,包括:A business indicator prediction device, comprising:
    数据集确定模块,用于根据纵向联邦学习中的发起方的发起方业务数据和合作方的合作方业务数据,确定发起方数据集和合作方数据集;A data set determination module, used to determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning;
    数据集加密传输模块,用于通过发起方设备对所述发起方数据集进行加密,并向合作方设备发送加密后的发起方数据集;以及,从所述合作方设备获取发起方和合作方之间的密文相关系数矩阵,并通过所述发起方设备对所述密文相关系数矩阵进行解密,且根据解密结果确定明文相关系数矩阵;A data set encryption transmission module, used to encrypt the initiator data set through the initiator device and send the encrypted initiator data set to the partner device; and obtain the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypt the ciphertext correlation coefficient matrix through the initiator device, and determine the plaintext correlation coefficient matrix according to the decryption result;
    相关系数矩阵确定模块,用于根据所述发起方数据集和所述合作方数据集,确定所述发起方的发起方相关系数矩阵,以及所述合作方的合作方相关系数矩阵;A correlation coefficient matrix determination module, used to determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set;
    多重共线性分析模块,用于根据所述发起方相关系数矩阵、所述合作方相关系数矩阵和所述明文相关系数矩阵,对所述发起方和所述合作方进行特征多重共线性分析;A multicollinearity analysis module, used for performing characteristic multicollinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix;
    模型训练模块,用于根据特征多重共线性分析结果,从所述发起方数据集和所述合作方数据集中确定模型训练数据,并采用模型训练数据训练联邦学习模型;所述联邦学习模型用于预测用户的业务指标值。The model training module is used to determine the model training data from the initiator data set and the partner data set according to the result of feature multicollinearity analysis, and use the model training data to train the federated learning model; the federated learning model is used to predict the user's business indicator value.
  9. 一种电子设备,包括:An electronic device, comprising:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-7中任一项所述的业务指标预测方法。The memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can execute the business indicator prediction method described in any one of claims 1 to 7.
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指 令,所述计算机指令用于使处理器执行时实现权利要求1-7中任一项所述的业务指标预测方法。 A computer-readable storage medium storing a computer-readable storage medium The computer instruction is used to enable the processor to implement the business indicator prediction method according to any one of claims 1 to 7 when executed.
PCT/CN2023/079369 2022-10-19 2023-03-02 Service index prediction method and apparatus, and device and storage medium WO2024082514A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211278879.7A CN115545216B (en) 2022-10-19 2022-10-19 Service index prediction method, device, equipment and storage medium
CN202211278879.7 2022-10-19

Publications (1)

Publication Number Publication Date
WO2024082514A1 true WO2024082514A1 (en) 2024-04-25

Family

ID=84735765

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/079369 WO2024082514A1 (en) 2022-10-19 2023-03-02 Service index prediction method and apparatus, and device and storage medium

Country Status (2)

Country Link
CN (1) CN115545216B (en)
WO (1) WO2024082514A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115545216B (en) * 2022-10-19 2023-06-30 上海零数众合信息科技有限公司 Service index prediction method, device, equipment and storage medium
CN116738196A (en) * 2023-06-19 2023-09-12 上海零数众合信息科技有限公司 Reputation evaluation method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113935050A (en) * 2021-09-26 2022-01-14 平安科技(深圳)有限公司 Feature extraction method and device based on federal learning, electronic device and medium
CN114372517A (en) * 2021-12-24 2022-04-19 武汉天喻信息产业股份有限公司 Longitudinal federated learning training and predicting method and device based on tree structure
US20220270590A1 (en) * 2020-07-20 2022-08-25 Google Llc Unsupervised federated learning of machine learning model layers
CN114996749A (en) * 2022-08-05 2022-09-02 蓝象智联(杭州)科技有限公司 Feature filtering method for federal learning
CN115017541A (en) * 2022-06-06 2022-09-06 电子科技大学 Cloud-side-end-collaborative ubiquitous intelligent federal learning privacy protection system and method
CN115545216A (en) * 2022-10-19 2022-12-30 上海零数众合信息科技有限公司 Service index prediction method, device, equipment and storage medium

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110633805B (en) * 2019-09-26 2024-04-26 深圳前海微众银行股份有限公司 Longitudinal federal learning system optimization method, device, equipment and readable storage medium
CN110956275B (en) * 2019-11-27 2021-04-02 支付宝(杭州)信息技术有限公司 Risk prediction and risk prediction model training method and device and electronic equipment
US11443207B2 (en) * 2020-03-12 2022-09-13 Capital One Services, Llc Aggregated feature importance for finding influential business metrics
CN111695697B (en) * 2020-06-12 2023-09-08 深圳前海微众银行股份有限公司 Multiparty joint decision tree construction method, equipment and readable storage medium
JP2021068456A (en) * 2020-10-15 2021-04-30 雅浩 白井 Calculation technique for eliminating "multicollinearity" or the like in regression analysis, and obtaining partial regression coefficient indicating contribution to proper objective variable of explanatory variable, as management material
CN112270597A (en) * 2020-11-10 2021-01-26 恒安嘉新(北京)科技股份公司 Business processing and credit evaluation model training method, device, equipment and medium
CN114638274A (en) * 2020-12-15 2022-06-17 深圳前海微众银行股份有限公司 Feature selection method, device, readable storage medium and computer program product
CN113095514A (en) * 2021-04-26 2021-07-09 深圳前海微众银行股份有限公司 Data processing method, device, equipment, storage medium and program product
CN114003939B (en) * 2021-11-16 2024-03-15 蓝象智联(杭州)科技有限公司 Multiple collinearity analysis method for longitudinal federal scene
CN114936372A (en) * 2022-04-06 2022-08-23 湘潭大学 Model protection method based on three-party homomorphic encryption longitudinal federal learning
CN114881247A (en) * 2022-06-10 2022-08-09 杭州博盾习言科技有限公司 Longitudinal federal feature derivation method, device and medium based on privacy computation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220270590A1 (en) * 2020-07-20 2022-08-25 Google Llc Unsupervised federated learning of machine learning model layers
CN113935050A (en) * 2021-09-26 2022-01-14 平安科技(深圳)有限公司 Feature extraction method and device based on federal learning, electronic device and medium
CN114372517A (en) * 2021-12-24 2022-04-19 武汉天喻信息产业股份有限公司 Longitudinal federated learning training and predicting method and device based on tree structure
CN115017541A (en) * 2022-06-06 2022-09-06 电子科技大学 Cloud-side-end-collaborative ubiquitous intelligent federal learning privacy protection system and method
CN114996749A (en) * 2022-08-05 2022-09-02 蓝象智联(杭州)科技有限公司 Feature filtering method for federal learning
CN115545216A (en) * 2022-10-19 2022-12-30 上海零数众合信息科技有限公司 Service index prediction method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115545216B (en) 2023-06-30
CN115545216A (en) 2022-12-30

Similar Documents

Publication Publication Date Title
WO2024082514A1 (en) Service index prediction method and apparatus, and device and storage medium
CN110245510B (en) Method and apparatus for predicting information
US10764297B2 (en) Anonymized persona identifier
WO2020220823A1 (en) Method and device for constructing decision trees
US10608905B2 (en) Method and system for temporal sampling in evolving network
EP3971798A1 (en) Data processing method and apparatus, and computer readable storage medium
AU2018310377A1 (en) Method and apparatus for encrypting data, method and apparatus for training machine learning model, and electronic device
WO2023040429A1 (en) Data processing method, apparatus, and device for federated feature engineering, and medium
WO2023103390A1 (en) Task processing method, task processing apparatus, electronic device and storage medium
WO2022247620A1 (en) Method and apparatus for determining valid value of service data feature and capable of privacy protection
JP2016511891A (en) Privacy against sabotage attacks on large data
US20230254113A1 (en) Correlation coefficient acquisition method, electronic device and non-transitory computer readable storage medium
CN111563267A (en) Method and device for processing federal characteristic engineering data
TW201610747A (en) Systems and methods for dynamic data storage
WO2023216494A1 (en) Federated learning-based user service strategy determination method and apparatus
EP3036678A1 (en) Method and apparatus for utility-aware privacy preserving mapping in view of collusion and composition
CN110750520A (en) Feature data processing method, device and equipment and readable storage medium
CN114881247A (en) Longitudinal federal feature derivation method, device and medium based on privacy computation
CN116432040B (en) Model training method, device and medium based on federal learning and electronic equipment
CN113645294A (en) Message acquisition method and device, computer equipment and message transmission system
CN112951433A (en) Mental health assessment method based on privacy calculation
Liu et al. A parallel encryption algorithm for dual-core processor based on chaotic map
WO2023040640A1 (en) Data validation method for vertical federated learning
CN115118531B (en) Distributed cloud cooperative encryption method and device based on differential privacy
Wang et al. EPSLP: Efficient and privacy-preserving single-layer perceptron learning in cloud computing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23878518

Country of ref document: EP

Kind code of ref document: A1