WO2024082514A1

WO2024082514A1 - Service index prediction method and apparatus, and device and storage medium

Info

Publication number: WO2024082514A1
Application number: PCT/CN2023/079369
Authority: WO
Inventors: 孙银银
Original assignee: 上海零数众合信息科技有限公司
Priority date: 2022-10-19
Filing date: 2023-03-02
Publication date: 2024-04-25
Also published as: CN115545216B; CN115545216A

Abstract

Disclosed in the present application is a service index prediction method, comprising: determining an initiator data set of an initiator and a partner data set of a partner in vertical federated learning; encrypting the initiator data set by means of an initiator device, and sending the encrypted initiator data set to a partner device; decrypting a ciphertext correlation coefficient matrix by means of the initiator device, so as to determine a plaintext correlation coefficient matrix; determining an initiator correlation coefficient matrix and a partner correlation coefficient matrix according to the initiator data set and the partner data set; performing feature multi-collinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix; and determining model training data from the initiator data set and the partner data set according to the analysis result, and training a federated learning model by using the model training data. The training efficiency of a federated learning model is increased, and the feature interpretability of a linear model is ensured.

Description

A business indicator prediction method, device, equipment and storage medium

This application claims priority to the Chinese patent application filed with the China Patent Office on October 19, 2022, with application number 202211278879.7, the entire contents of which are incorporated by reference into this application.

Technical Field

The embodiments of the present application relate to the field of computers, and in particular to a business indicator prediction method, apparatus, device and storage medium.

Background technique

Vertical federated learning is often used to solve the problem that one of the participants in vertical federated learning has too few data dimensions, and the modeling goal cannot be achieved well with only one party's data. It is mostly used for joint modeling between different industries. In vertical multicollinearity federated modeling, the data sets of the task initiator and the partner have a common sample space and different feature spaces. Encryption algorithms are required to ensure the data privacy security of the data user and the data party. At the same time, it is necessary to calculate the multicollinearity of each feature and other features, remove features with large collinearity, and improve the efficiency and accuracy of modeling. The federated multicollinearity calculation implemented by the linear model method requires multiple interactive iterations of training to calculate the fit, and the communication consumption and calculation complexity are large, and the calculation results depend on the setting of the model hyperparameters. Therefore, how to improve the training efficiency of the federated learning model while ensuring the accuracy of the federated learning model is a problem that needs to be solved.

Summary of the invention

The present application provides a business indicator prediction method, apparatus, device and storage medium, which can improve the computational efficiency and computational accuracy of linear models in vertical federated learning, and improve the prediction accuracy of the federated learning model, so that when predicting the user's business indicators through the federated learning model, the prediction accuracy of the user's business indicators can be improved.

According to one aspect of the present application, a business indicator prediction method is provided, comprising:

Determine the initiator data set and the partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning;

Encrypting the initiator data set through the initiator device and sending the encrypted initiator data set to the partner device; and obtaining the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypting the ciphertext correlation coefficient matrix through the initiator device, and determining the plaintext correlation coefficient matrix according to the decryption result;

Determine the initiator's counterpart of the initiator according to the initiator data set and the partner data set a relationship number matrix, and a partner correlation coefficient matrix of the partners;

Performing feature multicollinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix;

According to the result of feature multicollinearity analysis, model training data is determined from the initiator data set and the partner data set, and the model training data is used to train a federated learning model; the federated learning model is used to predict the business indicator value of the user.

According to another aspect of the present application, a business indicator prediction device is provided, the device comprising:

A data set determination module, used to determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning;

A data set encryption transmission module, used to encrypt the initiator data set through the initiator device and send the encrypted initiator data set to the partner device; and obtain the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypt the ciphertext correlation coefficient matrix through the initiator device, and determine the plaintext correlation coefficient matrix according to the decryption result;

A correlation coefficient matrix determination module, used to determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set;

A multicollinearity analysis module, used for performing characteristic multicollinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix;

The model training module is used to determine the model training data from the initiator data set and the partner data set according to the result of feature multicollinearity analysis, and use the model training data to train the federated learning model; the federated learning model is used to predict the user's business indicator value.

According to another aspect of the present application, an electronic device is provided, the electronic device comprising:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can execute the business indicator prediction method described in any embodiment of the present application.

According to another aspect of the present application, a computer-readable storage medium is provided, wherein the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the business indicator prediction method described in any embodiment of the present application when executed.

The technical solution of the embodiment of the present application is based on the number of initiator services of the initiator in the vertical federated learning. According to the business data of the initiator and the partner of the partner, determine the initiator data set and the partner data set; encrypt the initiator data set through the initiator device and send the encrypted initiator data set to the partner device; and obtain the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, and decrypt the ciphertext correlation coefficient matrix through the initiator device, and determine the plaintext correlation coefficient matrix according to the decryption result; according to the initiator data set and the partner data set, determine the initiator correlation coefficient matrix of the initiator and the partner correlation coefficient matrix of the partner; according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix, perform feature multicollinearity analysis on the initiator and partner data sets; according to the feature multicollinearity analysis results, determine the model training data from the initiator data set and the partner data set, and use the model training data to train the federated learning model. It solves the problem that when using the regression coefficient method to analyze the feature multicollinearity of the initiator and the partner in the vertical federated learning, it is necessary to perform multiple interactive iterative training to calculate the fit, the communication consumption and calculation complexity are large, and the calculation results depend on the setting of the model hyperparameters. The above scheme calculates the feature multicollinearity of the initiator and the partner in the vertical federated learning based on the correlation coefficient method. No parameter adjustment is required during the calculation process, which reduces the number of data interactions when calculating the feature multicollinearity of the initiator and the interacting party. The homomorphic encryption algorithm and the correlation coefficient method are used to determine the training data of the federated learning model through distributed computing, realize the training of the federated learning model, improve the training efficiency of the federated learning model, and ensure the interpretability of the linear model in the federated learning model. The use of the federated learning model to predict the business indicator value of the user can improve the prediction accuracy of the business indicator value.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a flow chart of a business indicator prediction method provided in Example 1 of the present application;

FIG2 is a flow chart of a business indicator prediction method provided in Example 2 of the present application;

FIG3 is a flow chart of a business indicator prediction method provided in Example 3 of the present application;

FIG4 is a schematic diagram of the structure of a business indicator prediction device provided in Embodiment 4 of the present application;

FIG5 is a schematic diagram of the structure of an electronic device provided in Embodiment 5 of the present application.

Detailed ways

In order to enable those skilled in the art to better understand the scheme of the present application, the technical scheme in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. The described embodiments are only embodiments of a part of the present application, rather than all embodiments.

It should be noted that the terms "object" and the like in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the terms used in this way can be interchanged where appropriate, so that the embodiments of the present application described herein can be implemented in an order other than those illustrated or described herein. In addition, the term "including" and "etc." and any variations thereof are intended to cover non-exclusive inclusions, for example, a process, method, system, product or apparatus comprising a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such process, method, product or apparatus.

Embodiment 1

FIG1 is a flowchart of a business indicator prediction method provided in Embodiment 1 of the present application. This embodiment can be applied to the case where a federated learning model for predicting business indicators is trained based on the initiator business data of the initiator and the partner business data of the initiator in vertical federated learning. It is particularly suitable for the case where a multicollinearity analysis is performed on the initiator business data and the partner business data of the initiator in vertical federated learning by the correlation coefficient method, and the training data of the federated learning model is determined based on the multicollinearity analysis results, so as to train the federated learning model for predicting business indicators through the training data. The method can be executed by a business indicator prediction device, which can be implemented in the form of hardware and/or software, and the business indicator prediction device can be configured in an electronic device. As shown in FIG1 , the method includes:

S110. Determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning.

Among them, vertical federated learning is generally applicable to a federated learning scenario composed of participants with the same sample space and different feature spaces on the data set. A federated learning model can be collaboratively trained for different participants through vertical federated learning. In this embodiment, the participants are the initiator and the partner in vertical federated learning. The initiator and the partner can be enterprises in different industries with cooperation needs. The initiator's business data refers to the sample data that can characterize the user's business indicators on the initiator, obtained by the initiator with the user's permission; the partner's business data refers to the sample data that can characterize the user's business indicators on the partner, obtained by the partner with the user's permission. Business indicators refer to indicators used to measure a certain aspect of user behavior. For example, business indicators can be credit indicators or performance indicators of users on the initiator and the partner. The initiator data set refers to a data set containing the initiator's feature data. The partner data set refers to a data set containing the partner's feature data.

Specifically, with the permission of the user corresponding to the initiator, the business data of the initiator corresponding to the initiator in the vertical federated learning is obtained, the feature data of the initiator's business data is extracted, the feature data of the initiator's business data is determined, and the initiator data set is determined according to the feature data of the initiator's business data. With the permission of the user corresponding to the partner, the business data of the partner corresponding to the partner in the vertical federated learning is obtained, the feature data of the partner's business data is extracted, the feature data of the partner's business data is determined, and the partner data set is determined according to the feature data of the partner's business data.

Exemplarily, the method for determining the initiator data set and the partner data set may be: determining the same user of the initiator and the partner in the vertical federated learning, and determining based on the identity of the same user The data intersection between the initiator's business data and the partner's business data; the initiator's business data and the partner's business data are processed respectively according to the data intersection to determine the initiator's data set and the partner's data set.

The identity identifier refers to data that can represent the identity of the user, and the identity identifier may include an ID number or a mobile phone number.

Specifically, the initiator and the partner of vertical federated learning may have the same user. Therefore, when training the federated learning model, it is necessary to determine the same user of the initiator and the partner in the vertical federated learning, and obtain the identity of the same user with the user's permission. Based on the identity of the same user, determine the data intersection between the initiator's initiator business data and the partner's partner business data. Integrate the data intersection and the feature data of the initiator's business data to determine the initiator data set; integrate the data intersection and the feature data of the partner's business data to determine the partner data set.

It can be understood that based on the same users of the initiator and the partner in the vertical federated learning, the data intersection between the initiator's business data and the partner's business data is determined, and the initiator's data set is determined based on the data intersection and the initiator's business data; the partner data set is determined based on the data intersection and the partner's business data, which can improve the intuitiveness of the association relationship between the initiator's data set and the partner's data set, and facilitate the subsequent calculation of the initiator and partner's feature multicollinearity based on the initiator's data set and the partner's data set.

S120, encrypting the initiator data set through the initiator device and sending the encrypted initiator data set to the partner device; and obtaining the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypting the ciphertext correlation coefficient matrix through the initiator device, and determining the plaintext correlation coefficient matrix according to the decryption result.

The ciphertext correlation coefficient matrix is specifically the ciphertext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set. The plaintext correlation coefficient matrix is specifically the plaintext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set. The initiator device refers to the terminal device corresponding to the initiator, and the partner device refers to the terminal device corresponding to the partner. The correlation coefficient is a statistical indicator used to reflect the closeness of the correlation between variables.

Exemplarily, the following sub-steps may be used to encrypt the initiator data set, decrypt the ciphertext correlation coefficient matrix, and determine the plaintext correlation coefficient matrix based on the decryption result:

S1201. Generate a key pair using a homomorphic encryption algorithm through an initiator device, and encrypt the initiator data set using the public key in the key pair.

Among them, the homomorphic encryption algorithm refers to an encryption algorithm that satisfies the homomorphic operation property of the ciphertext, that is, after the data is homomorphically encrypted, a specific calculation is performed on the ciphertext, and the ciphertext calculation result is obtained in the corresponding The plaintext after homomorphic decryption is equivalent to directly performing the same calculation on the plaintext data, achieving the "computable but invisible" data.

Specifically, the initiator device uses a homomorphic encryption algorithm to generate a key pair, which includes a public key and a private key. The initiator device uses the public key in the key pair to encrypt the initiator data set.

S1202. Send the encrypted initiator data set to the partner device, so that the partner device can calculate the ciphertext feature correlation coefficient between the feature data in the initiator data set and the feature data in the partner data set based on the encrypted initiator data set and the multiplication characteristics of the homomorphic encryption algorithm, and determine the ciphertext correlation coefficient matrix between the initiator and the partner based on the ciphertext feature correlation coefficient.

The ciphertext feature correlation coefficient refers to the ciphertext correlation coefficient obtained by calculating the ciphertext between the feature data in the partner data set and the encrypted feature data in the initiator data set.

Specifically, the encrypted initiator data set is sent to the partner device through the initiator device, so that the partner device calculates the ciphertext feature correlation coefficient between each feature data in the encrypted initiator data set and each feature data in the partner data set according to the multiplication characteristics of the homomorphic encryption algorithm, integrates the calculation results, and determines the ciphertext correlation coefficient matrix between the initiator's feature data and the partner's feature data.

S1203. Send the ciphertext correlation coefficient matrix to the initiator device, so that the ciphertext correlation coefficient matrix is decrypted by the initiator device using the private key in the key pair, and the plaintext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set is determined according to the decryption result.

Specifically, the ciphertext correlation coefficient matrix is sent to the initiator device through the partner device. The initiator device uses the private key in the key pair to decrypt the ciphertext correlation coefficient matrix, and determines the plaintext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set according to the decryption result.

It can be understood that the above scheme transforms the problem of calculating the feature correlation coefficients of the initiator and the partner in vertical federated learning into calculating the ciphertext correlation coefficient matrix between the initiator and the partner using a homomorphic encryption algorithm, thereby reducing the number of data interactions when the regression coefficient method is used to calculate the feature multicollinearity of the initiator and the partner, simplifying the calculation process and improving the calculation efficiency.

S130: Determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set.

The initiator correlation coefficient matrix refers to a matrix composed of correlation coefficients between characteristic data in the initiator data set, and the partner correlation coefficient matrix refers to a matrix composed of correlation coefficients between characteristic data in the partner data set.

Specifically, determine the characteristic data in the initiator's data set, that is, the characteristic data of the initiator, and calculate The correlation coefficients between the characteristic data of the initiator are used to determine the initiator correlation coefficient matrix of the initiator. The characteristic data in the partner data set, i.e., the characteristic data of the partner, are determined, the correlation coefficients between the characteristic data of the partner are calculated, and the partner correlation coefficient matrix of the partner is determined based on the correlation coefficients between the characteristic data of the partner.

Exemplarily, the initiator correlation coefficient matrix of the initiator and the partner correlation coefficient matrix of the partner may be determined according to the following sub-steps:

S1301. Standardize the feature data in the initiator data set and the feature data in the partner data set to determine the initiator standardized data and the partner standardized data.

Among them, standardization processing is to convert the original data according to a certain proportion through certain mathematical transformation methods, so that it falls into a small specific interval. The specific interval can be [0,1] or [-1,1], so as to eliminate the differences in characteristic attributes such as properties, dimensions and orders of magnitude between different variables, and convert it into a dimensionless relative value, that is, a standardized value, so that the values of each indicator are at the same order of magnitude, which facilitates the comprehensive analysis and comparison of indicators of different units or orders of magnitude.

Specifically, the characteristic data in the initiator data set and the characteristic data in the partner data set may be standardized by using a Z-score standardization method to determine the initiator standardized data and the partner standardized data.

For example, the method for obtaining the initiator's standardized data may be: determining the characteristic data in the initiator's data set and the number of samples corresponding to the characteristic data; calculating the average value of the characteristic data in the initiator's data set, and determining the characteristic value variance based on the average value of the characteristic data in the initiator's data set and the number of samples of the management party; and determining the initiator's standardized data based on the characteristic value variance, the characteristic data in the initiator's data set and the average value of the characteristic data in the initiator's data set.

S1302. Determine the initiator correlation coefficient matrix of the initiator according to the correlation coefficients between the characteristic data of the initiator's standardized data, and determine the partner correlation coefficient matrix of the partner according to the correlation coefficients between the characteristic data of the partner's standardized data.

The characteristic data of the initiator's standardized data refers to the standardized result of the characteristic data in the initiator's data set, and the characteristic data of the partner's standardized data refers to the standardized result of the characteristic data in the partner's data set.

Specifically, the correlation coefficients between the characteristic data of the initiator's standardized data are calculated and integrated to determine the initiator's correlation coefficient matrix. The correlation coefficients between the characteristic data of the partner's standardized data are calculated and integrated to determine the correlation coefficient matrix between the partner's characteristics.

Before calculating the initiator correlation coefficient matrix and the partner correlation coefficient matrix, the initiator data set and the partner data set are standardized, so that the characteristic data of the initiator and the partner can be calculated. The correlation coefficient matrix problem of the feature data is converted into the homomorphic multiplication problem of the feature data of the initiator and the feature data of the partner, which ensures the data privacy security of the initiator and the partner.

S140: Perform feature multicollinearity analysis on the initiator and the partners according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix.

Specifically, based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix, and the plaintext correlation coefficient matrix between the initiator's feature data and the partner's feature data, feature multicollinearity analysis is performed on all features of the initiator and the partner to obtain the multicollinearity values of each feature data of the initiator and the multicollinearity values of each feature data of the partner. The multicollinearity values of each feature data of the initiator are the federated multicollinearity values of each feature data of the initiator in federated learning. The multicollinearity values of each feature data of the partner are the federated multicollinearity values of each feature data of the partner in federated learning.

S150. According to the result of feature multicollinearity analysis, determine the model training data from the initiator data set and the partner data set, and use the model training data to train the federated learning model.

Among them, the federated learning model is used to predict the user's business indicator value.

Specifically, according to the result of feature multicollinearity analysis, the multicollinearity value of each feature data of the initiator and the multicollinearity value of each feature data of the partner are determined. According to the multicollinearity value of each feature data of the initiator, the multicollinearity value of each feature data of the partner, and the preset model training data screening conditions, the model training data is determined from the feature data in the initiator's data set and the feature data in the partner's data set, and the model training data is used to train the federated learning model. The business data of the user who needs to predict the business indicator value is used as the input data of the trained federated learning model, and the prediction result of the user's business indicator value is determined according to the output data of the trained federated learning model.

Exemplarily, the feature data in the initiator's data set and the feature data in the partner's data set can be screened according to the feature multicollinearity analysis results and the multicollinearity threshold, and the feature data in the initiator's data set and the partner's data set whose multicollinearity values are less than the multicollinearity threshold are determined as model training data, and the federated learning model is trained using the model training data.

It can be understood that screening out feature data with multicollinearity values less than the multicollinearity threshold from the initiator data set and the partner data set as model training data for the federated learning model can improve the model training efficiency and the reliability of the model.

The technical solution provided in this embodiment determines an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in vertical federated learning; encrypts the initiator data set through the initiator device and sends the encrypted initiator data set to the partner device; and obtains the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device. The ciphertext correlation coefficient matrix is decrypted through the initiator device, and the plaintext correlation coefficient matrix is determined based on the decryption result; the initiator correlation coefficient matrix of the initiator and the partner correlation coefficient matrix of the partner are determined based on the initiator data set and the partner data set; the initiator and partner data sets are analyzed for feature multicollinearity based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix; based on the feature multicollinearity analysis results, the model training data is determined from the initiator data set and the partner data set, and the federated learning model is trained using the model training data. This solves the problem that when using the regression coefficient method to analyze the feature multicollinearity of the initiator and the partner in the vertical federated learning, multiple interactive iterative training calculations are required, the communication consumption and computational complexity are large, and the calculation results depend on the setting of the model hyperparameters. The above scheme calculates the feature multicollinearity of the initiator and the partner in the vertical federated learning based on the correlation coefficient method. No parameter adjustment is required during the calculation process, which reduces the number of data interactions when calculating the feature multicollinearity of the initiator and the interacting party. The homomorphic encryption algorithm and the correlation coefficient method are used to determine the training data of the federated learning model through distributed computing, realize the training of the federated learning model, improve the training efficiency of the federated learning model, and ensure the interpretability of the linear model in the federated learning model. The use of the federated learning model to predict the business indicator value of the user can improve the prediction accuracy of the business indicator value.

Embodiment 2

FIG2 is a flow chart of a business indicator prediction method provided in Example 2 of the present application. This embodiment is optimized on the basis of the above embodiment, and provides an implementation method for performing feature multicollinearity analysis on the initiator and the partner based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix. Specifically, as shown in FIG2, the method includes:

S210: Determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning.

S220, encrypting the initiator data set through the initiator device and sending the encrypted initiator data set to the partner device; and obtaining the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypting the ciphertext correlation coefficient matrix through the initiator device, and determining the plaintext correlation coefficient matrix according to the decryption result.

S230: Determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set.

S240. Determine a correlation coefficient fusion matrix according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix, and determine a complete matrix determinant value of the correlation coefficient fusion matrix.

The correlation coefficient fusion matrix refers to the initiator's correlation coefficient matrix, the partner's correlation coefficient matrix, and the The matrix obtained by combining the matrix and the plaintext correlation coefficient matrix.

Specifically, the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix are combined to obtain a correlation coefficient fusion matrix. The determinant value of the correlation coefficient fusion matrix is calculated, and the determinant value of the correlation coefficient fusion matrix is used as the complete matrix determinant value of the correlation coefficient fusion matrix.

S250, determining the characteristic matrix determinant value after deleting the characteristic rows and characteristic columns where the matrix characteristic data are located from the correlation coefficient fusion matrix.

Among them, matrix feature data refers to the feature data contained in the correlation coefficient fusion matrix. Feature rows refer to the rows where the matrix feature data is located in the correlation coefficient fusion matrix. Feature columns refer to the columns where the matrix feature data is located in the correlation coefficient fusion matrix.

Specifically, the characteristic data of each matrix constituting the correlation coefficient fusion matrix are determined, the characteristic rows and characteristic columns of each matrix data are deleted from the correlation coefficient fusion matrix respectively, the characteristic matrix corresponding to each matrix characteristic data is obtained, and the characteristic matrix determinant value corresponding to each matrix characteristic data is determined.

S260. Use the ratio between the characteristic matrix determinant value and the complete matrix determinant value as the multicollinearity value of the matrix characteristic data, integrate the multicollinearity values of all matrix characteristic data, determine the multicollinearity values of the initiator and the partner, and use the multicollinearity values of the initiator and the partner as the characteristic multicollinearity analysis results.

Specifically, the ratio between the characteristic matrix determinant value and the complete matrix determinant value is used as the multicollinearity value of the matrix characteristic data corresponding to the characteristic matrix determinant, and the multicollinearity values corresponding to all matrix characteristic data are integrated to obtain the multicollinearity values of each characteristic data of the initiator and the multicollinearity values of each characteristic data of the partner. The multicollinearity values of each characteristic data of the initiator and the multicollinearity values of each characteristic data of the partner are used as the characteristic multicollinearity analysis results.

S270. According to the result of feature multicollinearity analysis, determine the model training data from the initiator data set and the partner data set, and use the model training data to train the federated learning model.

The technical solution of this embodiment proposes a method for calculating the multicollinearity values of the initiator and the partner by performing feature multicollinearity analysis on the initiator and the partner. The above scheme determines the correlation coefficient fusion matrix based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix, so as to calculate the feature matrix determinant value corresponding to each feature data in the correlation coefficient fusion matrix based on the correlation coefficient fusion matrix. According to the feature matrix determinant value corresponding to each feature data and the complete matrix determinant value of the correlation coefficient fusion matrix, the accurate multicollinearity value of each feature data can be obtained, thereby improving the accuracy of the feature multicollinearity analysis results of the initiator and the partner.

Embodiment 3

FIG3 is a flow chart of a business indicator prediction method provided in Example 3 of the present application. This embodiment is optimized on the basis of the above embodiment, and provides an implementation scheme for filtering the feature data in the initiator's data set and the feature data in the partner's data set according to the data contribution of the feature data in the initiator's data set and the feature data in the partner's data set. Specifically, as shown in FIG3, the method includes:

S310: Determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning.

S320: Determine the data contribution of the feature data in the initiator's data set and the feature data in the partner's data set.

Among them, the data contribution can be determined by the IV (Infromation Value) value of the data. It is used to indicate the contribution of feature data to target prediction, that is, the predictive ability of the feature. Generally speaking, the higher the IV value, the stronger the predictive ability of the feature and the higher the information contribution.

Specifically, the IV values of the feature data in the initiator's data set and the feature data in the partner's data set can be calculated by performing WOE (Weight of Evidence) weighted summation on the feature data in the initiator's data set and the feature data in the partner's data set, and the IV value can be used as the data contribution.

For example, based on the data contribution, the feature data in the initiator's data set and the feature data in the partner's data set are subjected to WOE calculation, the IV value of each feature data is calculated, and the WOE calculation result is subjected to feature collinearity analysis. The data set is screened for feature collinearity using the IV value and the feature collinearity threshold. By comparing the contribution of the feature data before and after the feature collinearity screening, it can be seen that the contribution of the feature data changes from a negative number before screening to a positive number after screening. Therefore, feature collinearity analysis has an impact on the interpretability of the model.

S330: Filter the feature data in the initiator's data set and the feature data in the partner's data set according to the data contribution.

Specifically, according to actual needs, a contribution threshold is set, and according to the contribution threshold and the data contribution, the feature data in the initiator's data set and the feature data in the partner's data set are filtered and processed, and the feature data whose data contribution is less than the contribution threshold is filtered out from the feature data in the initiator's data set and the feature data in the partner's data set. For example, the contribution threshold may be 0.1.

It should be noted that all the features of the initiator and partner that have not been filtered are used for model training, and all the feature data in the initiator's data set and the feature data in the partner's data set have weights. Filtering is performed to train the model based on the feature data with higher contribution rates in the initiator's data set and the partner's data set. The feature data obtained after filtering will also have corresponding weights in the model training process. By comparing the weights of the same features in the two trainings, it is found that some feature data have negative weights before data filtering and positive weights after data filtering, which shows that multicollinearity analysis has an impact on the interpretability of the federated learning model. After filtering features through federated multicollinearity analysis, feature data with smaller IV values in the initiator's data set and the partner's data set are filtered out, which can improve the speed of model training.

S340, encrypting the initiator data set through the initiator device and sending the encrypted initiator data set to the partner device; and obtaining the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypting the ciphertext correlation coefficient matrix through the initiator device, and determining the plaintext correlation coefficient matrix according to the decryption result.

It should be noted that the feature data in the initiator data set and the feature data in the partner data set in this step are the feature data in the initiator data set and the feature data in the partner data set after filtering, respectively.

S350: Determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set.

S360: Perform feature multicollinearity analysis on the initiator and the partners based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix, and the plaintext correlation coefficient matrix.

S370. According to the result of feature multicollinearity analysis, determine the model training data from the initiator data set and the partner data set, and use the model training data to train the federated learning model.

The technical solution of this embodiment is to filter the feature data in the initiator's data set and the feature data in the partner's data set according to the contribution of the feature data in the initiator's data set and the contribution of the feature data in the partner's data set before determining the model training data based on the initiator's data set and the partner's data set, so as to obtain the training data of the federated learning model. The above solution can improve the model training speed while ensuring the reliability of the federated learning model.

Embodiment 4

FIG4 is a schematic diagram of the structure of a business indicator prediction device provided in Example 4 of the present application. This embodiment can be applied to the case where a federated learning model for predicting business indicators is trained based on the business data of the initiator and the business data of the partner in vertical federated learning. As shown in FIG4, the business indicator prediction device includes: a data set determination module 410, a data set encryption transmission module 420, Correlation coefficient matrix determination module 430, multicollinearity analysis module 440 and model training module.

The data set determination module 410 is used to determine the initiator data set and the partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning;

The data set encryption transmission module 420 is used to encrypt the initiator data set through the initiator device and send the encrypted initiator data set to the partner device; and obtain the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypt the ciphertext correlation coefficient matrix through the initiator device, and determine the plaintext correlation coefficient matrix according to the decryption result;

A correlation coefficient matrix determination module 430 is used to determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set;

A multicollinearity analysis module 440 is used to perform feature multicollinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix;

The model training module 450 is used to determine the model training data from the initiator data set and the partner data set according to the result of the feature multicollinearity analysis, and use the model training data to train the federated learning model; the federated learning model is used to predict the user's business indicator value.

The technical solution provided in this embodiment determines the initiator data set and the partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning; encrypts the initiator data set through the initiator device and sends the encrypted initiator data set to the partner device; and obtains the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypts the ciphertext correlation coefficient matrix through the initiator device, and determines the plaintext correlation coefficient matrix according to the decryption result; determines the initiator correlation coefficient matrix of the initiator and the partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set; performs feature multicollinearity analysis on the initiator and partner data sets according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix; determines the model training data from the initiator data set and the partner data set according to the feature multicollinearity analysis results, and uses the model training data to train the federated learning model. This solves the problem that when using the regression coefficient method to analyze the feature multicollinearity of the initiator and the partner in the vertical federated learning, multiple interactive iterative training calculations are required, the communication consumption and calculation complexity are large, and the calculation results depend on the setting of the model hyperparameters. The above scheme calculates the feature multicollinearity of the initiator and the partner in the vertical federated learning based on the correlation coefficient method. No parameter adjustment is required during the calculation process, which reduces the number of data interactions when calculating the feature multicollinearity of the initiator and the interacting party. The homomorphic encryption algorithm and the correlation coefficient method are used to determine the training data of the federated learning model through distributed computing, realize the training of the federated learning model, improve the training efficiency of the federated learning model, and ensure the interpretability of the linear model in the federated learning model. The use of the federated learning model to predict the business indicator value of the user can improve the prediction accuracy of the business indicator value.

Exemplarily, the data set encryption transmission module 420 is specifically used for:

The initiator device uses a homomorphic encryption algorithm to generate a key pair, and uses the public key in the key pair to encrypt the initiator data set;

The encrypted initiator data set is sent to the partner device, so that the partner device calculates the ciphertext feature correlation coefficient between the feature data in the initiator data set and the feature data in the partner data set according to the encrypted initiator data set and the multiplication characteristics of the homomorphic encryption algorithm, and determines the ciphertext correlation coefficient matrix between the initiator and the partner according to the ciphertext feature correlation coefficient;

The ciphertext correlation coefficient matrix is sent to the initiator device, so that the ciphertext correlation coefficient matrix is decrypted by the initiator device using the private key in the key pair, and the plaintext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set is determined according to the decryption result.

Exemplarily, the multicollinearity analysis module 440 is specifically used for:

Determine a correlation coefficient fusion matrix according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix, and determine a complete matrix determinant value of the correlation coefficient fusion matrix;

Determine the characteristic matrix determinant value after deleting the characteristic row and characteristic column where the matrix characteristic data is located from the correlation coefficient fusion matrix;

The ratio between the determinant value of the characteristic matrix and the determinant value of the complete matrix is used as the multicollinearity value of the matrix characteristic data. The multicollinearity values of all matrix characteristic data are integrated to determine the multicollinearity values of the initiator and the partner. The multicollinearity values of the initiator and the partner are used as the characteristic multicollinearity analysis results.

Exemplarily, the data set determination module 410 is specifically used for:

Determine the same users of the initiator and the partner in the vertical federated learning, and determine the data intersection between the initiator's initiator business data and the partner's partner business data based on the same user's identity;

The initiator's business data and the partner's business data are processed separately according to the data intersection to determine the initiator's data set and the partner's data set.

Exemplarily, the correlation coefficient matrix determination module 430 is used to:

Standardize the feature data in the initiator's data set and the feature data in the partner's data set to determine the initiator's standardized data and the partner's standardized data;

According to the correlation coefficients between the characteristic data of the initiator's standardized data, the initiator correlation coefficient matrix of the initiator is determined, and according to the correlation coefficients between the characteristic data of the partner's standardized data, the partner correlation coefficient matrix of the partner is determined.

Exemplarily, the model training module 450 is specifically used for:

According to the feature multicollinearity analysis results and the multicollinearity threshold, the feature data in the initiator's dataset and the feature data in the partner's dataset are screened, and the feature data in the initiator's dataset and the partner's dataset whose multicollinearity values are less than the multicollinearity threshold are determined as model training data. The federated learning model is trained using the model training data.

Exemplarily, the business indicator prediction device further includes:

A data contribution determination module, used to determine the data contribution of the feature data in the initiator's data set and the feature data in the partner's data set;

The data filtering module is used to filter the feature data in the initiator's data set and the feature data in the partner's data set according to the data contribution.

The business indicator prediction device provided in this embodiment can be applied to the business indicator prediction method provided in any of the above embodiments, and has corresponding functions and beneficial effects.

Embodiment 5

Fig. 5 shows a block diagram of an electronic device 10 that can be used to implement an embodiment of the present application. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices (such as helmets, glasses, watches, etc.) and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the present application described and/or required herein.

As shown in FIG5 , the electronic device 10 includes at least one processor 11, and a memory connected to the at least one processor 11, such as a read-only memory (ROM) 12, a random access memory (RAM) 13, etc., wherein the memory stores a computer program that can be executed by at least one processor, and the processor 11 can perform various appropriate actions and processes according to the computer program stored in the read-only memory (ROM) 12 or the computer program loaded from the storage unit 18 to the random access memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 can also be stored. The processor 11, ROM 12 and RAM 13 are connected to each other through a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a disk, an optical disk, etc.; and a communication unit 19, such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special processing components with processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any appropriate processor, controller, microcontroller, etc. The processor 11 executes the various methods and processes described above, such as a business indicator prediction method.

In some embodiments, the business indicator prediction method may be implemented as a computer program, which is tangibly contained in a computer-readable storage medium, such as a storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed on the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the business indicator prediction method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the business indicator prediction method in any other appropriate manner (e.g., by means of firmware).

Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include: being implemented in one or more computer programs that can be executed and/or interpreted on a programmable system including at least one programmable processor, which can be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.

The computer programs for implementing the methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, so that when the computer programs are executed by the processor, the functions/operations specified in the flow charts and/or block diagrams are implemented. The computer programs may be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a stand-alone software package, or entirely on a remote machine or server.

In the context of this application, a computer readable storage medium may be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any suitable combination of the foregoing. Alternatively, a computer readable storage medium may be a machine readable signal medium. A more specific example of a machine readable storage medium would include an electrical connection based on one or more wires. The invention may be a computer programmable memory device, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk-read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.

To provide interaction with a user, the systems and techniques described herein may be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and a pointing device (e.g., a mouse or trackball) through which the user can provide input to the electronic device. Other types of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form (including acoustic input, voice input, or tactile input).

The systems and techniques described herein may be implemented in a computing system that includes backend components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes frontend components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such backend components, middleware components, or frontend components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: a local area network (LAN), a wide area network (WAN), a blockchain network, and the Internet.

A computing system may include a client and a server. The client and the server are generally remote from each other and usually interact through a communication network. The client and server relationship is generated by computer programs running on the corresponding computers and having a client-server relationship with each other. The server may be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the defects of difficult management and weak business scalability in traditional physical hosts and VPS services.

It should be understood that the various forms of processes shown above can be used to reorder, add or delete steps. For example, the steps recorded in this application can be executed in parallel, sequentially or in different orders, as long as the expected results of the technical solution of this application can be achieved, and this document is not limited here.

Claims

A business indicator prediction method, comprising:

Determine the initiator data set and the partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning;

Encrypting the initiator data set through the initiator device and sending the encrypted initiator data set to the partner device; and obtaining the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypting the ciphertext correlation coefficient matrix through the initiator device, and determining the plaintext correlation coefficient matrix according to the decryption result;

Determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set;

Performing feature multicollinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix;

According to the result of feature multicollinearity analysis, model training data is determined from the initiator data set and the partner data set, and the model training data is used to train a federated learning model; the federated learning model is used to predict the business indicator value of the user.
The method according to claim 1, wherein the initiator data set is encrypted by the initiator device and the encrypted initiator data set is sent to the partner device; and the ciphertext correlation coefficient matrix between the initiator and the partner is obtained from the partner device, and the ciphertext correlation coefficient matrix is decrypted by the initiator device, and the plaintext correlation coefficient matrix is determined according to the decryption result, comprising:

Generate a key pair using a homomorphic encryption algorithm through an initiator device, and encrypt the initiator data set using a public key in the key pair;

The encrypted initiator data set is sent to the partner device, so that the ciphertext feature correlation coefficient between the feature data in the initiator data set and the feature data in the partner data set is calculated by the partner device according to the encrypted initiator data set and the multiplication characteristic of the homomorphic encryption algorithm, and the ciphertext correlation coefficient matrix between the initiator and the partner is determined according to the ciphertext feature correlation coefficient;

The ciphertext correlation coefficient matrix is sent to the initiator device so that the ciphertext correlation coefficient matrix is decrypted by the initiator device using the private key in the key pair, and the plaintext correlation coefficient matrix between the feature data in the initiator data set and the feature data in the partner data set is determined according to the decryption result.
The method according to claim 1, wherein the initiator and the partner are correlated based on the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix. Perform feature multicollinearity analysis, including:

Determine a correlation coefficient fusion matrix according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix, and determine a complete matrix determinant value of the correlation coefficient fusion matrix;

Determine the characteristic matrix determinant value after deleting the characteristic row and characteristic column where the matrix characteristic data is located from the correlation coefficient fusion matrix;

The ratio between the determinant value of the characteristic matrix and the determinant value of the complete matrix is used as the multicollinearity value of the matrix characteristic data, the multicollinearity values of all matrix characteristic data are integrated, the multicollinearity values of the initiator and the partner are determined, and the multicollinearity values of the initiator and the partner are used as the characteristic multicollinearity analysis results.
The method according to claim 1, wherein determining the initiator data set and the partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning comprises:

Determine the same user of the initiator and the partner in the vertical federated learning, and determine the data intersection between the initiator business data of the initiator and the partner business data of the partner based on the identity identifier of the same user;

The initiator business data and the partner business data are processed respectively according to the data intersection to determine an initiator data set and a partner data set.
The method according to claim 1, wherein determining an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner based on the initiator data set and the partner data set comprises:

Standardizing the feature data in the initiator data set and the feature data in the partner data set to determine initiator standardized data and partner standardized data;

According to the correlation coefficients between the characteristic data of the initiator's standardized data, the initiator correlation coefficient matrix of the initiator is determined, and according to the correlation coefficients between the characteristic data of the partner's standardized data, the partner correlation coefficient matrix of the partner is determined.
The method according to claim 1, wherein, according to the result of feature multicollinearity analysis, determining model training data from the initiator data set and the partner data set, and using the model training data to train the federated learning model, comprises:

According to the feature multicollinearity analysis results and the multicollinearity threshold, the feature data in the initiator data set and the feature data in the partner data set are screened to determine the feature data in the initiator data set and the partner data set whose multicollinearity values are less than the multicollinearity threshold as the model Training data: Train the federated learning model through model training data.
The method according to claim 1, wherein after determining the initiator data set and the partner data set based on the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning, the method further comprises:

Determining data contribution of feature data in the initiator data set and feature data in the partner data set;

According to the data contribution, the feature data in the initiator data set and the feature data in the partner data set are filtered.
A business indicator prediction device, comprising:

A data set determination module, used to determine an initiator data set and a partner data set according to the initiator business data of the initiator and the partner business data of the partner in the vertical federated learning;

A data set encryption transmission module, used to encrypt the initiator data set through the initiator device and send the encrypted initiator data set to the partner device; and obtain the ciphertext correlation coefficient matrix between the initiator and the partner from the partner device, decrypt the ciphertext correlation coefficient matrix through the initiator device, and determine the plaintext correlation coefficient matrix according to the decryption result;

A correlation coefficient matrix determination module, used to determine an initiator correlation coefficient matrix of the initiator and a partner correlation coefficient matrix of the partner according to the initiator data set and the partner data set;

A multicollinearity analysis module, used for performing characteristic multicollinearity analysis on the initiator and the partner according to the initiator correlation coefficient matrix, the partner correlation coefficient matrix and the plaintext correlation coefficient matrix;

The model training module is used to determine the model training data from the initiator data set and the partner data set according to the result of feature multicollinearity analysis, and use the model training data to train the federated learning model; the federated learning model is used to predict the user's business indicator value.
An electronic device, comprising:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can execute the business indicator prediction method described in any one of claims 1 to 7.
A computer-readable storage medium storing a computer-readable storage medium The computer instruction is used to enable the processor to implement the business indicator prediction method according to any one of claims 1 to 7 when executed.