CN111079164A - Feature correlation calculation method, device, equipment and computer-readable storage medium - Google Patents

Feature correlation calculation method, device, equipment and computer-readable storage medium Download PDF

Info

Publication number
CN111079164A
CN111079164A CN201911309526.7A CN201911309526A CN111079164A CN 111079164 A CN111079164 A CN 111079164A CN 201911309526 A CN201911309526 A CN 201911309526A CN 111079164 A CN111079164 A CN 111079164A
Authority
CN
China
Prior art keywords
random number
data
feature
correlation
characteristic data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911309526.7A
Other languages
Chinese (zh)
Other versions
CN111079164B (en
Inventor
谭明超
范涛
魏文斌
马国强
郑会钿
陈天健
杨强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN201911309526.7A priority Critical patent/CN111079164B/en
Publication of CN111079164A publication Critical patent/CN111079164A/en
Application granted granted Critical
Publication of CN111079164B publication Critical patent/CN111079164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/606Protecting data by securing the transmission between two devices or processes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/58Random or pseudo-random number generators
    • G06F7/588Random number generators, i.e. based on natural stochastic processes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention discloses a method, a device and equipment for calculating characteristic correlation and a computer readable storage medium, wherein the method comprises the following steps: acquiring standby random arrays of the federal study according to a random number protection mechanism, wherein the random arrays acquired according to the random number protection mechanism in the multi-time federal study are different in distribution rule; normalizing the first characteristic data of the correlation to be calculated, and performing random number adding operation on a processing result by adopting a standby random number group to obtain random number added characteristic data; and sending the characteristic data of the added random number to second equipment participating in longitudinal federal learning so that the second equipment can carry out normalization processing on the second characteristic data of the correlation to be calculated, and then calculating according to the processing result and the characteristic data of the added random number to obtain a correlation value. According to the invention, on the basis of ensuring the data privacy, an additional encryption and decryption process is not required, and the data transmission and calculation efficiency in calculating the characteristic correlation is greatly improved.

Description

Feature correlation calculation method, device, equipment and computer-readable storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for calculating a feature correlation.
Background
With the rapid development and wide application of machine learning, the role of feature engineering in machine learning is more and more important, the feature engineering refers to the process of converting original data into training data of a model, and the purpose of the feature engineering is to obtain better training data features so that the performance of the model is improved. The feature engineering generally comprises three parts of feature construction, feature extraction and feature selection, and feature correlation calculation is an important method of the feature selection part. Some redundant features are eliminated by calculating the correlation among the features, and the features which have great effect on model training are reserved. For example, to examine which features have an influence on sales of ice cream, the correlation between the features is very high, such as temperature, season, whether hot or hot, whether summer, and the like, and if correlation calculation and feature selection are not performed, the features are used as training data of a model, so that noise is added to the model.
At present, under the application scene of longitudinal federal learning, each participant participating in the federal learning has different data characteristics, the correlation of the characteristics needs to be calculated jointly, and then the characteristics are selected according to the correlation. In order to ensure data privacy for the parties to the data, the parties typically cannot directly transmit the original data. The current scheme is that the party A encrypts data and sends the encrypted data to the party B, and the party B calculates the correlation according to the sent data and returns the correlation. The conventional encryption transmission mode of the data can greatly reduce the transmission and calculation efficiency of the data.
Disclosure of Invention
The invention mainly aims to provide a feature correlation calculation method, a feature correlation calculation device, feature correlation calculation equipment and a computer readable storage medium, and aims to solve the technical problems of low data transmission and calculation efficiency in feature correlation calculation existing in a target.
In order to achieve the above object, the present invention provides a feature correlation calculation method applied to a first device participating in longitudinal federal learning, the feature correlation calculation method including:
acquiring standby random arrays of the federal learning according to a random number protection mechanism, wherein the random arrays acquired according to the random number protection mechanism in the multi-time federal learning are different in distribution rule;
normalizing the first characteristic data of the correlation to be calculated, and performing random number adding operation on a processing result by using the standby random number group to obtain random number added characteristic data;
and sending the characteristic data of the added random number to second equipment participating in longitudinal federal learning so that the second equipment can calculate the correlation value of the first characteristic data and the second characteristic data according to the processing result and the characteristic data of the added random number after normalization processing is carried out on the second characteristic data of the correlation to be calculated by the second equipment.
Optionally, the step of obtaining the standby random number group for the federal learning according to a random number protection mechanism includes:
and randomly selecting one group from a plurality of groups of random arrays with different distribution laws generated in advance as a standby random array for the federal study.
Optionally, the step of normalizing the first feature data of the correlation to be calculated includes:
calculating the mean value and the standard deviation of each data in the first characteristic data of the correlation to be calculated;
and subtracting the mean value from each data in the first characteristic data respectively and dividing the result by the standard deviation.
Optionally, the step of performing a random number adding operation on the processing result by using the standby random number group to obtain the random number added feature data includes:
and respectively adding different random numbers in the random number group to be used to each data in the first characteristic data after normalization processing to obtain the characteristic data of the added random numbers.
Optionally, after the step of sending the random number added feature data to the second device participating in longitudinal federal learning, the method further includes:
receiving the correlation value sent by the second device;
and performing feature selection on the first feature data and the second feature data according to the correlation values.
Optionally, when the first feature data is stored in different execution machines of the distributed cluster, the step of normalizing the first feature data of the correlation to be calculated includes:
and performing normalization processing on the first characteristic data of the correlation to be calculated by adopting a distributed calculation mode.
Optionally, before the step of obtaining the standby random number group for the federal learning according to the random number protection mechanism, the method further includes:
performing sample alignment on the local sample data set and the second equipment to obtain an aligned sample data set;
determining the first feature data for which a correlation is to be computed from an alignment sample data set.
Further, to achieve the above object, the present invention also provides a feature correlation calculation apparatus including a memory, a processor, and a feature correlation calculation program stored on the memory and executable on the processor, the feature correlation calculation program implementing the steps of the feature correlation calculation method as described above when executed by the processor.
Further, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a feature correlation calculation program which, when executed by a processor, realizes the steps of the feature correlation calculation method as described above.
In the invention, a standby random array of the federal study is obtained through a first device according to a random number protection mechanism, wherein the random arrays obtained according to the random number protection mechanism in a plurality of times of federal study are different in distribution rule; normalizing the first characteristic data of the correlation to be calculated, and performing random number adding operation on a processing result by adopting a standby random number group to obtain random number added characteristic data; and sending the random number added feature data to second equipment participating in longitudinal federal learning so that the second equipment can carry out normalization processing on the second feature data of the correlation to be calculated, and then calculating according to the processing result and the random number added feature data to obtain a correlation value of the first feature data and the second feature data. Because the first device performs normalization processing on the first characteristic data and adds the random number, and then sends the first characteristic data to the second device, the second device cannot deduce the original first characteristic data by adding the random number characteristic data, and thus the data security of the first device is ensured. Moreover, because the first device performs normalization and random number addition operation on the first characteristic data, but not encryption operation, the second device can directly calculate the correlation value according to the data sent by the first device without performing an additional encryption and decryption process or additionally increasing the transmission load, thereby realizing data transmission and calculation efficiency during characteristic correlation calculation to a great extent on the basis of ensuring data security. In addition, the first device acquires the random array by adopting a random array protection mechanism, so that the distribution rules of the random arrays adopted in the multi-time federal learning are different, the second device cannot break the distribution rule of the random array of the first device through the random array of the multi-time federal learning, and the original characteristic data of the first device cannot be broken, the privacy of the first device is prevented from being revealed to the second device, and the data security of the first device is further improved.
Drawings
FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart diagram illustrating a first embodiment of a method for calculating a feature correlation according to the present invention;
FIG. 3 is a schematic diagram illustrating joint computation of feature correlation values for parties A and B according to an embodiment of the present invention;
FIG. 4 is a block diagram of a feature correlation calculation apparatus according to a preferred embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
An embodiment of the present invention provides a feature correlation computing device, and referring to fig. 1, fig. 1 is a schematic structural diagram of a hardware operating environment according to an embodiment of the present invention.
It should be noted that fig. 1 is a schematic structural diagram of a hardware operating environment of a feature correlation computing device. The characteristic correlation computing device of the embodiment of the invention can be a PC, and can also be a terminal device with a display function, such as a smart phone, a smart television, a tablet computer, a portable computer and the like.
As shown in fig. 1, the feature correlation calculation device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Optionally, the feature correlation computing device may further include a camera, RF (Radio Frequency) circuitry, sensors, audio circuitry, a WiFi module, and so forth. Those skilled in the art will appreciate that the feature correlation computing device configuration shown in FIG. 1 does not constitute a limitation of the feature correlation computing device, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a feature correlation calculation program.
In the feature correlation calculation apparatus shown in fig. 1, the network interface 1004 is mainly used for connecting other participating apparatuses participating in longitudinal federal learning, and performing data communication with the other participating apparatuses; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call the feature correlation calculation program stored in the memory 1005 and perform the following operations:
acquiring standby random arrays of the federal learning according to a random number protection mechanism, wherein the random arrays acquired according to the random number protection mechanism in the multi-time federal learning are different in distribution rule;
normalizing the first characteristic data of the correlation to be calculated, and performing random number adding operation on a processing result by using the standby random number group to obtain random number added characteristic data;
and sending the characteristic data of the added random number to second equipment participating in longitudinal federal learning so that the second equipment can calculate the correlation value of the first characteristic data and the second characteristic data according to the processing result and the characteristic data of the added random number after normalization processing is carried out on the second characteristic data of the correlation to be calculated by the second equipment.
Further, the step of obtaining the standby random array for the federal learning according to a random number protection mechanism includes:
and randomly selecting one group from a plurality of groups of random arrays with different distribution laws generated in advance as a standby random array for the federal study.
Further, the step of normalizing the first feature data of the correlation to be calculated includes:
calculating the mean value and the standard deviation of each data in the first characteristic data of the correlation to be calculated;
and subtracting the mean value from each data in the first characteristic data respectively and dividing the result by the standard deviation.
Further, the step of performing the random number adding operation on the processing result by using the standby random number group to obtain the random number added characteristic data comprises:
and respectively adding different random numbers in the random number group to be used to each data in the first characteristic data after normalization processing to obtain the characteristic data of the added random numbers.
Further, after the step of sending the random number added feature data to the second device participating in longitudinal federal learning, the processor 1001 may be configured to call the feature correlation calculation program stored in the memory 1005, and further perform the following operations:
receiving the correlation value sent by the second device;
and performing feature selection on the first feature data and the second feature data according to the correlation values.
Further, when the first feature data is stored in different execution machines of the distributed cluster, the step of normalizing the first feature data of the correlation to be calculated includes:
and performing normalization processing on the first characteristic data of the correlation to be calculated by adopting a distributed calculation mode.
Further, before the step of obtaining the standby random array for the current federal learning according to the random number protection mechanism, the processor 1001 may be configured to call the feature correlation calculation program stored in the memory 1005, and further perform the following operations:
performing sample alignment on the local sample data set and the second equipment to obtain an aligned sample data set;
determining the first feature data for which a correlation is to be computed from an alignment sample data set.
Based on the hardware structure, various embodiments of the feature correlation calculation method of the present invention are proposed.
Referring to fig. 2, a first embodiment of the feature correlation calculation method of the present invention provides a feature correlation calculation method, and it is noted that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order different from that here. The feature correlation calculation method is applied to second equipment participating in longitudinal federal learning, the second equipment is in communication connection with first equipment participating in longitudinal federal learning, the first equipment and the second equipment can be servers, and also can be terminal equipment such as PCs, smart phones, smart televisions, tablet computers and portable computers, and the feature correlation calculation method comprises the following steps:
step S10, obtaining standby random arrays of the federal study according to a random number protection mechanism, wherein the distribution rules of the random arrays obtained according to the random number protection mechanism in the multi-time federal study are different;
in this embodiment, each participant in the vertical federal learning locally owns a part of data, and each part of data has less overlap in the feature dimension and more overlap in the user dimension. The following description of the present embodiment is made by using a first device and a second device participating in longitudinal federated learning, and it should be understood that the feature correlation calculation method described in the present embodiment may be generalized to a plurality of participating devices participating in longitudinal federated learning.
Specifically, the first device and the second device may establish a communication connection in advance, determine two features for which correlation needs to be calculated, and determine feature data for calculating the correlation of the two features. If the first device determines first characteristic data of the correlation to be calculated, the second device determines second characteristic data of the correlation to be calculated. Therein, correlation between features needs to be calculated from feature data, such as calculating correlation between X1 and X5, feature data under feature X5 for users U1, U2 and U3 in the first device, and feature data under feature X6 for users U1, U2 and U3 in the second device are needed. The feature data may be used as a vector, where the elements in the vector are data of each user under the feature, the dimension of the vector is the number of users shared by the first device and the second device, for example, the feature to be calculated as the age, and the feature data is the age value of the user U1, the age value of the user U2, and the age value of the user U3.
The first device may obtain the standby random array for the federal learning according to a random number protection mechanism, where distribution rules of the random array obtained according to the random number protection mechanism are different in a plurality of federal studies. That is, the first device acquires the random number group according to a random number protection mechanism, and the random number protection mechanism enables the distribution rules of the random number group adopted by the first device in a plurality of times of federal learning to be different. It should be appreciated that there are a variety of random number protection mechanisms that can make the distribution law of the random number groups employed by the first device in a plurality of federal studies different. For example, the random number protection mechanism may be changed on the basis of the random array adopted in the last federative study, so that the distribution rule of the random array in the current federative study is different from that of the random array in the last time. Because the distribution rule of the random array adopted by the first equipment in the multi-time federal learning is different, the second equipment cannot break the distribution rule of the random array of the first equipment through the multi-time federal learning, and also cannot break the original characteristic data of the first equipment, so that the privacy of the first equipment is prevented from being revealed to the second equipment, and the data security of the first equipment is improved.
Step S20, normalization processing is carried out on the first characteristic data of the correlation to be calculated, and random number adding operation is carried out on the processing result by adopting the standby random number group to obtain random number added characteristic data;
the first device firstly carries out normalization processing on first feature data of the correlation to be calculated, and then carries out random number adding operation on the result after the normalization processing by adopting a standby random number group to obtain random number added feature data. Because the data of the first device and the second device have privacy, the data of the first device cannot be directly sent to the second device, and therefore, after the first device normalizes the first feature data of the correlation to be calculated, the random number is added to the processing result to obtain the feature data of the added random number, so that the original first feature data cannot be deduced according to the feature data of the added random number. The random number adding operation may be to add the random number in the random array to be used to each data in the first feature data after the normalization processing, or to subtract the random number in the random array to be used from each data. For example, the first device: the age value of the user U1, the age value of the user U2 and the age value of the user U3 are normalized to obtain { x1, x2 and x3}, and random numbers r are added to each data to obtain random number added feature data { x1+ r, x2+ r, x3+ r }.
Further, the step of normalizing the first feature data of the correlation to be calculated in step S20 includes:
step S201, calculating the mean value and standard deviation of each data in the first characteristic data of the correlation to be calculated;
step S202, subtracting the mean value from each data in the first feature data, and dividing the result by the standard deviation.
The first device may perform normalization processing on the first feature data in the following manner: and calculating the mean value and the standard deviation of each data in the first characteristic data. If the first characteristic data is { x1, x2, x3}, the mean value is calculated as mux(x1+ x2+ x3)/3, with a standard deviation of:
Figure BDA0002324139100000081
subtracting the mean value from each of the first feature data, and dividing the result by the standard deviation, such as calculating (x 1-mu)x)/σx、(x2-μx)/σxAnd (x 3-mu)x)/σxAnd obtaining the result of each data after normalization. For example, when the first characteristic data is the age value of the user U1, the age value of the user U2, and the age value of the user U3, the first device will find the mean and standard deviation of the three age values, and divide the mean by subtracting the mean values from the three age values, respectivelyDividing by the standard deviation to obtain the normalized processing results of the three age values.
Further, in order to enhance the security of the data in the first device, the step S20 of performing the random number adding operation on the processing result by using the standby random number group to obtain the characteristic data of the added random number includes:
step S203, different random numbers in the standby random number group are respectively added to each data in the first characteristic data after normalization processing, and random number added characteristic data are obtained.
And the first equipment respectively adds different preset random numbers in the random number groups to be used to each data in the first characteristic data after the normalization processing to obtain the characteristic data of the added random numbers. The number of the random numbers in the random array is the same as the number of the data in the second characteristic data, and all the random numbers are different. If the random numbers are r1, r2 and r3, different random numbers are respectively added to the first feature data { x1, x2 and x3} after normalization processing, and the feature data { x1+ r1, x2+ r2, x3+ r3} with the added random numbers are obtained. Because the first device adds different random numbers to each datum in the first characteristic data after normalization processing, the first characteristic data is protected more safely, and the safety of the datum in the first device is further enhanced.
Step S30, sending the random number added feature data to a second device participating in longitudinal federal learning, so that the second device performs normalization processing on second feature data to be correlated, and then calculates a correlation value between the first feature data and the second feature data according to a processing result and the random number added feature data.
And after the first equipment obtains the random number adding characteristic data, the random number adding characteristic data is sent to the second equipment. And the second equipment performs normalization processing on the second characteristic data of the correlation to be calculated, wherein the normalization processing process is the same as the normalization processing process of the first characteristic data performed by the first equipment. And the second equipment calculates the correlation value of the first characteristic data and the second characteristic data according to the received random number adding characteristic data and the normalized second characteristic data. Specifically, the second device may calculate an inner product of the random number feature data and the normalized second feature data, and the calculated result is a correlation value between the first feature data and the second feature data.
The mathematical principle of the joint calculation of the feature correlation by the first device and the second device in the embodiment is derived as follows:
the correlation coefficient is a quantity used to measure the degree of linear correlation between two variables, and is commonly referred to as the pearson coefficient. The formula for the Pearson coefficient is:
Figure BDA0002324139100000091
in equation (1), x and y are two variables, respectively, cov (x, y) represents the covariance, σ, of the two variablesxAnd σyRespectively, the standard deviation of the two variables. X and Y are two vectors corresponding to the values of X and Y, respectively, muxAnd muyRespectively, are the average of the elements in the two vectors, E representing expectation.
As can be seen from equation (1), the correlation between two columns of data can be obtained by dividing the covariance of the two columns by the product of the standard deviation of the two columns. By transformation, the Pearson correlation can be obtained as the result of subtracting the mean value of the two eigenvectors divided by the standard deviation and then performing inner product. The subtraction of the mean value divided by the standard deviation is a normalization operation, so the Pearson correlation can be regarded as the result of the inner product of two feature vectors after normalization.
Assuming that the X vector belongs to party a (the first device) and the Y vector belongs to party B (the second device), party a normalizes the X vector, adds the random number R to send to party B, and party B normalizes the Y vector, and calculates the correlation value using the processing result and the result of adding the random number sent by party a, the above formula can be transformed into:
Figure BDA0002324139100000092
since R is the random number added by the a-party itself, normalization to the Y vector is obviously independent, so the expectation of the product of the two is equal to the expected product of the two. At the same time, the expectation after Y normalization is also 0, so the product of the latter two terms is zero. The final result is thus the X-vector and the Y-vector Pearson correlation values P.
As shown in fig. 3, a diagram of the joint calculation of the correlation value between the feature X of the party a and the feature Y of the party B is shown for the parties a and B.
In this embodiment, a standby random array for the federal learning is obtained through a first device according to a random number protection mechanism, wherein the random arrays obtained according to the random number protection mechanism in a plurality of times of federal learning are different in distribution rule; normalizing the first characteristic data of the correlation to be calculated, and performing random number adding operation on a processing result by adopting a standby random number group to obtain random number added characteristic data; and sending the random number added feature data to second equipment participating in longitudinal federal learning so that the second equipment can carry out normalization processing on the second feature data of the correlation to be calculated, and then calculating according to the processing result and the random number added feature data to obtain a correlation value of the first feature data and the second feature data. Because the first device performs normalization processing on the first characteristic data and adds the random number, and then sends the first characteristic data to the second device, the second device cannot deduce the original first characteristic data by adding the random number characteristic data, and thus the data security of the first device is ensured. Moreover, since the first device performs normalization and random number addition operation on the first feature data, but not encryption operation, the second device can directly calculate the correlation value according to the data sent by the first device without performing an additional encryption and decryption process or additionally increasing transmission load, so that the scheme of the embodiment greatly improves the data transmission and calculation efficiency when calculating the feature correlation on the basis of ensuring the privacy of data transmission. In addition, the first device acquires the random array by adopting a random array protection mechanism, so that the distribution rules of the random arrays adopted in the multi-time federal learning are different, the second device cannot break the distribution rule of the random array of the first device through the random array of the multi-time federal learning, and the original characteristic data of the first device cannot be broken, the privacy of the first device is prevented from being revealed to the second device, and the data security of the first device is further improved.
Further, based on the first embodiment, a second embodiment of the feature correlation calculation method of the present invention provides a feature correlation calculation method. In this embodiment, the step S10 includes:
and S101, randomly selecting one random array from a plurality of random arrays with different pre-generated distribution laws to be used as the standby random array for the federal study.
In this embodiment, the random number protection mechanism may be: the first equipment generates multiple groups of random arrays with different distribution laws in advance, and then randomly selects one group from the multiple groups of random arrays as a standby random array for the federal study. The first device can store a plurality of groups of random arrays as random number files respectively, and randomly select one file when the file is needed. By generating a plurality of groups of random arrays in advance for storage, the standby random arrays can be quickly obtained when the random arrays need to be used. And the distribution rules of the multiple groups of random arrays are different, and the random array is randomly selected by the first equipment during each federal learning, so that the distribution rules of the random array adopted by the first equipment for many times of federal learning are different, the problem that the second equipment breaks the distribution rule of the random array of the first equipment through many times of federal learning is avoided, the original characteristic data of the first equipment cannot be broken by the second equipment, and the data security of the first equipment is further improved.
Further, step S30 is followed by:
step S40, receiving the correlation value sent by the first device;
step S50, selecting the feature of the first feature data and the second feature data according to the correlation value.
In this embodiment, the second device may send the correlation value to the first device after calculating the correlation value between the first feature data and the second feature data. And the first equipment receives the correlation value sent by the second equipment, and performs characteristic selection on the first characteristic data and the second characteristic data by adopting the correlation value. Specifically, a correlation threshold may be set in the first device, and if the correlation value is greater than the correlation threshold, it indicates that the correlation between the first feature data and the second feature data is high, the first device may select to remove one of the first feature data and the second feature data, where, if the first feature is season, the second feature is temperature, and the calculated correlation value is greater than the correlation threshold, the first device may remove one of the season and the temperature, that is, data under the feature is not used in a subsequent federal learning model training process. If the correlation value is not greater than the correlation threshold, the first characteristic data and the second characteristic data may be retained.
It should be noted that the second device may also perform feature selection directly according to the correlation value after the correlation value is calculated.
In this embodiment, the second device sends the correlation value to the first device, and the first device performs feature selection according to the correlation value, so that a higher-quality model can be obtained through subsequent modeling and model training of the federal learning model.
Further, when the first feature data is stored in different execution machines of the distributed cluster, the step of performing the normalization processing on the first feature data with the correlation to be calculated in step S20 includes:
and step S204, performing the normalization processing on the first characteristic data of the correlation to be calculated in a distributed calculation mode.
In this embodiment, the first feature data of the first device may be stored in different execution machines of the distributed cluster, and if the different execution machines store data of different users, the first device may perform normalization processing on the first feature data by using a distributed computing method when performing normalization processing on the first feature data. Specifically, different execution machines normalize respective local partial first feature data, and then send the results of the normalization processing to the first device, and the first device summarizes the results of the execution machines and performs the operation of adding random numbers.
Similarly, when the second device performs normalization processing on the second data feature, the normalization processing may also be performed in a distributed computing manner.
In this embodiment, the first device and the second device may perform normalization processing in a distributed computing manner, which reduces the computing resource consumption of the first device and the second device, and increases the speed of normalization processing, thereby increasing the efficiency of the whole longitudinal federal learning modeling process.
Further, before step S10, the method further includes:
step S60, carrying out sample alignment on the local sample data set and the second equipment to obtain an aligned sample data set;
step S70, determining the first feature data of the correlation to be calculated from the alignment sample data set.
The first device may perform sample alignment using the local sample data set and the sample data set in the second device, to obtain an aligned sample data set. And the first equipment and the second equipment carry out sample alignment, determine users shared by the two equipment, and take the data of the shared users as an alignment sample data set. For example, the user dimension in the first device is { U1, U2, U3, U4}, the feature dimension is { X4, X5}, the user dimension in the second device is { U1, U2, U3, U4, U5}, the feature dimension is { X1, X2, X3}, and the data tag Y is also included in the second device; the first device and the second device determine that the common user is { U1, U2, U3} through sample alignment, and the aligned sample data set is a sample data set formed by data of the user { U1, U2, U3} in the first device under the characteristic { X4, X5 }.
The first device and the second device may jointly calculate the correlation between the two different features, for example, the correlations between the features X4 and X5 and the features X1, X2, X3 and Y may be calculated, respectively. After determining the characteristics of the correlation to be calculated, the first device takes the data of each user in the alignment sample data set under the characteristics as the first characteristic data of the correlation to be calculated.
In addition, an embodiment of the present invention further provides a feature correlation calculation apparatus, where the feature correlation calculation apparatus is deployed in a first device participating in longitudinal federal learning, and referring to fig. 4, the feature correlation calculation apparatus includes:
the acquiring module 10 is configured to acquire a standby random array for the federal learning according to a random number protection mechanism, where distribution rules of the random array acquired according to the random number protection mechanism in multiple federal learning are different;
the processing module 20 is configured to perform normalization processing on the first feature data of the correlation to be calculated, and perform random number adding operation on a processing result by using the standby random number group to obtain random number added feature data;
the sending module 30 is configured to send the random number added feature data to a second device participating in longitudinal federated learning, so that the second device performs normalization processing on second feature data to be correlated, and then calculates a correlation value between the first feature data and the second feature data according to a processing result and the random number added feature data.
Further, the obtaining module 10 includes:
and the selecting unit randomly selects one group from a plurality of groups of random arrays with different pre-generated distribution laws as a standby random array for the federal study.
Further, the processing module 20 includes:
the calculation unit is used for calculating the mean value and the standard deviation of each datum in the first feature data of the correlation to be calculated; and subtracting the mean value from each data in the first characteristic data respectively and dividing the result by the standard deviation.
Further, the processing module 20 includes:
and the random number adding unit is used for respectively adding different random numbers in the standby random number group to each data in the first characteristic data after normalization processing to obtain the characteristic data of the added random numbers.
Further, the feature correlation calculation means further includes:
a receiving module, configured to receive the correlation value sent by the second device;
and the characteristic selection module is used for selecting the characteristics of the first characteristic data and the second characteristic data according to the relevant values.
Further, when the first feature data is stored in different execution machines of a distributed cluster, the processing module 20 comprises:
and the distributed computing unit is used for performing normalization processing on the first characteristic data of the correlation to be computed by adopting a distributed computing mode.
Further, the feature correlation calculation means further includes:
the sample alignment module is used for carrying out sample alignment on the local sample data set and the second equipment to obtain an aligned sample data set;
a determining module, configured to determine the first feature data of the correlation to be calculated from the alignment sample data set.
The specific implementation of the feature correlation calculation apparatus of the present invention has basically the same extension as that of each embodiment of the feature correlation calculation method, and is not described herein again.
Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, on which a feature correlation calculation program is stored, and when the feature correlation calculation program is executed by a processor, the steps of the feature correlation calculation method are implemented.
The specific implementation of the feature correlation calculation device and the computer-readable storage medium of the present invention has substantially the same expansion content as the embodiments of the feature correlation calculation method, and will not be described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A feature correlation calculation method is applied to first equipment participating in longitudinal federal learning, and comprises the following steps:
acquiring standby random arrays of the federal learning according to a random number protection mechanism, wherein the random arrays acquired according to the random number protection mechanism in the multi-time federal learning are different in distribution rule;
normalizing the first characteristic data of the correlation to be calculated, and performing random number adding operation on a processing result by using the standby random number group to obtain random number added characteristic data;
and sending the characteristic data of the added random number to second equipment participating in longitudinal federal learning so that the second equipment can calculate the correlation value of the first characteristic data and the second characteristic data according to the processing result and the characteristic data of the added random number after normalization processing is carried out on the second characteristic data of the correlation to be calculated by the second equipment.
2. The feature correlation calculation method according to claim 1, wherein the step of obtaining the inactive random number group for the current federal learning according to the random number protection mechanism comprises:
and randomly selecting one group from a plurality of groups of random arrays with different distribution laws generated in advance as a standby random array for the federal study.
3. The feature correlation calculation method according to claim 1, wherein the step of normalizing the first feature data of which the correlation is to be calculated includes:
calculating the mean value and the standard deviation of each data in the first characteristic data of the correlation to be calculated;
and subtracting the mean value from each data in the first characteristic data respectively and dividing the result by the standard deviation.
4. The method of claim 1, wherein the step of performing the random number adding operation on the processing result using the standby random number group to obtain the random number added feature data comprises:
and respectively adding different random numbers in the random number group to be used to each data in the first characteristic data after normalization processing to obtain the characteristic data of the added random numbers.
5. The feature correlation calculation method according to claim 1, wherein after the step of sending the random number added feature data to a second device participating in longitudinal federal learning, further comprising:
receiving the correlation value sent by the second device;
and performing feature selection on the first feature data and the second feature data according to the correlation values.
6. The method according to claim 1, wherein the step of normalizing the first feature data of the correlation to be calculated when the first feature data is stored in different execution machines of the distributed cluster comprises:
and performing normalization processing on the first characteristic data of the correlation to be calculated by adopting a distributed calculation mode.
7. The feature correlation calculation method according to any one of claims 1 to 6, wherein the step of obtaining the standby random number group for the current federal learning according to the random number protection mechanism further comprises, before the step of obtaining the standby random number group for the current federal learning:
performing sample alignment on the local sample data set and the second equipment to obtain an aligned sample data set;
determining the first feature data for which a correlation is to be computed from an alignment sample data set.
8. A feature relevance computation apparatus deployed at a first device participating in longitudinal federal learning, the feature relevance computation apparatus comprising:
the acquisition module is used for acquiring the standby random arrays of the federal study according to a random number protection mechanism, wherein the distribution rules of the random arrays acquired according to the random number protection mechanism in the multi-time federal study are different;
the processing module is used for carrying out normalization processing on the first characteristic data of the correlation to be calculated and carrying out random number adding operation on a processing result by adopting the standby random number group to obtain random number added characteristic data;
and the sending module is used for sending the characteristic data added with the random number to second equipment participating in longitudinal federal learning so that the second equipment can calculate the correlation value of the first characteristic data and the second characteristic data according to the processing result and the characteristic data added with the random number after normalization processing is carried out on the second characteristic data to be subjected to correlation calculation by the second equipment.
9. A feature correlation calculation apparatus comprising a memory, a processor and a feature correlation calculation program stored on the memory and executable on the processor, the feature correlation calculation program when executed by the processor implementing the steps of the feature correlation calculation method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that a feature correlation calculation program is stored thereon, which when executed by a processor implements the steps of the feature correlation calculation method according to any one of claims 1 to 7.
CN201911309526.7A 2019-12-18 2019-12-18 Feature correlation calculation method, device, equipment and computer-readable storage medium Active CN111079164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911309526.7A CN111079164B (en) 2019-12-18 2019-12-18 Feature correlation calculation method, device, equipment and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911309526.7A CN111079164B (en) 2019-12-18 2019-12-18 Feature correlation calculation method, device, equipment and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN111079164A true CN111079164A (en) 2020-04-28
CN111079164B CN111079164B (en) 2021-09-07

Family

ID=70315441

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911309526.7A Active CN111079164B (en) 2019-12-18 2019-12-18 Feature correlation calculation method, device, equipment and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN111079164B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112508199A (en) * 2020-11-30 2021-03-16 同盾控股有限公司 Feature selection method, device and related equipment for cross-feature federated learning
CN114996749A (en) * 2022-08-05 2022-09-02 蓝象智联(杭州)科技有限公司 Feature filtering method for federal learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102196320A (en) * 2011-04-15 2011-09-21 江苏省现代企业信息化应用支撑软件工程技术研发中心 Image encrypting and decrypting system
CN105320921A (en) * 2014-07-31 2016-02-10 腾讯科技(深圳)有限公司 Binocular positioning method and binocular positioning apparatus
CN109359439A (en) * 2018-10-26 2019-02-19 北京天融信网络安全技术有限公司 Software detecting method, device, equipment and storage medium
CN110135494A (en) * 2019-05-10 2019-08-16 南京工业大学 Feature selection approach based on maximum information coefficient and Geordie index
CN110443378A (en) * 2019-08-02 2019-11-12 深圳前海微众银行股份有限公司 Feature correlation analysis method, device and readable storage medium storing program for executing in federation's study

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102196320A (en) * 2011-04-15 2011-09-21 江苏省现代企业信息化应用支撑软件工程技术研发中心 Image encrypting and decrypting system
CN105320921A (en) * 2014-07-31 2016-02-10 腾讯科技(深圳)有限公司 Binocular positioning method and binocular positioning apparatus
CN109359439A (en) * 2018-10-26 2019-02-19 北京天融信网络安全技术有限公司 Software detecting method, device, equipment and storage medium
CN110135494A (en) * 2019-05-10 2019-08-16 南京工业大学 Feature selection approach based on maximum information coefficient and Geordie index
CN110443378A (en) * 2019-08-02 2019-11-12 深圳前海微众银行股份有限公司 Feature correlation analysis method, device and readable storage medium storing program for executing in federation's study

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112508199A (en) * 2020-11-30 2021-03-16 同盾控股有限公司 Feature selection method, device and related equipment for cross-feature federated learning
CN114996749A (en) * 2022-08-05 2022-09-02 蓝象智联(杭州)科技有限公司 Feature filtering method for federal learning
CN114996749B (en) * 2022-08-05 2022-11-25 蓝象智联(杭州)科技有限公司 Feature filtering method for federal learning

Also Published As

Publication number Publication date
CN111079164B (en) 2021-09-07

Similar Documents

Publication Publication Date Title
US10728018B2 (en) Secure probabilistic analytics using homomorphic encryption
US11196541B2 (en) Secure machine learning analytics using homomorphic encryption
CN110807528A (en) Feature correlation calculation method, device and computer-readable storage medium
KR101843340B1 (en) Privacy-preserving collaborative filtering
CN110443378B (en) Feature correlation analysis method and device in federal learning and readable storage medium
CN110414567B (en) Data processing method and device and electronic equipment
CN110892672A (en) Key authentication assertion generation to provide device anonymity
CN110704860A (en) Longitudinal federal learning method, device and system for improving safety and storage medium
CN110851869A (en) Sensitive information processing method and device and readable storage medium
CN108270944B (en) Digital image encryption method and device based on fractional order transformation
CN111079164B (en) Feature correlation calculation method, device, equipment and computer-readable storage medium
EP3961458B1 (en) Blockchain-based service processing methods, apparatuses, devices, and storage media
CN109214543B (en) Data processing method and device
CN110750520A (en) Feature data processing method, device and equipment and readable storage medium
US20160042183A1 (en) Generating identifier
KR20150115762A (en) Privacy protection against curious recommenders
CN111368314A (en) Modeling and predicting method, device, equipment and storage medium based on cross features
KR101751971B1 (en) Image processing method and apparatus for encoded image
CN114745178A (en) Identity authentication method, identity authentication device, computer equipment, storage medium and program product
CN110516461B (en) Multichannel image encryption method and device, storage medium and electronic equipment
KR20230029388A (en) Method for privacy preserving using homomorphic encryption with private variables and apparatus theroef
CN113094735A (en) Method for training privacy model
CN112615712B (en) Data processing method, related device and computer program product
CN111199027B (en) User authentication method and device, computer-readable storage medium and electronic equipment
CN112000964B (en) Data encryption method, system, medium and device based on dynamic coordinates and algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant