CN116204773A

CN116204773A - Causal feature screening method, causal feature screening device, causal feature screening equipment and storage medium

Info

Publication number: CN116204773A
Application number: CN202211411484.XA
Authority: CN
Inventors: 张燕; 夏正勋; 谭锋镭
Original assignee: Henan Xinghuan Zhongzhi Information Technology Co ltd; Transwarp Technology Shanghai Co Ltd
Current assignee: Henan Xinghuan Zhongzhi Information Technology Co ltd; Transwarp Technology Shanghai Co Ltd
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-06-02

Abstract

The invention discloses a screening method, a screening device, a screening equipment and a storage medium for causal features, wherein the method comprises the following steps: acquiring data to be screened containing causal features, wherein the data to be screened comprises horizontal federation scene data and longitudinal federation scene data; the method comprises the steps of performing condition independence judgment on characteristics and tag variables in data to be screened; and sequentially screening the features in the data to be screened according to the condition independence judgment result to determine a final causal feature set. According to the causal feature screening method provided by the invention, the differential privacy and the causal feature selection method are combined, so that the federal feature condition independence detection of a plurality of participants is realized on the premise of ensuring that data is not delivered, the causal feature selection based on constraint in a federal learning environment is further completed, the coordinator is prevented from revealing statistical result privacy information, the problem that an absolute trusted third party cannot be found in engineering implementation is solved, and the condition independence judgment requirements in different federal scenes can be met.

Description

Causal feature screening method, causal feature screening device, causal feature screening equipment and storage medium

Technical Field

The present invention relates to the field of causal science and technology, and in particular, to a causal feature screening method, apparatus, device, and storage medium.

Background

Feature selection is widely applied to high-dimensional data analysis scenes as a feature dimension reduction technology. However, conventional feature selection algorithms typically filter based on correlations between features and class attributes, which do not represent causality between features and class attributes, thus resulting in a predictive classification model lacking interpretability, operability, and robustness. Causal feature selection is a substructure of a bayesian network that discovers class attributes, namely markov carpets (MBs), which consist of parents (direct causes), children (PCs, direct results) and spouses (SPs, other direct causes of direct results) of class attributes, whereby causal relationships between local class attributes and features are explicitly deduced, and an interpretable, operational and robust predictive classification model can be constructed.

In the prior art, in a multi-party federal learning scenario, the invisible nature of the data sets many obstacles to causal feature selection applications, such as: 1) Based on the data privacy safety consideration, all the participants do not share the original data, and only the encrypted intermediate data are interacted; 2) In the current federal learning scene, a trusted third party is generally adopted as a coordinator to process intermediate data or fuse models, however, it is very difficult to find a truly trusted third party in actual production, and the coordinator can acquire an intermediate interaction data result in a decryption mode, so that the risk of data privacy disclosure exists.

Disclosure of Invention

The invention provides a causal feature screening method, a causal feature screening device, causal feature screening equipment and a causal feature screening storage medium, so as to realize feature selection in a multi-party federal learning scene.

According to an aspect of the present invention, there is provided a method for screening causal features, comprising:

acquiring data to be screened containing the causal features, wherein the data to be screened comprises horizontal federation scene data and vertical federation scene data;

performing condition independence judgment on the characteristics and the tag variables in the data to be screened;

and sequentially screening the features in the data to be screened according to the condition independence judgment result to determine a final causal feature set.

Further, obtaining data to be screened containing the causal features comprises:

acquiring original data containing the causal features;

if the original data is in the transverse federal scene, performing feature alignment, federal feature engineering and feature value desensitization treatment on the original data; if the original data is in the longitudinal federal scene, performing sample identification alignment, local feature engineering and feature value desensitization treatment on the original data;

and taking the processed original data as the data to be screened.

Further, the condition independence judgment of the characteristics and the tag variables in the data to be screened comprises the following steps:

respectively carrying out a row-column table statistics on the horizontal federation scene data and the vertical federation scene data;

and carrying out condition independence judgment on the characteristics and the tag variables in the data to be screened according to the statistical result of the list.

Further, performing a row-link table statistics on the lateral federation scene data, including:

counting the first sample frequency of the candidate feature, the tag variable and the condition feature corresponding to the transverse federal scene data to generate a local sample frequency series list;

adding noise into the first sample frequency number to obtain a first privacy sample frequency number;

and performing a column-link table fusion calculation according to the first privacy sample frequency, and generating a horizontal federal sample frequency column-link table as a column-link table statistical result.

Further, performing a list statistics on the longitudinal federation scene data, including:

performing feature binning processing on the candidate features, the tag variables and the conditional features corresponding to the longitudinal federal scene data respectively to obtain respective corresponding sample identification sets;

according to the characteristic box division processing result, determining sample intersections under different characteristic value combinations as a second sample frequency;

And adding noise into the second sample frequency to obtain a second privacy sample frequency as a list statistical result.

Further, the noise satisfies a condition independent noise constraint.

Further, determining the sample intersection under different characteristic value combinations according to the characteristic binning processing result comprises:

carrying out hash processing according to the box division processing result to obtain hash sample identification sets corresponding to the candidate features, the tag variables and the conditional features respectively;

encrypting the feature binning result according to a first random factor corresponding to the candidate feature;

and comparing the encrypted characteristic box division results by utilizing the ciphertext, and determining the sample intersection according to the intersection of the sample identifiers under different characteristic value combinations.

Further, the condition independence judgment of the characteristics and the tag variables in the data to be screened is carried out according to the statistical result of the list, and the method comprises the following steps:

determining a chi-square test value and the degree of freedom according to the list statistical result;

and inquiring a chi-square distribution table according to the chi-square test value and the degree of freedom to determine whether the characteristics and the tag variables meet the condition independence.

Further, according to the condition independence judgment result, sequentially screening the features in the data to be screened to determine a final causal feature set, including:

Sequentially screening the characteristics in the data to be screened, and if the characteristics and the tag variables meet the condition independence, adding the corresponding characteristics to the candidate Markov blanket;

and performing secondary screening on the characteristics in the candidate Markov blanket, and determining the candidate Markov blanket with the false positive characteristics removed as a final causal characteristic set.

According to another aspect of the present invention, there is provided a causal feature screening apparatus, comprising:

the data to be screened obtaining module is used for obtaining data to be screened containing the causal characteristics, wherein the data to be screened comprises horizontal federal scene data and longitudinal federal scene data;

the condition independence judging module is used for judging the condition independence of the characteristics and the tag variables in the data to be screened;

and the causal feature set determining module is used for sequentially screening the features in the data to be screened according to the condition independence judging result to determine a final causal feature set.

Optionally, the data obtaining module to be screened is further configured to:

acquiring original data containing the causal features;

And taking the processed original data as the data to be screened.

Optionally, the condition independence determination module is further configured to:

Optionally, the noise satisfies a condition independent noise constraint.

Optionally, the causal feature set determination module is further configured to:

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the causal feature screening method of any of the embodiments of the present invention.

According to another aspect of the present invention there is provided a computer readable storage medium storing computer instructions for causing a processor to execute a method of screening causal features according to any of the embodiments of the present invention.

The embodiment of the invention provides a causal feature screening method, which comprises the steps of firstly obtaining to-be-screened data containing causal features, wherein the to-be-screened data comprises transverse federal scene data and longitudinal federal scene data, then judging the condition independence of features and tag variables in the to-be-screened data, and finally screening the features in the to-be-screened data in sequence according to the condition independence judgment result to determine a final causal feature set. According to the causal feature screening method provided by the embodiment of the invention, the differential privacy and causal feature selection method are combined, so that the federal feature condition independence detection of a plurality of participants is realized on the premise of ensuring that data is not delivered, the causal feature selection based on constraint in a federal learning environment is further completed, the coordinator is prevented from revealing statistical result privacy information, the problem that an absolute trusted third party cannot be found in engineering implementation is solved, and the condition independence judgment requirement in different federal scenes can be met.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for screening causal features according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a method for screening causal features according to a second embodiment of the present invention;

FIG. 3 is a causal feature selection framework in a lateral federal scenario provided according to a second embodiment of the present invention;

FIG. 4 is a causal feature selection framework in a vertical federal scenario provided according to a second embodiment of the present invention;

FIG. 5 is a schematic diagram of a screening apparatus for causal features according to a third embodiment of the present invention;

Fig. 6 is a schematic structural diagram of an electronic device implementing the causal feature screening method according to the fourth embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

FIG. 1 is a flow chart of a method for screening causal features, which may be implemented by a causal feature screening apparatus, which may be implemented in hardware and/or software, and which may be configured in an electronic device, according to an embodiment of the present invention. As shown in fig. 1, the method includes:

s110, acquiring data to be screened containing causal features, wherein the data to be screened comprises horizontal federation scene data and vertical federation scene data.

Wherein the causal feature is a causal feature. Federal machine learning is a machine learning framework, which can effectively help a plurality of institutions to perform data use and machine learning modeling under the condition that the requirements of user privacy protection, data security and government regulations are met, further, the lateral federal learning is also called Sample-divided federal learning (Sample-Partitioned Federated Learning or Sample-Partitioned Federated Learning), and can be applied to scenes in which data sets of all participants of federal learning have the same feature space and different Sample spaces; federal learning consisting of participants with the same sample space, different feature spaces on the dataset is classified as longitudinal federal learning (Vertical Federated Learning, VFL), and can also be understood as federal learning divided by features.

For example, a bank and an electronic commerce are opened and closed, the bank has a credit label, the electronic commerce has consumption data, after two parties obtain a batch of same users, the bank wants to evaluate the data effect of the electronic commerce, but the bank does not want to disclose own credit label, and the electronic commerce also does not want to disclose own consumption data, in which case, longitudinal federal learning in a federal learning method can be adopted; for the fact that the same bank belongs to different subsidiary companies or different banks, users of the same bank are different, and data characteristics are basically the same, at the moment, business cooperation among the different subsidiary companies or the different banks can adopt transverse federal learning in a federal learning method. Federal learning can enable enterprises to perform joint training on the basis of not sharing data through an encrypted distributed machine learning framework, and the problem of 'data island' is solved.

In this embodiment, before the causal feature screening is started, preprocessing needs to be performed on the data of each participant, and the data after preprocessing is used as the data to be screened. The data preprocessing can be sample alignment, local and federal feature engineering, data desensitization and other operations, and different preprocessing modes can be adopted for the horizontal federal scene data and the vertical federal scene data according to the data types.

S120, judging the condition independence of the characteristics and the tag variables in the data to be screened.

In this step we are mainly concerned about whether or not one random variable (a feature) and another random variable (a tag) are independent of each other? If independent, it can be said that the feature does not contribute to the determination of the tag variable, i.e. we cannot determine whether the sample belongs to the class corresponding to the tag based on whether the feature is present or not at all.

Alternatively, the conditional independence judging method includes a chi-square checking method based on a list, an F checking method based on linear correlation, a conditional independence checking method based on mutual information, and the like. The embodiment of the invention is not limited to a specific method for judging the condition independence, and takes chi-square test as an example, wherein the chi-square test is a commonly used method for testing the condition independence of two variables in mathematical statistics, and the most basic idea is to determine whether the theory is correct or not by observing the deviation of an actual value and a theoretical value. In the specific process, two variables are usually assumed to be independent, then the deviation degree of an actual value and a theoretical value is observed, if the deviation is small enough, the error is considered to be a very natural sample error, the error is caused by inaccurate measurement means or happens accidentally, the two variables are actually independent, and the original assumption is accepted at the moment; if the deviation is so large that such errors are unlikely to be due to accidental generation or measurement inaccuracy, we consider that the two are not actually independent of each other, i.e. negate the original assumption and accept the alternative assumption.

For other condition independence judging methods, such as a mutual information-based method and a correlation-based F detection method, the calculation modes are different, but the condition independence detection is realized on the basis of the statistical result of federal data, and the difference is that the statistical objects are different.

S130, sequentially screening the features in the data to be screened according to the condition independence judgment result to determine a final causal feature set.

In this embodiment, the coordinator may perform causal feature selection accordingly based on the condition independence determination result.

Optionally, all the features to be screened can be screened sequentially, the current feature is used as a candidate feature, if the candidate feature is not independent of the tag variable, the candidate feature is added into the candidate feature set, otherwise, the feature is discarded, and the next feature is selected as the candidate feature to continue to be judged until the features are screened.

The embodiment of the invention firstly acquires data to be screened containing causal features, wherein the data to be screened comprises transverse federal scene data and longitudinal federal scene data, then carries out condition independence judgment on features in the data to be screened and tag variables, and finally screens the features in the data to be screened in sequence according to a condition independence judgment result to determine a final causal feature set. According to the causal feature screening method provided by the embodiment of the invention, the differential privacy and causal feature selection method are combined, so that the federal feature condition independence detection of a plurality of participants is realized on the premise of ensuring that data is not delivered, the causal feature selection based on constraint in a federal learning environment is further completed, the coordinator is prevented from revealing statistical result privacy information, the problem that an absolute trusted third party cannot be found in engineering implementation is solved, and the condition independence judgment requirement in different federal scenes can be met.

Example two

Fig. 2 is a flowchart of a causal feature screening method according to a second embodiment of the present invention, where the present embodiment is a refinement of the foregoing embodiment. As shown in fig. 2, the method includes:

s210, acquiring original data containing causal features.

Wherein the original data is the data before preprocessing.

In this embodiment, according to actual requirements, raw data for causal feature screening may be obtained, and the raw data may be provided by each participant in federal learning.

S220, if the original data are in the transverse federation scene, performing feature alignment, federation feature engineering and feature value desensitization treatment on the original data; and if the original data is in the longitudinal federal scene, performing sample identification alignment, local feature engineering and feature value desensitization treatment on the original data.

In this embodiment, the preprocessing may be performed in different manners according to the difference between the horizontal federation scene and the vertical federation scene. Assuming that 2 participants A and B and 1 coordinator C are provided, the characteristic variable set is F, and the tag variable is Y, in the horizontal federal scene, each participant has the characteristic set F and the tag variable Y; in the vertical federal scenario, party A has part of feature FA (FA e F) and party B has the rest of feature FB (FB e F) and tag variable Y.

Specifically, the transverse federal scene is subjected to feature alignment, federal feature engineering and feature value desensitization treatment; the vertical federal scenario may perform sample Identification (ID) alignment, local feature engineering, and feature value desensitization processing.

The feature engineering is the most important link of machine learning modeling, and the biggest difference between the federal feature engineering and the traditional feature engineering is that feature processing (and possible monitoring links) depends on encrypted data, and meanwhile, data integration and calculation are required to be performed at the cloud. The characteristic alignment is to ensure that the characteristics of the owners participating in federal learning are the same and the misaligned characteristics are removed; sample ID alignment is a privacy preserving set intersection technique (Private Set Intersection, PSI) in which data holders can calculate the intersection of each party's data sets without exposing any data set information beyond the intersection; the characteristic value desensitization is to encrypt the original data, so as to remove or protect the privacy information of the original data.

S230, taking the processed original data as data to be screened.

In this embodiment, the raw data may be used as the data to be screened after the above processing.

S240, respectively carrying out row and column table statistics on the horizontal federation scene data and the vertical federation scene data.

Wherein the list (contingency table) is a frequency table listed when the observed data is classified by two or more attributes (qualitative variables), which is a frequency distribution table cross-classified by two or more variables. The basic problem of the list statistics (Contingency Table Analysis) is to ascertain whether there is a correlation, i.e. whether there is independence, between the properties under investigation, based on the relevant statistical analysis and inference made by the list.

In this embodiment, when causal feature screening is performed on data to be screened, first, a list statistics is required, and different list statistics modes exist according to different horizontal federation scenes and different vertical federation scenes.

Optionally, the manner of performing the row-link table statistics on the lateral federation scene data may be: counting the first sample frequency of the candidate feature, the tag variable and the condition feature corresponding to the transverse federal scene data to generate a local sample frequency series list; adding noise into the first sample frequency number to obtain a first privacy sample frequency number; and performing a column-link table fusion calculation according to the first privacy sample frequency, and generating a transverse federal sample frequency column-link table as a column-link table statistical result.

In the present embodiment, the above-described noise satisfies the condition-independent noise constraint condition.

Specifically, in a horizontal federation learning scene, the list statistics are realized through three steps of local list statistics, local differential privacy method encryption statistics based on condition independence noise constraint and federation list fusion calculation. The specific implementation steps are as follows:

1) Local list statistics: each participant locally counts sample frequencies among the candidate features, the tag variables and the conditional features respectively to generate a local sample frequency series list.

2) Privacy calculation of conditional independent noise constraints: each participant encrypts all sample frequencies in the statistical list by using a privacy calculation method of conditional independence noise constraint. Assuming that x represents a sample frequency variable in a list, r (x) represents noise added by the frequency, differential privacy Laplacian mechanism noise or Gaussian mechanism noise can be adopted, k (x) is a conditional independence detection method function based on the list, and f (x) is the sample frequency after differential privacy processing. The principle of the differential privacy method based on the conditional independence noise constraint is as follows: the results of |k (x) -k (x+r (x))| are guaranteed to be as small as possible within the maximum range of variation of the conditional independence detection value given the level of significance (typically 0.05). Assuming that the degree of freedom is n, the check-square test critical value table shows that the condition independence detection value under the significance level of 0.05 is k1, and the condition independence detection value under the significance level of 0.1 is k2. Thus, the maximum change value of the conditional independence detection value given a significance level of 0.05 is: k1-k2, namely: i k (x) -k (x+r (x))I < k1

k2. Adding r (x) meeting the condition independence noise constraint as noise into the sample frequency x to finally obtain a sample frequency f (x) after differential privacy treatment, namely: f (x) =x+r (x).

3) Federal column fusion calculation: and (3) sending the desensitized candidate characteristic, the label variable and the conditional characteristic value together with the sample frequency series list statistical result encrypted in the step (2) to a coordinator, and performing a series list fusion calculation by the coordinator, namely adding the sample frequency under the same characteristic value combination to generate a federal sample frequency series list.

Optionally, the manner of performing the determinant statistics on the longitudinal federation scene data may be: respectively carrying out feature binning treatment on candidate features, tag variables and conditional features corresponding to the longitudinal federal scene data to obtain respective corresponding sample identification sets; according to the characteristic box division processing result, determining sample intersections under different characteristic value combinations as a second sample frequency; and adding noise into the second sample frequency to obtain the second privacy sample frequency as a list statistical result.

Specifically, in a longitudinal federation learning scene, the list statistics are realized through three steps of local feature box division, federation feature box division intersection calculation and privacy calculation method encryption list statistics based on condition-independent noise constraint. The specific implementation steps are as follows:

1) Local feature box division: the participant with the candidate feature X, the tag variable Y and the conditional feature Z respectively carries out feature classification on the candidate feature, the tag variable and the conditional feature locally to obtain all feature values of the candidate feature X and corresponding sample ID set IDs _X All tag values of tag variable Y and corresponding sample ID set ID thereof _Y All feature values of the conditional feature Z and corresponding sample ID set IDs thereof _Z 。

2) Calculating federal characteristic box division intersection: and determining sample intersections under different characteristic value combinations according to the characteristic box division processing result to be used as a second sample frequency. The label side obtains the encrypted characteristic box division result of each participant, and can calculate the intersection of sample IDs in the characteristic box division under different characteristic value combinations through ciphertext comparison, so as to obtain the federal list statistical result under the condition characteristics.

3) The label side encrypts the statistical result of the federal list and sends the encrypted statistical result to the coordinator: and (3) the label side encrypts all sample frequency numbers in the federal list obtained through statistics by using a differential privacy method based on conditional independence noise constraint, the details are the same as step 2) in a transverse federal learning scene, and then the encrypted federal list statistical result is sent to the coordinator.

Further, according to the feature binning result, the manner of determining the sample intersection under different feature value combinations may be: carrying out hash processing according to the box division processing result to obtain hash sample identification sets respectively corresponding to the candidate features, the tag variables and the conditional features; encrypting the feature binning result according to a first random factor corresponding to the candidate feature; and comparing the encrypted characteristic box division results by utilizing the ciphertext, and determining a sample intersection according to the intersections of the sample identifiers under different characteristic value combinations.

Specifically, the federal feature binning intersection calculation can be implemented according to the following steps:

1) Carrying out Hash (Hash) calculation on the local feature binning result: each participant locally performs Hash calculation on the feature binning result obtained in the step 1) under the longitudinal federal scene to obtain candidate features H (X) and a corresponding sample ID set H (ID) _X ) Tag variable H (Y) and its corresponding sample ID set H (ID) _Y ) Condition feature H (Z) and its corresponding sample ID set H (ID) _Z )。

2) The candidate feature party generates a random factor k, multiplies all candidate feature values H (X) by a corresponding sample ID set H (ID) _X ) Obtaining k×H (X) and k×H (ID) _X ) To the tag party.

3) The label side generates a random factor r, multiplies all label values H (Y) by a corresponding sample ID set H (ID) _Y ) Obtaining r×H (Y) and r×H (ID) _Y ) And sending the result to the candidate feature party. Other non-candidate feature parties generate different random factors p, multiplied by the holding feature H (Z) and the corresponding sample ID set H (ID _Z ) To obtain p×H (Z) and p×H (ID) _Z ) And sending the result to the candidate feature party.

4) The candidate feature party multiplies the tag value r×h (Y) transmitted from the tag party by a random factor k and a corresponding sample ID set r×h (ID) _Y ) Obtaining k x r x H (Y)And k×r×h (ID) _Y ) And sending the information to a label side. Meanwhile, the candidate feature party multiplies the held feature value p×h (Z) transmitted from the non-candidate feature party and the corresponding sample ID set p×h (ID) by using the random factor k _Z ) Obtaining k×p×h (Z) and k×p×h (ID) _Z ) And sending the result to the corresponding non-candidate characteristic party.

5) The label side uses the inverse r of the random factor r ^-1 Multiplied by k×r×h (Y) and k×r×h (ID) _Y ) Obtaining k×H (Y) and k×H (ID) _Y ) The method comprises the steps of carrying out a first treatment on the surface of the The non-candidate feature party multiplies k×p×h (Z) and k×p×h (ID) using the inverse p-1 of the random factor p _Z ) Obtaining k×H (Z) and k×H (ID) _Z ) And sending the information to a label side.

6) The label side obtains the characteristic box division result encrypted by each participant by the same random factor k, and the intersection of sample IDs in the characteristic box division under different characteristic value combinations can be calculated through ciphertext comparison, so that the federal list statistical result under the condition characteristics is obtained.

S250, judging the condition independence of the characteristics and the tag variables in the data to be screened according to the statistical result of the list.

In this embodiment, the condition independence determination of the feature and the tag variable may be performed after the list statistics.

Optionally, the method for performing the condition independence judgment on the feature and the tag variable in the data to be screened according to the statistical result of the list may be: determining a chi-square test value and the degree of freedom according to the statistical result of the list; and inquiring the chi-square distribution table according to the chi-square test value and the degree of freedom to determine whether the characteristics and the tag variables meet the condition independence.

Specifically, based on federal list statistics, the coordinator may perform chi-square test calculations, where the chi-square statistic in the condition independent test is the sum of all the list chi-square statistics under the condition characteristic value combination. The calculation formula is as follows:

wherein χ is ² Is chi-square value, i is a candidateFeature ith feature value, j is the label variable jth label value, k is the conditional feature kth feature value combination,

actual frequency numbers in the list cells; />

Is the expected frequency number when the candidate feature and the tag variable are independent. And (3) inquiring the chi-square distribution table after calculating the chi-square test and the degree of freedom to obtain whether the candidate characteristics and the tag variables are independent or not. Wherein the degree of freedom in the condition independence detection is 2 ^|z| And |z| is the number of variables in the conditional feature set.

And S260, sequentially screening the features in the data to be screened, and if the features and the tag variables meet the condition independence, adding the corresponding features to the candidate Markov blanket.

Wherein Markov Blanket (MB) is a minimum feature subset that satisfies the following characteristics: one feature is independent of all other feature conditions in the feature domain under its markov blanket condition. Let the markov blanket of feature T be MB (T), then the above can be expressed as: p (t|mb (T))=p (t|y, MB (T)), where Y is all non-markov carpet nodes in the feature domain.

In this embodiment, the coordinator may perform causal feature selection accordingly based on the condition independence determination result. If the candidate feature and the tag variable are not independent under the conditional feature, adding the candidate feature to a candidate MB (markov blanket); otherwise, the feature is discarded and the next feature is selected as a candidate feature until the features are all screened.

And S270, performing secondary screening on the features in the candidate Markov blanket, and determining the candidate Markov blanket with the false positive features removed as a final causal feature set.

In this embodiment, the candidate MB obtained after the first screening may include false positive features, that is, features where the candidate features and the tag variables are independent of each other, so that the false positive features in the candidate MB (markov blanket) may be removed through the second screening, to obtain a final causal feature set.

Fig. 3 and fig. 4 are respectively a causal feature selection framework in a horizontal federal scene and a vertical federal scene provided by the embodiment of the present invention, provided that 2 participants a and B,1 coordinator C, a feature variable set F, and a tag variable Y are provided, and in the horizontal federal scene, each participant has a feature set F and a tag variable Y; in the vertical federal scenario, party A has part of feature FA (FA e F) and party B has the rest of feature FB (FB e F) and tag variable Y.

As shown in fig. 3, in the transverse federal scenario, after feature alignment, federal feature engineering and feature value desensitization, the coordinator initializes

Then the coordinator takes feature X from feature set F as candidate feature, i.e. +.>

The candidate feature X and the conditional feature set Z (z=cmb (Y)) are transmitted to each participant. Each participant locally counts a sample frequency list between candidate feature variable X, tag variable Y, and condition variable Z (z=cmb (Y)), respectively. Each participant encrypts the sample frequency in the local list statistics using a differential privacy algorithm based on conditional independence noise. And the coordinator receives the encrypted sample frequency statistic result sent by each participant and performs fusion calculation. And the coordinator performs condition independence judgment and calculation based on the fused sample frequency sequence list result. The coordinator updates the candidate MB set CMB (Y) and the feature set F according to the condition independence judgment result, specifically: if the feature X is not independent of the tag variable Y condition, then the feature X is added to the candidate MB set CMB (Y) and deleted from the feature set F, namely: CMB (Y) =cmb (Y)/(X), f=f\x. The coordinator continues to select one feature from the feature set F at will, and the steps of selecting candidate features are circularly executed until the candidate MB set CMB (Y) and the feature set F are updated according to the conditional independence judgment result, until the feature set F has no feature. The coordinator arbitrarily selects a feature from the current CMB (Y) X，/>

Let Z be CMB (Y) with X removed, i.e., z=cmb (Y) \x, send features X and Z to each participant, and perform a step of statistical to conditional independence determination of the determinant. According to the condition independence judgment result, updating the candidate MB set CMB (Y), specifically: if the feature is independent of the tag variable Y condition, then the feature X is deleted from the candidate MB set CMB (Y), namely: CMB (Y) =cmb (Y) \x until no feature can be deleted from CMB (Y), the final output CMB (Y) is the final causal feature set in this embodiment. Taking federal learning of a disease prediction model as an example, it is assumed that the data amount of each of the medical institution a and the medical institution b in a certain area is insufficient, but both have the same data characteristics, and data cooperation is possible. And the medical institution A and the medical institution B serve as two participants, sample frequency number list tables among the candidate characteristic variable X, the tag variable Y and the condition variable Z are respectively counted locally, a differential privacy algorithm based on the condition independent noise is used for encrypting the sample frequency numbers in the local list table counting results, and the encrypted sample frequency counting results are sent to a third party coordinator. And the coordinator performs condition independence judgment calculation on the sample frequency series list result after fusion calculation, updates the candidate MB set according to the condition independence judgment result, and then processes the candidate MB set according to the steps to finally obtain the causal feature set. A disease prediction model is established through the data combination of the first party and the second party, so that the diagnosis rate of the disease is improved in actual medical service.

As shown in fig. 4, in the longitudinal federal scenario, after sample ID alignment, local feature engineering and feature value desensitization, all participants locally perform feature binning on all feature variables and tag variables, and count sample ID sets corresponding to each feature binning. The coordinator initializes CMB (Y) =Φ, and the coordinator takes X as a candidate feature from the feature set F, i.e.

Issuing candidate feature X and conditional feature set Z (z=cmb (Y))To the party holding the feature. After the federal list statistics, the tag side encrypts all sample frequencies in the sample frequency list obtained by the statistics by using a differential privacy method based on the conditional independence noise constraint, and then sends the encrypted list statistics result to the coordinator. And after receiving the sample frequency sequence list result, the coordinator judges the condition independence, namely under the given condition Z, judges the condition independence of X and Y. The coordinator updates the candidate MB set CMB (Y) and the feature set F according to the condition independence detection results of X and Y, and specifically comprises the following steps: if the feature X is not independent of the tag variable Y condition, then the feature X is added to the candidate MB set CMB (Y) and deleted from the feature set F, namely: CMB (Y) =cmb (Y)/(X), f=f\x. The subsequent steps are similar to those in the lateral federation scenario and will not be described here again. Taking federal learning for intelligent analysis of group rentals as an example: the initiative is a certain electric department, which is used as an initiator of a federal learning task, provides electricity data (including labels of whether the study is a group renter) and defines information such as model parameters; the participants provide water consumption data for government water affair related departments and participate in federal modeling (without labels); the coordinator is deployed in the management department of the related group rental houses and is used as a third party for safety supervision and is responsible for providing computing power and distributing keys. In this example, after sample ID alignment, local feature engineering and feature value desensitization, a certain electric power department and government water affair related department serve as participants to perform feature box division on all feature variables and tag variables locally, and count sample ID sets corresponding to each feature box division. And the group renting house main department serves as a coordinator to take out candidate features from the feature set, and sends the candidate features and the conditional feature set to the participants corresponding to the held features. After the federal list statistics, the electric power department as a label side encrypts all sample frequencies in the sample frequency list obtained by the statistics by using a differential privacy method based on the conditional independence noise constraint, and then sends the encrypted list statistics result to a group rental house main department. And after receiving the sample frequency series list result, the group lease house main department carries out condition independence judgment, and finally outputs a causal feature set according to the condition independence detection result.

The embodiment of the invention firstly acquires original data containing causal features, and if the original data is in a transverse federal scene, the original data is subjected to feature alignment, federal feature engineering and feature value desensitization; and if the characteristics and the tag variables meet the independent conditions, adding the corresponding characteristics into the candidate Markov blanket, performing secondary screening on the characteristics in the candidate Markov blanket, and determining the candidate Markov blanket with the false positive characteristics removed as a final causal characteristic set. According to the causal feature screening method provided by the embodiment of the invention, the differential privacy and causal feature selection method are combined, so that the federal feature condition independence detection of a plurality of participants is realized on the premise of ensuring that data is not delivered, the causal feature selection based on constraint in a federal learning environment is further completed, the coordinator is prevented from revealing statistical result privacy information, the problem that an absolute trusted third party cannot be found in engineering implementation is solved, and the condition independence judgment requirement in different federal scenes can be met.

Example III

Fig. 5 is a schematic structural diagram of a screening device for causal features according to a third embodiment of the present invention. As shown in fig. 5, the apparatus includes: a data acquisition to be screened module 310, a condition independence judgement module 320 and a causal feature set determination module 330.

The data to be screened obtaining module 310 is configured to obtain data to be screened including the causal features, where the data to be screened includes horizontal federal scene data and vertical federal scene data.

The condition independence judging module 320 is configured to perform condition independence judgment on the feature and the tag variable in the data to be screened.

The causal feature set determining module 330 is configured to sequentially screen features in the data to be screened according to the condition independence determination result, and determine a final causal feature set.

Optionally, the data to be screened obtaining module 310 is further configured to:

acquiring original data containing causal features; if the original data is in the transverse federal scene, performing feature alignment, federal feature engineering and feature value desensitization treatment on the original data; if the original data is in the longitudinal federal scene, performing sample identification alignment, local feature engineering and feature value desensitization treatment on the original data; and taking the processed original data as the data to be screened.

Optionally, the condition independence determination module 320 is further configured to:

respectively carrying out a row and column linkage table statistics on the horizontal federation scene data and the vertical federation scene data; and carrying out condition independence judgment on the characteristics and the tag variables in the data to be screened according to the statistical result of the list.

counting the first sample frequency of the candidate feature, the tag variable and the condition feature corresponding to the transverse federal scene data to generate a local sample frequency series list; adding noise into the first sample frequency number to obtain a first privacy sample frequency number; and performing a column-link table fusion calculation according to the first privacy sample frequency, and generating a transverse federal sample frequency column-link table as a column-link table statistical result.

respectively carrying out feature binning treatment on candidate features, tag variables and conditional features corresponding to the longitudinal federal scene data to obtain respective corresponding sample identification sets; according to the characteristic box division processing result, determining sample intersections under different characteristic value combinations as a second sample frequency; and adding noise into the second sample frequency to obtain the second privacy sample frequency as a list statistical result.

Optionally, the noise satisfies a conditional independent noise constraint.

carrying out hash processing according to the box division processing result to obtain hash sample identification sets respectively corresponding to the candidate features, the tag variables and the conditional features; encrypting the feature binning result according to a first random factor corresponding to the candidate feature; and comparing the encrypted characteristic box division results by utilizing the ciphertext, and determining a sample intersection according to the intersections of the sample identifiers under different characteristic value combinations.

determining a chi-square test value and the degree of freedom according to the statistical result of the list; and inquiring the chi-square distribution table according to the chi-square test value and the degree of freedom to determine whether the characteristics and the tag variables meet the condition independence.

Optionally, the causal feature set determination module 330 is further configured to:

sequentially screening the features in the data to be screened, and if the features and the tag variables meet the independent conditions, adding the corresponding features to the candidate Markov blanket; and (3) performing secondary screening on the features in the candidate Markov blanket, and determining the candidate Markov blanket with the false positive features removed as a final causal feature set.

The causal feature screening device provided by the embodiment of the invention can execute the causal feature screening method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 6 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 6, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as screening methods for causal features.

In some embodiments, the method of screening causal features may be implemented as a computer program, which is tangibly embodied on a computer readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more of the steps of screening for causal features described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the screening method of causal features in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of screening for causal features, comprising:

2. The method of claim 1, wherein obtaining data to be screened comprising the causal features comprises:

acquiring original data containing the causal features;

and taking the processed original data as the data to be screened.

3. The method of claim 1, wherein the performing a condition independence determination on the feature and the tag variable in the data to be screened comprises:

4. A method according to claim 3, wherein performing a row-link table statistic on the lateral federal scene data comprises:

5. A method according to claim 3, wherein performing a row-link table statistic on the longitudinal federal scene data comprises:

6. The method of claim 4 or 5, wherein the noise satisfies a conditional independence noise constraint.

7. The method of claim 5, wherein determining the intersection of samples for different combinations of feature values based on the feature binning result comprises:

8. A method according to claim 3, wherein the performing the condition independence determination on the feature and the tag variable in the data to be screened according to the list statistics comprises:

9. The method of claim 1, wherein sequentially screening the features in the data to be screened according to the condition independence determination result to determine a final causal feature set comprises:

10. A causal feature screening apparatus, comprising:

11. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the causal feature screening method of any of claims 1-9.

12. A computer readable storage medium storing computer instructions for causing a processor to execute the method of screening for causal features according to any of claims 1-9.