CN114444592A - Feature screening method, device, storage medium, and program product - Google Patents

Feature screening method, device, storage medium, and program product Download PDF

Info

Publication number
CN114444592A
CN114444592A CN202210084862.1A CN202210084862A CN114444592A CN 114444592 A CN114444592 A CN 114444592A CN 202210084862 A CN202210084862 A CN 202210084862A CN 114444592 A CN114444592 A CN 114444592A
Authority
CN
China
Prior art keywords
feature
screening
subset
index
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210084862.1A
Other languages
Chinese (zh)
Inventor
范昊
杨恺
王虎
黄志翔
彭南博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Holding Co Ltd
Original Assignee
Jingdong Technology Holding Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Holding Co Ltd filed Critical Jingdong Technology Holding Co Ltd
Priority to CN202210084862.1A priority Critical patent/CN114444592A/en
Publication of CN114444592A publication Critical patent/CN114444592A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Optimization (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Algebra (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Operations Research (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a feature screening method, a device, a storage medium and a program product, wherein the method comprises the steps of sampling a plurality of features provided by each participant of federal learning to obtain a plurality of feature subsets, calculating a plurality of feature screening indexes of the feature subsets aiming at each feature subset, calculating and obtaining a distribution function curve corresponding to the feature screening indexes aiming at each feature screening index of each feature subset, and determining a subset screening threshold value of the feature screening index corresponding to the feature subset according to the distribution function curve, determining feature screening threshold values corresponding to the feature screening indexes respectively according to the subset screening threshold values, and screening the features of the feature subsets according to the feature screening threshold values respectively corresponding to the feature screening indexes to obtain a final feature screening result. The embodiment of the invention can improve the rationality of feature screening.

Description

Feature screening method, device, storage medium, and program product
Technical Field
The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a feature screening method, feature screening equipment, a storage medium and a program product.
Background
Federal learning is also known as Federal machine learning, Joint learning, and Union learning. Federal learning is a machine learning framework, and can effectively help a plurality of organizations to perform data use and machine learning modeling under the condition of meeting the requirements of user privacy protection and data safety, so as to realize data sharing. The key point for improving the overall effect of the model is to screen out appropriate features in the federally learned business floor.
In the prior art, feature screening is usually performed by manually setting feature threshold values, for example, in screening using Information Value (IV), modeling and manually specifying the IV threshold Value are required to screen out those features with IV values greater than the threshold Value.
However, the above feature screening manner depends excessively on subjective experience, and modeling personnel of different participants in federal modeling have different experiences and cognition on feature screening, so that the rationality of feature screening cannot be guaranteed, and the overall effect of the federal learning modeling is affected.
Disclosure of Invention
Embodiments of the present invention provide a feature screening method, device, storage medium, and program product to improve the rationality of feature screening.
In a first aspect, an embodiment of the present invention provides a feature screening method, including:
sampling a plurality of features provided by each participant of federal learning to obtain a plurality of feature subsets; the features do not include an actual name, but only identification information;
for each feature subset, calculating a plurality of feature screening indicators for the feature subset;
aiming at each characteristic screening index of each characteristic subset, calculating and obtaining a distribution function curve corresponding to the characteristic screening index, and determining a subset screening threshold value corresponding to the characteristic subset by the characteristic screening index according to the distribution function curve;
determining a characteristic screening threshold value corresponding to each characteristic screening index according to each subset screening threshold value;
and screening the characteristics of each characteristic subset according to the characteristic screening threshold value corresponding to each characteristic screening index to obtain a final characteristic screening result.
In a second aspect, an embodiment of the present invention provides a feature screening apparatus, including:
the system comprises a sampling module, a processing module and a processing module, wherein the sampling module is used for sampling a plurality of characteristics provided by each participant of federal learning to obtain a plurality of characteristic subsets;
an index calculation module for calculating, for each feature subset, a plurality of feature screening indexes for the feature subset;
the subset screening threshold determining module is used for calculating and obtaining a distribution function curve corresponding to each feature screening index of each feature subset, and determining the subset screening threshold corresponding to the feature subset according to the distribution function curve;
a feature screening threshold determination module, configured to determine, according to each subset screening threshold, a feature screening threshold corresponding to each feature screening indicator;
and the screening module is used for screening the features of the feature subsets according to the feature screening threshold values respectively corresponding to the feature screening indexes to obtain a final feature screening result.
In a third aspect, an embodiment of the present invention provides a feature screening apparatus, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the method as set forth in the first aspect above and in various possible designs of the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the method according to the first aspect and various possible designs of the first aspect are implemented.
In a fifth aspect, an embodiment of the present invention provides a computer program product, which includes a computer program that, when executed by a processor, implements the method as set forth in the first aspect and various possible designs of the first aspect.
The method includes sampling a plurality of features provided by each participant of federal learning to obtain a plurality of feature subsets, calculating a plurality of feature screening indexes of the feature subsets for each feature subset, calculating and obtaining a distribution function curve corresponding to the feature screening indexes for each feature subset, determining subset screening thresholds corresponding to the feature subsets according to the distribution function curve, determining feature screening thresholds corresponding to the feature screening indexes according to the subset screening thresholds, and screening the features of each feature subset according to the feature screening thresholds corresponding to the feature screening indexes to obtain a final feature screening result. The threshold is automatically selected based on the distribution function curve, and the method has higher objectivity and rationality compared with the traditional scheme which needs to depend on subjective experience to specify the threshold. And each participant hides the feature name of the feature provided by each participant, so that the feature screening can be performed under the conditions of protecting the privacy of user data and ensuring the safety, and the complete feature data set is sampled to form the feature subset, so that the feature screening can be performed in parallel under a plurality of feature subsets and a plurality of screening indexes simultaneously.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic diagram of a feature screening process provided in the prior art;
FIG. 2 is a schematic flow chart of a feature screening method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a feature screening method according to an embodiment of the present invention;
fig. 4 is a schematic diagram illustrating a method for determining a subset filtering threshold in a feature filtering method according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart of a feature screening method according to another embodiment of the present invention;
FIG. 6 is a schematic diagram of a feature screening method according to another embodiment of the present invention;
fig. 7 is a schematic structural diagram of a feature screening apparatus according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a hardware structure of the feature screening apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With the rapid development and application of high and new technologies such as big data, artificial intelligence and the like, the problems of data security and user privacy are gradually exposed, and accordingly, measures of relevant laws and regulations make it clear that ownership of data is owned by a user, a data holding organization only has the right of use of the data, and the data holding organization is strictly prohibited from trading or sharing the data of the user and the like. Under the environment, in order to protect the privacy of users and obey various regulations, data among various organizations cannot be directly exchanged or shared, so that a 'data island' phenomenon that mass data of different users and dimensions are respectively stored by different organizations is formed, the development of technologies such as big data is trapped in a bottleneck period, federal learning is a technical scheme which is put forward and developed by transportation under the background, and the aim of the method is to realize data combined modeling of multiple parties on the premise of protecting the privacy of user data, so that the social data resource value is exerted, and the ordered sharing of data is enhanced.
Although the theory of federal learning develops well, many problems still need to be solved in the process of business landing of federal learning, wherein one key problem is how to select a proper number of features from a plurality of mechanisms participating in federal learning modeling, and because the number of the features is not more and better, some redundant features not only increase the modeling cost, such as storage, modeling time, use cost and the like, but also influence the overall accuracy and reduce the overall effect of the model. Therefore, how to reasonably select features in a plurality of participants respectively forms a set of feature set which not only can maintain or improve the model effect, but also can reduce the modeling cost, and the modeling cost can be reduced while the model effect is ensured.
Feature selection in an actual service scenario can be divided into two aspects of feature filtering before modeling and feature importance analysis after modeling, where feature filtering before modeling refers to sorting features according to their relationships and their relationships with tags, and assigning a cut-off threshold Value subjectively according to experience, and finally, only features above the cut-off threshold Value are selected, such as Pearson Correlation Coefficient (PCC), Information Value (IV), and the like, where Pearson Correlation Coefficient represents elimination of redundant features by calculating the degree of Correlation between features, and IV represents evaluation of capability of encoding and predicting input variables using the features. The feature importance analysis after modeling refers to the importance of the features playing rules in the used model, such as the number of times the features are used in the nodes of the tree model, the information gain of the features, and the like, which also requires subjective designation of a cut-off threshold, so that the feature set larger than the threshold is retained as the final result of feature screening for subsequent federal modeling.
Fig. 1 is a schematic diagram of a feature screening process provided in the prior art. As shown in FIG. 1, N characteristics provided by each participant in federal learning are screened by M screening schemes to obtain N-X1-X2- … -Xm-1-XmAnd the characteristic can be used as a final screening result for subsequent federal modeling. Specifically, each screening scheme may correspond to one screening index, and the screening threshold of the screening index is set for a person, for example, in the screening process of the first screening scheme, X1 features are eliminated from N features input into the first screening scheme according to the first manually set threshold to obtain N-X1 features of the screening result of the first screening scheme, in the screening process of the second screening scheme, X2 features are eliminated from N-X1 features input into the first screening scheme according to the second manually set threshold to obtain N-X1-X2 features of the screening result of the first screening scheme, and similarly, in the screening process of the M screening scheme, N-X1-X2- … -X2 features input into the first screening scheme according to the M manually set threshold to obtain N-X1-X2-X3525-X2 features of the first screening schemem-1Is characterized by eliminating XmThe screening result of the first screening scheme is obtained by the characteristics of the first screening scheme, namely N-X1-X2- … -Xm-1-XmAnd (4) a feature.
Therefore, in the screening scheme shown in fig. 1, a screening threshold needs to be manually set, the manual setting mode excessively depends on subjective experience, long-term error accumulation or inheritance from qualified personnel may be required, modeling personnel of different participants in federal modeling have different experiences and cognition on feature screening, the rationality of threshold setting cannot be guaranteed, the screening threshold is reset for each screening scheme, a large amount of labor cost is wasted, the modeling efficiency is seriously reduced, in addition, a plurality of screening schemes are performed in series, N features provided by each participant in federal learning are rejected batch by batch, on one hand, various feature screening schemes are not really considered comprehensively, for example, features rejected by IV values before modeling, because the features are rejected, the features do not participate in feature screening after modeling, and therefore, feature importance results after modeling do not appear, on the other hand, this serial approach easily causes problems with model overfitting.
In order to solve the above problems, the inventors have found through research that a plurality of feature subsets can be obtained by sampling features provided by each participant of federal learning, parallel screening of a plurality of screening indexes is realized, and threshold values of different screening indexes are automatically determined based on a distribution function. Based on this, an embodiment of the present invention provides a feature screening method, which implements integration of multiple feature screening indexes and automatic interception of feature threshold values, so that the contribution of each feature to the model prediction capability is comprehensively, accurately and efficiently measured on the basis of not reducing data security, and the rationality of feature screening is ensured.
The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 2 is a schematic flow chart of a feature screening method according to an embodiment of the present invention. As shown in fig. 2, the method includes:
201. a plurality of features provided by each participant of federal learning are sampled to obtain a plurality of feature subsets.
The execution subject of this embodiment may be a terminal device such as a computer or a server. This embodiment is not limited to this.
In this embodiment, in order to improve the universality of the screened features and avoid overfitting, a plurality of features provided by each participant of federal learning are first formed into a plurality of feature subsets D in a sampling-based manner in the feature dimension, and then a subsequent feature screening step is performed in parallel based on the feature subsets. The sampling mode can be various, such as a full search, a heuristic search, or a random search.
Wherein, the exhaustive complete search mode is simpler. Heuristic searching may be to heuristically add features from the empty set or remove features from the full set of features by computing some indicator of all features prior to federal feature screening. The search range can be reduced, and the problem complexity can be reduced.
In some embodiments, the features provided by the parties to federal learning may include identification information that may distinguish the features, for example, the identification information may be a code number such as a number, letter, alias, etc. The distinction is performed with respect to the actual name of the adopted feature, which can play a role in protecting the privacy of the user. Each participant hides the feature name of the feature provided by each participant, and can perform feature screening under the conditions of protecting the privacy of user data and ensuring the safety.
In some embodiments, a random search mode may be adopted in consideration of the fact that the communication cost of the federal modeling is high and the feature dimensions provided by multiple parties in the federal modeling are often large. The sampling a plurality of features provided by each participant of federal learning to obtain a plurality of feature subsets includes: the method comprises the steps of randomly sampling a plurality of features provided by each participant of federal learning to obtain a plurality of feature subsets. The features include identification information, the identification information is a serial number, the random sampling is performed on a plurality of features provided by each participant of federal learning, a plurality of feature subsets are obtained, and the method includes the following steps: sequencing according to the serial numbers of the features to obtain a feature sequence; and randomly sampling the characteristic sequence according to a randomly generated sequence selection algorithm to obtain a plurality of characteristic subsets. Specifically, different feature subsets are formed on the complete feature set provided by the multiple participants, for example, the features of the multiple participants may be numbered to form a feature sequence, and then the multiple subsets are randomly generated to form the feature subsets by using a Random Generation plus Sequential Selection (RGSS) algorithm. Exemplarily, fig. 3 is a schematic flow chart of a feature screening method according to an embodiment of the present invention. Fig. 3 is a schematic diagram of the principle of the feature screening method according to an embodiment of the present invention, and as shown in fig. 3, the n features provided by the m federal learning participants generate 3 feature subsets, namely, feature subset D1, feature subset D2, and feature subset D3.
202. For each feature subset, a plurality of feature screening indicators for the feature subset are calculated.
In this embodiment, after preparing a plurality of feature subsets, different feature screening indicators, such as an IV value and a pearson correlation coefficient before modeling, and feature importance indicators after modeling, such as an average information Gain value Gain, need to be calculated for each data set.
In some embodiments, each feature filter metric corresponds to a filter function.
For example, a first filtering index, i.e., a first filtering function F1, and a second filtering index, i.e., a second filtering function F2, may be selected for feature filtering with IV values and Gain values. The IV value is the prediction ability of each feature to the label y calculated before modeling, the number of times each feature is used in the model is considered, and the Gain value is the average information Gain of each feature selected as the split point.
As shown in fig. 3, the value at each filter function Fj is calculated for each feature subset Di. Specifically, for the D1 feature subset, a plurality of feature screening indexes F1 and F2 of D1 are calculated to obtain D1F1 and D1F2, for the D2 feature subset, a plurality of feature screening indexes F1 and F2 of D2 are calculated to obtain D2F1 and D2F2, and for the D3 feature subset, a plurality of feature screening indexes F1 and F2 of D3 are calculated to obtain D3F1 and D3F 2.
203. And aiming at each characteristic screening index of each characteristic subset, calculating and obtaining a distribution function curve corresponding to the characteristic screening index, and determining a subset screening threshold value corresponding to the characteristic subset by the characteristic screening index according to the distribution function curve.
In this embodiment, the Distribution Function may be a Cumulative Distribution Function (CDF) or a Complementary Cumulative Distribution Function (CCDF). Of course, the variable may be implemented in different ways according to the type of the variable, for example, in one case, if the variable is a continuous variable, a Probability Density Function (PDF) may be used, where the Probability density Function of the continuous random variable is a Function describing the possibility that the output value of the random variable is near a certain value-taking point. In another case, if the variable is a categorical variable, it can be implemented by a Probability Mass Function (PMF), which is the Probability of each specific value of the discrete random variable.
Specifically, if the feature XX is a continuous random variable, the probability density function is defined as fx (x), and the integral of PDF over a certain interval is used to characterize the probability that the random variable falls within this interval. If feature XX is a discrete random variable, the probability mass function is defined as fX (x).
In some embodiments, the calculating and obtaining a distribution function curve corresponding to the feature screening indicators includes: calculating a filtering function value corresponding to the characteristic screening index; normalizing the filter function values, and sequencing the normalized filter function values according to a preset sequence; and calculating to obtain the CDF curve according to the sorted filter function values. In one possible implementation, the normalized filter function values may be sorted in increasing order. The determining, from the CDF curve, that the feature screening metric corresponds to a subset screening threshold of the feature subset includes: according to the gradient of the CDF curve, a first horizontal line point located at a preset accumulation value is positioned; and determining the filter function value corresponding to the first horizontal line point as a subset screening threshold value of the feature screening index corresponding to the feature subset.
For example, the present embodiment exemplifies the curve plotting manner and the determination of the subset filtering threshold value by taking the distribution function as the CDF. For each subset of features DiAnd each filter function value D corresponding theretoiFjThe cutoff threshold of the feature screening needs to be calculated as the subset screening threshold of the feature subset, i.e. the cutoff rank. Compared with the traditional mode of artificial subjective designation, the method has the advantages of being simple in operation, convenient to use and capable of achieving the purpose of improving the quality of the productIn the embodiment, the subset screening threshold is automatically confirmed based on the cumulative distribution function CDF.
Wherein the CDF is expressed as: fx(x)=P(X≤x) (1)
For the discrete variables of the filter function values obtained in step 202, the cumulative distribution function represents the sum of all occurrences equal to or less than a certain value x. In order to explain the present embodiment more clearly and accurately, the present step is exemplified by taking the IV value as an example.
Firstly, calculating IV values of all the characteristics in a certain characteristic subset, then carrying out normalization processing on the IV values of all the characteristics in the characteristic subset, mapping a data range to a 0-1 closed interval, and then sorting according to an increasing order, wherein the normalization definition is shown in a formula (2):
Figure BDA0003487189230000081
wherein X is the data value before normalization, X' is the corresponding value after normalization, and XmaxAnd XminRespectively representing the maximum and minimum values in the set X formed by all X.
Secondly, calculating according to the normalized IV value to obtain a CDF curve, positioning a first horizontal line point (namely a point with the gradient of 0) which is above a preset accumulation value (for example, 50%) according to the gradient of the CDF curve, and mapping the IV values of all the characteristics in the normalized characteristic subset to a characteristic sequence of the characteristic subset to obtain the characteristic cut-off rank. In practical applications, to calculate the gradient at any point k on the CDF curve, forward difference can be used for points inside the CDF curve and single-edge difference (forward or backward) can be used for boundary points, where the gradient GT is expressed as:
Figure BDA0003487189230000091
wherein D isiFjIs represented byi feature subsets DiAnd the corresponding j filtering function value FjAnd n is a feature subset DiThe number of all the features in the table, k is an index value which is ordered from small to large after normalization, t is a specific filtering function, taking the IV value as an example, tkThe calculation process for the IV value corresponding to the kth feature is schematically illustrated in fig. 4, wherein the lower curve is a scattergram of the normalized IV values corresponding to all features of the feature subset ranging from 0 to 1, and the upper curve is a CDF curve corresponding to the normalized IV values corresponding to all features of the feature subset, wherein the position marked by the vertical dashed line is the first horizontal line point with a cumulative value of more than 50%, i.e., the subset filtering threshold value after filtering through the IV value in the feature subset, thereby obtaining the feature subset DiThe features to be removed and the features to be retained are determined, and optionally, the features at the threshold point can be retained.
Finally, the above operations are repeated for each subset of features, and then all the subsets of features D are selected according to the above manneri,i∈1..IGenerating a corresponding characteristic cutoff rank TijI.e. a subset screening threshold is automatically determined for each feature subset.
204. And determining a characteristic screening threshold value corresponding to each characteristic screening index according to each subset screening threshold value.
In this embodiment, the feature screening thresholds corresponding to the feature screening indexes may be determined in various manners, for example, a median of the subset screening thresholds may be taken, and an average of the subset screening thresholds may be taken. This embodiment is not limited to this.
In some embodiments, the determining, according to each of the subset filtering thresholds, a feature filtering threshold corresponding to each of the feature filtering indicators includes: calculating the average value of the subset screening threshold values respectively corresponding to the feature subsets aiming at each feature screening index; and determining the average value as a feature screening threshold value of the feature screening index. The characteristic screening threshold value of the characteristic screening index is determined by calculating the average value of the subset screening threshold values respectively corresponding to the characteristic subsets, so that the characteristic screening threshold value is more reasonable and accurate, and the reasonability of characteristic screening is improved.
The following description will exemplify a determination method of averaging. Screening the jth feature for an index FjCalculating the average value in a list formed by I subset screening threshold values corresponding to different feature subsets to obtain the jth feature screening index FjCorresponding feature screening threshold, i.e. feature cut-off rank TjNote that the same feature screening index function is used to average (I dimension) the screening thresholds of the subsets of different features, and the calculation process is as follows:
Figure BDA0003487189230000101
obtaining the jth characteristic screening index FjCorresponding characteristic cut-off rank TjAnd then, by analogy, the average value of the subset screening threshold values corresponding to each characteristic screening index is obtained and used as the characteristic screening threshold value of the characteristic screening index. I.e. a list [ T ] from the feature filter threshold value of the 1 st feature filter index to the feature filter threshold value of the jth feature filter index1,…,Tj,…TJ]。
As shown in FIG. 3, when j is 2, the feature filtering threshold T is calculated based on CDF according to steps 203 and 2041And T2
205. And screening the features of the feature subsets according to the feature screening threshold values respectively corresponding to the feature screening indexes to obtain a final feature screening result.
In this embodiment, the final feature screening result may be obtained in a variety of manners, for example, the features of each feature subset may be screened according to the feature screening thresholds respectively corresponding to the feature screening indexes, and then a union may be obtained, and the union may be determined as the final feature screening result. And screening the features of the feature subsets according to the feature screening thresholds respectively corresponding to the feature screening indexes, and then solving an intersection, wherein the intersection can be determined as a final feature screening result. Of course, the final feature screening result can also be determined by other schemes of weighted screening. This embodiment is not limited to this.
The feature screening method provided by the embodiment automatically selects the threshold value based on the distribution function curve, and has higher objectivity and rationality compared with the traditional scheme which needs to specify the threshold value depending on subjective experience. The complete feature data set is sampled to form the feature subset, the feature subset can be simultaneously performed under a plurality of feature subsets and a plurality of screening indexes in parallel, multiple parameter adjustment and modeling are required for meeting the feature screening requirements of all participants compared with the traditional modeling, the efficiency is better, the comprehensiveness and the reasonability of feature screening are improved, and in addition, the overfitting problem caused by too few in-mold features is avoided compared with the traditional scheme of serially using a plurality of feature screening indexes.
Fig. 5 is a schematic flow chart of a feature screening method according to another embodiment of the present invention. As shown in fig. 5, in addition to the embodiment shown in fig. 2, the embodiment describes in detail a manner of performing feature screening according to a feature screening threshold. The method comprises the following steps:
501. a plurality of features provided by each participant of federal learning are sampled to obtain a plurality of feature subsets.
502. For each feature subset, a plurality of feature screening indicators for the feature subset are calculated.
503. And aiming at each characteristic screening index of each characteristic subset, calculating and obtaining a distribution function curve corresponding to the characteristic screening index, and determining a subset screening threshold value corresponding to the characteristic subset by the characteristic screening index according to the distribution function curve.
504. And determining a characteristic screening threshold value corresponding to each characteristic screening index according to each subset screening threshold value.
Steps 501 to 504 in this embodiment are similar to steps 201 to 204 in the above embodiment, and are not described again here.
505. For each feature screening index, performing feature screening on each feature subset according to a feature screening threshold of the feature screening index to obtain each new feature subset corresponding to the feature screening index; and determining a feature collection corresponding to the feature screening indexes according to the new feature subsets corresponding to the feature screening indexes.
In this embodiment, the feature collection may be determined in various manners, and intersection, union, or other weighted screening manners may be performed on each new feature subset.
In some embodiments, the determining, according to each new feature subset corresponding to the feature screening metric, a feature set corresponding to the feature screening metric includes: and taking the union of the new feature subsets corresponding to the feature screening indexes as a feature collection corresponding to the feature screening indexes.
Specifically, the j-th characteristic screening index F obtained in the last stepjCorresponding characteristic cut-off rank TjThereafter, T may be usedjFor all data sets [ D1,…,Di,…DI]The feature sets corresponding to the data subsets are subjected to feature screening, and therefore a new feature subset { FS) after preliminary screening is generated for all the data subsets1j,…FSij,…FSIjIn which FS isijI.e. the ith data set DiScreening of index F Using jth featurejAnd the retained features after feature screening constitute a new feature subset.
Then, aiming at the I data subsets, the feature set FS after the first round of preliminary screening is carried outijThe operation of obtaining the union set is carried out, the calculation process is shown in a formula (5), and therefore the feature collection FS corresponding to the jth feature screening index after screening can be obtainedjThe union set may integrate the results of multiple randomly drawn features to avoid overfitting.
Figure BDA0003487189230000111
Fig. 6 is a schematic diagram illustrating a principle of a feature screening method according to yet another embodiment of the present invention, as shown in fig. 6, on the basis of fig. 3, feature sets FS1 and FS2 are obtained by merging new feature subsets obtained according to feature screening thresholds T1 and T2.
506. And aiming at each feature, determining the score of the feature according to the frequency of occurrence of the feature in each feature set, and determining a final feature screening result according to the score of each feature.
In the present embodiment, there are various ways of determining the score of a feature, and the feature score may be determined by summing up the number of times that the feature appears in each feature set and determining the sum as the feature score, or may be determined by using feature values such as the minimum value, the maximum value, the median value, and the average value of the number of times. This embodiment is not limited to this.
In some embodiments, the determining, for each feature, a score for the feature according to the number of times the feature appears in each feature set includes: and determining the score of the characteristic by the sum of the times of the characteristic appearing in each characteristic collection.
In some embodiments, the determining a final feature screening result according to the score of each feature includes: and for each feature, if the difference between the number of the feature screening indexes and the score of the feature is less than a preset value, adding the feature into a final feature screening result.
Specifically, a feature set { FS) obtained by collecting and collecting data subsets based on J feature screening indexes obtained by the previous step1,…FSj,…FSJGet statistics of each feature in all FS separatelyjThe number of occurrences in (f) is recorded as the score for the feature, so that all n features provided by multiple participants for all federal modeling can be grouped together in the set { FS }1,…FSj,…FSJThe cumulative occurrence times in the feature selection index are used as corresponding scores, that is, assuming that the kth feature occurs 2 times in the set after the 3 feature selection index is selected, the score of the kth feature is 2, that is, SkBy analogy, we can get the score set corresponding to all n features, and then if J is 1, i.e. in the case of using only 1 feature filtering index, then FS is used1Namely a final output characteristic result set S, when J is more than or equal to 2We select the feature with a feature score of at least J-1 as the final output feature result set S, that is, the feature can enter the final feature result set FS only if it appears at least J-1 times in all the result sets after the filtering of the J feature filtering indicators, and the calculation process is as shown in formula (6):
Figure BDA0003487189230000121
and when a final feature screening result, namely a final modeling feature set, is obtained, the screening result is synchronized to each federal modeling participant, so that the features in the set S are selected from feature sets provided by each party to perform a subsequent federal modeling link.
As shown in fig. 6, the number of occurrences of the feature in the two feature collections is calculated for the feature collections FS1 and FS2, and each feature is scored based on the number of occurrences, and the final feature screening result S is determined from the scoring result. And after S is obtained, synchronizing S to the corresponding m participants respectively.
Compared with the traditional method for manually specifying the feature screening threshold value, the feature screening method provided by the embodiment can automatically output the final feature screening set, avoids subjectivity of screening according to experience, and solves the problem of difference of parameter value cognition of different participant modelers in a federal modeling scene. In addition, compared with the traditional characteristic screening mode which needs to repeatedly adjust parameter observation results, the scheme does not need to adjust any parameter, so a large amount of labor cost is saved, and different from the traditional serial screening mode which uses a plurality of screening indexes for screening, the scheme comprehensively considers the plurality of characteristic screening indexes in a parallel mode, can avoid the phenomenon of overfitting, and simultaneously determines the screening threshold value by considering the distribution condition of the characteristics under a certain screening index from a distribution function curve of the characteristics, such as a CDF curve, so that the method has higher universality.
Fig. 7 is a schematic structural diagram of a feature screening apparatus according to an embodiment of the present invention. As shown in fig. 7, the feature screening apparatus 70 includes: a sampling module 701, an index calculation module 702, a subset screening threshold determination module 703, a feature screening threshold determination module 704, and a screening module 705.
A sampling module 701, configured to sample multiple features provided by each party involved in federal learning to obtain multiple feature subsets;
an index calculation module 702, configured to calculate, for each feature subset, a plurality of feature screening indexes for the feature subset;
a subset screening threshold determining module 703, configured to calculate and obtain a distribution function curve corresponding to each feature screening indicator of each feature subset, and determine, according to the distribution function curve, a subset screening threshold of the feature subset corresponding to the feature screening indicator;
a feature screening threshold determining module 704, configured to determine, according to each subset screening threshold, a feature screening threshold corresponding to each feature screening indicator;
the screening module 705 is configured to screen the features of each feature subset according to the feature screening threshold corresponding to each feature screening indicator, so as to obtain a final feature screening result.
The feature screening apparatus 70 provided in this embodiment samples, by using a sampling module 701, a plurality of features provided by each participant in the federal learning to obtain a plurality of feature subsets, where the features do not include actual names and only include identification information, an index calculation module 702 calculates, for each feature subset, a plurality of feature screening indexes of the feature subsets, a subset screening threshold determination module 703 calculates, for each feature screening index of each feature subset, and obtains a distribution function curve corresponding to the feature screening indexes, and determines, according to the distribution function curve, a subset screening threshold corresponding to the feature subset for the feature screening indexes, a feature screening threshold determination module 704 determines, according to each subset screening threshold, a feature screening threshold corresponding to each feature screening index, and a screening module 705 determines, according to each feature screening threshold corresponding to each feature screening index, and screening the characteristics of each characteristic subset to obtain a final characteristic screening result. The device 70 has higher objectivity and rationality compared with the traditional scheme which needs to rely on subjective experience to specify the threshold value by automatically selecting the threshold value based on the distribution function curve. And each participant hides the feature name of the feature provided by each participant, so that the feature screening can be performed under the conditions of protecting the privacy of user data and ensuring the safety, and the complete feature data set is sampled to form the feature subset, so that the feature screening can be performed in parallel under a plurality of feature subsets and a plurality of screening indexes simultaneously.
In one possible design, the sampling module is specifically configured to:
the method comprises the steps of randomly sampling a plurality of features provided by each participant of federal learning to obtain a plurality of feature subsets.
In one possible design, the feature includes identification information, the identification information is a number, and the sampling module is specifically configured to:
sequencing according to the serial numbers of the features to obtain a feature sequence;
and randomly sampling the characteristic sequence according to a randomly generated sequence selection algorithm to obtain a plurality of characteristic subsets.
In one possible design, the different feature filtering indicators correspond to different filter functions, the distribution function is a cumulative distribution function CDF, and the subset filtering threshold determining module is specifically configured to:
calculating a filtering function value corresponding to the characteristic screening index;
normalizing the filter function values, and sequencing the normalized filter function values according to a preset sequence;
and calculating to obtain the CDF curve according to the sorted filter function values.
In one possible design, the subset screening threshold determination module is specifically configured to:
according to the gradient of the CDF curve, a first horizontal line point located at a preset accumulation value is positioned;
and determining the filter function value corresponding to the first horizontal line point as a subset screening threshold value of the feature screening index corresponding to the feature subset.
In one possible design, the feature screening threshold determination module is specifically configured to:
calculating the average value of the subset screening threshold values respectively corresponding to the feature subsets aiming at each feature screening index;
and determining the average value as a feature screening threshold value of the feature screening index.
In one possible design, the screening module is specifically configured to:
for each feature screening index, performing feature screening on each feature subset according to a feature screening threshold of the feature screening index to obtain each new feature subset corresponding to the feature screening index; determining a feature collection corresponding to the feature screening indexes according to the new feature subsets corresponding to the feature screening indexes;
for each feature, determining a score of the feature according to the number of times the feature appears in each feature set;
and determining a final characteristic screening result according to the scores of the characteristics.
In one possible design, the screening module is specifically configured to:
and taking the union of the new feature subsets corresponding to the feature screening indexes as a feature collection corresponding to the feature screening indexes.
In one possible design, the screening module is specifically configured to:
and determining the total number of times of the feature appearing in each feature collection as the score of the feature.
In one possible design, the screening module is specifically configured to:
and for each feature, if the difference between the number of the feature screening indexes and the score of the feature is smaller than a preset value, adding the feature into a final feature screening result.
The feature screening device provided in the embodiment of the present invention may be used to implement the method embodiments described above, and the implementation principle and the technical effect are similar, which are not described herein again.
Fig. 8 is a schematic diagram of a hardware structure of the feature screening apparatus according to an embodiment of the present invention. As shown in fig. 8, the feature screening apparatus 80 according to the present embodiment includes: at least one processor 801 and a memory 802. The feature screening apparatus 80 further comprises a communication component 803. The processor 801, the memory 802, and the communication unit 803 are connected by a bus 804.
In particular implementations, the at least one processor 801 executes computer-executable instructions stored by the memory 802, causing the at least one processor 801 to perform a feature screening method as performed by the feature screening apparatus 80 described above.
When the data processing of the present embodiment is executed by the server, the communication section 803 may transmit a plurality of features provided by each party of federal learning to the server.
For a specific implementation process of the processor 801, reference may be made to the above method embodiments, which have similar implementation principles and technical effects, and details of this embodiment are not described herein again.
In the embodiment shown in fig. 8, it should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
The memory may comprise high speed RAM memory and may also include non-volatile storage NVM, such as at least one disk memory.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
An embodiment of the present invention further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the feature screening method performed by the above feature screening apparatus is implemented.
The present application further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the feature screening method performed by the above feature screening apparatus is implemented.
The computer-readable storage medium may be any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk. Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the readable storage medium may also reside as discrete components in the apparatus.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (17)

1. A method of feature screening, comprising:
sampling a plurality of features provided by each participant of federal learning to obtain a plurality of feature subsets; for each feature subset, calculating a plurality of feature screening indicators for the feature subset;
aiming at each characteristic screening index of each characteristic subset, calculating and obtaining a distribution function curve corresponding to the characteristic screening index, and determining a subset screening threshold value corresponding to the characteristic subset by the characteristic screening index according to the distribution function curve;
determining a characteristic screening threshold value corresponding to each characteristic screening index according to each subset screening threshold value;
and screening the characteristics of each characteristic subset according to the characteristic screening threshold value corresponding to each characteristic screening index to obtain a final characteristic screening result.
2. The method of claim 1, wherein sampling a plurality of features provided by each participant in federal learning to obtain a plurality of feature subsets comprises:
the method comprises the steps of randomly sampling a plurality of features provided by each participant of federal learning to obtain a plurality of feature subsets.
3. The method of claim 2, wherein the features include identification information, the identification information being a number, and wherein randomly sampling the plurality of features provided by each of the parties to federal learning to obtain a plurality of feature subsets comprises:
sequencing according to the serial numbers of the features to obtain a feature sequence;
and randomly sampling the characteristic sequence according to a randomly generated sequence selection algorithm to obtain a plurality of characteristic subsets.
4. The method according to claim 1, wherein different feature filter indicators correspond to different filter functions, the distribution function is a cumulative distribution function CDF, and the calculating and obtaining a distribution function curve corresponding to the feature filter indicators comprises:
calculating a filtering function value corresponding to the characteristic screening index;
normalizing the filter function values, and sequencing the normalized filter function values according to a preset sequence;
and calculating to obtain the CDF curve according to the sorted filter function values.
5. The method of claim 4, wherein said determining from the CDF curve that the feature screening metric corresponds to a subset screening threshold of the subset of features comprises:
according to the gradient of the CDF curve, a first horizontal line point located at a preset accumulation value is positioned;
and determining the filter function value corresponding to the first horizontal line point as a subset screening threshold value of the feature screening index corresponding to the feature subset.
6. The method of claim 1, wherein determining the feature filtering threshold corresponding to each feature filtering index according to each subset filtering threshold comprises:
calculating the average value of the subset screening threshold values respectively corresponding to the feature subsets aiming at each feature screening index;
and determining the average value as a feature screening threshold value of the feature screening index.
7. The method according to any one of claims 1 to 6, wherein the screening the features of each feature subset according to the feature screening threshold corresponding to each feature screening index to obtain a final feature screening result includes:
for each feature screening index, performing feature screening on each feature subset according to a feature screening threshold of the feature screening index to obtain each new feature subset corresponding to the feature screening index; determining a feature collection corresponding to the feature screening indexes according to the new feature subsets corresponding to the feature screening indexes;
for each feature, determining a score of the feature according to the number of times the feature appears in each feature set;
and determining a final feature screening result according to the scores of the features.
8. The method according to claim 7, wherein the determining a feature set corresponding to the feature screening metric according to each new feature subset corresponding to the feature screening metric includes:
and taking the union of the new feature subsets corresponding to the feature screening indexes as a feature collection corresponding to the feature screening indexes.
9. The method of claim 7, wherein determining, for each feature, a score for the feature based on a number of times the feature occurs in each feature set comprises:
and determining the total number of times of the feature appearing in each feature collection as the score of the feature.
10. The method of claim 7, wherein determining a final feature screening result according to the score of each feature comprises:
and for each feature, if the difference between the number of the feature screening indexes and the score of the feature is less than a preset value, adding the feature into a final feature screening result.
11. A feature screening apparatus, comprising:
the system comprises a sampling module, a processing module and a processing module, wherein the sampling module is used for sampling a plurality of characteristics provided by each participant of federal learning to obtain a plurality of characteristic subsets;
an index calculation module for calculating, for each feature subset, a plurality of feature screening indexes for the feature subset;
the subset screening threshold determining module is used for calculating and obtaining a distribution function curve corresponding to each feature screening index of each feature subset, and determining the subset screening threshold corresponding to the feature subset according to the distribution function curve;
a feature screening threshold determination module, configured to determine, according to each subset screening threshold, a feature screening threshold corresponding to each feature screening indicator;
and the screening module is used for screening the features of the feature subsets according to the feature screening threshold values respectively corresponding to the feature screening indexes to obtain a final feature screening result.
12. The apparatus according to claim 11, wherein different feature filtering indicators correspond to different filter functions, the distribution function is a cumulative distribution function CDF, and the subset filtering threshold determination module is specifically configured to:
calculating a filtering function value corresponding to the characteristic screening index;
normalizing the filter function values, and sequencing the normalized filter function values according to a preset sequence;
and calculating to obtain the CDF curve according to the sorted filter function values.
13. The device of claim 12, wherein the subset filter threshold determination module is specifically configured to:
according to the gradient of the CDF curve, a first horizontal line point located at a preset accumulation value is positioned;
and determining the filter function value corresponding to the first horizontal line point as a subset screening threshold value of the feature screening index corresponding to the feature subset.
14. The apparatus according to any one of claims 11-13, wherein the screening module is specifically configured to:
for each feature screening index, performing feature screening on each feature subset according to a feature screening threshold of the feature screening index to obtain each new feature subset corresponding to the feature screening index; determining a feature collection corresponding to the feature screening indexes according to the new feature subsets corresponding to the feature screening indexes;
for each feature, determining a score of the feature according to the number of times the feature appears in each feature set;
and determining a final feature screening result according to the scores of the features.
15. A feature screening apparatus, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the feature screening method of any of claims 1 to 10.
16. A computer-readable storage medium having computer-executable instructions stored therein, which when executed by a processor, implement the feature screening method of any one of claims 1 to 10.
17. A computer program product comprising a computer program, characterized in that the computer program realizes the feature screening method of any one of claims 1 to 10 when executed by a processor.
CN202210084862.1A 2022-01-25 2022-01-25 Feature screening method, device, storage medium, and program product Pending CN114444592A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210084862.1A CN114444592A (en) 2022-01-25 2022-01-25 Feature screening method, device, storage medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210084862.1A CN114444592A (en) 2022-01-25 2022-01-25 Feature screening method, device, storage medium, and program product

Publications (1)

Publication Number Publication Date
CN114444592A true CN114444592A (en) 2022-05-06

Family

ID=81369831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210084862.1A Pending CN114444592A (en) 2022-01-25 2022-01-25 Feature screening method, device, storage medium, and program product

Country Status (1)

Country Link
CN (1) CN114444592A (en)

Similar Documents

Publication Publication Date Title
CN107025596B (en) Risk assessment method and system
CN105824813B (en) A kind of method and device for excavating core customer
CN108898476A (en) A kind of loan customer credit-graded approach and device
CN113095927A (en) Method and device for identifying suspicious transactions of anti-money laundering
CN115577152B (en) Online book borrowing management system based on data analysis
WO2021174699A1 (en) User screening method, apparatus and device, and storage medium
CN108197795B (en) Malicious group account identification method, device, terminal and storage medium
CN113240259B (en) Rule policy group generation method and system and electronic equipment
WO2023029065A1 (en) Method and apparatus for evaluating data set quality, computer device, and storage medium
CN114782123A (en) Credit assessment method and system
CN110827036A (en) Method, device, equipment and storage medium for detecting fraudulent transactions
CN113112347A (en) Determination method of hasty collection decision, related device and computer storage medium
CN112035570A (en) Merchant evaluation method and system
CN114444592A (en) Feature screening method, device, storage medium, and program product
CN112241820A (en) Risk identification method and device for key nodes in fund flow and computing equipment
CN107402984B (en) A kind of classification method and device based on theme
CN111723338A (en) Detection method and detection equipment
CN110458707B (en) Behavior evaluation method and device based on classification model and terminal equipment
CN114881761A (en) Determination method of similar sample and determination method of credit limit
CN114626940A (en) Data analysis method and device and electronic equipment
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
CN113487440A (en) Model generation method, health insurance claim settlement determination method, device, equipment and medium
CN113869460A (en) Model iteration method, device, equipment and storage medium
CN113191877A (en) Data feature acquisition method and system and electronic equipment
CN115082079B (en) Method and device for identifying associated user, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination