WO2023231184A1 - 一种特征筛选方法、装置、存储介质及电子设备 - Google Patents

一种特征筛选方法、装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2023231184A1
WO2023231184A1 PCT/CN2022/113011 CN2022113011W WO2023231184A1 WO 2023231184 A1 WO2023231184 A1 WO 2023231184A1 CN 2022113011 W CN2022113011 W CN 2022113011W WO 2023231184 A1 WO2023231184 A1 WO 2023231184A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
feature
target
sample
training
Prior art date
Application number
PCT/CN2022/113011
Other languages
English (en)
French (fr)
Inventor
成晓亮
张磊
周岳
张伟
郑可嘉
Original Assignee
江苏品生医疗科技集团有限公司
南京品生医疗科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 江苏品生医疗科技集团有限公司, 南京品生医疗科技有限公司 filed Critical 江苏品生医疗科技集团有限公司
Publication of WO2023231184A1 publication Critical patent/WO2023231184A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to the field of data processing technology, and in particular to a feature screening method, device, storage medium and electronic equipment.
  • Mass spectrometry technology is booming and is widely used in testing projects in many clinical fields, including endocrinology, cardiovascular, oncology, and drug therapy.
  • Mass spectrometry technology is an indispensable tool for achieving precise diagnosis and precision medicine.
  • various omics big data such as proteomics, metabolomics, and lipidomics of clinical samples can be obtained. Accordingly, how to conduct reasonable and effective analysis of multi-omics data brought by mass spectrometry technology is one of the key points and hot spots in research.
  • the invention provides a feature screening method, device, storage medium and electronic equipment to improve the accuracy of feature screening.
  • a feature screening method including:
  • the sample data is divided into individual groups to obtain individual sample groups corresponding to different individuals, and cross-validation is performed based on multiple individual sample groups to determine the training data set and verification data set obtained by the division. ;
  • the corresponding target data feature group of the processing target is determined based on the training process data of each machine learning model.
  • a feature screening device including:
  • the feature verification subset determination module is used to determine multiple feature verification subsets based on the data characteristics in the sample data
  • a data set division module is used to divide the sample data into individual groups based on the individuals to which the sample data belongs, to obtain individual sample groups corresponding to different individuals, and to perform cross-validation division based on multiple individual sample groups to determine the divided training data set and validation data set;
  • the model training module is used to train the machine learning model of the processing target based on the training data set and verification data set corresponding to each of the feature verification subsets;
  • the target data feature group determination module is used to determine the corresponding target data feature group of the processing target based on the training process data of each machine learning model.
  • an electronic device including:
  • the memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present invention.
  • Feature filtering methods Feature filtering methods.
  • a computer-readable storage medium stores computer instructions.
  • the computer instructions are used to enable a processor to implement any embodiment of the present invention when executed. feature screening method.
  • multiple data features in the sample data are verified in the form of feature verification subsets based on a machine learning model to achieve wrapped screening of data features and obtain the results for Groups of target data features that handle target prediction.
  • the sample data used to train the machine learning model is divided into individuals, and the sample data of the same individual is divided into the same individual sample group, and cross-validation is divided based on the individual sample group to avoid the sample data of the same individual.
  • it is divided into a training data set and a verification data set to avoid the impact of individual sample data on the performance of the machine learning model and further improve the accuracy of feature screening.
  • Figure 1 is a flow chart of a feature screening method provided by an embodiment of the present invention.
  • Figure 2 is a flow chart of a feature screening method provided by an embodiment of the present invention.
  • Figure 3 is a flow chart of a feature screening method provided by an embodiment of the present invention.
  • Figure 4 is an example of a data distribution diagram provided by an embodiment of the present invention.
  • Figure 5 is a flow chart of a feature screening method provided by an embodiment of the present invention.
  • Figure 6 is a schematic structural diagram of a feature screening device provided by an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
  • Figure 1 is a flow chart of a feature screening method provided by an embodiment of the present invention. This embodiment can be applied to the situation of screening data features for predicting processing targets among a large number of data features.
  • This method can be executed by a feature screening device.
  • the feature screening device can be implemented in the form of hardware and/or software, and the feature screening device can be configured in electronic equipment such as computers, servers, etc.
  • the method includes:
  • each group of sample data may include multiple types of data features.
  • the types of data features included in different groups of sample data may be the same, and the data values of each data feature may be different.
  • the sample data may include omics data and/or clinical data.
  • the omics data may be obtained through mass spectrometry technology.
  • the omics data includes but is not limited to proteomics, metabolomics, and lipids.
  • Omics and clinical data can be collected through data collection equipment, or can be historical collection data.
  • Clinical data includes but is not limited to blood pressure, heart rate, respiratory rate, etc.
  • the target data features corresponding to the processing target are only part of the features of the data features in the sample data, and the target data features corresponding to different processing targets can be different.
  • the processing target can be classification prediction of the input data in any dimension.
  • the processing target is based on prediction of hormone concentrations at different time points, prediction based on the pathological grade of a certain disease, etc.
  • yi in the above sample data represents the label of xi in the processing target dimension.
  • a data feature set is obtained based on the data features in each group of sample data, and multiple feature verification subsets are randomly determined in the data feature set, where the number of data features in the feature verification subset is random, and the number of data features is It can be greater than 1 and less than the total number of data features, that is, the feature verification subset can include partial data features or all data features. Based on the determined number of data features, a corresponding number of data features are randomly determined in the data feature set to form a feature verification subset.
  • determining multiple feature verification subsets based on data features in the sample data includes: determining multiple feature verification subsets among the data features in the sample data based on the number of features in the feature verification subset.
  • the number of features in the feature verification subset can be preset, for example, 8, 10, 15, etc., which can be set according to user needs.
  • the number of features in the feature verification subset can also be determined based on the amount of sample data.
  • the maximum number of features in the feature verification subset is the ratio of the number of samples to the preset value, and the preset value can be 15. It should be noted that the preset values are not limited and can be set according to user needs.
  • the number of features in the feature verification subset is at the number of samples is the number of samples. Based on the number of features d in the feature verification subset and the total number of data features D in the sample data, the number of feature verification subsets can be determined. For example, the number of feature verification subsets is
  • the target data features for the processing target are screened, that is, the important marker combination for predicting label y.
  • the machine learning model is verified for each feature verification subset, and the accuracy of the feature verification subset is reversely verified through the training results of the machine learning model to obtain the target data feature group corresponding to the processing target.
  • the sample data before training each machine learning model based on the sample data, the sample data is cross-validated and divided to obtain a training data set and a verification data set. Combined with the corresponding relationship between sample data and individuals, the sample data is divided into cross-validation to avoid sample data from the same individual being divided into the training data set and the verification data set at the same time, which will affect the real performance of the machine learning model.
  • N sample data can come from M individuals, M ⁇ N, the number of individuals M can be equal to the number of samples N, and the sample data can also be M sampling at different stages.
  • the sample data is divided into individual groups to obtain individual sample groups corresponding to different individuals, and cross-validation is performed based on multiple individual sample groups to determine the training data set obtained by the division. and verification data set, including: dividing at least one set of sample data of the same individual into the same individual group to obtain individual sample groups corresponding to different individuals; based on at least one preset cross-validation rule, conducting multiple individual sample groups Cross-validation partitioning determines the training data set and validation data set obtained by the partitioning.
  • Each sample data can carry the identification information of the individual it belongs to.
  • Each sample data can be divided based on the identification information of the individual it belongs to. That is, the sample data carrying the same identification information can be divided into the same individual group to obtain each individual sample group.
  • Each individual sample group is used as a unit data group for cross-validation division, and cross-validation division is performed to obtain a training data set and a validation data set.
  • the implementation method of cross-validation division is not limited, any individual sample group can be used as cross-validation division.
  • K-fold repeated cross-validation method Repeated K-fold, leave-one cross-validation method LeaveOneOut and leave-P cross-validation method LeavePOut can be used.
  • Individual sample groups can be divided as cross-validation based on any of the above cross-validation partitioning methods. Among them, for the K-fold repeated cross-validation method Repeated K-fold, the configuration parameter k is greater than or equal to 2 and is an integer, and the number of repetitions Repeated is greater than or equal to 1 and is an integer.
  • the value range of P is 1 ⁇ P ⁇ M.
  • the parameter is only P, which means that P individual data groups are used as the test data set and M-P individuals are used as the training data set.
  • the quantity threshold can be 100
  • the individual sample groups can be divided as cross-validation through the LeaveOne cross-validation method LeaveOneOut and the Leave-P cross-validation method LeavePOut to Ensure the stability of the running results of machine learning model training.
  • the data results are recorded as follows, in:
  • testset(*) is the test data set
  • trainset(*) is the training data set.
  • the cross-validation division of the sample data P individual data groups are taken from the M individual data groups and combined, and the number of combinations is Execute for each combination of C(M,P): P individual data set as test data set ⁇ S 1 ,...,S P ⁇ , MP individual as training data set ⁇ S (P+1) ,...,S M ⁇ .
  • the results of dividing the data are recorded as follows:
  • the machine learning model is trained based on the processing target based on the training data set and the verification data set obtained by the cross-validation division. Since the types and quantities of data features included in different feature verification subsets are different, based on the type of data features in each feature verification subset, the corresponding data items are screened from each of the above training data sets and each verification data set to form a feature verification sub-set.
  • the training data set and verification data set corresponding to the feature verification subset are used to train the machine learning model based on the training data set and verification data set corresponding to the feature verification subset, so as to obtain a machine learning model corresponding to the processing target of the feature verification subset.
  • the machine learning model may be a logistic regression model.
  • the machine learning model includes but is not limited to a simple linear regression model, a ridge regression model, a lasso regression model, an elastic network regression model, and a Bayesian regression model. k-nearest neighbor regression model, support vector machine regression model, random forest regression model, etc.
  • grid search can be used to optimize model parameters during the training process.
  • the training results of the machine learning model may include, but are not limited to, a first parameter used to characterize training completion, a second parameter used to characterize model accuracy, etc.
  • the training results of the machine learning model may include but are not limited to the prediction evaluation information of the model. The optimal machine learning model is selected through one or more of the above parameters or prediction evaluation information.
  • the feature verification subset corresponding to the optimal machine learning model can be determined as the target data feature group.
  • the training results of the machine learning model can be the prediction results of the sample data.
  • the evaluation parameters of the machine learning model such as prediction errors, can be determined through the labels and prediction results in the sample data.
  • the evaluation parameters of the machine learning model can be determined. Sort each machine learning model or screen the optimal machine learning model to determine the target data feature group corresponding to the processing target.
  • determining the corresponding target data feature group of the processing target based on the training process data of each machine learning model includes: for any machine learning model, based on the training data in the training process data of the machine learning model and verification data, determine training indicators and test indicators respectively; sort and filter each machine learning model based on the training indicators and test indicators of each machine learning model; determine the feature verification subset corresponding to the screened machine learning model is the target data feature group of the processing target.
  • the training data is the prediction result obtained by training the machine learning model based on the sample data in the training data set
  • the verification data is the prediction result obtained by training the machine learning model based on the sample data in the verification data set.
  • the number of index types of the training index and the testing index is at least one respectively, and the index types of the training index and the testing index are the same.
  • the training index and the testing index include the root mean square error RMSE and the goodness of fit R 2 respectively.
  • the root mean square error RMSE can be calculated by the following formula:
  • the goodness of fit R 2 can be calculated by the following formula: in, is the predicted value, y i is the real value, that is, the label value in the sample data, is the average of the true values.
  • the corresponding processing target is determined through the training indicators and test indicators of the machine learning model.
  • the target data feature group refers to the machine learning model whose training indicators and test indicators meet the filtering conditions, and the corresponding feature verification subset is determined as the target data feature group corresponding to the processing target.
  • the evaluation parameter is the root mean square error RMSE in the training index and the root mean square error in the test index.
  • This evaluation parameter is negatively related to the performance stability of the machine learning model. The smaller the evaluation parameter, the closer the performance of the training data set model and the test data set model is, that is, the more stable the performance of the machine learning model.
  • the feature verification subset corresponding to the machine learning model whose evaluation parameters are smaller than the first preset value may be determined as the target data feature group corresponding to the processing target.
  • the goodness of fit R 2 in the test indicator is positively related to the performance of the machine learning model.
  • the greater the goodness of fit R 2 in the test indicator the better the performance of the machine learning model.
  • a machine learning model whose goodness of fit R 2 in the test index is greater than the second preset value may be used, and the corresponding feature verification subset is determined as the target data feature group corresponding to the processing target.
  • the performance of the machine learning model can be jointly evaluated based on the goodness-of-fit R 2 in the evaluation parameters and the test index, for example, based on the goodness-of-fit R 2 in the evaluation parameter and the test index, and respectively corresponding
  • the weights are weighted to obtain the performance evaluation value of the machine learning model, each machine learning model is sorted based on the performance evaluation value, and the corresponding feature verification subset of the machine learning model whose performance evaluation value meets the performance requirements is determined as the processing target.
  • the corresponding target data feature group is
  • multiple data features in the sample data are verified in the form of feature verification subsets based on a machine learning model to achieve wrapped screening of data features and obtain the results for Groups of target data features that handle target prediction.
  • the sample data used to train the machine learning model is divided into individuals, and the sample data of the same individual is divided into the same individual sample group, and cross-validation is divided based on the individual sample group to avoid the sample data of the same individual.
  • it is divided into a training data set and a verification data set to avoid the impact of individual sample data on the performance of the machine learning model and further improve the accuracy of feature screening.
  • Figure 2 is a flow chart of a feature screening method provided by an embodiment of the present invention, which is optimized based on the above embodiment.
  • the method also includes: determining the correlation between each of the data features in the sample data and the processing target, and screening candidate data features based on the correlation with the processing target; correspondingly, determining based on the data features in the sample data
  • Multiple feature verification subsets include: determining multiple feature verification subsets among the candidate data features.
  • the method includes:
  • the data features in the sample data are initially screened to eliminate data features that have no correlation or weak correlation with the processing target, and obtain the data features that are correlated or have a strong correlation with the processing target.
  • the number of candidate data features is reduced, which reduces the calculation amount of the screening process and improves the screening efficiency.
  • candidate data features are determined based on the correlation between each data feature and the processing target.
  • the correlation between the data characteristics and the processing target can be characterized by numerical values, and the comparison result of the numerical value and the threshold value is used to determine whether the data characteristics and the processing target are related, and the strength of the correlation.
  • the correlation between the data characteristics and the processing target can be determined through at least one correlation determination rule, so as to calculate the correlation between the data characteristics and the processing target from different dimensions and improve the accuracy of the determined candidate data characteristics.
  • multiple screening of data features in the sample data can be performed based on the correlations determined by multiple correlation determination rules. For example, based on the first correlation determination rule, the third correlation between the data characteristics and the processing target is determined. 1. Correlation: Based on the first correlation corresponding to each data feature, data features that have no correlation or weak correlation with the processing target are eliminated to obtain the first candidate data feature.
  • the second candidate data feature For the first candidate data feature, based on the second correlation determination rule, determine the second correlation between the data feature and the processing target, and based on the second correlation corresponding to each first candidate data feature, eliminate or eliminate the correlation with the processing target. For weak data features, the second candidate data feature is obtained, and so on, until the final candidate data feature is obtained.
  • correlation determination rules include but are not limited to univariate linear regression method, mutual information method (Mutual Information), Lasso regression method, etc.
  • correlation calculations are performed on the data features in the sample data based on the above-mentioned correlation determination rules, and candidate data features are screened.
  • a linear equation between the data characteristics and the processing target can be constructed, wx+b, where w is the slope and b is the intercept.
  • the absolute value of the slope is positively related to the correlation, and the data can be calculated by the slope.
  • the P value of the first correlation between the feature and the treatment target A small first correlation between data features and processing targets indicates a strong correlation between data features and processing targets. A large first correlation between data features and processing targets indicates a weak correlation between data features and processing targets.
  • the feature data is set as a candidate data feature, that is, the feature data whose first correlation P value is greater than or equal to the preset correlation threshold is eliminated, where the preset
  • the association threshold can be 0.1 or 0.05, which can be determined according to the setting filtering accuracy.
  • data features whose first correlation P value is greater than or equal to the preset correlation threshold and whose second correlation MI value is zero can be eliminated from all data features in the sample data to obtain the first candidate data feature, For example, it can be recorded as D filt .
  • the filtered candidate data features can be sorted based on the absolute value of ⁇ j .
  • the target data feature group for the processing target is further screened through machine learning.
  • the technical solution of this embodiment preliminarily screens all data features through the correlation between each data feature and the processing target, eliminates some irrelevant or weakly relevant technical features, and reduces the amount of features to be screened in the machine learning process, further , screening based on candidate technical features to obtain the target data feature group corresponding to the processing target, reducing the number of data features, screening candidate data features in a targeted manner, and reducing the interference of invalid data features, reducing screening computational cost and time cost.
  • Figure 3 is a flow chart of a feature screening method provided by an embodiment of the present invention, which is optimized based on the above embodiment.
  • the method also includes: for any Target data characteristics: draw a data distribution diagram of the target data characteristics based on the sample data corresponding to the target data characteristics; verify the target data characteristics based on the data distribution diagram of the target data characteristics.
  • the method includes:
  • the filtered target data feature group can build a model with better performance, if the data distribution of the data features does not meet the clinical performance, In the subsequent research or application process of the model, it is easy to introduce bias and affect the performance of the model.
  • data distribution verification is performed on the screened out target data features to ensure that the target data features used for prediction processing targets meet the data distribution requirements.
  • the data distribution diagram of each target data feature is drawn, and whether the data distribution diagram meets the data distribution requirements is determined.
  • drawing a data distribution diagram of the target data feature includes: determining the data type of the target data feature; based on the sample data corresponding to the target data feature, drawing A data distribution diagram corresponding to the data type.
  • the data types of target data features can include categorical and numerical types, and different data types correspond to different types of data distribution charts.
  • typed data features have limited data content for different objects and belong to a fixed range of data content.
  • the data of different objects are non-fixed data, which can be any data within the data range, and are not limited to positive numbers.
  • the data content of any object is any one of ⁇ 1,0 ⁇ , that is, the data content of any object is 0 or 1, and there is no other data form.
  • the data content of any object can be (0,1).
  • the data content of different objects can be any value greater than 0 and less than 1, for example, 0.5, 0.33, 0.96, 0.5689 etc.
  • the data type of the target data feature is determined according to the numerical type and the number of data values of the data content corresponding to the target data feature, where the numerical type can be an integer type or a decimal type, for example, the numerical value corresponding to a classified target data feature
  • the type can be integer type
  • the numerical type corresponding to the numerical target data feature can include integer type and decimal type.
  • the number of data values can be the number of non-repeating data values.
  • the number of data values corresponding to the target data feature of the categorical type is limited, and the number of data values is small, for example, less than the quantity threshold.
  • the data values corresponding to the numerical target data feature The quantity is large, or the number of data values is greater than the quantity threshold.
  • Determining the data type of the target data feature includes: performing deduplication processing on the data values of the target data feature to obtain deduplicated data values; after deduplication, each data value satisfies an integer and the number of data values is less than or equal to In the case of a preset threshold, it is determined that the data type of the target data feature is a classification type, and in the case where each data value after deduplication does not satisfy an integer or the number of data values is greater than or equal to the preset threshold, the target data is determined
  • the data type of the feature is numeric.
  • each data value of the target data feature in the sample data By deduplicating each data value of the target data feature in the sample data, removing duplicate data values, obtaining unique data values, and obtaining a unique data set of the target data feature, which can be recorded as a set Count the number of data values in the data set and the numerical types in each data. If the data values in the set satisfy integers and the number of data values is less than or equal to the preset threshold, then it is determined that the data type of the target data feature is a classification type. Correspondingly, if If the data values in the set are not integers, or the number of data values in the set is greater than the preset threshold, it is determined that the data type of the target data feature is numeric.
  • the preset threshold may be 5, which is not limited and can be set according to requirements.
  • different target data characteristics can execute the above determination process synchronously to improve the efficiency of data type determination.
  • the data type of the target data feature determine the data distribution diagram type of the target data feature. Further, based on the sample data corresponding to the target data feature, drawing a data distribution diagram corresponding to the data type includes: if the data type of the target data feature is a classification type, based on the target data feature corresponding to The sample data draws a horizontal bar graph of the target data characteristics, and a box plot of the target data characteristics and the processing target; if the data type of the target data characteristics is a numerical type, based on the corresponding target data characteristics The sample data draws a histogram of the target data characteristics, and a scatter regression plot of the target data characteristics and the processing target.
  • any target data feature obtain the data value of the target data feature in the sample data, and draw the data distribution map of the target data feature based on the data value of the target data feature according to the data distribution map type corresponding to the target data feature.
  • FIG. 4 is an example of a data distribution diagram provided by an embodiment of the present invention.
  • the left picture in Figure 4 is a bar chart
  • the right picture is a scatter regression chart.
  • verifying the target data characteristics based on the data distribution diagram of the target data characteristics includes: removing the target data characteristics when the data distribution diagram of the target data characteristics does not satisfy the distribution rules, Or eliminate the target data feature group in which the target data feature is located.
  • different data distribution map types may correspond to different distribution rules, and the data distribution map of the target data feature is verified based on the distribution rules corresponding to the target data feature.
  • different target data features correspond to different distribution rules, and the data distribution diagram of the target data feature can be verified according to the distribution rules corresponding to the target data feature type.
  • the target data features are eliminated. Furthermore, multiple target data features in the target data feature group work together to achieve the purpose of predicting the processing target. When any target data feature in the target data feature group does not satisfy the distribution rules, it will cause the target data feature group to introduce errors in subsequent analysis and application. Therefore, the target data feature group is eliminated.
  • the technical solution provided by this embodiment uses machine learning to filter the target data feature group corresponding to the processing target from multiple feature verification subsets, and then further verifies the data distribution of the target data features to eliminate those that do not meet the clinical performance. Data characteristics to ensure that the filtered target data characteristics are practical.
  • FIG. 5 is a flow chart of a feature screening method provided by an embodiment of the present invention.
  • Figure 5 provides the system structure for executing the feature screening method.
  • the system structure includes the main module (i.e., the regression learning algorithm module), the cross-validation data set building module and the feature distribution drawing module.
  • the main module can call the cross-validation data set building module and features Distribution drawing module. After the main module performs univariate screening of the data features in the sample data, it determines multiple feature verification subsets for multivariate screening.
  • the cross-validation data set building module Before performing multivariate screening through the regression learning algorithm, the cross-validation data set building module is called to Obtain training and test data sets for multivariate screening.
  • the cross-validation data set building module is used to divide the training data set and the test data set through the cross-validation method based on the correspondence between individuals and samples.
  • the main module executes the regression learning algorithm and obtains the multivariate screening results, that is, the target data feature group.
  • the main module calls the feature distribution drawing module to draw the distribution graph of features in the data, which is used to visually display the distribution status of the target data features and verify the target data feature group.
  • Figure 6 is a schematic structural diagram of a feature screening device provided by an embodiment of the present invention. As shown in Figure 6, the device includes:
  • the feature verification subset determination module 410 is used to determine multiple feature verification subsets based on the data characteristics in the sample data;
  • the data set division module 420 is used to divide the sample data into individual groups based on the individuals to which the sample data belongs, to obtain individual sample groups corresponding to different individuals, and to perform cross-validation division based on multiple individual sample groups to determine the division. training data set and validation data set;
  • the model training module 430 is used to train the machine learning model of the processing target based on the training data set and verification data set corresponding to each feature verification subset;
  • the target data feature group determination module 440 is configured to determine the corresponding target data feature group of the processing target based on the training process data of each machine learning model.
  • the technical solution of this embodiment is to verify multiple data features in the sample data in the form of feature verification subsets based on the machine learning model to achieve wrapped screening of data features and obtain the target processing results.
  • Predicted target data feature group Predicted target data feature group.
  • the sample data used to train the machine learning model is divided into individuals, and the sample data of the same individual is divided into the same individual sample group, and cross-validation is divided based on the individual sample group to avoid the sample data of the same individual.
  • it is divided into a training data set and a verification data set to avoid the impact of individual sample data on the performance of the machine learning model and further improve the accuracy of feature screening.
  • the device further includes:
  • a candidate data feature screening module configured to determine the correlation between each data feature in the sample data and the processing target before determining a plurality of feature verification subsets based on the data features in the sample data, and based on the processing target Relevance screening of candidate data features;
  • the feature verification subset determination module 410 is configured to determine multiple feature verification subsets among the candidate data features.
  • the feature verification subset determination module 410 is configured to determine multiple feature verification subsets among the data features or candidate data features in the sample data based on the number of features in the feature verification subset.
  • the data set dividing module 420 is used to:
  • the target data feature group determination module 440 is used to:
  • the feature verification subset corresponding to the filtered machine learning model is determined as the target data feature group of the processing target.
  • the training index and the testing index include root mean square error and goodness of fit respectively.
  • the device further includes:
  • a data distribution diagram drawing module used for drawing a data distribution diagram of any target data characteristic based on the sample data corresponding to the target data characteristic
  • a feature verification module configured to verify the target data features based on the data distribution diagram of the target data features.
  • data distribution chart drawing module includes:
  • a data type determination unit used to determine the data type of the target data feature
  • a data distribution diagram drawing unit is configured to draw a data distribution diagram corresponding to the data type based on the sample data corresponding to the target data characteristics.
  • the data type determination unit is used for:
  • each data value after deduplication satisfies an integer and the number of data values is less than or equal to the preset threshold, it is determined that the data type of the target data feature is a classification type, and each data value after deduplication does not satisfy an integer or data value. If the number of values is greater than or equal to the preset threshold, it is determined that the data type of the target data feature is a numerical type.
  • the data distribution chart drawing unit is used for:
  • the data type of the target data feature is a classification type, draw a horizontal bar chart of the target data feature and a box plot of the target data feature and the processing target based on the sample data corresponding to the target data feature;
  • the data type of the target data feature is numeric, draw a histogram of the target data feature and a scatter regression plot of the target data feature and the processing target based on the sample data corresponding to the target data feature.
  • the feature verification module is used for:
  • the target data feature is eliminated, or the target data feature group in which the target data feature is located is eliminated.
  • the feature screening device provided by the embodiment of the present invention can execute the feature screening method provided by any embodiment of the present invention, and has corresponding functional modules and beneficial effects for executing the method.
  • FIG. 7 is a schematic structural diagram of an electronic device provided in Embodiment 6 of the present invention.
  • Electronic device 10 is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (eg, helmets, glasses, watches, etc.), and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit the implementation of the invention described and/or claimed herein.
  • the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a read-only memory (ROM) 12, a random access memory (RAM) 13, etc., wherein the memory stores There is a computer program that can be executed by at least one processor.
  • the processor 11 can perform the operation according to the computer program stored in the read-only memory (ROM) 12 or loaded from the storage unit 18 into the random access memory (RAM) 13. Perform various appropriate actions and processing.
  • RAM 13 various programs and data required for the operation of the electronic device 10 can also be stored.
  • the processor 11, the ROM 12 and the RAM 13 are connected to each other via the bus 14.
  • An input/output (I/O) interface 15 is also connected to bus 14 .
  • the I/O interface 15 Multiple components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 19, such as network card, modem, wireless communication transceiver, etc.
  • the communication unit 19 allows the electronic device 10 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc.
  • the processor 11 performs various methods and processes described above, such as feature screening methods.
  • the feature filtering method may be implemented as a computer program, which is tangibly embodied in a computer-readable storage medium, such as the storage unit 18 .
  • part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19.
  • the processor 11 may be configured to perform the feature filtering method in any other suitable manner (eg, by means of firmware).
  • Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or a combination thereof.
  • These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
  • the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Computer programs for implementing the feature screening methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that the computer program, when executed by the processor, causes the functions/operations specified in the flowcharts and/or block diagrams to be implemented.
  • a computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • Embodiment 7 of the present invention provides a computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions.
  • the computer instructions are used to cause the processor to execute a feature screening method.
  • the method includes:
  • the sample data is divided into individual groups to obtain individual sample groups corresponding to different individuals, and cross-validation is performed based on multiple individual sample groups to determine the training data set and verification data set obtained by the division. ;
  • the corresponding target data feature group of the processing target is determined based on the training process data of each machine learning model.
  • a computer-readable storage medium may be a tangible medium that may contain or store a computer program for use by or in connection with an instruction execution system, apparatus, or device.
  • Computer-readable storage media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • the computer-readable storage medium may be a machine-readable signal medium.
  • machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • the systems and techniques described herein may be implemented on an electronic device having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display)) for displaying information to the user monitor); and a keyboard and pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device.
  • a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display)
  • a keyboard and pointing device e.g., a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), blockchain network, and the Internet.
  • Computing systems may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact over a communications network.
  • the relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problems of difficult management and weak business scalability in traditional physical hosts and VPS services. defect.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

一种特征筛选方法、装置、存储介质及电子设备。其中方法包括基于样本数据中的数据特征确定多个特征验证子集;基于样本数据所属个体,对样本数据进行个体组划分,得到不同个体对应的个体样本组,并基于多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集;基于各所述特征验证子集对应的训练数据集和验证数据集,进行处理目标的机器学习模型训练;基于各机器学习模型的训练过程数据确定所述处理目标的对应的目标数据特征组。本实施例中,通过对个体样本组进行交叉验证划分,避免同一个体的样本数据同时划分至训练数据集和验证数据集,从而避免个体样本数据对机器学习模型性能的影响,进一步提高特征筛选的准确性。

Description

一种特征筛选方法、装置、存储介质及电子设备
本申请要求于2022年06月02日提交到国家知识产权局、申请号为202210624370.7、发明名称为“一种特征筛选方法、装置、存储介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及数据处理技术领域,尤其涉及一种特征筛选方法、装置、存储介质及电子设备。
背景技术
目前质谱技术正在蓬勃发展,并广泛应用于临床多领域的检测项目,包括内分泌、心血管、肿瘤、和药物治疗等等。质谱技术是实现精准诊断和精准医疗必不可少的工具。基于质谱技术,可以获得临床样本的蛋白质组学,代谢组学,脂质组学等多种组学大数据。相应的,如何对质谱技术带来的多组学数据进行合理有效的分析是研究的关键点和热点之一。
在实现本发明的过程中,发现现有技术中至少存在以下技术问题:数据特征过多,导致很难从海量的数据特征中确定有效标志物,同时一个个体可能产生多个样本数据,个体的差异性导致数据特征的筛选存在一定的偏差。
发明内容
本发明提供了一种特征筛选方法、装置、存储介质及电子设备,以提高特征筛选的准确性。
根据本发明的一方面,提供了一种特征筛选方法,包括:
基于样本数据中的数据特征确定多个特征验证子集;
基于所述样本数据所属个体,对所述样本数据进行个体组划分,得到不同个体对应的个体样本组,并基于多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集;
基于各所述特征验证子集对应的训练数据集和验证数据集,进行处理目标的机器学习模型训练;
基于各机器学习模型的训练过程数据确定所述处理目标的对应的目标数据特征组。
根据本发明的另一方面,提供了一种特征筛选装置,包括:
特征验证子集确定模块,用于基于样本数据中的数据特征确定多个特征验证子集;
数据集划分模块,用于基于所述样本数据所属个体,对所述样本数据进行个体组划分,得到不同个体对应的个体样本组,并基于多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集;
模型训练模块,用于基于各所述特征验证子集对应的训练数据集和验证数据集,进行处理目标的机器学习模型训练;
目标数据特征组确定模块,用于基于各机器学习模型的训练过程数据确定所述处理目标的对应的目标数据特征组。
根据本发明的另一方面,提供了一种电子设备,所述电子设备包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行本发明任一实施例所述的特征筛选方法。
根据本发明的另一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现本发明任一实施例所述的特征筛选方法。
本实施例提供的技术方案中,通过对样本数据中的多个数据特征,以特征验证子集的形式,基于机器学习模型的方式进行验证,以实现对数据特征的包裹式筛选,得到用于处理目标预测的目标数据特征组。进一步的,用于对机器学习模型进行训练的样本数据,通过进行个体划分,将同一个体的样本数据划分为同一个体样本组,并基于个体样本组进行交叉验证划分,以避免同一个体的样本数据同时划分至训练数据集和验证数据集,从而避免个体样本数据对机器学习模型性能的影响,进一步提高特征筛选的准确性。
应当理解,本部分所描述的内容并非旨在标识本发明的实施例的关键或重要特征,也不用于限制本发明的范围。本发明的其它特征将通过以下的说明书而变得容易理解。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的一种特征筛选方法的流程图;
图2是本发明实施例提供的一种特征筛选方法的流程图;
图3是本发明实施例提供的一种特征筛选方法的流程图;
图4是本发明实施例提供的数据分布图的示例图;
图5是本发明实施例提供的一种特征筛选方法的流程图;
图6是本发明实施例提供的一种特征筛选装置的结构示意图;
图7是本发明实施例提供的一种电子设备的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里 图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
实施例一
图1是本发明实施例提供的一种特征筛选方法的流程图,本实施例可适用于在大量数据特征中筛选用于预测处理目标的数据特征的情况,该方法可以由特征筛选装置来执行,该特征筛选装置可以采用硬件和/或软件的形式实现,该特征筛选装置可配置于诸如计算机、服务器等的电子设备中。如图1所示,该方法包括:
S110、基于样本数据中的数据特征确定多个特征验证子集。
S120、基于所述样本数据所属个体,对所述样本数据进行个体组划分,得到不同个体对应的个体样本组,并基于多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集。
S130、基于各所述特征验证子集对应的训练数据集和验证数据集,进行处理目标的机器学习模型训练。
S140、基于各机器学习模型的训练过程数据确定所述处理目标的对应的目标数据特征组。
本实施例中,获取大量的样本数据,每一组样本数据中可以包括多种类型的数据特征,不同组样本数据中包括的数据特征的种类可以相同,各数据特征的数据值不同。可选的,样本数据中可以包括组学数据和/或临床数据,示例性的,组学数据可以是通过质谱技术得到的,组学数据包括但不限于蛋白质组学、代谢组学、脂质组学,临床数据可以是通过数据采集设备采集得到,或者可以是历史采集数据,临床数据包括但不限于血压、心率、呼吸次数等。样本数据中每一项数据特征,可通过以下方式进行记录
Figure PCTCN2022113011-appb-000001
其中x i表示第i个样本特征向量,N表示有i=1,…,N个样本特征向量。x i的维度记为j=1,…,D,每个维度x j表示第j个特征,共有D个特征。y i表示x i的标签,y i取值是实数,特征标签y为数值型。
样本数据中数据特征的种类多,仅有局部的数据特征对处理目标存在影响,即处理目标对应的目标数据特征仅为样本数据中数据特征的部分特征,且不同处理目标对应的目标数据特征可以不同。需要说明的是,处理目标可以是对输入数据在任一维度的分类预测,示例性的,处理目标基于不同时间点的激素浓度预测、基于某一疾病病理分级预测等。需说明的是,上述样本数据中y i表示x i在处理目标维度上的标签。
在一些实施例中,基于各组样本数据中的数据特征,得到数据特征集合,在数据特征集合中随机确定多个特征验证子集,其中,特征验证子集中的数据特征数量随机,数据特征数量可以是大于1,小于数据特征总数量,即特征验证子集中可包括局部数据特征或者全部数据特征。基于确定的数据特征数量,在数据特征集合中随机确定对应数量的数据特征,形成特征验证子集。
可选的,所述基于样本数据中的数据特征确定多个特征验证子集,包括:基于特征验 证子集中特征数量,在样本数据中的数据特征中确定多个特征验证子集。可选的,特征验证子集中特征数量可以是预先设置的,例如可以是8、10、15等,可根据用户需求设置。可选的,特征验证子集中特征数量也可以是根据样本数据的数据量确定。特征验证子集中最大特征数量为样本数量与预设数值的比值,该预设数值可以是15。需要说明的是,预设数值不做局限,可根据用户需求设置。相应的,特征验证子集中特征数量位于
Figure PCTCN2022113011-appb-000002
the number of samples为样本数量。基于特征验证子集中特征数量d和样本数据中的数据特征总数D,可确定特征验证子集的数量,例如特征验证子集的数量为
Figure PCTCN2022113011-appb-000003
通过对特征验证子集中的多个数据特征进行包裹式特征选择,筛选得到对处理目标的目标数据特征,即预测标签y的重要标志物组合。具体的,通过对每一特征验证子集进行机器学习模型的验证,通过机器学习模型的训练结果反向验证特征验证子集的准确性,以得到处理目标对应的目标数据特征组。
在上述实施例的基础上,基于样本数据对各机器学习模型的训练之前,对样本数据进行交叉验证划分,以得到训练数据集和验证数据集。结合样本数据与个体的对应关系,对样本数据进行交叉验证划分,避免来自同一个体的样本数据同时被划分到训练数据集和验证数据集,导致影响机器学习模型的真实性能。示例性的,N个样本数据可来自于M个个体,M≤N,个体数目M可以等于样本数目N,样本数据也可以是不同阶段的M取样。如果M=N,说明个体S m和样本x i是唯一的一一对应关系,即每个个体唯一对应一个样本,m与i表示同一个样本,这时S m=S i,数据
Figure PCTCN2022113011-appb-000004
若M<N,说明S m和x i是一对多关系,这类数据一个个体是多个样本的集合,比如第m个体S m={x i=1,x i=2,…,x i=l,},表示样本中有l个样本x来自于同一个个体。
可选的,基于所述样本数据所属个体,对所述样本数据进行个体组划分,得到不同个体对应的个体样本组,并基于多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集,包括:将同一个体的至少一组样本数据,划分至同一个体组内,得到不同个体对应的个体样本组;基于预设的至少一个交叉验证规则,对多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集。
每一个样本数据,可携带有所属个体的标识信息,可基于所属个体的标识信息对各样本数据进行划分,即将携带有同一标识信息的样本数据划分至同一个体组,以得到各个个体样本组,m=1,…,M,依次找出每个个体对应的样本集合,统一记为S m={x i=1,x i=2,…,x i=l}。将每一个体样本组作为交叉验证划分的单位数据组,进行交叉验证划分,以得到训练数据集和验证数据集。
本实施例中,不限定交叉验证划分的实现方式,任意可实现对个体样本组作为交叉验证划分即可。示例性的,可通过K折重复交叉验证方式Repeated K-fold、留一交叉验证方式LeaveOneOut和留P交叉验证方式LeavePOut。可基于上述的任一种交叉验证划分方式对个体样本组作为交叉验证划分。其中,对于K折重复交叉验证方式Repeated K-fold中的配置参数k大于等于2且为整数,重复次数Repeated大于等于1且为整数,示例性的,默认K=10,Repeated=10;示例性的,个体数据组的数目M<10的情况下,默认K=3,Repeated=5。对于留一交叉验证方式LeaveOneOut,无需设置参数,可默认参数K=M,Repeated=1。对于留P交叉验证方式LeavePOut,P取值范围1≤P≤M,此时参数只有P,表示P个个体数据组作为测试数据集,M-P个体作为训练数据集。
需要说明的是,在个体数据组的数量小于数量阈值的情况下,例如数量阈值可以是100,可通过留一交叉验证方式LeaveOneOut和留P交叉验证方式LeavePOut对个体样本组作为交叉验证划分,以保证机器学习模型训练的运行结果的稳定性。
示例性的,对于过K折重复交叉验证方式Repeated K-fold或者留一交叉验证方式LeaveOneOut,对样本数据的交叉验证划分:随机将M个个体样本组分割成K份子集fold,记为{F 1,F 2,…,F K};对每份子集执行:每份子集作为测试数据集,K-1份子集作为训练数据集;记为CV(r=1),重复次数为Repeated,划分数据结果记录如下,
Figure PCTCN2022113011-appb-000005
其中:
Figure PCTCN2022113011-appb-000006
以及,
Figure PCTCN2022113011-appb-000007
其中,testset(*)为测试数据集,trainset(*)为训练数据集。
对于留P交叉验证方式LeavePOut,对样本数据的交叉验证划分:从M个个体数据组中取P个体数据组进行组合,组合个数为
Figure PCTCN2022113011-appb-000008
对C(M,P)每个组合执行:P个体数据组作为测试数据集{S 1,…,S P},M-P个体作为训练数据集{S (P+1),…,S M}。划分数据结果记录如下:
Figure PCTCN2022113011-appb-000009
在上述实施例的基础上,基于交叉验证划分得到的训练数据集和验证数据集对机器学习模型进行基于处理目标的训练。由于不同特征验证子集中包括的数据特征的类型和数量不相同,基于每一特征验证子集中数据特征的类型,从上述各训练数据集和各验证数据集中筛选对应的数据项,形成特征验证子集对应的训练数据集和验证数据集,基于特征验证子集对应的训练数据集和验证数据集进行机器学习模型的训练,以得到特征验证子集基于处理目标对应的机器学习模型。
本实施例中,机器学习模型可以是逻辑回归模型,示例性的,机器学习模型包括但不限于简单线性回归模型、岭回归模型、套索回归模型、弹性网络回归模型、贝叶斯回归模型、k-近邻回归模型、支持向量机回归模型和随机森林回归模型等。对于每一特征验证子集,可采样上述一种或多种逻辑回归模型进行模型训练,训练过程中可采用网格搜索优化模型参数。
在基于每一特征验证子集进行处理目标的机器学习模型训练的过程中,基于相同的训练方式,得到多个机器学习模型,其中,相同的训练方式包括但不限于相同的样本数量、相同的损失函数、相同的学习率、相同的迭代次数等。对于训练完成的机器学习模型,可选的,机器学习模型的训练结果可以包括但不限于用于表征训练完成度的第一参数、用于表征模型精度的第二参数等。可选的,机器学习模型的训练结果可以包括但不限于模型的预测评估信息。通过上述参数中的一项或多项,或者预测评估信息筛选最优机器学习模型,相应的,可将最优机器学习模型对应的特征验证子集确定为目标数据特征组。可选的,机器学习模型的训练结果可以是对样本数据的预测结果,通过样本数据中的标签和预测结果可确定机器学习模型的评价参数,例如预测误差等,通过机器学习模型的评价参数可对各机器学习模型进行排序,或者筛选最优的机器学习模型,以确定 处理目标对应的目标数据特征组。
在一些实施例中,基于各机器学习模型的训练过程数据确定所述处理目标的对应的目标数据特征组,包括:对于任一机器学习模型,基于所述机器学习模型训练过程数据中的训练数据和验证数据,分别确定训练指标和测试指标;基于各机器学习模型的所述训练指标和测试指标,对各机器学习模型进行排序和筛选;将筛选出的机器学习模型对应的特征验证子集确定为所述处理目标的目标数据特征组。
其中,训练数据为机器学习模型在基于训练数据集中的样本数据进行训练得到的预测结果,验证数据为机器学习模型在基于验证数据集中的样本数据进行训练得到的预测结果。训练指标和测试指标的指标类型数量分别为至少一个,且训练指标和测试指标的指标类型相同,例如训练指标和测试指标分别包括均方根误差RMSE和拟合优度R 2
示例性的,均方根误差RMSE可通过如下公式计算得到:
Figure PCTCN2022113011-appb-000010
拟合优度R 2可通过如下公式计算得到:
Figure PCTCN2022113011-appb-000011
其中,
Figure PCTCN2022113011-appb-000012
为预测值,y i为真实值,即样本数据中的标签值,
Figure PCTCN2022113011-appb-000013
为真实值的平均值。
基于处理目标训练得到的各机器学习模型、机器学习模型的训练指标和测试指标,以及机器学习模型对应的特征验证子集的对应关系,通过机器学习模型的训练指标和测试指标确定处理目标对应的目标数据特征组,即将训练指标和测试指标满足筛选条件的机器学习模型,对应的特征验证子集确定为处理目标对应的目标数据特征组。
可选的,基于训练指标中的均方根误差RMSE和测试指标中的均方根误差RMSE,确定评价参数,即评价参数为训练指标中的均方根误差RMSE和测试指标中的均方根误差RMSE的差值绝对值,该评价参数与机器学习模型的性能稳定性负相关,评价参数越小,表明训练数据集模型和测试数据集模型性能越接近,即机器学习模型的性能越稳定。在一些实施例中,可以是将评价参数小于第一预设值的机器学习模型,对应的特征验证子集确定为处理目标对应的目标数据特征组。
测试指标中的拟合优度R 2与机器学习模型的性能正相关,测试指标中的拟合优度R 2越大,表明机器学习模型的性能越好。在一些实施例中,可以是将测试指标中的拟合优度R 2大于第二预设值的机器学习模型,对应的特征验证子集确定为处理目标对应的目标数据特征组。
在一些实施例中,可基于评价参数和测试指标中的拟合优度R 2共同评估机器学习模型的性能,例如,基于评价参数和测试指标中的拟合优度R 2,以及分别对应的权重进行加权处理,得到机器学习的模型的性能评价值,基于性能评价值对各机器学习模型进行排序,以及将性能评价值满足性能要求的机器学习模型,对应的特征验证子集确定为处理目标对应的目标数据特征组。
本实施例提供的技术方案中,通过对样本数据中的多个数据特征,以特征验证子集的形式,基于机器学习模型的方式进行验证,以实现对数据特征的包裹式筛选,得到用于处理目标预测的目标数据特征组。进一步的,用于对机器学习模型进行训练的样本数据,通过进行个体划分,将同一个体的样本数据划分为同一个体样本组,并基于个体样本组进行交叉验证划分,以避免同一个体的样本数据同时划分至训练数据集和验证数据集,从而避免个体样本数据对机器学习模型性能的影响,进一步提高特征筛选的准确性。
实施例二
图2是本发明实施例提供的一种特征筛选方法的流程图,在上述实施例的基础上进行了优化,可选的,在基于样本数据中的数据特征确定多个特征验证子集之前,所述方法还包括:确定所述样本数据中各所述数据特征与处理目标的关联性,并基于所述与处理目标的关联性筛选候选数据特征;相应的,基于样本数据中的数据特征确定多个特征验证子集,包括:在所述候选数据特征中确定多个特征验证子集。
如图2所示,该方法包括:
S210、确定所述样本数据中各所述数据特征与处理目标的关联性,并基于所述与处理目标的关联性筛选候选数据特征,在所述候选数据特征中确定多个特征验证子集。
S220、基于所述样本数据所属个体,对所述样本数据进行个体组划分,得到不同个体对应的个体样本组,并基于多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集。
S230、基于各所述特征验证子集对应的训练数据集和验证数据集,进行处理目标的机器学习模型训练。
S240、基于各机器学习模型的训练过程数据确定所述处理目标的对应的目标数据特征组。
由于样本数据中数据特征的数量大,相应的,导致特征验证子集的数量大,以及目标数据特征组的筛选过程计算成本大、耗时长的问题。本实施例中,在确定多个特征验证子集之前,对样本数据中数据特征进行初步筛选,与剔除与处理目标无关联或者关联性弱的数据特征,得到与处理目标存在关联或者关联性强的候选数据特征,相对于样本数据中全部数据特征,候选数据特征数量减少,减少筛选过程计算量,同时提高筛选效率。
本实施例中,基于每一数据特征与处理目标的关联性,确定候选数据特征。其中,数据特征与处理目标的关联性可以通过数值进行表征,并通过数值与阈值的比对结果确定数据特征与处理目标是否具备关联,以及关联性的强弱。
可通过至少一种关联性确定规则,确定数据特征与处理目标的关联性,以从不同维度计算数据特征与处理目标的关联性,提高确定的候选数据特征的准确性。在一些实施例中,可基于多种关联性确定规则确定的关联性,对样本数据中数据特征进行多重筛选,示例性的,基于第一种关联性确定规则,确定数据特征与处理目标的第一关联性,基于各数据特征对应的第一关联性,剔除与处理目标无关联或者关联性弱的数据特征,得到第一候选数据特征。对于第一候选数据特征,基于第二种关联性确定规则,确定数据特征与处理目标的第二关联性,基于各第一候选数据特征对应的第二关联性,剔除与处理目标无关联或者关联性弱的数据特征,得到第二候选数据特征,并以此类推,直到得到最终的候选数据特征。
可选的,关联性确定规则包括但不限于单变量线性回归方法、互信息方法(Mutual Information)和套索回归方法等。在一些实施例中,依次基于上述的关联性确定规则分别对样本数据中的数据特征进行关联性计算,并筛选候选数据特征。
对于单变量线性回归方法,可构建数据特征与处理目标的线性方程,wx+b,其中,w为斜率,b为截距,其中,斜率的绝对值与关联性正相关,可通过斜率计算数据特征与处理目标的第一关联性P值。数据特征与处理目标的第一关联性小,表明数据特征与处 理目标的关联性强,数据特征与处理目标的第一关联性大,表明数据特征与处理目标的关联性弱。相应的,若数据特征的第一关联性小于预设关联阈值,则将所述特征数据设为候选数据特征,即剔除第一关联性P值大于等于预设关联阈值的特征数据,其中,预设关联阈值可以是0.1,或者0.05,可根据设置筛选精度确定。
对于互信息方法,可通过如下公式计算数据特征与标签y(即处理目标)的第二关联性MI,
Figure PCTCN2022113011-appb-000014
其中,P为概率值,若2个变量之间的MI(x j,y)=0,说明第x j个数据特征与处理目标y无关联性。相应的,若数据特征的第二关联性不为零,则将所述特征数据设为候选数据特征,即剔除第二关联性MI值为零的数据特征。
在一些实施例中,可从样本数据中的全部数据特征中剔除第一关联性P值大于等于预设关联阈值,且第二关联性MI值为零的数据特征,得到第一候选数据特征,例如可以记为D filt
在上述实施例的基础上,基于套索回归方法对上述以得到的候选数据特征D filter1进一步筛选。构建已筛选出的各候选数据特征D filter1的回归模型,例如
Figure PCTCN2022113011-appb-000015
Figure PCTCN2022113011-appb-000016
其中,λ为惩罚因子,β j为系数值,若模型中第x (j)个特征与y无关联,相应的,β j=0。从D filte个数据特征筛选出β j!=0的重要特征,即剔除β j=0的数据特征,得到的最终的候选数据特征,记为j=1,…,D filter2,即总共D filter2个数据特征。
可选的,可基于β j的绝对值对筛选出的候选数据特征进行排序,其中,β j的绝对值越大,表征数据特征与处理目标的关联性越大。
基于候选数据特征中确定多个特征验证子集,以通过机器学习的方式进一步筛选处理目标的目标数据特征组。
本实施例的技术方案,通过每一数据特征与处理目标的关联性,对全部的数据特征进行初步筛选,剔除部分无关联或关联性弱的技术特征,减少机器学习过程筛选的特征量,进一步的,基于候选技术特征进行筛选,以得到处理目标对应的目标数据特征组,减少了数据特征的数量,以针对性的对候选数据特征进行筛选,以及减少了无效数据特征的干扰,降低的筛选的计算成本和时间成本。
实施例三
图3是本发明实施例提供的一种特征筛选方法的流程图,在上述实施例的基础上进行了优化,可选的,在确定目标数据特征组之后,所述方法还包括:对于任一目标数据特征,基于所述目标数据特征对应的样本数据,绘制所述目标数据特征的数据分布图;基于所述目标数据特征的数据分布图对所述目标数据特征进行验证。如图3所示,该方法包括:
S310、基于样本数据中的数据特征确定多个特征验证子集。
S320、基于所述样本数据所属个体,对所述样本数据进行个体组划分,得到不同个体对应的个体样本组,并基于多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集。
S330、基于各所述特征验证子集对应的训练数据集和验证数据集,进行处理目标的机器学习模型训练。
S340、基于各机器学习模型的训练过程数据确定所述处理目标的对应的目标数据特征组。
S350、对于任一目标数据特征,基于所述目标数据特征对应的样本数据,绘制所述 目标数据特征的数据分布图。
S360、基于所述目标数据特征的数据分布图对所述目标数据特征进行验证。
在海量数据特征分析中很容易忽视单个数据特征在数据集的分布情况,虽然筛选出的目标数据特征组可构建出性能较好的模型,但如果数据特征在数据分布的情况不符合临床表现,在后续对该模型的研究或应用过程中,很容易引入偏差,影响模型的性能。
为了避免筛选出目标数据特征组中的数据特征存在上述问题,对于筛选出的目标数据特征进行数据分布验证,以保证用于预测处理目标的目标数据特征满足数据分布要求。本实施例中,通过绘制每一目标数据特征的数据分布图,并判断数据分布图是否符合数据分布要求。
可选的,基于所述目标数据特征对应的样本数据,绘制所述目标数据特征的数据分布图,包括:确定所述目标数据特征的数据类型;基于所述目标数据特征对应的样本数据,绘制所述数据类型对应类型的数据分布图。
目标数据特征的数据类型可以包括分类型和数值型,不同的数据类型对应不同类型的数据分布图。其中,分类型的数据特征,其不同对象的数据内容有限,且属于固定的数据内容范围。数值型的数据特征,其不同对象的数据为非固定数据,可以是数据范围内的任意数据,且不局限正数。例如,某一分类型的目标数据特征,其任一对象的数据内容为{1,0}中的任一项,即任一对象的数据内容为0或1,不存在其他数据形式。某一数值型的目标数据特征,其任一对象的数据内容可以是(0,1),相应的,其不同对象的数据内容为大于0小于1的任意数值,例如,0.5、0.33、0.96、0.5689等。
本实施例中,根据目标数据特征对应的数据内容的数值类型和数据值数量确定目标数据特征的数据类型,其中,数值类型可以是整数型和小数型,例如分类型的目标数据特征对应的数值类型可以是整数型,数值型的目标数据特征对应的数值类型可以包括整数型和小数型。数据值数量可以是非重复数据值的数量,其中,分类型的目标数据特征对应的数据值数量为有限的,且数据值数量较小,例如小于数量阈值,数值型的目标数据特征对应的数据值数量较大,或者数据值数量大于数量阈值。
确定所述目标数据特征的数据类型,包括:对所述目标数据特征的数据值进行去重处理,得到去重后的数据值;在去重后的各数据值满足整数且数据值数量小于等于预设阈值的情况下,确定所述目标数据特征的数据类型为分类型,以及在去重后的各数据值不满足整数或者数据值数量大于等于预设阈值的情况下,确定所述目标数据特征的数据类型为数值型。
通过对目标数据特征在样本数据中的各数据值进行去重处理,去除重复的数据值,得到唯一性的数据值,得到目标数据特征的唯一性数据集合,可记为集合
Figure PCTCN2022113011-appb-000017
Figure PCTCN2022113011-appb-000018
统计该数据集合中数据值数量以及各数据中的数值类型,若集合中数据值满足整数且数据值数量小于等于预设阈值,则确定该目标数据特征的数据类型为分类型,相应的,若集合中数据值不是整数,或者集合中数据值的数据值数量大于预设阈值,则确定该目标数据特征的数据类型为数值型。其中,预设阈值可以是5,对此不作限定,可根据需求设置。示例性的,如果s 1各元素为整数且n≤5,x (1)记为分类型数据0 1,否则为数值型数据1 1,判断结果存储在向量s=(a 1)中,a为0或者1,其中,a为0表征分类型,a为1表征数值型。对于其他的目标数据特征分别通过上述判定过程确定对应的数据类型,得到初始临床数据中各目标数据特征的数据类型向量s=(a 1,a 2,…,a d),a为0或者1。进一步的,不同的目标数据特征可同步执行上述判定过程,以提高数据类型的判 定效率。
根据目标数据特征的数据类型,确定目标数据特征的数据分布图类型。进一步的,基于所述目标数据特征对应的样本数据,绘制所述数据类型对应类型的数据分布图,包括:若所述目标数据特征的数据类型为分类型,则基于所述目标数据特征对应的样本数据绘制所述目标数据特征的水平长条图,以及所述目标数据特征与处理目标的箱线图;若所述目标数据特征的数据类型为数值型,则基于所述目标数据特征对应的样本数据绘制所述目标数据特征的直方图,以及所述目标数据特征与处理目标的散点回归图。
对于任一目标数据特征,获取样本数据中该目标数据特征的数据值,根据目标数据特征对应的数据分布图类型,通过目标数据特征的数据值绘制该目标数据特征的数据分布图。示例性的,参见图4,图4是本发明实施例提供的数据分布图的示例图。其中,图4中左图为长条图,右图为散点回归图。
通过绘制目标数据特征自身的数据分布图,用于验证目标数据特征本身是否符合数据分布要求,通过绘制目标数据特征与处理目标之间对应关系的数据分布图,用于验证目标数据特征与处理目标之间对应关系是否符合数据分布要求。
可选的,基于所述目标数据特征的数据分布图对所述目标数据特征进行验证,包括:在所述目标数据特征的数据分布图不满足分布规则的情况下,剔除所述目标数据特征,或者剔除所述目标数据特征所在的目标数据特征组。
在一些实施例中,不同数据分布图类型可对应不同的分布规则,基于目标数据特征对应的分布规则对目标数据特征的数据分布图进行验证。在一些实施例中,不同的目标数据特征对应不同的分布规则,可根据目标数据特征类型对应的分布规则对目标数据特征的数据分布图进行验证。
对于数据分布图不满足分布规则的目标数据特征,为了避免后续的分析和应用中引入误差,剔除该目标数据特征。进一步的,由于目标数据特征组中的多个目标数据特征共同作用,以达到预测处理目标的目的。在目标数据特征组中任意目标数据特征不满足分布规则的情况下,导致目标数据特征组在后续的分析和应用中引入误差,因此,剔除该目标数据特征组。
本实施例提供的技术方案,通过机器学习方式从多个特征验证子集中筛选处理目标对应的目标数据特征组后,进一步的通过对目标数据特征进行数据分布的验证,以剔除不符合临床表现的数据特征,以保证筛选出的目标数据特征具有实用性。
实施例四
在上述实施例的基础上,本发明实施例还提供了一种特征筛选方法的优选实例。示例性的,参见图5,图5是本发明实施例提供的一种特征筛选方法的流程图。图5提供了执行特征筛选方法的系统结构,该系统结构包括主模块(即回归学习算法模块)、交叉验证数据集构建模块和特征分布绘图模块,主模块可调用交叉验证数据集构建模块和特征分布绘图模块。主模块对样本数据中的数据特征进行单变量筛选后,确定用于进行多变量筛选的多个特征验证子集,在通过回归学习算法进行多变量筛选之前,调用交叉验证数据集构建模块,以得到用于多变量筛选的训练数据集和测试数据集。交叉验证数据集构建模块用于基于个体和样本间的对应关系,通过交叉验证方法进行训练数据集和测试数据集的划分。主模块执行回归学习算法,得到多变量筛选结果,即目标数据特征组。主模块调用特征分布绘图模块绘制特征在数据中的分布图形,用于直观展示目标数 据特征的分布状态,对目标数据特征组进行验证。
实施例五
图6是本发明实施例提供的一种特征筛选装置的结构示意图。如图6所示,该装置包括:
特征验证子集确定模块410,用于基于样本数据中的数据特征确定多个特征验证子集;
数据集划分模块420,用于基于所述样本数据所属个体,对所述样本数据进行个体组划分,得到不同个体对应的个体样本组,并基于多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集;
模型训练模块430,用于基于各所述特征验证子集对应的训练数据集和验证数据集,进行处理目标的机器学习模型训练;
目标数据特征组确定模块440,用于基于各机器学习模型的训练过程数据确定所述处理目标的对应的目标数据特征组。
本实施例的技术方案,通过对样本数据中的多个数据特征,以特征验证子集的形式,基于机器学习模型的方式进行验证,以实现对数据特征的包裹式筛选,得到用于处理目标预测的目标数据特征组。进一步的,用于对机器学习模型进行训练的样本数据,通过进行个体划分,将同一个体的样本数据划分为同一个体样本组,并基于个体样本组进行交叉验证划分,以避免同一个体的样本数据同时划分至训练数据集和验证数据集,从而避免个体样本数据对机器学习模型性能的影响,进一步提高特征筛选的准确性。
在上述实施例的基础上,可选的,该装置还包括:
候选数据特征筛选模块,用于在基于样本数据中的数据特征确定多个特征验证子集之前,确定所述样本数据中各所述数据特征与处理目标的关联性,并基于所述与处理目标的关联性筛选候选数据特征;
相应的,特征验证子集确定模块410用于:在所述候选数据特征中确定多个特征验证子集。
在上述实施例的基础上,可选的,特征验证子集确定模块410用于:基于特征验证子集中特征数量,在样本数据中的数据特征或者候选数据特征中确定多个特征验证子集。
在上述实施例的基础上,可选的,数据集划分模块420用于:
将同一个体的至少一组样本数据,划分至一个体组内,得到不同个体对应的个体样本组;
基于预设的至少一个交叉验证规则,对多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集。
在上述实施例的基础上,可选的,目标数据特征组确定模块440用于:
对于任一机器学习模型,基于所述机器学习模型训练过程数据中的训练数据和验证数据,分别确定训练指标和测试指标;
基于各机器学习模型的所述训练指标和测试指标,对各机器学习模型进行排序和筛选;
将筛选出的机器学习模型对应的特征验证子集确定为所述处理目标的目标数据特征组。
可选的,所述训练指标和所述测试指标分别包括均方根误差和拟合优度。
在上述实施例的基础上,可选的,该装置还包括:
数据分布图绘制模块,用于对于任一目标数据特征,基于所述目标数据特征对应的样本数据,绘制所述目标数据特征的数据分布图;
特征验证模块,用于基于所述目标数据特征的数据分布图对所述目标数据特征进行验证。
可选的,数据分布图绘制模块包括:
数据类型确定单元,用于确定所述目标数据特征的数据类型;
数据分布图绘制单元,用于基于所述目标数据特征对应的样本数据,绘制所述数据类型对应类型的数据分布图。
可选的,数据类型确定单元用于:
对所述目标数据特征的数据值进行去重处理,得到去重后的数据值;
在去重后的各数据值满足整数且数据值数量小于等于预设阈值的情况下,确定所述目标数据特征的数据类型为分类型,以及在去重后的各数据值不满足整数或者数据值数量大于等于预设阈值的情况下,确定所述目标数据特征的数据类型为数值型。
可选的,数据分布图绘制单元用于:
若所述目标数据特征的数据类型为分类型,则基于所述目标数据特征对应的样本数据绘制所述目标数据特征的水平长条图,以及所述目标数据特征与处理目标的箱线图;
若所述目标数据特征的数据类型为数值型,则基于所述目标数据特征对应的样本数据绘制所述目标数据特征的直方图,以及所述目标数据特征与处理目标的散点回归图。
可选的,特征验证模块用于:
在所述目标数据特征的数据分布图不满足分布规则的情况下,剔除所述目标数据特征,或者剔除所述目标数据特征所在的目标数据特征组。
本发明实施例所提供的特征筛选装置可执行本发明任意实施例所提供的特征筛选方法,具备执行方法相应的功能模块和有益效果。
实施例六
图7是本发明实施例六提供的一种电子设备的结构示意图。电子设备10旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备(如头盔、眼镜、手表等)和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本发明的实现。
如图7所示,电子设备10包括至少一个处理器11,以及与至少一个处理器11通信连接的存储器,如只读存储器(ROM)12、随机访问存储器(RAM)13等,其中,存储器存储有可被至少一个处理器执行的计算机程序,处理器11可以根据存储在只读存储器(ROM)12中的计算机程序或者从存储单元18加载到随机访问存储器(RAM)13中的计算机程序,来执行各种适当的动作和处理。在RAM 13中,还可存储电子设备10操作所需的各种程序和数据。处理器11、ROM 12以及RAM 13通过总线14彼此相连。输入/输出(I/O)接口15也连接至总线14。
电子设备10中的多个部件连接至I/O接口15,包括:输入单元16,例如键盘、鼠标等;输出单元17,例如各种类型的显示器、扬声器等;存储单元18,例如磁盘、光盘 等;以及通信单元19,例如网卡、调制解调器、无线通信收发机等。通信单元19允许电子设备10通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
处理器11可以是各种具有处理和计算能力的通用和/或专用处理组件。处理器11的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的处理器、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。处理器11执行上文所描述的各个方法和处理,例如特征筛选方法。
在一些实施例中,特征筛选方法可被实现为计算机程序,其被有形地包含于计算机可读存储介质,例如存储单元18。在一些实施例中,计算机程序的部分或者全部可以经由ROM 12和/或通信单元19而被载入和/或安装到电子设备10上。当计算机程序加载到RAM 13并由处理器11执行时,可以执行上文描述的特征筛选方法的一个或多个步骤。备选地,在其他实施例中,处理器11可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行特征筛选方法。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
用于实施本发明的特征筛选方法的计算机程序可以采用一个或多个编程语言的任何组合来编写。这些计算机程序可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器,使得计算机程序当由处理器执行时使流程图和/或框图中所规定的功能/操作被实施。计算机程序可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
实施例七
本发明实施例七提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机指令,计算机指令用于使处理器执行一种特征筛选方法,该方法包括:
基于样本数据中的数据特征确定多个特征验证子集;
基于所述样本数据所属个体,对所述样本数据进行个体组划分,得到不同个体对应的个体样本组,并基于多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集;
基于各所述特征验证子集对应的训练数据集和验证数据集,进行处理目标的机器学习模型训练;
基于各机器学习模型的训练过程数据确定所述处理目标的对应的目标数据特征组。
在本发明的上下文中,计算机可读存储介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的计算机程序。计算机可读存储介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。备选地,计算机可 读存储介质可以是机器可读信号介质。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在电子设备上实施此处描述的系统和技术,该电子设备具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给电子设备。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)、区块链网络和互联网。
计算系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务中,存在的管理难度大,业务扩展性弱的缺陷。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发明中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本发明的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本发明保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本发明的精神和原则之内所作的修改、等同替换和改进等,均应包含在本发明保护范围之内。

Claims (10)

  1. 一种特征筛选方法,其特征在于,包括:
    基于样本数据中的数据特征确定多个特征验证子集;
    基于所述样本数据所属个体,对所述样本数据进行个体组划分,得到不同个体对应的个体样本组,并基于多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集;
    基于各所述特征验证子集对应的训练数据集和验证数据集,进行处理目标的机器学习模型训练;
    基于各机器学习模型的训练过程数据确定所述处理目标的对应的目标数据特征组。
  2. 根据权利要求1所述的方法,其特征在于,在基于样本数据中的数据特征确定多个特征验证子集之前,所述方法还包括:
    确定所述样本数据中各所述数据特征与处理目标的关联性,并基于所述与处理目标的关联性筛选候选数据特征;
    相应的,基于样本数据中的数据特征确定多个特征验证子集,包括:在所述候选数据特征中确定多个特征验证子集。
  3. 根据权利要求1-2任一项所述的方法,其特征在于,所述基于样本数据中的数据特征确定多个特征验证子集,包括:
    基于特征验证子集中特征数量,在样本数据中的数据特征或者候选数据特征中确定多个特征验证子集。
  4. 根据权利要求1所述的方法,其特征在于,所述基于所述样本数据所属个体,对所述样本数据进行个体组划分,得到不同个体对应的个体样本组,并基于多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集,包括:
    将同一个体的至少一组样本数据,划分至一个体组内,得到不同个体对应的个体样本组;基于预设的至少一个交叉验证规则,对多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集;
    和/或,
    所述基于各机器学习模型的训练过程数据确定所述处理目标的对应的目标数据特征组,包括:
    对于任一机器学习模型,基于所述机器学习模型训练过程数据中的训练数据和验证数据,分别确定训练指标和测试指标;基于各机器学习模型的所述训练指标和测试指标,对各机器学习模型进行排序和筛选;将筛选出的机器学习模型对应的特征验证子集确定为所述处理目标的目标数据特征组,
    其中,所述训练指标和所述测试指标分别包括均方根误差和拟合优度。
  5. 根据权利要求1所述的方法,其特征在于,在确定目标数据特征组之后,所述方法还包括:
    对于任一目标数据特征,基于所述目标数据特征对应的样本数据,绘制所述目标数据特征的数据分布图;
    基于所述目标数据特征的数据分布图对所述目标数据特征进行验证。
  6. 根据权利要求5所述的方法,其特征在于,所述基于所述目标数据特征对应的样 本数据,绘制所述目标数据特征的数据分布图,包括:
    确定所述目标数据特征的数据类型;基于所述目标数据特征对应的样本数据,绘制所述数据类型对应类型的数据分布图;
    和/或,
    所述基于所述目标数据特征的数据分布图对所述目标数据特征进行验证,包括:在所述目标数据特征的数据分布图不满足分布规则的情况下,剔除所述目标数据特征,或者剔除所述目标数据特征所在的目标数据特征组。
  7. 根据权利要求6所述的方法,其特征在于,所述确定所述目标数据特征的数据类型,包括:
    对所述目标数据特征的数据值进行去重处理,得到去重后的数据值;在去重后的各数据值满足整数且数据值数量小于等于预设阈值的情况下,确定所述目标数据特征的数据类型为分类型,以及在去重后的各数据值不满足整数或者数据值数量大于等于预设阈值的情况下,确定所述目标数据特征的数据类型为数值型;
    和/或,
    所述基于所述目标数据特征对应的样本数据,绘制所述数据类型对应类型的数据分布图,包括:
    若所述目标数据特征的数据类型为分类型,则基于所述目标数据特征对应的样本数据绘制所述目标数据特征的水平长条图,以及所述目标数据特征与处理目标的箱线图;若所述目标数据特征的数据类型为数值型,则基于所述目标数据特征对应的样本数据绘制所述目标数据特征的直方图,以及所述目标数据特征与处理目标的散点回归图。
  8. 一种特征筛选装置,其特征在于,包括:
    特征验证子集确定模块,用于基于样本数据中的数据特征确定多个特征验证子集;
    数据集划分模块,用于基于所述样本数据所属个体,对所述样本数据进行个体组划分,得到不同个体对应的个体样本组,并基于多个个体样本组进行交叉验证划分,确定划分得到的训练数据集和验证数据集;
    模型训练模块,用于基于各所述特征验证子集对应的训练数据集和验证数据集,进行处理目标的机器学习模型训练;
    目标数据特征组确定模块,用于基于各机器学习模型的训练过程数据确定所述处理目标的对应的目标数据特征组。
  9. 一种电子设备,其特征在于,所述电子设备包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-7中任一项所述的特征筛选方法。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现权利要求1-7中任一项所述的特征筛选方法。
PCT/CN2022/113011 2022-06-02 2022-08-17 一种特征筛选方法、装置、存储介质及电子设备 WO2023231184A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210624370.7 2022-06-02
CN202210624370.7A CN114936205A (zh) 2022-06-02 2022-06-02 一种特征筛选方法、装置、存储介质及电子设备

Publications (1)

Publication Number Publication Date
WO2023231184A1 true WO2023231184A1 (zh) 2023-12-07

Family

ID=82866696

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/113011 WO2023231184A1 (zh) 2022-06-02 2022-08-17 一种特征筛选方法、装置、存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN114936205A (zh)
WO (1) WO2023231184A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180365525A1 (en) * 2016-02-26 2018-12-20 Alibaba Group Holding Limited Multi-sampling model training method and device
US20190258904A1 (en) * 2018-02-18 2019-08-22 Sas Institute Inc. Analytic system for machine learning prediction model selection
US20190392351A1 (en) * 2018-06-22 2019-12-26 Amadeus S.A.S. System and method for evaluating and deploying unsupervised or semi-supervised machine learning models
US20210150415A1 (en) * 2018-10-24 2021-05-20 Advanced New Technologies Co., Ltd. Feature selection method, device and apparatus for constructing machine learning model
US20210319366A1 (en) * 2020-12-22 2021-10-14 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus and device for generating model and storage medium
WO2021212737A1 (zh) * 2020-04-23 2021-10-28 苏州浪潮智能科技有限公司 一种行人重识别方法、系统、设备及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180365525A1 (en) * 2016-02-26 2018-12-20 Alibaba Group Holding Limited Multi-sampling model training method and device
US20190258904A1 (en) * 2018-02-18 2019-08-22 Sas Institute Inc. Analytic system for machine learning prediction model selection
US20190392351A1 (en) * 2018-06-22 2019-12-26 Amadeus S.A.S. System and method for evaluating and deploying unsupervised or semi-supervised machine learning models
US20210150415A1 (en) * 2018-10-24 2021-05-20 Advanced New Technologies Co., Ltd. Feature selection method, device and apparatus for constructing machine learning model
WO2021212737A1 (zh) * 2020-04-23 2021-10-28 苏州浪潮智能科技有限公司 一种行人重识别方法、系统、设备及计算机可读存储介质
US20210319366A1 (en) * 2020-12-22 2021-10-14 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus and device for generating model and storage medium

Also Published As

Publication number Publication date
CN114936205A (zh) 2022-08-23

Similar Documents

Publication Publication Date Title
US10042912B2 (en) Distributed clustering with outlier detection
CN111612041B (zh) 异常用户识别方法及装置、存储介质、电子设备
CN113435602A (zh) 确定机器学习样本的特征重要性的方法及系统
KR20220147550A (ko) 다중 목표의 이미지-텍스트 매칭 모델의 훈련 방법, 이미지-텍스트 검색 방법 및 장치
CN108805174A (zh) 聚类方法及装置
WO2024051052A1 (zh) 组学数据的批次矫正方法、装置、存储介质及电子设备
WO2024098699A1 (zh) 实体对象的威胁检测方法、装置、设备及存储介质
CN111198905B (zh) 用于理解二分网络中的缺失链路的视觉分析框架
CN114090601B (zh) 一种数据筛选方法、装置、设备以及存储介质
WO2023231184A1 (zh) 一种特征筛选方法、装置、存储介质及电子设备
EP4227855A1 (en) Graph explainable artificial intelligence correlation
CN114936204A (zh) 一种特征筛选方法、装置、存储介质及电子设备
CN116186603A (zh) 异常用户的识别方法及装置、计算机存储介质、电子设备
CN115579069A (zh) scRNA-Seq细胞类型注释数据库的构建方法、装置及电子设备
CN114861800A (zh) 模型训练方法、概率确定方法、装置、设备、介质及产品
CN113792749A (zh) 时间序列数据异常检测方法、装置、设备及存储介质
CN114385460A (zh) 数据稳定性的检测方法及装置、存储介质
CN112906723A (zh) 一种特征选择的方法和装置
CN114066278B (zh) 物品召回的评估方法、装置、介质及程序产品
US20240037410A1 (en) Method for model aggregation in federated learning, server, device, and storage medium
Bi et al. Lightweight and Data-imbalance-aware Defect Detection Approach Based on Federated Learning in Industrial Edge Networks
CN117611324A (zh) 信用评级方法、装置、电子设备和存储介质
US20140172320A1 (en) Stable genes in comparative transcriptomics
CN117113192A (zh) 一种数据分类方法、装置、设备及存储介质
CN114400026A (zh) 基于语音特征选择的帕金森病患者updrs得分预测方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22944495

Country of ref document: EP

Kind code of ref document: A1