WO2021139115A1 - 特征选择方法、装置、设备及存储介质 - Google Patents

特征选择方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021139115A1
WO2021139115A1 PCT/CN2020/099553 CN2020099553W WO2021139115A1 WO 2021139115 A1 WO2021139115 A1 WO 2021139115A1 CN 2020099553 W CN2020099553 W CN 2020099553W WO 2021139115 A1 WO2021139115 A1 WO 2021139115A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
features
sets
importance
score
Prior art date
Application number
PCT/CN2020/099553
Other languages
English (en)
French (fr)
Inventor
刘小双
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139115A1 publication Critical patent/WO2021139115A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a feature selection method, device, equipment, and storage medium.
  • Feature selection algorithm is a dimensionality reduction technology that can find the most relevant features of the problem, remove redundant features, and improve data storage and processing efficiency. Using these most relevant features for later model construction can avoid dimensional disasters.
  • the method of feature selection can be used to screen out important features that are more relevant to the results in medical production. For example, through feature selection, you can find those that are more relevant to sleep quality.
  • Features a wearable device that produces health data, focuses on the detection of important features obtained after screening through feature selection.
  • the current feature selection method is to select through filtering, that is, through loops, to continuously remove features of low importance, and the filtering often has the following operating principles.
  • the inventor realized that in actual training, if the model performance does not change after deleting a certain dimension feature, this does not fully indicate that the feature is not an important feature. In most cases, if the feature dimension is very large, the feature A There is an association between, B, and C.
  • the model selects features A and B as important features and puts them in the model at random, the importance of feature C will become 0, resulting in feature C being filtered out, although the feature C and the result also have a strong correlation, that is to say, this feature selection method will inevitably remove one of the two or three features that have a strong correlation with each other.
  • our purpose of feature selection is to dig out important features instead of getting the best model, due to the high degree of correlation between features and the information between features and features interfere with each other, we cannot choose Important features or filter some important features.
  • the main purpose of this application is to solve the problem that in the prior art, due to deleting one of multiple features that are strongly related to each other, important features cannot be selected or some important features are filtered.
  • the first aspect of this application provides a feature selection method, including:
  • the feature matrix obtained after feature removal processing is subjected to segmentation processing, and feature selection is continued.
  • the second aspect of the present application provides a feature selection device, including:
  • the characterization module is used to obtain original medical data and perform characterization processing on the original medical data to obtain a feature group to be selected corresponding to the original medical data, wherein the characterization processing is to perform the characterization processing on the original medical data. Mapping between data and features to be selected;
  • the copy module is used to make multiple copies of the feature group to be selected, and randomly scramble the set after each copy to obtain multiple sets of random sets;
  • the splicing and segmentation module is used to splice the feature group to be selected with multiple sets of the random sets to obtain a feature matrix, and divide the feature matrix into n groups of training sets, where n is a preset value and is greater than A positive integer of 1;
  • the tree model building module is used to select n-1 sets of training sets from the n sets of training sets to construct a tree model based on the model training algorithm, to obtain n tree models, and to calculate the feature importance set corresponding to each tree model;
  • a calculation module configured to calculate representative scores of a plurality of features to be selected according to the feature importance set
  • the recording module obtains the feature to be selected corresponding to the highest value among the representative scores, records the score of the feature to be selected, and moves the obtained feature to be selected and its corresponding random feature from the feature matrix except;
  • the judging module is used to judge whether the number of selected features is greater than or equal to the preset number of features
  • the output module is used to end feature selection when the number of selected features is greater than or equal to the preset number of features, and output the selected features selected by the feature as important medical features;
  • the loop module is used to perform segmentation processing on the feature matrix obtained after feature removal processing when the number of selected features is not greater than or not equal to the preset number of features, and continue feature selection.
  • a third aspect of the present application provides a feature selection device, including: a memory and at least one processor, the memory stores instructions, the memory and the at least one processor are interconnected by wires; the at least one processor The device calls the instructions in the memory, so that the feature selection device executes the steps of the feature selection method described below, including: obtaining original medical data, and performing characterization processing on the original medical data, to obtain and The feature group to be selected corresponding to the original medical data, wherein the characterization process is a mapping process between the original medical data and the feature to be selected;
  • the feature matrix obtained after feature removal processing is subjected to segmentation processing, and feature selection is continued.
  • the fourth aspect of the present application provides a computer-readable storage medium in which a computer program is stored, which when running on a computer, causes the computer to execute the steps of the feature selection method described below, It includes: obtaining original medical data, and performing characterization processing on the original medical data to obtain a feature group to be selected corresponding to the original medical data, wherein the characterization process is to combine the original medical data with the to-be-selected feature group Mapping between features;
  • the feature matrix obtained after feature removal processing is subjected to segmentation processing, and feature selection is continued.
  • the original medical data is extracted to obtain feature values, all feature values are used as feature sets for multiple copies, and each copy is shuffled to obtain a feature set in a random order, and the original feature set is compared with The random order feature set is spliced into a feature matrix. Cut all samples into n groups, take n-1 groups each time to build a tree model, repeat n times, and then calculate the feature importance set, calculate the true score of the feature according to the feature importance set, and remove all features to be selected In the feature with the highest feature score, continue to segment with the removed feature matrix. Repeat the above steps until the number of removed features reaches the preset number, because this method can avoid filtering two or three existing in the feature selection process One of the features with a strong correlation, so as to meet the needs of mining important medical features.
  • FIG. 1 is a schematic diagram of an embodiment of a feature selection method in an embodiment of this application
  • FIG. 2 is a schematic diagram of another embodiment of the feature selection method in the embodiment of the application.
  • FIG. 3 is a schematic diagram of an embodiment of a feature selection device in an embodiment of this application.
  • FIG. 4 is a schematic diagram of another embodiment of the feature selection device in the embodiment of the application.
  • Fig. 5 is a schematic diagram of an embodiment of a feature selection device in an embodiment of the application.
  • the embodiment of the application provides a feature selection method.
  • the specific implementation process is: a feature selection method, which extracts original medical data to obtain feature values, and copies all the feature values as a feature set for multiple times. After each copy, Shuffle, get the feature set in random order, and splice the original feature set and the random order feature set into a feature matrix. Cut all samples into n groups, take n-1 groups each time to build a tree model, repeat n times, and then calculate the feature importance set, calculate the true score of the feature according to the feature importance set, and remove all features to be selected Among the features with the highest feature score, continue to be segmented using the removed feature matrix. Repeat the above steps until the number of removed features reaches the preset number. This method can avoid removing two or three in the feature selection process. One of the features that have a strong correlation with each other, which meets the demand for mining important medical features.
  • An embodiment of the feature selection method in the embodiment of the present application includes:
  • the characterization process is to perform mapping processing between the original medical data and the features to be selected.
  • original data there will be a variety of original data in the medical process, for example, there will be blood information data, "18.5*10 ⁇ 9/L” “20*10 ⁇ 9/L” corresponds to the individual's white blood cell count, "71.80%”, “72.50%” and “73.67%” correspond to lymphocyte%, "31.19pg” and "32.50pg” correspond to average hemoglobin content
  • the feature to be selected corresponding to "32.50pg” is the average hemoglobin content
  • the obtained feature to be selected is used as a set to obtain the feature to be selected. group.
  • the same set is obtained.
  • the features to be selected in the feature group to be selected are scrambled.
  • the order of the features is truly random, not pseudo-random, repeated multiple copies, and each time a different random seed is selected to disrupt the order of the features, and multiple sets of different random sets are obtained.
  • the purpose of constructing a random set is to make each feature and result Relevance removal can eliminate the bias introduced by chance and randomness in the evaluation of feature importance.
  • n is a preset value and is a positive integer greater than 1.
  • the way to divide the feature matrix is to sample the samples in the feature matrix by stratified sampling, and label the sample data in advance. For example, in the production of certain blood disease drugs, it is necessary to obtain sample data. For a collected personal sample, if it is known that he has a certain blood disease based on his personal data, the data of other medical samples will be marked as "affected". The ending tag. Perform stratified sampling according to the outcome label of the data.
  • the preset number of training set groups is 5, then 10
  • the outcome labels of each feature data are “affected”, and 2 are drawn each time, and the outcome labels of 20 feature data are “healthy”. 4 are drawn each time.
  • the 6 samples obtained are regarded as 1 group, and stratified sampling is performed. After getting 5 sets.
  • a fixed random seed can be selected for each stratified sampling to ensure that the results of stratified sampling are the same during repeated experiments.
  • the purpose of stratified sampling and dividing into n groups to build n tree models is to ensure that all samples participate in training when building the tree model, and avoid random bias caused by samples.
  • the model training algorithm can be one or more of Random Forest Algorithm, Adboost, GBDT, Xgboost, and LightGBM, and build a tree model based on these algorithm training, where GBDT is a gradient boosting tree , Mainly by calculating the average value of the importance of the feature in a single tree, XGboost is calculated by the sum of the number of splits in each tree of the feature, for example, this feature splits once in the first tree, and the second tree 2 times..., then the score of this feature is (1+2+%), and which tree-based algorithm model to use can be selected from the perspective of practical applications such as the application field and the characteristics of the feature.
  • the random forest algorithm is mainly used. After n tree models are obtained, the feature importance set corresponding to each tree model needs to be calculated.
  • the feature importance set includes the features to be selected and the impact of random features on the tree model.
  • the way that random forest calculates the importance of features is mainly by judging how much each feature contributes to each tree in the random forest, and then taking the average value.
  • the method of evaluating the error rate of out-of-bag data is selected to calculate the feature importance of the tree model.
  • each feature importance set includes the feature importance of the feature to be selected and the feature importance of the corresponding random feature
  • the Z-score value of each feature to be selected and the Z-score value of the corresponding random feature are calculated, and the Z-score value of each feature to be selected is calculated according to the Z-score value of each feature to be selected.
  • the score value and the Z-score value of the corresponding random feature are used to calculate the representative score of each feature to be selected.
  • the features to be selected can be sorted according to the representative score, and according to a preset ratio, the representative with the highest score is the smallest corresponding to the A feature to be selected is removed, and the representative score of the removed feature is recorded.
  • the most important feature can be removed continuously to eliminate the interference between the features and remove the mutual interference between the features in the selection process.
  • this step after each feature selection, it is necessary to determine whether to continue feature selection, otherwise it will continue to perform feature selection, increase the amount of calculation, reduce operating efficiency, and the importance of the later selected features is also relatively small
  • it is done by judging whether the selected feature is greater than or equal to the preset number of features.
  • it is also possible to calculate the model evaluation index of the tree model after each feature selection. (AUC), judging whether the model evaluation index is less than a preset value to determine whether to continue feature selection, the value of the model evaluation index is between [0.5, 1], and the larger the value, the better the model.
  • the original medical data is extracted to obtain the feature value, all the feature values are used as a feature set for multiple copies, and each copy is shuffled to obtain a feature set in a random order.
  • the original feature set and the random order The feature set is spliced into a feature matrix. Cut all samples into n groups, take n-1 groups each time to build a tree model, repeat n times, and then calculate the feature importance set, calculate the true score of the feature according to the feature importance set, and remove all features to be selected Among the features with the highest feature score, continue to be segmented using the removed feature matrix. Repeat the above steps until the number of removed features reaches the preset number. This method can avoid removing two or three in the feature selection process. One of the features that have a strong correlation with each other, which meets the demand for mining important medical features.
  • FIG. 2 another embodiment of the feature selection method in the embodiment of the present application includes:
  • the method further includes:
  • a feature label is applied to the copied features obtained after copying, the feature label is used to make the feature to be selected correspond to the random feature, and the feature label is used to perform segmentation processing on the feature matrix.
  • the random feature cannot correspond to the feature to be selected in the subsequent calculation process, and the representative score of the feature to be selected cannot be calculated.
  • the random feature corresponding to the feature to be selected can be found through the feature tag, for example, for the random feature pair A-S1, A-S2, etc. corresponding to the feature A to be selected.
  • At least one type of sample can be divided according to the oriented outcome, and the outcome label is marked on the sample data in advance.
  • the outcome label is marked on the sample data in advance.
  • the outcome label of each feature data is "healthy” and 4 samples are drawn each time, and the 2 samples drawn with the outcome label of "affected” and 4 samples drawn with the outcome label of "healthy” are taken as one group.
  • stratified sampling is performed to obtain n sets of samples, n-1 of them are taken as training set samples each time, so that a total of n sets of different training sets can be obtained.
  • the preset number of training sets is 3
  • the second and third sets of samples are a set of training sets
  • the first and third sets of samples are a set of training sets.
  • the first set of samples and the second set of samples are a set of training sets.
  • the stratified sampling of the sample can ensure that there are individuals in each layer of the population.
  • the construction of n groups of training sets ensures that the random errors caused by the randomness of the samples.
  • n-1 sets of training sets from the n sets of training sets to construct a tree model based on a model training algorithm to obtain n tree models
  • I is the feature importance
  • n is the number of tree models
  • E_I is the number of second classification errors
  • E_i is the number of first classification errors.
  • the tree model calculates the feature importance for each input feature.
  • the tree model calculates the feature importance of the feature to be selected and the feature importance of the random feature corresponding to the feature to be selected.
  • the Z-score value of the feature to be selected is subtracted from the Z-score value of the random feature corresponding to the feature to be selected.
  • the Z-score value can obtain the representative score of the feature to be selected, and the calculation formula of the Z-score value is:
  • I represents the mean value of the feature importance of a feature on n tree models
  • ⁇ _I represents the standard deviation of the feature importance of a feature on the tree model.
  • One of the first importance score and the second importance score is selected as a reference score, and the reference score is used for subsequent feature analysis.
  • the importance scores of all the selected important features need to be calculated.
  • the obtained importance scores are convenient for subsequent comparison of the importance of each feature. This is because After each feature selection and elimination, the selected feature loses its association with other features. Therefore, after selecting the important features you need, you need to calculate the importance scores of all the selected features.
  • the feature selection is performed at the beginning There are too many selected features, and the importance scores of some of the selected features are too low. We can select some of the features with the highest importance scores for subsequent analysis.
  • the calculation method for calculating the first feature importance of a certain feature a is:
  • Z-score a for a characteristic of a Z-score value Z-score max is the maximum value of Z-score of all the selected features
  • Z-score min Z-score is the minimum of all the selected features .
  • the original medical data is extracted to obtain the feature value, all the feature values are used as a feature set for multiple copies, and each copy is shuffled to obtain a feature set in a random order, and the original feature set is compared with the random order.
  • the feature set is spliced into a feature matrix. Cut all samples into n groups, take n-1 groups each time to build a tree model, repeat n times, and then calculate the feature importance set, calculate the true score of the feature according to the feature importance set, and remove all features to be selected Among the features with the highest feature score, continue to be segmented using the removed feature matrix. Repeat the above steps until the number of removed features reaches the preset number. This method can avoid removing two or three in the feature selection process. One of the features that have a strong correlation with each other, which meets the demand for mining important medical features.
  • An embodiment of the feature selection device in the embodiment of the present application includes:
  • the characterization module 301 is used to obtain original medical data, and perform characterization processing on the original medical data to obtain a feature group to be selected corresponding to the original medical data, wherein the characterization process is to process the original Mapping between medical data and features to be selected;
  • the copy module 302 is configured to make multiple copies of the feature group to be selected, and randomly scramble the set after each copy to obtain multiple sets of random sets;
  • the splicing and segmentation module 303 is configured to splice the feature group to be selected with multiple sets of the random sets to obtain a feature matrix, and divide the feature matrix into n groups of training sets, where n is a preset value and is A positive integer greater than 1;
  • the tree model construction module 304 is configured to select n-1 sets of training sets from the n sets of training sets to construct a tree model based on the model training algorithm, obtain n tree models, and calculate the feature importance set corresponding to each tree model;
  • the calculation module 305 is configured to calculate the representative scores of the multiple features to be selected according to the feature importance set;
  • the recording module 306 obtains the feature to be selected corresponding to the highest value among the representative scores, records the score of the feature to be selected, and extracts the obtained feature to be selected and its corresponding random feature from the feature matrix Remove
  • the judging module 307 is used to judge whether the number of selected features is greater than or equal to the preset number of features
  • the output module 308 is used to end the feature selection when the number of selected features is greater than or equal to the preset number of features, and output the selected features selected by the feature as important medical features;
  • the loop module 309 is configured to perform segmentation processing on the feature matrix obtained after feature removal processing when the number of selected features is not greater than or not equal to the preset number of features, and continue feature selection.
  • the original medical data is extracted to obtain feature values, all feature values are used as feature sets for multiple copies, and each copy is shuffled to obtain a feature set in a random order, and the original feature set is compared with the random order.
  • the feature set is spliced into a feature matrix. Cut all samples into n groups, take n-1 groups each time to build a tree model, repeat n times, and then calculate the feature importance set, calculate the true score of the feature according to the feature importance set, and remove all features to be selected Among the features with the highest feature scores, continue to be segmented using the feature matrix after removal, and repeat the above steps until the number of removed features reaches the preset number. This method can avoid removing two or three in the feature selection process. One of the characteristics that have a strong correlation with each other, which meets the demand for mining important medical characteristics.
  • another embodiment of the feature selection device in the embodiment of the present application includes:
  • the characterization module 301 is used to obtain original medical data, and perform characterization processing on the original medical data to obtain a feature group to be selected corresponding to the original medical data, wherein the characterization process is to process the original Mapping between medical data and features to be selected;
  • the copy module 302 is configured to make multiple copies of the feature group to be selected, and randomly scramble the set after each copy to obtain multiple sets of random sets;
  • the splicing and segmentation module 303 is configured to splice the feature group to be selected with multiple sets of the random sets to obtain a feature matrix, and divide the feature matrix into n groups of training sets, where n is a preset value and is A positive integer greater than 1;
  • the tree model construction module 304 is configured to select n-1 sets of training sets from the n sets of training sets to construct a tree model based on the model training algorithm, obtain n tree models, and calculate the feature importance set corresponding to each tree model;
  • the calculation module 305 is configured to calculate the representative scores of the multiple features to be selected according to the feature importance set;
  • the recording module 306 is configured to obtain the feature to be selected corresponding to the highest value among the representative scores, record the score of the feature to be selected, and extract the obtained feature to be selected and its corresponding random feature from the feature Remove from the matrix;
  • the judging module 307 is used to judge whether the number of selected features is greater than or equal to the preset number of features
  • the output module 308 is used to end the feature selection when the number of selected features is greater than or equal to the preset number of features, and output the selected features selected from the feature as important medical features;
  • the loop module 309 is configured to perform segmentation processing on the feature matrix obtained after feature removal processing when the number of selected features is not greater than or not equal to the preset number of features, and continue feature selection.
  • the feature selection device further includes a tag module 310, which is used to tag the copied features obtained after copying with feature tags, and the feature tags are used to make the feature to be selected correspond to the random feature, wherein the The feature label is used to perform segmentation processing on the feature matrix.
  • a tag module 310 which is used to tag the copied features obtained after copying with feature tags, and the feature tags are used to make the feature to be selected correspond to the random feature, wherein the The feature label is used to perform segmentation processing on the feature matrix.
  • the 303 splicing and segmentation module includes: a dividing unit 3031 and a stratified sampling unit 3032;
  • the dividing unit 3031 is configured to divide the data of each feature in the feature matrix into at least one type of samples according to the feature label;
  • the stratified sampling unit 3032 is used to perform stratified sampling on the samples to obtain n sets of training sets, and the training sets are used to construct a tree model through a model training algorithm.
  • the tree model construction unit includes: a first calculation unit 3041, a second calculation unit 3042, a feature importance calculation unit 3043;
  • the first calculation unit 3041 calculates the first classification error number of the tree model in the random forest on the out-of-bag data
  • the second calculation unit 3042 randomly perturbs the value of the feature in the out-of-bag data of the tree model, and calculates the second classification error number
  • the screening unit 3043 calculates the feature importance of each feature according to the first classification error number and the second classification error number.
  • the calculation module 305 includes: a Z-score unit 3051, a representative score unit 3052;
  • the Z-score unit 3051 is used to calculate the Z-score value of each feature according to the importance of the feature
  • the representative score unit 3052 is configured to calculate the representative score of the feature to be selected according to the Z-score value of the feature to be selected and the Z-score value of the corresponding random feature.
  • the recording module 306 is configured to obtain the feature to be selected corresponding to the highest value among the representative scores, record the score of the feature to be selected, and extract the obtained feature to be selected and its corresponding random feature from the feature Remove from the matrix;
  • the judging module 307 is used to judge whether the number of selected features is greater than or equal to the preset number of features
  • the output module 308 is used to end the feature selection when the number of selected features is greater than or equal to the preset number of features, and output the selected features selected by the feature as important medical features;
  • the loop module 309 is configured to perform segmentation processing on the feature matrix obtained after feature removal processing when the number of selected features is not greater than or not equal to the preset number of features, and continue feature selection.
  • the feature selection device further includes an analysis module 311, which is used to obtain the selected feature and the Z-score value corresponding to the selected feature; and calculate all the selected features according to the Z-score value corresponding to the selected feature.
  • an analysis module 311 which is used to obtain the selected feature and the Z-score value corresponding to the selected feature; and calculate all the selected features according to the Z-score value corresponding to the selected feature.
  • Select the first importance score of the feature re-input the selected feature into the tree model, calculate the feature importance of the selected feature, and use the feature importance as the second importance score; select the One of the first importance score and the second importance score is used as a reference score, and the reference score is used for subsequent feature analysis.
  • the original medical data is extracted to obtain the feature value, all the feature values are used as a feature set for multiple copies, and each copy is shuffled to obtain a feature set in a random order.
  • the original feature set and the random order The feature set is spliced into a feature matrix. Cut all samples into n groups, take n-1 groups each time to build a tree model, repeat n times, and then calculate the feature importance set, calculate the true score of the feature according to the feature importance set, and remove all features to be selected Among the features with the highest feature score, continue to be segmented using the removed feature matrix. Repeat the above steps until the number of removed features reaches the preset number. This method can avoid removing two or three in the feature selection process. One of the features that have a strong correlation with each other, which meets the demand for mining important medical features.
  • FIG. 5 is a schematic structural diagram of a feature selection device provided by an embodiment of the present application.
  • the feature selection device 500 may have relatively large differences due to different configurations or performances, and may include one or more processors (central processing units, CPUs). 510 (for example, one or more processors) and memory 520, and one or more storage media 530 (for example, one or more storage devices) storing application programs 533 or data 532.
  • the memory 520 and the storage medium 530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the feature selection device 500.
  • the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the feature selection device 500 to implement the steps of the feature selection method in the foregoing embodiments.
  • the feature-based selection device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or one or more operating systems 531, such as Windows Serve , Mac OS X, Unix, Linux, FreeBSD, etc.
  • operating systems 531 such as Windows Serve , Mac OS X, Unix, Linux, FreeBSD, etc.
  • FIG. 5 does not constitute a limitation on the feature-based selection device, and may include more or less components than shown in the figure, or a combination of certain components, or different components Layout.
  • the present application also provides a feature selection device.
  • the feature selection device includes a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor executes the steps in the foregoing embodiments. The steps of the feature selection method.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium, and the computer-readable storage medium may also be a volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program (that is, an instruction). When the computer program runs on a computer, the computer executes the steps of the feature selection method. Optionally, the computer program is executed by a processor on the computer. The computer program is executed.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

一种特征选择方法、装置、设备及存储介质,涉及人工智能技术领域。通过对原始医疗数据进行提取获得特征值,将所有特征值作为特征集进行多次拷贝,每次拷贝后打乱,得到随机顺序的特征集,将原有特征集与随机顺序特征集拼接为特征矩阵。将所有样本切割为n组,每次取n-1组构建树模型,重复n次,然后计算得到特征重要性集合,根据特征重要性集合计算特征的真实分值,剔除出在所有待选择特征中特征分值最高的特征,并以剔除后的特征矩阵继续进行分割重复上述步骤,直到剔除的特征数达到预设数量。还涉及区块链技术,所述原始医疗数据可存储于区块链节点中。

Description

特征选择方法、装置、设备及存储介质
本申请要求于2020年05月26日提交中国专利局、申请号为202010453796.1、发明名称为“特征选择方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种特征选择方法、装置、设备及存储介质。
背景技术
近年来,随着互联网的普及,许多企业和组织都产生了大量的数据,大数据量和超高维度成为后续分析的主要障碍,因此我们需要从过剩的信息中筛选去除冗余,找到相关信息,特征选择算法是一种降维技术,能够找到和问题最相关的特征,去除冗余特征,提高数据存储和处理效率,使用这些最相关特征进行后期的模型构架,能够避免维度灾难。
而在医疗领域,在处理医疗数据时,通过特征选择的方法,能够筛选出在医疗生产中的与结果相关性较高的重要特征,例如通过特征选择,能够找到对于睡眠质量相关性较高的特征,生产探测健康数据的某可穿戴设备,重点对通过特征选择进行筛选后得出的重要特征进行探测。
目前的特征选择方法是通过过滤的方式来选择,也就是通过循环,不断的去除重要性较低的特征,而该过滤往往存在以下操作原则,一是若删掉某维特征,导致模型性能下降,则认为该特征很重要;二是如果删掉某维特征,模型性能没有变化,则认为该特征不重要。然而,发明人意识到,在实际训练中,如果删掉某维特征后模型性能没有发生变化,这并不能充分说明该特征不是重要特征,在大多数情况下,如果特征维度非常大,特征A,B,C之间存在关联,那么模型在随机情况下,选择特征A和B作为重要特征放在模型中后,特征C的重要性就会变成0,导致特征C被过滤掉,尽管特征C和结果也具有很强的关联,也就是说,这种特征选择方法无法避免的会去掉两个或三个彼此存在较强关联的特征中的一个。而当我们进行特征选择的目的是挖掘出重要的特征而不是得到最好的模型时,由于特征与特征之间的关联度高,特征与特征之间信息彼此干扰的缘故,导致我们选择不出重要特征或将某些重要特征过滤。
发明内容
本申请的主要目的在于解决现有技术中,由于删去多个彼此存在较强关联的特征中的一个特征,导致选择不出对于重要的特征或导致将某些重要特征过滤的问题。
本申请第一方面提供了一种特征选择方法,包括:
获取原始医疗数据,并对所述原始医疗数据进行特征化处理,得到与所述原始医疗数据对应的待选择特征组;
将所述待选择特征组进行多次拷贝,并对每次拷贝后的集合进行随机打乱,得到多组随机集合;
将所述待选择特征组与多组所述随机集合进行拼接,得到特征矩阵,并将所述特征矩阵进行分割成n组训练集,n为预设值,且为大于1的正整数;
选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合;
根据所述特征重要性集合计算多个所述待选择特征的代表分值;
获取所述代表分值中最高值所对应的待选择特征,记录所述待选择特征的分值,并将获取到的待选择特征及其对应的随机特征从所述特征矩阵中移除;
判断选择出的特征个数是否大于或等于预设的特征个数;若是,结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出;
若否,则将特征移除处理后得到的特征矩阵进行分割处理,继续进行特征选择。
本申请第二方面提供了一种特征选择装置,包括:
特征化模块,用于获取原始医疗数据,并对所述原始医疗数据进行特征化处理,得到与所述原始医疗数据对应的待选择特征组,其中,所述特征化处理为将所述原始医疗数据与待选择特征之间进行映射处理;
拷贝模块,用于将所述待选择特征组进行多次拷贝,并对每次拷贝后的集合进行随机打乱,得到多组随机集合;
拼接分割模块,用于将所述待选择特征组与多组所述随机集合进行拼接,得到特征矩阵,并将所述特征矩阵进行分割成n组训练集,n为预设值,且为大于1的正整数;
树模型构建模块,用于选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合;
计算模块,用于根据所述特征重要性集合计算多个所述待选择特征的代表分值;
记录模块,获取所述代表分值中最高值所对应的待选择特征,记录所述待选择特征的分值,并将获取到的待选择特征及其对应的随机特征从所述特征矩阵中移除;
判断模块,用于判断选择出的特征个数是否大于或等于预设的特征个数;
输出模块,用于当选择出的特征个数大于或等于预设的特征个数时,结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出;
循环模块,用于当选择出的特征个数不大于或不等于预设的特征个数时,将特征移除处理后得到的特征矩阵进行分割处理,继续进行特征选择。
本申请第三方面提供了一种特征选择设备,包括:存储器和至少一个处理器,所述存储器中存储有指令,所述存储器和所述至少一个处理器通过线路互连;所述至少一个处理器调用所述存储器中的所述指令,以使得所述特征选择设备执行如下所述的特征选择方法的步骤,包括:获取原始医疗数据,并对所述原始医疗数据进行特征化处理,得到与所述原始医疗数据对应的待选择特征组,其中,所述特征化处理为将所述原始医疗数据与待选择特征之间进行映射处理;
将所述待选择特征组进行多次拷贝,并对每次拷贝后的集合进行随机打乱,得到多组随机集合;
将所述待选择特征组与多组所述随机集合进行拼接,得到特征矩阵,并将所述特征矩阵进行分割成n组训练集,n为预设值,且为大于1的正整数;
选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合;
根据所述特征重要性集合计算多个所述待选择特征的代表分值;
获取所述代表分值中最高值所对应的待选择特征,记录所述待选择特征的分值,并将获取到的待选择特征及其对应的随机特征从所述特征矩阵中移除;
判断选择出的特征个数是否大于或等于预设的特征个数;
若是,结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出;
若否,则将特征移除处理后得到的特征矩阵进行分割处理,继续进行特征选择。
本申请的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行如下所述的特征选择方法的步骤,包括:获取原始医疗数据,并对所述原始医疗数据进行特征化处理,得到与所述原始医疗 数据对应的待选择特征组,其中,所述特征化处理为将所述原始医疗数据与待选择特征之间进行映射处理;
将所述待选择特征组进行多次拷贝,并对每次拷贝后的集合进行随机打乱,得到多组随机集合;
将所述待选择特征组与多组所述随机集合进行拼接,得到特征矩阵,并将所述特征矩阵进行分割成n组训练集,n为预设值,且为大于1的正整数;
选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合;
根据所述特征重要性集合计算多个所述待选择特征的代表分值;
获取所述代表分值中最高值所对应的待选择特征,记录所述待选择特征的分值,并将获取到的待选择特征及其对应的随机特征从所述特征矩阵中移除;
判断选择出的特征个数是否大于或等于预设的特征个数;
若是,结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出;
若否,则将特征移除处理后得到的特征矩阵进行分割处理,继续进行特征选择。
本申请提供的技术方案中,对原始医疗数据进行提取获得特征值,将所有特征值作为特征集进行多次拷贝,每次拷贝后打乱,得到随机顺序的特征集,将原有特征集与随机顺序特征集拼接为特征矩阵。将所有样本切割为n组,每次取n-1组构建树模型,重复n次,然后计算得到特征重要性集合,根据特征重要性集合计算特征的真实分值,剔除出在所有待选择特征中特征分值最高的特征,并以剔除后的特征矩阵继续进行分割重复上述步骤,直到剔除的特征数达到预设数量,由于此方法能够避免在特征选择过程中过滤两个或三个彼此存在较强关联的特征中的一个,从而满足挖掘重要医疗特征的需求。
附图说明
图1为本申请实施例中特征选择方法的一个实施例示意图;
图2为本申请实施例中特征选择方法的另一个实施例示意图;
图3为本申请实施例中特征选择装置的一个实施例示意图;
图4为本申请实施例中特征选择装置的另一个实施例示意图;
图5为本申请实施例中特征选择设备的一个实施例示意图。
具体实施方式
本申请实施例提供了一种特征选择方法,具体的实现过程为:一种特征选择方法,对原始医疗数据进行提取获得特征值,将所有特征值作为特征集进行多次拷贝,每次拷贝后打乱,得到随机顺序的特征集,将原有特征集与随机顺序特征集拼接为特征矩阵。将所有样本切割为n组,每次取n-1组构建树模型,重复n次,然后计算得到特征重要性集合,根据特征重要性集合计算特征的真实分值,剔除出在所有待选择特征中特征分值最高的特征,并以剔除后的特征矩阵继续进行分割,重复上述步骤,直到剔除的特征数达到预设数量,通过本方式,能够避免在特征选择过程中去掉两个或三个彼此存在较强关联的特征中的一个,满足挖掘重要医疗特征的需求。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例 中特征选择方法的一个实施例包括:
101、获取原始医疗数据,并对所述原始医疗数据进行特征化处理,得到与所述原始医疗数据对应的待选择特征组;
在实际应用中,所述特征化处理为将所述原始医疗数据与待选择特征之间进行映射处理,在医疗过程中会有多种原始数据,例如,会有血液信息数据,“18.5*10^9/L”“20*10^9/L”对应个人的白细胞计数,“71.80%”“72.50%”“73.67%”对应淋巴细胞%,“31.19pg”“32.50pg”对应平均血红蛋白含量等,对每个人原始医疗数据进行特征化,使得原始医疗数据与待选择特征对应,例如“32.50pg”对应的待选择特征为平均血红蛋白含量,将得到的待选择特征作为一组集合得到待选择特征组。
102、将所述待选择特征组进行多次拷贝,并对每次拷贝后的集合进行随机打乱,得到多组随机集合;
在该步骤中,通过对待选择特征组进行拷贝,得到相同的集合,在本案中,通过选择随机种子,将待选择特征组中的待选择特征进行打乱,通过选择随机种子的方式,能够达到特征顺序的真随机,而不是伪随机,重复多次拷贝,并且每次选择不同的随机种子打乱特征的顺序,得到多组不同的随机集合,构建随机集合的目的是为了使得各特征和结果的关联去除,可消除特征重要性评估中偶然性和随机性引入的偏差。
103、将所述待选择特征组与多组所述随机集合进行拼接,得到特征矩阵,并将所述特征矩阵进行分割成n组训练集;
在该步骤中,n为预设值,且为大于1的正整数将特征矩阵进行分割的方式是通过分层抽样的方式,对特征矩阵中的样本进行取样,在事先对样本数据打上结局标签,例如在生产某些血液疾病药物时,需要获取样本数据,对于某个收集到的个人样本,根据他的个人数据得知他罹患了某种血液疾病,则对其他医疗样本数据打上“罹患”的结局标签。根据数据的结局标签进行分层抽样,若有10个特征数据的结局标签为“罹患”以及20个特征数据的结局标签为“健康”,预先设定的训练集组数为5,则从10个特征数据的结局标签为“罹患”中每次抽取2个,从20个特征数据的结局标签为“健康”每次抽取4个,获取得到的6个样本作为1组,在进行分层抽样得到5组后。每次分层抽样可以选择固定的随机种子,确保在重复试验的过程中得到的进行分层抽样的结果是一样的。进行分层抽样,并且分为n组构建n个树模型的目的是为了确保构建树模型时所有样本都参与训练,避免了样本带来的随机性偏差。
104、选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合;
在该步骤中,所述模型训练算法可以为随机森林算法、Adboost、GBDT、Xgboost、以及LightGBM中的其中一种或者两种以上,并基于这些算法训练构建树模型,其中,GBDT为梯度提升树,主要通过计算特征在单棵树中的重要性的平均值,XGboost是通过该特征每棵树中分裂次数的和去计算的,比如这个特征在第一棵树分裂1次,第二棵树2次……,那么这个特征的得分就是(1+2+...),所述具体采用哪种基于树的算法模型可以从应用领域以及特征的特点等实际应用的角度进行选择。
在本实施例中主要使用随机森林算法,在得到n个树模型后,需要计算每个树模型对应的特征重要性集合,所述特征重要性集合包括待选择特征以及随机特征对于树模型影响大小的数值,在实际应用中,随机森林计算特征重要性的方式主要是通过判断每个特征在随机森林中每棵树做了多大的贡献,然后取平均值,一般有两种,一个是袋外数据错误率评估,另外一个是Gini系数评价指标,在本实施例中,选择的是袋外数据错误率评估的方式计算树模型的特征重要性。
105、根据所述特征重要性集合计算多个所述待选择特征的代表分值;
106、获取所述代表分值中最高值所对应的待选择特征,记录所述待选择特征的分值, 并将获取到的待选择特征及其对应的随机特征从所述特征矩阵中移除;
在该步骤中,在获得特征重要性集合后,由于不同树模型对应不同特征重要性集合,而在每个特征重要性集合中都包括了待选择特征的特征重要性以及对应随机特征的特征重要性,通过各待选择特征的特征重要性以及对应随机特征的特征重要性计算出各待选择特征的Z-score值和对应随机特征的Z-score值,根据计算得到的各待选择特征的Z-score值和对应随机特征的Z-score值计算各待选择特征的代表分值。
在该步骤中,在计算得到每个待选择特征的代表分值后可以对所述待选择特征根据代表分值的高低进行排序,并且按照预先设定的比例将代表分值最高的对应的最少一个待选择特征进行剔除,并记录剔除掉的特征的代表分值,通过这种方式不断的剔除出最重要的特征能够消除特征之间彼此的干扰,去除特征在选择过程中因彼此之间的关联带来的漏掉重要特征的问题,更全面的筛选出和结果相关的特征。
107、判断选择出的特征个数是否大于或等于预设的特征个数;
108、若是,结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出;
109、若否,则将特征移除处理后得到的特征矩阵进行分割处理,继续进行特征选择。
在该步骤中,在每次进行特征选择后,需要进行判断是否继续进行特征选择,否则将持续进行特征选择,加大计算量,降低运行效率,同时后面选择的特征的重要性也相对较小,在本实施例中,是通过判断选择出来的特征是否大于或等于预设特征个数的方式完成的,在实际应用中,还可以通过在每次进行特征选择后计算树模型的模型评估指标(AUC),判断所述模型评估指标是否小于预设值的方式判断是否继续进行特征选择,所述模型评估指标的值在[0.5,1]之间,数值越大代表模型越好。
本申请实施例中,对原始医疗数据进行提取获得特征值,将所有特征值作为特征集进行多次拷贝,每次拷贝后打乱,得到随机顺序的特征集,将原有特征集与随机顺序特征集拼接为特征矩阵。将所有样本切割为n组,每次取n-1组构建树模型,重复n次,然后计算得到特征重要性集合,根据特征重要性集合计算特征的真实分值,剔除出在所有待选择特征中特征分值最高的特征,并以剔除后的特征矩阵继续进行分割,重复上述步骤,直到剔除的特征数达到预设数量,通过本方式,能够避免在特征选择过程中去掉两个或三个彼此存在较强关联的特征中的一个,满足挖掘重要医疗特征的需求。
请参阅图2,本申请实施例中特征选择方法的另一个实施例包括:
201、获取原始医疗数据,并对所述原始医疗数据进行特征化处理,得到与所述原始医疗数据对应的待选择特征组,其中,所述特征化处理为将所述原始医疗数据与待选择特征之间进行映射处理;
202、将所述待选择特征组进行多次拷贝,并对每次拷贝后的集合进行随机打乱,得到多组随机集合;
在该步骤中,在所述将所述待选择特征组进行多次拷贝之后,还包括:
对拷贝后获得的拷贝特征打上特征标签,所述特征标签用于使待选择特征和所述随机特征相对应,其中,所述特征标签用于对所述特征矩阵进行分割处理。
在该步骤,由于后续需要进行对特征的shuffle处理获得随机特征,导致在后续计算过程中,随机特征无法与待选择特征相对应,也就无法计算待选择特征的代表分值,为了避免这种情况发生,在本案中,需要将通过对待选择特征组进行拷贝后获得的拷贝特征的name打上标签,使得待选择特征与随机特征相对应,这样在进行打乱处理后计算待选择特征的代表分值时,通过所述特征标签,就能够找到待选择特征对应的随机特征,例如对于待选择特征A对应的随机特征对A-S1、A-S2等。
203、将所述待选择特征组与多组所述随机集合进行拼接,得到特征矩阵;
204、根据所述特征标签,将所述特征矩阵中的每个特征的数据分别划分为至少一类 样本;
205、对所述样本进行分层抽样,得到n组训练集,所述训练集用于通过模型训练算法构建树模型;
在该步骤中,对于不同特征的数据可以根据导向的结局划分至少一类样本,在事先对样本数据打上结局标签,例如在生产某些血液疾病药物时,需要获取样本数据,对于某个收集到的个人样本,根据他的个人数据得知他已经罹患了血液疾病,则对其他样本数据打上“罹患”的结局标签,根据样本的结局标签进行分层抽样,若有10个特征数据的结局标签为“罹患”以及20个特征数据的结局标签为“健康”,预先设定的训练集组数为5,则从10个特征数据的结局标签为“罹患”中每次抽取2个,从20个特征数据的结局标签为“健康”每次抽取4个,将结局标签为“罹患”抽取的2个样本以及结局标签为“健康”抽取的4个样本作为1组。在进行分层抽样获取n组样本后,每次取其中的n-1组作为训练集样本,这样一共可以获取n组不同的训练集,例如当预设的训练集的组数为3时,在获取到3组样本后,每次取其中的2组作为训练集样本,分别是第二组和第三组样本为一组训练集,第一组和第三组样本为一组训练集,第一组和第二组样本为一组训练集。对样本进行分层抽样能保证总体中每一层都有个体被抽到,同时构建n组训练集保证了由于样本随机性带来的偶然误差。
206、选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型;
207、计算随机森林中的树模型在袋外数据上的第一分类错误数;
208、在树模型的袋外数据中对特征的取值进行随机扰动,计算第二分类错误数;
209、根据所述第一分类错误数和第二分类错误数计算各特征的特征重要性。
在实际应用中,对于随机森林计算特征的特征重要性的方式有两种,一种是通过计算袋外错误率的方式,一种是通过计算基尼指数的方式,在本实施例中,是使用计算袋外错误率的方式,所述计算各特征的特征重要性的计算公式为:
Figure PCTCN2020099553-appb-000001
其中,I为特征重要性,n为树模型的个数,E_I为第二分类错误数,E_i第一分类错误数。
210、根据所述特征重要性,计算各特征的Z-score值;
211、根据所述待选择特征的Z-score值和对应的随机特征的Z-score值计算待选择特征的代表分值;
在本案中,树模型对于每个输入的特征都会计算其特征重要性,在前面的步骤中,通过树模型计算出待选择特征的特征重要性以及所述待选择特征对应的随机特征的特征重要性后,计算待选择特征的Z-score值以及所述待选择特征对应的随机特征的Z-score值后,将待选择特征的Z-score值减去所述待选择特征对应的随机特征的Z-score值即可得到待选择特征的代表分值,所述Z-score值的计算公式为:
Figure PCTCN2020099553-appb-000002
其中,I表示一个特征在n个树模型上的特征重要性的均值,σ_I表示一个特征在个树模型上的特征重要性的标准差。
212、获取所述代表分值中最高值所对应的待选择特征,记录所述待选择特征的分值,并将获取到的待选择特征及其对应的随机特征从所述特征矩阵中移除;
213、判断选择出的特征个数是否大于或等于预设的特征个数;
214、若是,结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出;
在该步骤中,在将特征选择出来的被选择特征作为重要医疗特征输出之后,还包括:
获取被选择特征及所述被选择特征对应的Z-score值;
根据所述被选择特征对应的Z-score值,计算所有被选择特征的第一重要性分值;
将所述被选择特征重新输入树模型中,计算所述被选择特征的特征重要性,将所述特征重要性作为第二重要性分值;
选择所述第一重要性分值和所述第二重要性分值中的一个作为参考分值,所述参考分值用于进行后续特征分析。
在该步骤中,在通过特征选择获得重要特征后,还需要计算所有选择出来的重要特征的重要性分值,得到的重要性分值便于在后续进行比较各特征的重要程度,这是因为在每次进行特征选择并剔除后,选择出来的特征与其他特征失去了关联,所以在选择出需要的重要特征后需要计算所有被选择特征的重要性分值,同时若一开始进行特征选择时的个数过多,其中部分被选择特征的重要性分值过低,我们可以选择其中重要性分值最高的部分特征进行后续分析,计算某一特征a的第一特征重要性的计算方法为:
Figure PCTCN2020099553-appb-000003
其中,Z-score a为某一特征a的Z-score值,Z-score max为所有被选择特征中的Z-score最大值,Z-score min为所有被选择特征中的Z-score最小值。
215、若否,则将特征移除处理后得到的特征矩阵进行分割处理,继续进行特征选择。
本申请实施例中,对原始医疗数据进行提取获得特征值,将所有特征值作为特征集进行多次拷贝,每次拷贝后打乱,得到随机顺序的特征集,将原有特征集与随机顺序特征集拼接为特征矩阵。将所有样本切割为n组,每次取n-1组构建树模型,重复n次,然后计算得到特征重要性集合,根据特征重要性集合计算特征的真实分值,剔除出在所有待选择特征中特征分值最高的特征,并以剔除后的特征矩阵继续进行分割,重复上述步骤,直到剔除的特征数达到预设数量,通过本方式,能够避免在特征选择过程中去掉两个或三个彼此存在较强关联的特征中的一个,满足挖掘重要医疗特征的需求。
上面对本申请实施例中特征选择方法进行了描述,下面对本申请实施例中特征选择装置进行描述,请参阅图3,本申请实施例中特征选择装置一个实施例包括:
特征化模块301,用于获取原始医疗数据,并对所述原始医疗数据进行特征化处理,得到与所述原始医疗数据对应的待选择特征组,其中,所述特征化处理为将所述原始医疗数据与待选择特征之间进行映射处理;
拷贝模块302,用于将所述待选择特征组进行多次拷贝,并对每次拷贝后的集合进行随机打乱,得到多组随机集合;
拼接分割模块303,用于将所述待选择特征组与多组所述随机集合进行拼接,得到特征矩阵,并将所述特征矩阵进行分割成n组训练集,n为预设值,且为大于1的正整数;
树模型构建模块304,用于选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合;
计算模块305,用于根据所述特征重要性集合计算多个所述待选择特征的代表分值;
记录模块306,获取所述代表分值中最高值所对应的待选择特征,记录所述待选择特征的分值,并将获取到的待选择特征及其对应的随机特征从所述特征矩阵中移除;
判断模块307,用于判断选择出的特征个数是否大于或等于预设的特征个数;
输出模块308,用于当选择出的特征个数大于或等于预设的特征个数时,结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出;
循环模块309,用于当选择出的特征个数不大于或不等于预设的特征个数时,将特征移除处理后得到的特征矩阵进行分割处理,继续进行特征选择。
本申请实施例中,对原始医疗数据进行提取获得特征值,将所有特征值作为特征集进行多次拷贝,每次拷贝后打乱,得到随机顺序的特征集,将原有特征集与随机顺序特征集 拼接为特征矩阵。将所有样本切割为n组,每次取n-1组构建树模型,重复n次,然后计算得到特征重要性集合,根据特征重要性集合计算特征的真实分值,剔除出在所有待选择特征中特征分值最高的特征,并以剔除后的特征矩阵继续进行分割,重复上述步骤,直到剔除的特征数达到预设数量,通过本方式,能够避免在特征选择过程中去掉两个或三个彼此存在较强关联的特征中的一个,满足挖掘重要医疗特征的需求。
请参阅图4,本申请实施例中特征选择装置的另一个实施例包括:
特征化模块301,用于获取原始医疗数据,并对所述原始医疗数据进行特征化处理,得到与所述原始医疗数据对应的待选择特征组,其中,所述特征化处理为将所述原始医疗数据与待选择特征之间进行映射处理;
拷贝模块302,用于将所述待选择特征组进行多次拷贝,并对每次拷贝后的集合进行随机打乱,得到多组随机集合;
拼接分割模块303,用于将所述待选择特征组与多组所述随机集合进行拼接,得到特征矩阵,并将所述特征矩阵进行分割成n组训练集,n为预设值,且为大于1的正整数;
树模型构建模块304,用于选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合;
计算模块305,用于根据所述特征重要性集合计算多个所述待选择特征的代表分值;
记录模块306,用于获取所述代表分值中最高值所对应的待选择特征,记录所述待选择特征的分值,并将获取到的待选择特征及其对应的随机特征从所述特征矩阵中移除;
判断模块307,用于判断选择出的特征个数是否大于或等于预设的特征个数;
输出模块308,用于当选择出的特征个数大于或等于预设的特征个数时,结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出;
循环模块309,用于当选择出的特征个数不大于或不等于预设的特征个数时,将特征移除处理后得到的特征矩阵进行分割处理,继续进行特征选择。
可选的,所述特征选择装置还包括标签模块310,用于对拷贝后获得的拷贝特征打上特征标签,所述特征标签用于使待选择特征和所述随机特征相对应,其中,所述特征标签用于对所述特征矩阵进行分割处理。
可选的,所述303拼接分割模块包括:划分单元3031,分层抽样单元3032;
划分单元3031用于根据所述特征标签,将所述特征矩阵中的每个特征的数据分别划分为至少一类样本;
分层抽样单元3032用于对所述样本进行分层抽样,得到n组训练集,所述训练集用于通过模型训练算法构建树模型。
可选的,树模型构建单元包括:第一计算单元3041、第二计算单元3042、特征重要性计算单元3043;
第一计算单元3041,计算随机森林中的树模型在袋外数据上的第一分类错误数;
第二计算单元3042,在树模型的袋外数据中对特征的取值进行随机扰动,计算第二分类错误数;
筛选单元3043,根据所述第一分类错误数和第二分类错误数计算各特征的特征重要性。
可选的,所述计算模块305包括:Z-score单元3051、代表分值单元3052;
Z-score单元3051用于根据所述特征重要性,计算各特征的Z-score值;
代表分值单元3052用于根据所述待选择特征的Z-score值和对应的随机特征的Z-score值计算待选择特征的代表分值。
记录模块306,用于获取所述代表分值中最高值所对应的待选择特征,记录所述待选择特征的分值,并将获取到的待选择特征及其对应的随机特征从所述特征矩阵中移除;
判断模块307,用于判断选择出的特征个数是否大于或等于预设的特征个数;
输出模块308,用于当选择出的特征个数大于或等于预设的特征个数时,结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出;
循环模块309,用于当选择出的特征个数不大于或不等于预设的特征个数时,将特征移除处理后得到的特征矩阵进行分割处理,继续进行特征选择。
可选的,所述特征选择装置还包括分析模块311,用于获取被选择特征及所述被选择特征对应的Z-score值;根据所述被选择特征对应的Z-score值,计算所有被选择特征的第一重要性分值;将所述被选择特征重新输入树模型中,计算所述被选择特征的特征重要性,将所述特征重要性作为第二重要性分值;选择所述第一重要性分值和所述第二重要性分值中的一个作为参考分值,所述参考分值用于进行后续特征分析。
本申请实施例中,对原始医疗数据进行提取获得特征值,将所有特征值作为特征集进行多次拷贝,每次拷贝后打乱,得到随机顺序的特征集,将原有特征集与随机顺序特征集拼接为特征矩阵。将所有样本切割为n组,每次取n-1组构建树模型,重复n次,然后计算得到特征重要性集合,根据特征重要性集合计算特征的真实分值,剔除出在所有待选择特征中特征分值最高的特征,并以剔除后的特征矩阵继续进行分割,重复上述步骤,直到剔除的特征数达到预设数量,通过本方式,能够避免在特征选择过程中去掉两个或三个彼此存在较强关联的特征中的一个,满足挖掘重要医疗特征的需求。
上面图3和图4从模块化功能实体的角度对本申请实施例中的特征选择装置进行详细描述,下面从硬件处理的角度对本申请实施例中特征选择设备进行详细描述。
图5是本申请实施例提供的一种特征选择设备的结构示意图,该特征选择设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)510(例如,一个或一个以上处理器)和存储器520,一个或一个以上存储应用程序533或数据532的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器520和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对特征选择设备500中的一系列指令操作。更进一步地,处理器510可以设置为与存储介质530通信,在特征选择设备500上执行存储介质530中的一系列指令操作,以实现上述各实施例中的特征选择方法的步骤。
基于特征选择设备500还可以包括一个或一个以上电源540,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口560,和/或,一个或一个以上操作系统531,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5示出的特征选择设备结构并不构成对基于特征选择设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
本申请还提供一种特征选择设备,所述特征选择设备包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行上述各实施例中的所述特征选择方法的步骤。
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序(即是指令),当所述计算机程序在计算机上运行时,使得计算机执行所述特征选择方法的步骤,可选的,是通过计算机上的处理器来执行所述计算机程序。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者 说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种特征选择方法,其中,所述特征选择方法包括:
    获取原始医疗数据,并对所述原始医疗数据进行特征化处理,得到与所述原始医疗数据对应的待选择特征组,其中,所述特征化处理为将所述原始医疗数据与待选择特征之间进行映射处理;
    将所述待选择特征组进行多次拷贝,并对每次拷贝后的集合进行随机打乱,得到多组随机集合;
    将所述待选择特征组与多组所述随机集合进行拼接,得到特征矩阵,并将所述特征矩阵进行分割成n组训练集,n为预设值,且为大于1的正整数;
    选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合;
    根据所述特征重要性集合计算多个所述待选择特征的代表分值;
    获取所述代表分值中最高值所对应的待选择特征,记录所述待选择特征的分值,并将获取到的待选择特征及其对应的随机特征从所述特征矩阵中移除;
    判断选择出的特征个数是否大于或等于预设的特征个数;
    若是,结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出;
    若否,则将特征移除处理后得到的特征矩阵进行分割处理,继续进行特征选择。
  2. 根据权利要求1所述的特征选择方法,其中,在所述将所述待选择特征组进行多次拷贝之后,还包括:
    对拷贝后获得的拷贝特征打上特征标签,所述特征标签用于使待选择特征和所述随机特征相对应,其中,所述特征标签用于对所述特征矩阵进行分割处理。
  3. 根据权利要求2所述的特征选择方法,其中,所述将所述特征矩阵进行分割成n组训练集包括:
    根据所述特征标签,将所述特征矩阵中的每个特征的数据分别划分为至少一类样本;
    对所述样本进行分层抽样,得到n组训练集,所述训练集用于通过模型训练算法构建树模型。
  4. 根据权利要求1-3任一项所述的特征选择方法,其中,所述模型训练算法包括随机森林算法、Adboost、GBDT、Xgboost、以及LightGBM中的任意一种。
  5. 根据权利要求4所述的特征选择方法,其中,当所述模型训练算法为随机森林算法时,所述选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合包括:
    计算随机森林中的树模型在袋外数据上的第一分类错误数;
    在树模型的袋外数据中对特征的取值进行随机扰动,计算第二分类错误数;
    根据所述第一分类错误数和第二分类错误数计算各特征的特征重要性。
  6. 根据权利要求1所述的特征选择方法,其中,所述根据所述特征重要性集合计算多个所述待选择特征的代表分值包括:
    根据所述特征重要性,计算各特征的Z-score值;
    根据所述待选择特征的Z-score值和对应的随机特征的Z-score值计算待选择特征的代表分值。
  7. 根据权利要求6所述的特征选择方法,其中,在所述结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出之后,还包括:
    获取被选择特征及所述被选择特征对应的Z-score值;
    根据所述被选择特征对应的Z-score值,计算所有被选择特征的第一重要性分值;
    将所述被选择特征重新输入树模型中,计算所述被选择特征的特征重要性,将所述特 征重要性作为第二重要性分值;
    选择所述第一重要性分值和所述第二重要性分值中的一个作为参考分值,所述参考分值用于进行后续特征分析。
  8. 一种特征选择装置,其中,所述特征选择装置包括:
    特征化模块,用于获取原始医疗数据,并对所述原始医疗数据进行特征化处理,得到与所述原始医疗数据对应的待选择特征组,其中,所述特征化处理为将所述原始医疗数据与待选择特征之间进行映射处理;
    拷贝模块,用于将所述待选择特征组进行多次拷贝,并对每次拷贝后的集合进行随机打乱,得到多组随机集合;
    拼接分割模块,用于将所述待选择特征组与多组所述随机集合进行拼接,得到特征矩阵,并将所述特征矩阵进行分割成n组训练集,n为预设值,且为大于1的正整数;
    树模型构建模块,用于选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合;
    计算模块,用于根据所述特征重要性集合计算多个所述待选择特征的代表分值;
    记录模块,获取所述代表分值中最高值所对应的待选择特征,记录所述待选择特征的分值,并将获取到的待选择特征及其对应的随机特征从所述特征矩阵中移除;
    判断模块,用于判断选择出的特征个数是否大于或等于预设的特征个数;
    输出模块,用于当选择出的特征个数大于或等于预设的特征个数时,结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出;
    循环模块,用于当选择出的特征个数不大于或不等于预设的特征个数时,将特征移除处理后得到的特征矩阵进行分割处理,继续进行特征选择。
  9. 一种特征选择设备,其中,所述特征选择设备包括:存储器和至少一个处理器,所述存储器中存储有指令,所述存储器和所述至少一个处理器通过线路互连;
    所述至少一个处理器调用所述存储器中的所述指令,以使得所述特征选择设备执行如下所述的特征选择方法的步骤:
    获取原始医疗数据,并对所述原始医疗数据进行特征化处理,得到与所述原始医疗数据对应的待选择特征组,其中,所述特征化处理为将所述原始医疗数据与待选择特征之间进行映射处理;
    将所述待选择特征组进行多次拷贝,并对每次拷贝后的集合进行随机打乱,得到多组随机集合;
    将所述待选择特征组与多组所述随机集合进行拼接,得到特征矩阵,并将所述特征矩阵进行分割成n组训练集,n为预设值,且为大于1的正整数;
    选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合;
    根据所述特征重要性集合计算多个所述待选择特征的代表分值;
    获取所述代表分值中最高值所对应的待选择特征,记录所述待选择特征的分值,并将获取到的待选择特征及其对应的随机特征从所述特征矩阵中移除;
    判断选择出的特征个数是否大于或等于预设的特征个数;
    若是,结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出;
    若否,则将特征移除处理后得到的特征矩阵进行分割处理,继续进行特征选择。
  10. 根据权利要求9所述的特征选择设备,其中,所述特征选择设备被所述处理器执行所述将所述待选择特征组进行多次拷贝的步骤之后,还执行以下步骤:
    对拷贝后获得的拷贝特征打上特征标签,所述特征标签用于使待选择特征和所述随机特征相对应,其中,所述特征标签用于对所述特征矩阵进行分割处理。
  11. 根据权利要求10所述的特征选择设备,其中,所述特征选择设备被所述处理器 执行所述将所述特征矩阵进行分割成n组训练集的步骤时,还执行以下步骤:
    根据所述特征标签,将所述特征矩阵中的每个特征的数据分别划分为至少一类样本;
    对所述样本进行分层抽样,得到n组训练集,所述训练集用于通过模型训练算法构建树模型。
  12. 根据权利要求9-11所述的特征选择设备,其中,所述特征选择设备被所述处理器执行模型训练算法包括随机森林算法、Adboost、GBDT、Xgboost、以及LightGBM中的任意一种。
  13. 根据权利要求12所述的特征选择设备,其中,所述特征选择设备被所述处理器执行当所述模型训练算法为随机森林算法时,所述选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合的步骤时,还执行以下步骤:
    计算随机森林中的树模型在袋外数据上的第一分类错误数;
    在树模型的袋外数据中对特征的取值进行随机扰动,计算第二分类错误数;
    根据所述第一分类错误数和第二分类错误数计算各特征的特征重要性。
  14. 根据权利要求9所述的特征选择设备,其中,所述特征选择设备被所述处理器执行所述根据所述特征重要性集合计算多个所述待选择特征的代表分值的步骤时,还执行以下步骤:
    根据所述特征重要性,计算各特征的Z-score值;
    根据所述待选择特征的Z-score值和对应的随机特征的Z-score值计算待选择特征的代表分值。
  15. 根据权利要求14所述的特征选择设备,其中,所述特征选择设备被所述处理器执行所述结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出的步骤之后,还执行以下步骤:
    获取被选择特征及所述被选择特征对应的Z-score值;
    根据所述被选择特征对应的Z-score值,计算所有被选择特征的第一重要性分值;
    将所述被选择特征重新输入树模型中,计算所述被选择特征的特征重要性,将所述特征重要性作为第二重要性分值;
    选择所述第一重要性分值和所述第二重要性分值中的一个作为参考分值,所述参考分值用于进行后续特征分析。
  16. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下所述特征选择方法的步骤:获取原始医疗数据,并对所述原始医疗数据进行特征化处理,得到与所述原始医疗数据对应的待选择特征组,其中,所述特征化处理为将所述原始医疗数据与待选择特征之间进行映射处理;
    将所述待选择特征组进行多次拷贝,并对每次拷贝后的集合进行随机打乱,得到多组随机集合;
    将所述待选择特征组与多组所述随机集合进行拼接,得到特征矩阵,并将所述特征矩阵进行分割成n组训练集,n为预设值,且为大于1的正整数;
    选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合;
    根据所述特征重要性集合计算多个所述待选择特征的代表分值;
    获取所述代表分值中最高值所对应的待选择特征,记录所述待选择特征的分值,并将获取到的待选择特征及其对应的随机特征从所述特征矩阵中移除;
    判断选择出的特征个数是否大于或等于预设的特征个数;
    若是,结束特征选择,将特征选择出来的被选择特征作为重要医疗特征输出;
    若否,则将特征移除处理后得到的特征矩阵进行分割处理,继续进行特征选择。
  17. 根据权利要求16所述的计算机可读存储介质,所述特征选择的程序被处理器执行所述特征选择设备被所述处理器执行所述将所述待选择特征组进行多次拷贝的步骤之后,还执行以下步骤:
    对拷贝后获得的拷贝特征打上特征标签,所述特征标签用于使待选择特征和所述随机特征相对应,其中,所述特征标签用于对所述特征矩阵进行分割处理。
  18. 根据权利要求17所述的计算机可读存储介质,所述特征选择的程序被处理器执行所述特征选择设备被所述处理器执行所述将所述特征矩阵进行分割成n组训练集的步骤时,还执行以下步骤:
    根据所述特征标签,将所述特征矩阵中的每个特征的数据分别划分为至少一类样本;
    对所述样本进行分层抽样,得到n组训练集,所述训练集用于通过模型训练算法构建树模型。
  19. 根据权利要求16-18所述的计算机可读存储介质,所述特征选择的程序被处理器执行所述特征选择设备被所述处理器执行模型训练算法包括随机森林算法、Adboost、GBDT、Xgboost、以及LightGBM中的任意一种。
  20. 根据权利要求19所述的计算机可读存储介质,所述特征选择的程序被处理器执行所述特征选择设备被所述处理器执行当所述模型训练算法为随机森林算法时,所述选取所述n组训练集中的n-1组训练集基于模型训练算法构建树模型,得到n个树模型,并计算每个树模型对应的特征重要性集合的步骤时,还执行以下步骤:
    计算随机森林中的树模型在袋外数据上的第一分类错误数;
    在树模型的袋外数据中对特征的取值进行随机扰动,计算第二分类错误数;
    根据所述第一分类错误数和第二分类错误数计算各特征的特征重要性。
PCT/CN2020/099553 2020-05-26 2020-06-30 特征选择方法、装置、设备及存储介质 WO2021139115A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010453796.1 2020-05-26
CN202010453796.1A CN111738297A (zh) 2020-05-26 2020-05-26 特征选择方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021139115A1 true WO2021139115A1 (zh) 2021-07-15

Family

ID=72647700

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/099553 WO2021139115A1 (zh) 2020-05-26 2020-06-30 特征选择方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111738297A (zh)
WO (1) WO2021139115A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657499A (zh) * 2021-08-17 2021-11-16 中国平安财产保险股份有限公司 基于特征选择的权益分配方法、装置、电子设备及介质
CN113688923A (zh) * 2021-08-31 2021-11-23 中国平安财产保险股份有限公司 订单异常智能检测方法、装置、电子设备及存储介质
CN113933334A (zh) * 2021-10-13 2022-01-14 北京工商大学 一种基于特征选择和机器学习算法的洋槐蜜真伪鉴别方法
CN113988152A (zh) * 2021-09-23 2022-01-28 北京达佳互联信息技术有限公司 用户类型预测模型训练方法、资源分配方法、介质及装置
CN115423600A (zh) * 2022-08-22 2022-12-02 前海飞算云创数据科技(深圳)有限公司 数据筛选方法、装置、介质及电子设备
CN116883414A (zh) * 2023-09-08 2023-10-13 国网上海市电力公司 一种适用于输电线路运维的多系统数据选择方法和系统
CN116912919A (zh) * 2023-09-12 2023-10-20 深圳须弥云图空间科技有限公司 一种图像识别模型的训练方法及装置
CN117112857A (zh) * 2023-10-23 2023-11-24 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种适用于工业智能制造的加工路径推荐方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113283484A (zh) * 2021-05-14 2021-08-20 中国邮政储蓄银行股份有限公司 改进的特征选择方法、装置及存储介质
CN113554527A (zh) * 2021-07-28 2021-10-26 广东电网有限责任公司 电费数据处理方法、装置、终端设备及存储介质
CN114511037A (zh) * 2022-02-18 2022-05-17 创新奇智(青岛)科技有限公司 一种自动化特征筛选方法、装置、电子设备及存储介质
CN115271067B (zh) * 2022-08-25 2024-02-23 天津大学 基于特征关系评估的安卓对抗样本攻击方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021984A (zh) * 2016-11-01 2018-05-11 第四范式(北京)技术有限公司 确定机器学习样本的特征重要性的方法及系统
US20180341801A1 (en) * 2016-01-18 2018-11-29 Alibaba Group Holding Limited Feature data processing method and device
CN108960436A (zh) * 2018-07-09 2018-12-07 上海应用技术大学 特征选择方法
CN109460825A (zh) * 2018-10-24 2019-03-12 阿里巴巴集团控股有限公司 用于构建机器学习模型的特征选取方法、装置以及设备
CN109543747A (zh) * 2018-11-20 2019-03-29 厦门大学 一种基于分层随机森林的数据特征选择方法及装置
CN110135494A (zh) * 2019-05-10 2019-08-16 南京工业大学 基于最大信息系数和基尼指标的特征选择方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180341801A1 (en) * 2016-01-18 2018-11-29 Alibaba Group Holding Limited Feature data processing method and device
CN108021984A (zh) * 2016-11-01 2018-05-11 第四范式(北京)技术有限公司 确定机器学习样本的特征重要性的方法及系统
CN108960436A (zh) * 2018-07-09 2018-12-07 上海应用技术大学 特征选择方法
CN109460825A (zh) * 2018-10-24 2019-03-12 阿里巴巴集团控股有限公司 用于构建机器学习模型的特征选取方法、装置以及设备
CN109543747A (zh) * 2018-11-20 2019-03-29 厦门大学 一种基于分层随机森林的数据特征选择方法及装置
CN110135494A (zh) * 2019-05-10 2019-08-16 南京工业大学 基于最大信息系数和基尼指标的特征选择方法

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657499B (zh) * 2021-08-17 2023-08-11 中国平安财产保险股份有限公司 基于特征选择的权益分配方法、装置、电子设备及介质
CN113657499A (zh) * 2021-08-17 2021-11-16 中国平安财产保险股份有限公司 基于特征选择的权益分配方法、装置、电子设备及介质
CN113688923A (zh) * 2021-08-31 2021-11-23 中国平安财产保险股份有限公司 订单异常智能检测方法、装置、电子设备及存储介质
CN113688923B (zh) * 2021-08-31 2024-04-05 中国平安财产保险股份有限公司 订单异常智能检测方法、装置、电子设备及存储介质
CN113988152A (zh) * 2021-09-23 2022-01-28 北京达佳互联信息技术有限公司 用户类型预测模型训练方法、资源分配方法、介质及装置
CN113933334B (zh) * 2021-10-13 2024-03-26 北京工商大学 一种基于特征选择和机器学习算法的洋槐蜜真伪鉴别方法
CN113933334A (zh) * 2021-10-13 2022-01-14 北京工商大学 一种基于特征选择和机器学习算法的洋槐蜜真伪鉴别方法
CN115423600B (zh) * 2022-08-22 2023-08-04 前海飞算云创数据科技(深圳)有限公司 数据筛选方法、装置、介质及电子设备
CN115423600A (zh) * 2022-08-22 2022-12-02 前海飞算云创数据科技(深圳)有限公司 数据筛选方法、装置、介质及电子设备
CN116883414A (zh) * 2023-09-08 2023-10-13 国网上海市电力公司 一种适用于输电线路运维的多系统数据选择方法和系统
CN116883414B (zh) * 2023-09-08 2024-01-26 国网上海市电力公司 一种适用于输电线路运维的多系统数据选择方法和系统
CN116912919A (zh) * 2023-09-12 2023-10-20 深圳须弥云图空间科技有限公司 一种图像识别模型的训练方法及装置
CN116912919B (zh) * 2023-09-12 2024-03-15 深圳须弥云图空间科技有限公司 一种图像识别模型的训练方法及装置
CN117112857A (zh) * 2023-10-23 2023-11-24 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种适用于工业智能制造的加工路径推荐方法
CN117112857B (zh) * 2023-10-23 2024-01-05 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种适用于工业智能制造的加工路径推荐方法

Also Published As

Publication number Publication date
CN111738297A (zh) 2020-10-02

Similar Documents

Publication Publication Date Title
WO2021139115A1 (zh) 特征选择方法、装置、设备及存储介质
Wang et al. Towards heterogeneous temporal clinical event pattern discovery: a convolutional approach
Xin et al. Complex network classification with convolutional neural network
CN107731269B (zh) 基于原始诊断数据和病历文件数据的疾病编码方法及系统
Howe et al. A generalized iterative record linkage computer system for use in medical follow-up studies
Kumar Knowledge discovery in data using formal concept analysis and random projections
US8515956B2 (en) Method and system for clustering datasets
CN111612041A (zh) 异常用户识别方法及装置、存储介质、电子设备
Lan et al. Medical image retrieval via histogram of compressed scattering coefficients
CN111784040B (zh) 政策模拟分析的优化方法、装置及计算机设备
Lu et al. Quantitative arbor analytics: unsupervised harmonic co-clustering of populations of brain cell arbors based on L-measure
CN110910991B (zh) 一种医用自动图像处理系统
CN113610103A (zh) 基于统一锚点与子空间学习的医疗数据的聚类方法及系统
WO2021223449A1 (zh) 一种菌群标记物的获取方法、装置、终端及存储介质
Chui et al. Representation of functions on big data associated with directed graphs
Weissman et al. Identifying duplicate and contradictory information in wikipedia
CN116483932A (zh) 一种基于心内科冠心病药物库的分级分类管理方法和装置
Anai et al. Effect of Genetic Algorithm as a Feature Selection for Image Classification
Tasoulis et al. Unsupervised clustering using fractal dimension
Chak et al. The scattering transform network with generalized morse wavelets and its application to music genre classification
CN111986815A (zh) 基于共现关系的项目组合挖掘方法及相关设备
Abubacker et al. Correlation-based feature selection for association rule mining in semantic annotation of mammographic medical images
Xia et al. Analyzing the Performance of TabTransformer in Brain Stroke Prediction
Al-Abaji Cuckoo Search Algorithm Based Feature Selection in Image Retrieval System
Bokhari et al. A framework for clustering dental patients' records using unsupervised learning techniques

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20912894

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20912894

Country of ref document: EP

Kind code of ref document: A1