CN114881158A - Defect value filling method and device based on random forest and computer equipment - Google Patents

Defect value filling method and device based on random forest and computer equipment Download PDF

Info

Publication number
CN114881158A
CN114881158A CN202210536238.0A CN202210536238A CN114881158A CN 114881158 A CN114881158 A CN 114881158A CN 202210536238 A CN202210536238 A CN 202210536238A CN 114881158 A CN114881158 A CN 114881158A
Authority
CN
China
Prior art keywords
missing
nested
sample
current
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210536238.0A
Other languages
Chinese (zh)
Inventor
王可
蔡志平
罗孟华
周桐庆
李雄略
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210536238.0A priority Critical patent/CN114881158A/en
Publication of CN114881158A publication Critical patent/CN114881158A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The missing value filling method, the missing value filling device and the computer equipment based on the random forest are characterized in that a missing sample set of a target sample set is divided according to a characteristic value missing type to obtain a plurality of nested missing sample groups which comprise a plurality of missing samples arranged according to a characteristic value missing number and the characteristic value missing type of the missing sample comprises the characteristic value missing type of the last missing sample, then a nested random forest is constructed according to the nested missing sample groups, each nested layer of the nested random forest corresponds to the missing sample, finally a non-missing sample set corresponding to the current nested layer is adopted to train the random forest in the current nested layer of the nested random forest, and the missing sample corresponding to the current nested layer is filled with the missing characteristic value. The invention constructs the random forest with fine particle attribution nested step by step, considers the known data integral distribution, simultaneously fully utilizes the individual and local characteristics of the data, and realizes the refined step by step missing value filling.

Description

Defect value filling method and device based on random forest and computer equipment
Technical Field
The application relates to the technical field of data processing, in particular to a missing value filling method and device based on a random forest and computer equipment.
Background
With the development of data science, missing data becomes a basic research problem, and is an inevitable phenomenon in both industrial production and scientific research. Due to the limitation of technical conditions and the cost of data acquisition, some data loss is inevitable, and even if complete data is collected, data loss still occurs in the process of storage and application due to technical faults, artificial design problems, misoperation and the like. In the field of data science, the data loss problem is a common and important problem. With the development of data science and artificial intelligence in recent years, data modeling and analysis methods dominated by machine learning are fiercely developed. In the process of data preprocessing, the research of missing data relates to various application fields, such as medical treatment, financial systems, biological science, climate research and social problems.
In many researches, the processing method of the missing data is directly eliminated, and in the case of relatively sufficient data quantity, the method is simple and easy, but from the viewpoint of data science, the eliminated data may be data containing key information, and if the data can be effectively filled, the result of data analysis must be of a certain positive significance. In the traditional research, the filling of the missing data mainly depends on the existing data to simulate the whole data distribution, then a certain statistical index is used as a filling rule, and the main idea of the existing machine learning algorithm in processing the missing data problem is also focused on the simulation of the data distribution. For example, the existing random forest data filling method is a coarse-grained filling method, and each existing numerical value is not fully utilized in order to avoid overfitting.
Disclosure of Invention
Based on this, it is necessary to provide a fine-grained random forest-based missing value filling method, apparatus and computer device that can fully consider the influence of each existing data on the missing value in view of the above technical problems.
A random forest based deficiency value population method, the method comprising:
acquiring a target sample set; the target sample set comprises a non-missing sample set without missing feature values and a missing sample set without missing feature values;
dividing the missing sample set according to the characteristic value missing type to obtain a plurality of nested missing sample groups; the nested missing sample group comprises a plurality of missing samples which are arranged according to the missing number of the characteristic values; the characteristic value missing type of the missing sample in each nested missing sample group comprises the characteristic value missing type of the last missing sample;
constructing a nested random forest according to the nested missing sample group; each nested layer of the nested random forest corresponds to a missing sample;
after a non-missing sample set corresponding to a current nested layer is adopted as a training set to train the current nested layer of the nested random forest, filling missing characteristic values in the missing samples corresponding to the current nested layer; and the non-missing sample set corresponding to the current nested layer is a sample obtained by filling the last nested layer of each nested random forest and a non-missing sample set corresponding to the last nested layer.
In one embodiment, the constructing a nested random forest according to the nested missing sample groups includes:
and when more than 2 missing samples with the same characteristic value missing quantity and type exist in the nested missing sample group, constructing more than 2 corresponding mutually independent random forests on the corresponding nested layer.
In one embodiment, after training a current nested layer of the nested random forest by using a non-missing sample set corresponding to the current nested layer as a training set, filling missing characteristic values in a missing sample group corresponding to the current nested layer includes:
and when only a single sample exists in the current nested layer, adopting a non-missing sample set corresponding to the current nested layer as a training set to train the current nested layer of the nested forest, and filling missing characteristic values in the single sample corresponding to the current nested layer.
In one embodiment, after training the current nested layer of the nested random forest by using a non-missing sample set corresponding to the current nested layer as a training set, filling missing characteristic values in a missing sample group corresponding to the current nested layer, further includes:
when more than 2 samples exist in the current nested layer, respectively training more than 2 mutually independent random forests of the current nested layer by adopting a non-missing sample set corresponding to the current nested layer as a training set, and respectively filling missing characteristic values in more than 2 samples corresponding to the current nested layer.
In one embodiment, after training the current nested level of the nested random forest by using a non-missing sample set corresponding to the current nested level as a training set, and after filling missing feature values in the missing samples corresponding to the current nested level, the method further includes:
and filling a characteristic value of a sample corresponding to the current nested layer of each nested random forest to obtain a sample and a non-missing sample set corresponding to the current nested layer, and taking the sample and the non-missing sample set corresponding to the next nested layer of each nested forest as the non-missing sample set corresponding to the next nested layer of each nested forest.
In one embodiment, the training of the current nested level of the nested random forest by using the non-missing sample set corresponding to the current nested level as a training set includes:
resampling the non-missing sample set corresponding to the current nested layer to obtain a plurality of non-missing sample subsets;
and randomly selecting a plurality of characteristics in each non-missing sample subset, selecting the optimal decision characteristics from the characteristics to construct each decision tree of the random forest of the current nested layer, and respectively training each decision tree of each random forest in the current nested layer by adopting each non-missing sample subset.
An apparatus for random forest based missing value population, the apparatus comprising:
the acquisition module is used for acquiring a target sample set; the target sample set comprises a non-missing sample set without missing feature values and a missing sample set without missing feature values;
the dividing module is used for dividing the missing sample set according to the missing type of the characteristic value to obtain a plurality of nested missing sample groups; the nested missing sample group comprises a plurality of missing samples which are arranged according to the missing number of the characteristic values; the characteristic value missing type of the missing sample in each nested missing sample group comprises the characteristic value missing type of the last missing sample;
the construction module is used for constructing a nested random forest according to the nested missing sample group; each nested layer of the nested random forest corresponds to a missing sample;
the filling module is used for filling missing characteristic values of the missing samples corresponding to the current nested layer after the current nested layer of the nested random forest is trained by adopting a non-missing sample set corresponding to the current nested layer as a training set; and the non-missing sample set corresponding to the current nested layer is a sample obtained by filling the last nested layer of each nested random forest and a non-missing sample set corresponding to the last nested layer.
In one embodiment, the build module is further configured to:
and when more than 2 missing samples with the same characteristic value missing quantity and type exist in the nested missing sample group, constructing more than 2 corresponding mutually independent random forests on the corresponding nested layer.
In one embodiment, the fill module is further to:
when only a single sample exists in the current nested layer, adopting a non-missing sample set corresponding to the current nested layer as a training set to train the current nested layer of the nested forest, and then filling missing characteristic values in the single sample corresponding to the current nested layer;
and when more than 2 samples exist in the current nested layer, respectively training more than 2 mutually independent random forests of the current nested layer by taking a non-missing sample set corresponding to the current nested layer as a training set, and respectively filling missing characteristic values in more than 2 samples corresponding to the current nested layer.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring a target sample set; the target sample set comprises a non-missing sample set without missing feature values and a missing sample set without missing feature values;
dividing the missing sample set according to the characteristic value missing type to obtain a plurality of nested missing sample groups; the nested missing sample group comprises a plurality of missing samples which are arranged according to the missing number of the characteristic values; the characteristic value missing type of the missing sample in each nested missing sample group comprises the characteristic value missing type of the last missing sample;
constructing a nested random forest according to the nested missing sample group; each nested layer of the nested random forest corresponds to a missing sample;
after a non-missing sample set corresponding to a current nested layer is adopted as a training set to train the current nested layer of the nested random forest, filling missing characteristic values in the missing samples corresponding to the current nested layer; and the non-missing sample set corresponding to the current nested layer is a sample obtained by filling the last nested layer of each nested random forest and a non-missing sample set corresponding to the last nested layer.
According to the missing value filling method, the missing value filling device and the computer equipment based on the random forest, the missing sample set of the target sample set is divided according to the characteristic value missing type to obtain a plurality of nested missing sample groups which comprise a plurality of missing samples arranged according to the characteristic value missing number and the characteristic value missing type of the missing sample comprises the characteristic value missing type of the last missing sample, then the nested random forest is built according to the nested missing sample groups, each nested layer of the nested random forest corresponds to the missing sample, finally the non-nested sample set corresponding to the current nested layer is adopted to train the random forest in the current nested random forest layer, and the missing sample corresponding to the current nested layer is filled with the missing characteristic value. According to the method, multiple division is performed on the missing sample set of the target sample set according to the characteristic value missing type, a plurality of nested random forests are built according to the obtained nested missing sample groups, missing value filling is performed step by step, and when the known data overall distribution is considered, the individual and local characteristics of the data are fully utilized, and refined missing value filling is realized.
Drawings
FIG. 1 is a schematic flow chart of a random forest based missing value filling method in an embodiment;
FIG. 2 is a schematic diagram of a ladder structure of nested missing sample sets in one embodiment;
FIG. 3 is a diagram illustrating exemplary relationships between missing data in one embodiment;
FIG. 4 is a diagram of an example database nested random forest in one embodiment;
FIG. 5 is a block diagram of an apparatus for filling missing values in a random forest according to an embodiment;
FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, there is provided a random forest-based missing value filling method, including the following steps:
step 102, a target sample set is obtained.
The target sample set refers to a sample set which has a data missing problem and needs to be subjected to data filling, the target sample set can be a radar data sample set, and for example, missing of certain characteristic parameters in data acquired by a radar due to the existence of interference, the existence of measurement errors of detection equipment and the like can be carried out by adopting the method. The target sample set can also be a social network user sample set, one sample comprises a plurality of attributes of one user and corresponding attribute values, and the target sample set can be presented in a form of a table or a matrix.
The target sample library includes a non-missing sample set of non-missing feature values and a missing sample set of missing feature values. A sample may include feature attributes of multiple dimensions, and if the sample has corresponding feature values on all feature attributes, the sample belongs to a non-missing sample set, and correspondingly, if the sample only lacks corresponding feature values on one feature attribute, the sample belongs to a missing sample set.
And step 104, dividing the missing sample set according to the characteristic value missing type to obtain a plurality of nested missing sample groups.
The nested missing sample group comprises a plurality of missing samples which are arranged according to the missing number of the characteristic values, and the characteristic value missing type of the missing sample in each nested missing sample group comprises the characteristic value missing type of the last missing sample.
The nested missing sample group can be regarded as a ladder structure, and different section ladders represent different feature value missing quantities, wherein the missing sample positions of the ladder at the upper part have smaller feature value missing quantities. Meanwhile, the height difference between each step may not be consistent, that is, there may be more than 2 missing samples with completely consistent eigenvalue missing types in a certain step, for example, the number of eigenvalue missing of the 7 th and 8 th data in table 1 is 4, and the eigenvalue missing types are all the same.
TABLE 1 examples of deficiency of characteristic values
Data 1 ×
Data 2 ×
Data 3 × ×
Data 4 × ×
Data 5 × ×
Data 6 × × ×
Data 7 × × × ×
Data 8 × × × ×
When the feature value missing numbers of the samples are the same but the missing types are not completely the same, for example, the feature missing numbers of data 1 and data 2 in table 1 are both 1 but the feature value missing types are different, or the feature missing numbers of data 3, data 4, and data 5 are all 2, the missing types may overlap each other but are not completely the same, and for such samples, the samples are divided into different nested sample groups. For example, a ladder structure of one nested missing sample group { data 1, data 3, data 6, data 7, and data 8} obtained after the division is presented in fig. 2, it can be seen that the feature value missing type of the missing sample in the nested missing sample group covers the feature value missing type of the missing sample in the previous ladder. Similarly, a nested missing sample group of { data 2, data 4, data 6, data 7, and data 8} can also be obtained, and it can be seen that missing samples between nested sample groups can overlap, for example, data 6, data 7, and data 8 are included in { data 1, data 3, data 6, data 7, and data 8} and { data 2, data 4, data 6, data 7, and data 8 }.
And 106, constructing a nested random forest according to the nested missing sample group.
Each nested missing sample group is provided with a corresponding nested random forest, each nested layer of the nested random forest corresponds to a missing sample, and the characteristic value missing quantity and the characteristic value missing type of the samples in the same nested layer of the same nested random forest are the same.
And 108, after training the current nested layer of the nested random forest by using the non-missing sample set corresponding to the current nested layer as a training set, filling missing characteristic values in a missing sample group corresponding to the current nested layer.
And the non-missing sample set used for training and corresponding to the current nested layer is a sample obtained by filling the last nested layer and a non-missing sample set corresponding to the last nested layer. That is to say, after each nested layer fills the missing value, the corresponding missing sample has no missing value any more, and the filled sample and the missing sample set used in the training of the nested layer are taken as the training sample set of the random forest in the next nested layer, and so on, so that it can be known that the deeper the nested layer is, the more complete and abundant the samples of the training set are, and the more effective the training result of the random forest in the nested layer with the deeper the level is.
According to the missing value filling method based on the random forest, a plurality of nested missing sample groups which comprise a plurality of missing samples arranged according to the missing number of the characteristic values and the characteristic value missing type of the missing sample comprises the characteristic value missing type of the last missing sample are obtained by dividing the missing sample set of a target sample set according to the characteristic value missing type, then the nested random forest is constructed according to the nested missing sample groups, each nested layer of the nested random forest corresponds to the missing sample, finally the non-missing sample set corresponding to the current nested layer is adopted to train the random forest in the current nested layer of the nested random forest, and the missing characteristic value filling is carried out on the missing sample corresponding to the current nested layer. According to the method, multiple division is performed on the missing sample set of the target sample set according to the characteristic value missing type, a plurality of nested random forests are built according to the obtained nested missing sample groups, missing value filling is performed step by step, and when the known data overall distribution is considered, the individual and local characteristics of the data are fully utilized, and refined missing value filling is realized.
A common data filling method is to estimate the true overall distribution by using the distribution of known data, and then fill in with some index or parameter of the whole, without considering the individual characteristics of the single data itself. For example, the existing random forest data filling method has a certain effect on the estimation of missing data, and the purpose of the algorithm is to avoid overfitting but not to fully utilize each existing numerical value. In the present invention, this manner is considered as a coarse-grained filling method.
The personalized data analysis in the invention is to take the real data as the only standard, not consider the application purpose of the data, not make any classification or prediction hypothesis on the data, most importantly, never abandon the influence of any known value on the single data, restore the personalized characteristics of the data as much as possible, and take the overall distribution index as an important reference basis rather than a decision basis.
In one embodiment, constructing a nested random forest from nested missing sample groups comprises:
and when more than 2 missing samples with the same characteristic value missing quantity and type exist in the nested missing sample group, constructing more than 2 corresponding mutually independent random forests on the corresponding nested layer.
The mutual independence among the random forests means that training of random forests with different samples in a missing sample group and missing value filling processes are not influenced mutually, in practical application, in order to save time resources, the random forests with different samples in the same nested layer can be trained simultaneously, but the training processes are mutually independent; the missing value padding process is not affected by each other, that is, different samples of the same nested layer are not used as reference values when an opposite side fills missing values, for example, in table 1, data 7 and data 8 are located in the same nested missing sample group, but training and missing value padding processes of the two are independent from each other, and the two are not used as existing data considered when the opposite side fills missing values.
In one embodiment, after training a current nested layer of a nested random forest by using a non-missing sample set corresponding to the current nested layer as a training set, filling missing characteristic values in a missing sample group corresponding to the current nested layer includes:
and when only a single sample exists in the current nested layer, filling missing characteristic values in the single sample corresponding to the current nested layer after training the current nested layer of the nested forest by using a non-missing sample set corresponding to the current nested layer as a training set.
When more than 2 samples exist in the current nested layer, the non-missing sample set corresponding to the current nested layer is adopted as a training set to train more than 2 mutually independent random forests of the current nested layer respectively, and missing characteristic values are filled in more than 2 samples corresponding to the current nested layer respectively.
And filling a characteristic value of a sample corresponding to the current nested layer of each nested random forest to obtain a sample and a non-missing sample set corresponding to the current nested layer, and taking the sample and the non-missing sample set corresponding to the next nested layer of each nested forest as the non-missing sample set corresponding to the next nested layer of each nested forest.
That is to say, the computation processes of a plurality of nested random forests can be developed in parallel, and the training set-non-missing sample set adopted by each nested layer of each nested random forest is consistent and updated along with missing value filling, for example, 2 nested missing sample sets { data 1, data 3, data 6, data 7, data 8} and { data 2, data 4, data 6, data 7, data 8} obtained by dividing in table 1 are filled with the missing feature values of data 1 and data 2 respectively by the first nested layer of the 2 nested random forests corresponding to the first nested layer, and the complete data 1 'and data 2' obtained after filling can be used as the training set of the random forest of the next nested layer of each nested random forest together with the training set used by the first nested layer.
In one embodiment, training a current nested level of a nested random forest by using a non-missing sample set corresponding to the current nested level as a training set, includes:
and resampling a non-missing sample set corresponding to the current nested layer to obtain a plurality of non-missing sample subsets, randomly selecting a plurality of characteristics in each non-missing sample subset, selecting the best decision characteristics from the characteristics to construct each decision tree of the random forest of the current nested layer, and respectively training each decision tree of each random forest in the current nested layer by adopting each non-missing sample subset.
The specific implementation of the present invention will be described below in conjunction with a simple sample library.
TABLE 1 example sample library
Figure BDA0003648328610000091
For ease of understanding, linear transformation of the example samples shown in table 1 results in the sample library of table 2.
TABLE 2 sample library
Figure BDA0003648328610000101
Taking data 1, data 3 and data 5 as non-missing data, taking the filling of data 2 as an example, in a common filling method, the filling value of feature 4 is 15.0, and the filling value of feature 6 is 2.8, and from the perspective of personalized data estimation, the method is rough, does not fully consider the characteristics of each existing data, and does not fully utilize the value of each existing data.
The important basis for personalized population is 7.5 for feature 3 and 11.9 for feature 5, where feature 1, feature 2, and feature 5 are stable in the non-missing data (data 1, data 3, data 5). By analyzing the numerical relationships between features 3, 4 and 6, 17.0 and 2.2 are better suited to fill in features 4 and 6 of data 2, rather than 15.0 and 2.8 in the normal way of calculating averages. When filling in the missing values of data 2, the value 11.9 of feature 5 of data 2 seems worthless due to the data deficit, but when classifying feature 5 of data 4, it decides that the feature 5 fill value of data 4 is more inclined to 11.9 than 12.0.
Similarly, 7.x may be better than 8.x to fill feature 3 in data 7, and 3.1 with a greater probability than 2.6 when data 6 is filled with feature 1. In other words, data 2 can be considered as one possible fill-in result of data 4, data 4 is one possible fill-in result of data 7, and they are all specific examples of data 6. As shown in fig. 2, there is a special case relationship between the missing data divided according to different missing quantities of feature values.
As can be seen from table 2, the data 6 is a data whose feature value is completely missing, it is reasonable to fill the data 6 according to the overall distribution of each data feature, and the fill value of the data 6 obtained according to the ordinary averaging method may be 3.0, so that the filling is generally considered reasonable because it is within the allowable error range, however, it is the focus of personalized filling. In the case of not considering the data feature semantics, whether the error is reasonable is judged only from the numerical value itself, which is actually a judgment that the error is large. The method emphasizes that the existence of each real datum has important significance. When a true data record is generated it reflects the then-current state of the data body, which is not simply determined by the distribution of the entire data volume, but rather depends more on the data body itself. The diversity of the overall data and the characteristics of each data itself should be considered together. The present invention recognizes that when data is generated, it cannot be divided into every data feature, but rather the data generates a representation of the state of the body. Data recorded in some form is always missing, and there is always something that is not considered or recorded. This is also why data modeling cannot solve the problem as accurately as mathematical modeling. Therefore, in the present invention, each real value is a reflection of the data generated by the data body, and should be properly considered. The method is just the filling concept of the missing data, so that the filling data is really close to the original data.
The random forest can be regarded as a combination of a plurality of decision trees, the nested random forest provided by the method is formed by considering a forest as a plurality of small nested random forests according to the characteristic value missing type, the sub-forests participate in the training process and the missing value filling process of a large forest, and the nested random forest with fine-grained attribution is formed by nesting step by combining the actual situation of a target database.
According to the method, the data 7 in the table 2 is regarded as a special case of the data 6, namely, the data 6 is a special case model nested under a data filling model of the data 7, and the filling model of the data 7 can be used as a reference of a filling process of the data 6, so that the method takes the filling model of the data 7 as a nested model of the data 6, namely, the whole random forest of the data 7 is regarded as the nested random forest of the data 6.
Thus, a nested random forest of missing data sets { data 2, data 4, data 7, data 6} in table 2 can be constructed according to how many eigenvalue missing. Since the data presented in table 2 is relatively simple, its corresponding random nested forest can be easily obtained, as shown in fig. 3.
Each existing data is a specific data point and each missing data is a range according to its own type, the ranges being strictly nested. The whole model can make full use of each objective effective value to help the completeness of data induction. Theoretically, a random forest of data 2 is trained first, then the model of data 2 is integrated into the voting process of the data 4 model, and so on. In practice, some training can be performed in parallel due to the low coupling relation among the tree structures of the forest model, so that the actual parallel degree of the calculation program can be improved, and a large amount of training time is saved.
The nested random forest classifies the missing data into different types, and each non-missing value is fully utilized according to specific conditions. When the known data overall distribution is considered, the individual and local features of the data are fully utilized, the refined missing value filling is realized, and a filling result better than that of the existing method is obtained.
In one embodiment, for an m x n data matrix, after linear translation there is a m ' x n ' sized matrix containing missing values, where n ' ≦ n, m ″.<And m is selected. (1) When m' is m, the missing characteristics exist in all data, no complete piece of data can be used as a reference, and the working error of the model is large. (2) Theoretically, if data having the same missing feature is classified into one padding type, there will be 2n' types at most. The NP problem occurs when n' → ∞. However, in an actual dataset, to achieve this incomputability, it is also necessary that there be enough missing data entries, i.e., m'>2 n'. These two factors must be satisfied simultaneously before the above-mentioned calculation amount problem occurs. While this is a small probability event, the solution should be efficient and feasible. Even if this special case occurs, parallel acceleration can be performed by expanding the computational resources due to the parallel feasibility of the sub-forests discussed above. In fact, there is usually an overlap in the actual missing value data characteristics, which means that the actual missing data type quantity k<<2 n'. In practical application, if the time complexity of the random forest model is O (mn logm), the time complexity of the nested random forest without parallel computation is k × 0(mn log) m ) Where k is a constant.
To verify the applicability of the method, machine learning public data sets from the university of california (gulf) were selected and tested on authentic data sets from different industries, medical, financial and transportation. The Haberman's survivval (HM) dataset contains 4-dimensional data for 306 lung cancer patients. The Spambase DataSet (SB) dataset contains 268079 spam 57 dimensions of Data. The method readjusts the missing data proportion of the training data of the data set so as to better compare the method. The data sets were set to missing values of 10%, 30%, 50%, 70% and 90%, with the row and column positions of missing data values being randomly selected each time until the total number of missing values needed was met.
In order to verify the stability of the method, Mean Absolute Error (MAE) and Mean Square Error (MSE) are selected to judge the data filling Error at the same time, and the Error between the data matrix filled by the algorithm and the real complete data is compared, as shown in formulas (1) and (2).
Figure BDA0003648328610000121
Figure BDA0003648328610000131
Wherein, the filling-value is a filling value, and the true-value is a true value. Because the emphasis of the evaluation method is on the comparison of filling data errors, the MAE emphasis is on the average filling effect, and the MSE emphasis is on the consideration of individual abnormal values, if the MAE values are the same, but the MSE values are larger, the filling method has larger instability.
Compared with the method which takes 5 existing missing value filling methods as baseline methods, the baseline algorithm and the method are presented as follows:
(1) statistical class common algorithms: mean filling (Mean), Random filling (Random), K-nearest filling (KNN), Expectation Maximization (EM)
(2) Generating an antagonistic neural network GAN series algorithm: GAIN
(3) The method is a nested random forest method (NRF)
The average error value results for 500 random number extraction experiments are presented in tables below, where tables 3 and 4 present the MAE and MSE error values for the HM data set, respectively, and tables 5 and 6 present the MAE and MSE error values for the SB data set, respectively. Missing-Num in the table represents the number of deletions of characteristic values.
TABLE 3 HM data set MAE error value
Figure BDA0003648328610000132
TABLE 4 HM data set MSE error values
Figure BDA0003648328610000141
TABLE 5 SB dataset MAE error values
Figure BDA0003648328610000142
TABLE 6 SB dataset MSE error values
Figure BDA0003648328610000143
As can be seen from tables 5 and 6, there is a significant increase in the amount of data in the SB dataset, while the GAIN and NRF methods have significant advantages in their effectiveness, with the NRF method of the present invention being the most effective. From table 5, GAIN and NRF each have advantages, while from table 6, it can be seen that NRF is more stable and has less error fluctuation than the GAIN method.
The error values of the data from the above multiple experiments were averaged and ranked after summarization, as shown in table 7. It can be seen that the NRF method has significant advantages under the combined expression.
TABLE 7 data aggregation error mean ranking
Figure BDA0003648328610000151
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 4, there is provided a random forest-based missing value padding apparatus, including: the device comprises an acquisition module, a division module, a construction module and a filling module, wherein:
and the acquisition module is used for acquiring the target sample set.
The target sample set includes a non-missing sample set that does not miss a feature value and a missing sample set that misses a feature value.
And the construction module is used for dividing the missing sample set according to the characteristic value missing type to obtain a plurality of nested missing sample groups.
The nested missing sample group comprises a plurality of missing samples which are arranged according to the missing number of the characteristic values, and the characteristic value missing type of the missing sample in each nested missing sample group comprises the characteristic value missing type of the last missing sample;
and the dividing module is used for constructing the nested random forest according to the nested missing sample group.
Each nested level of the nested random forest corresponds to a missing sample.
And the filling module is used for filling missing characteristic values of the missing samples corresponding to the current nested layer after the non-missing sample set corresponding to the current nested layer is adopted as the training set to train the current nested layer of the nested random forest.
And the non-missing sample set corresponding to the current nested layer is a sample obtained by filling the last nested layer of each nested random forest and a non-missing sample set corresponding to the last nested layer.
In one embodiment, the building module is further configured to build, when there are more than 2 missing samples with the same number and type of missing eigenvalues in the nested missing sample group, more than 2 corresponding mutually independent random forests at the corresponding nested level.
In an embodiment, the filling module is further configured to, when only a single sample exists in the current nested level, perform filling of the missing characteristic value on the single sample corresponding to the current nested level after training the current nested level of the nested forest by using the non-missing sample set corresponding to the current nested level as a training set.
In one embodiment, the fill module is further to:
when only a single sample exists in the current nested layer, a non-missing sample set corresponding to the current nested layer is adopted as a training set to train the current nested layer of the nested forest, and then the missing characteristic value of the single sample corresponding to the current nested layer is filled;
when more than 2 samples exist in the current nested layer, the non-missing sample set corresponding to the current nested layer is adopted as a training set to train more than 2 mutually independent random forests of the current nested layer respectively, and missing characteristic values are filled in more than 2 samples corresponding to the current nested layer respectively.
For specific definition of the missing value padding apparatus based on random forest, reference may be made to the above definition of the missing value padding method based on random forest, and details are not described here. The modules in the random forest based missing value filling device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing target data to be filled. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a random forest based miss value population method.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program:
acquiring a target sample set; the target sample set comprises a non-missing sample set without missing characteristic values and a missing sample set without missing characteristic values;
dividing the missing sample set according to the characteristic value missing type to obtain a plurality of nested missing sample groups; the nested missing sample group comprises a plurality of missing samples which are arranged according to the missing number of the characteristic values; the characteristic value missing type of the missing sample in each nested missing sample group comprises the characteristic value missing type of the last missing sample;
constructing a nested random forest according to the nested missing sample group; each nested layer of the nested random forest corresponds to a missing sample;
after a non-missing sample set corresponding to the current nested layer is adopted as a training set to train the current nested layer of the nested random forest, filling missing characteristic values in the missing samples corresponding to the current nested layer; and the non-missing sample set corresponding to the current nested layer is a sample obtained by filling the last nested layer of each nested random forest and a non-missing sample set corresponding to the last nested layer.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring a target sample set; the target sample set comprises a non-missing sample set without missing characteristic values and a missing sample set without missing characteristic values;
dividing the missing sample set according to the characteristic value missing type to obtain a plurality of nested missing sample groups; the nested missing sample group comprises a plurality of missing samples which are arranged according to the missing number of the characteristic values; the characteristic value missing type of the missing sample in each nested missing sample group comprises the characteristic value missing type of the last missing sample;
constructing a nested random forest according to the nested missing sample group; each nested layer of the nested random forest corresponds to a missing sample;
after a non-missing sample set corresponding to a current nested layer is adopted as a training set to train the current nested layer of the nested random forest, filling missing characteristic values in the missing samples corresponding to the current nested layer; and the non-missing sample set corresponding to the current nested layer is a sample obtained by filling the last nested layer of each nested random forest and a non-missing sample set corresponding to the last nested layer.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A random forest based missing value filling method is characterized by comprising the following steps:
acquiring a target sample set; the target sample set comprises a non-missing sample set without missing feature values and a missing sample set without missing feature values;
dividing the missing sample set according to the missing type of the characteristic value to obtain a plurality of nested missing sample groups; the nested missing sample group comprises a plurality of missing samples which are arranged according to the missing number of the characteristic values; the characteristic value missing type of the missing sample in each nested missing sample group comprises the characteristic value missing type of the last missing sample;
constructing a nested random forest according to the nested missing sample group; each nested layer of the nested random forest corresponds to a missing sample;
after a non-missing sample set corresponding to a current nested layer is adopted as a training set to train the current nested layer of the nested random forest, filling missing characteristic values in the missing samples corresponding to the current nested layer; and the non-missing sample set corresponding to the current nested layer is a sample obtained by filling the last nested layer of each nested random forest and a non-missing sample set corresponding to the last nested layer.
2. The method of claim 1, wherein constructing a nested random forest from the set of nested missing samples comprises:
and when more than 2 missing samples with the same characteristic value missing quantity and type exist in the nested missing sample group, constructing more than 2 corresponding mutually independent random forests on the corresponding nested layer.
3. The method of claim 1, wherein after training a current nested level of the nested random forest by using a non-missing sample set corresponding to the current nested level as a training set, filling missing feature values in a missing sample group corresponding to the current nested level comprises:
and when only a single sample exists in the current nested layer, adopting a non-missing sample set corresponding to the current nested layer as a training set to train the current nested layer of the nested forest, and filling missing characteristic values in the single sample corresponding to the current nested layer.
4. The method of claim 1, wherein after training a current nested level of the nested random forest using a non-missing sample set corresponding to the current nested level as a training set, filling missing feature values in a missing sample group corresponding to the current nested level, further comprising:
when more than 2 samples exist in the current nested layer, respectively training more than 2 mutually independent random forests of the current nested layer by adopting a non-missing sample set corresponding to the current nested layer as a training set, and respectively filling missing characteristic values in more than 2 samples corresponding to the current nested layer.
5. The method of claim 1, wherein after training a current nested level of the nested random forest using a non-missing sample set corresponding to the current nested level as a training set, and after filling missing feature values in missing samples corresponding to the current nested level, the method further comprises:
and filling a characteristic value of a sample corresponding to the current nested layer of each nested random forest to obtain a sample and a non-missing sample set corresponding to the current nested layer, and taking the sample and the non-missing sample set corresponding to the next nested layer of each nested forest as the non-missing sample set corresponding to the next nested layer of each nested forest.
6. The method of claim 1, wherein training the current nested level of the nested random forest using the non-missing sample set corresponding to the current nested level as a training set comprises:
resampling the non-missing sample set corresponding to the current nested layer to obtain a plurality of non-missing sample subsets;
and randomly selecting a plurality of characteristics in each non-missing sample subset, selecting the optimal decision characteristics from the characteristics to construct each decision tree of the random forest of the current nested layer, and respectively training each decision tree of each random forest in the current nested layer by adopting each non-missing sample subset.
7. An apparatus for filling missing values based on a random forest, the apparatus comprising:
the acquisition module is used for acquiring a target sample set; the target sample set comprises a non-missing sample set without missing feature values and a missing sample set without missing feature values;
the dividing module is used for dividing the missing sample set according to the missing type of the characteristic value to obtain a plurality of nested missing sample groups; the nested missing sample group comprises a plurality of missing samples which are arranged according to the missing number of the characteristic values; the characteristic value missing type of the missing sample in each nested missing sample group comprises the characteristic value missing type of the last missing sample;
the construction module is used for constructing a nested random forest according to the nested missing sample group; each nested layer of the nested random forest corresponds to a missing sample;
the filling module is used for filling missing characteristic values of the missing samples corresponding to the current nested layer after the current nested layer of the nested random forest is trained by adopting a non-missing sample set corresponding to the current nested layer as a training set; and the non-missing sample set corresponding to the current nested layer is a sample obtained by filling the last nested layer of each nested random forest and a non-missing sample set corresponding to the last nested layer.
8. The apparatus of claim 7, wherein the build module is further configured to:
and when more than 2 missing samples with the same characteristic value missing quantity and type exist in the nested missing sample group, constructing more than 2 corresponding mutually independent random forests on the corresponding nested layer.
9. The apparatus of claim 7, wherein the fill module is further configured to:
when only a single sample exists in the current nested layer, adopting a non-missing sample set corresponding to the current nested layer as a training set to train the current nested layer of the nested forest, and then filling missing characteristic values in the single sample corresponding to the current nested layer;
when more than 2 samples exist in the current nested layer, respectively training more than 2 mutually independent random forests of the current nested layer by adopting a non-missing sample set corresponding to the current nested layer as a training set, and respectively filling missing characteristic values in more than 2 samples corresponding to the current nested layer.
10. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
CN202210536238.0A 2022-05-17 2022-05-17 Defect value filling method and device based on random forest and computer equipment Pending CN114881158A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210536238.0A CN114881158A (en) 2022-05-17 2022-05-17 Defect value filling method and device based on random forest and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210536238.0A CN114881158A (en) 2022-05-17 2022-05-17 Defect value filling method and device based on random forest and computer equipment

Publications (1)

Publication Number Publication Date
CN114881158A true CN114881158A (en) 2022-08-09

Family

ID=82676023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210536238.0A Pending CN114881158A (en) 2022-05-17 2022-05-17 Defect value filling method and device based on random forest and computer equipment

Country Status (1)

Country Link
CN (1) CN114881158A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115454988A (en) * 2022-09-27 2022-12-09 哈尔滨工业大学 Satellite power supply system missing data completion method based on random forest network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115454988A (en) * 2022-09-27 2022-12-09 哈尔滨工业大学 Satellite power supply system missing data completion method based on random forest network

Similar Documents

Publication Publication Date Title
CN112529168B (en) GCN-based attribute multilayer network representation learning method
Nagi et al. Classification of microarray cancer data using ensemble approach
CN110570111A (en) Enterprise risk prediction method, model training method, device and equipment
CN112633426B (en) Method and device for processing data class imbalance, electronic equipment and storage medium
Tripoliti et al. Modifications of the construction and voting mechanisms of the random forests algorithm
CN109886554B (en) Illegal behavior discrimination method, device, computer equipment and storage medium
CN112966114A (en) Document classification method and device based on symmetric graph convolutional neural network
CN109376381A (en) Method for detecting abnormality, device, computer equipment and storage medium are submitted an expense account in medical insurance
CN115496144A (en) Power distribution network operation scene determining method and device, computer equipment and storage medium
CN114881158A (en) Defect value filling method and device based on random forest and computer equipment
Roigé et al. Cluster validity and uncertainty assessment for self-organizing map pest profile analysis.
Farooq Genetic algorithm technique in hybrid intelligent systems for pattern recognition
CN112348226A (en) Prediction data generation method, system, computer device and storage medium
CN110263106B (en) Collaborative public opinion fraud detection method and device
CN112232951A (en) Credit evaluation method, device, equipment and medium based on multi-dimensional cross feature
CN117520954A (en) Abnormal data reconstruction method and system based on isolated forest countermeasure network
CN114912109B (en) Abnormal behavior sequence identification method and system based on graph embedding
Wang et al. Inductive multi-view semi-supervised anomaly detection via probabilistic modeling
Bond et al. An unsupervised machine learning approach for ground‐motion spectra clustering and selection
Mariño et al. Two weighted c-medoids batch SOM algorithms for dissimilarity data
Han et al. Classification of Nuclear Fuel Cycle Related Documents by Supervised and Unsupervised Learning Algorithms
Jalaldoust et al. Causal discovery in Hawkes processes by minimum description length
CN114676167B (en) User persistence model training method, user persistence prediction method and device
Zhang et al. A rough set-based multiple criteria linear programming approach for classification
Feitosa Neto et al. Multiobjective optimization techniques for selecting important metrics in the design of ensemble systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination