CN111984637B

CN111984637B - Missing value processing method and device in data modeling, equipment and storage medium

Info

Publication number: CN111984637B
Application number: CN202010641389.3A
Authority: CN
Inventors: 王建刚
Original assignee: Suzhou Yanshu Information Technology Co ltd
Current assignee: Suzhou Yanshu Information Technology Co ltd
Priority date: 2020-07-06
Filing date: 2020-07-06
Publication date: 2023-04-18
Anticipated expiration: 2040-07-06
Also published as: CN111984637A

Abstract

The application discloses a missing value processing method in data modeling, which comprises the steps of replacing missing values in a sample data set with preset values by acquiring the sample data set, and constructing a plurality of variables based on each data in the sample data set; the method comprises the steps of obtaining a plurality of variables, segmenting data in each variable to obtain a plurality of data segments, dividing missing values into the same data segment, calculating information values of each variable, selecting a first number of variables from the plurality of variables, and establishing a model based on the selected first number of variables. The missing values are not substantially changed, so that the authenticity and the accuracy of the data are reserved, the missing values are regarded as normal attribute values and participate in the modeling calculation process together with other attribute values, the trend relation of the missing values of the variables to the modeling target can be more clearly expressed, the classification capability of the model is improved, and the model is better explained in the later model evaluation process.

Description

Missing value processing method and device in data modeling, equipment and storage medium

Technical Field

The present disclosure relates to the field of data modeling, and in particular, to a missing value processing method and apparatus, a device, and a storage medium in data modeling.

Background

In data modeling, some variables in sample data have missing values, and the missing values hide historical characteristics of the sample data, so that technical processing needs to be performed on the missing value data in a data preprocessing stage to find more characteristics from the missing data, find and analyze relationships between targets, and meet the requirements of a modeling program.

In the prior art, the processing methods for the deficiency values are mainly as follows: deletion, special value filling, mean filling, near completion, cluster filling, filling using all possible values, combinatorial completion methods, and regression interpolation.

In the prior art, data can be changed to different degrees no matter the data is deleted or filled, so that the final model effect is influenced. The missing value is the result caused by various reasons, if the reason for the missing value is not clear and the missing value is artificially filled according to processing experience and business understanding, the current result of the missing value is changed, so that the change history of the missing value is covered, and even the relation between the filling value and other normal values is artificially increased or changed, so that the influence relation of the variable on the modeling target is influenced.

Disclosure of Invention

In view of this, the present disclosure provides a missing value processing method in data modeling, including:

acquiring a sample data set, replacing missing values in the sample data set with preset values, and constructing a plurality of variables based on each data in the sample data set; wherein each variable comprises a plurality of data;

segmenting the data in each variable to obtain a plurality of data segments; wherein the missing values are divided into the same data segment;

information values of the variables are calculated, a first number of variables are selected from the variables, and a model is built based on the selected first number of variables.

In one possible implementation, constructing a plurality of variables based on each data in the sample data set includes:

acquiring the data in the sample data set and the preset variable names of the variables;

attributing each piece of data to the corresponding variable according to the attribute of each piece of data and the name of each variable;

wherein the attribute of the data corresponds to the variable name.

In one possible implementation manner, segmenting the data in each of the variables to obtain a plurality of data segments includes:

segmenting data contained in each variable according to preset conditions to obtain a plurality of initial data segments;

and merging or reserving the initial data segments according to the similarity between any two initial data segments in the same variable to obtain the data segments.

In a possible implementation manner, merging or retaining each initial data segment according to a similarity between any two initial data segments in the same variable includes:

acquiring a weight value corresponding to each initial data segment in another variable under the same variable; the weight value is any one of the mean value and the mode of the data corresponding to the other variable of each data in each initial data segment;

if the difference value of the weight values of the two initial data segments is less than or equal to a set value, combining the two initial data segments;

and if the difference value of the weights of the two initial data sections is greater than the set value, keeping the two initial data sections.

In a possible implementation manner, obtaining a weight value corresponding to each of the initial data segments in another variable under the same variable includes:

acquiring data corresponding to each data in the initial data segments in another variable;

and calculating based on the data corresponding to the data in the initial data segment in another variable to obtain the weight value corresponding to the initial data segment in another variable.

In one possible implementation, selecting a first number of variables from the plurality of variables comprises:

sorting the variables according to the information values to obtain variable sorting results;

and selecting the first quantity of variables according to the variable sorting result.

In one possible implementation, a first number of variables is selected from the plurality of variables, and the selection of the variables is performed using a recursive algorithm when building a model based on the first number of variables.

According to another aspect of the present disclosure, a missing value processing apparatus in data modeling is provided, which is characterized by comprising a variable construction module, a variable segmentation module and a modeling variable selection module;

the variable construction module is configured to acquire a sample data set, replace missing values in the sample data set with preset values, and construct a plurality of variables based on each data in the sample data set; wherein each variable comprises a plurality of data;

the variable segmentation module is configured to segment the data in each variable to obtain a plurality of data segments; wherein the missing values are divided into the same data segment;

the modeling variable selection module is configured to calculate information values of the variables, select a first number of variables from the variables, and build a model based on the selected first number of variables.

According to another aspect of the present disclosure, there is provided a missing value processing apparatus in data modeling, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the executable instructions to implement any of the methods described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of the preceding.

Replacing missing values in the sample data set with preset values by acquiring the sample data set, and constructing a plurality of variables based on each data in the sample data set; the method comprises the steps of obtaining a plurality of variables, segmenting data in each variable to obtain a plurality of data segments, dividing missing values into the same data segment, calculating information values of each variable, selecting a first number of variables from the plurality of variables, and establishing a model based on the selected first number of variables. The missing values are not substantially changed, so that the authenticity and the accuracy of data are kept, meanwhile, the missing values are specially processed in the modeling process, the smooth proceeding of modeling calculation is not influenced, the missing values are regarded as normal attribute values and participate in the modeling calculation process together with other attribute values, the trend relation of the missing values of variables to modeling targets can be more clearly expressed, and the classification capability of the model is improved, and the model can be better explained in the later model evaluation process.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow chart of a missing value handling method in data modeling of the present disclosure;

FIG. 2 illustrates a missing value schematic of the missing value processing method in data modeling of the present disclosure;

FIG. 3 illustrates a data segmentation schematic of a missing value handling method in data modeling of the present disclosure;

FIG. 4 illustrates an information value diagram of a missing value handling method in data modeling of the present disclosure;

FIG. 5 illustrates a trend relationship diagram of a missing value processing method in data modeling of the present disclosure;

FIG. 6 illustrates a block diagram of a missing value processing apparatus in data modeling of the present disclosure;

FIG. 7 illustrates a block diagram of a missing value processing device in the data modeling of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 illustrates a flowchart of a missing value processing method in data modeling according to an embodiment of the present disclosure. As shown in fig. 1, the missing value processing method in data modeling includes:

s100, acquiring a sample data set, replacing missing values in the sample data set with preset values, and constructing a plurality of variables based on each data in the sample data set; the method comprises the steps of S200, segmenting data in each variable to obtain a plurality of data segments, dividing missing values into the same data segment, S300, calculating information values of each variable, selecting a first number of variables from the plurality of variables, and establishing a model based on the selected first number of variables.

Specifically, referring to fig. 1, first, step S100 is executed to obtain a sample data set, replace missing values in the sample data set with preset values, and construct a plurality of variables based on each data in the sample data set; wherein each variable comprises a plurality of data.

In one possible implementation, a sample data set is obtained first, if there is a missing value in the data of the variable, the missing value is filled in the set value, for example, referring to fig. 2, if there is a missing value in the "buyamount" variable, that is, a blank in the table, then "NONE" is filled in the position, and a plurality of variables are constructed based on each data in the sample data set, where constructing the variables includes: and acquiring data in the data group and variable names of all the variables, and selecting part of the data from the data group and corresponding variable names for association. For example, data of a purchased product is stored in the hard disk, referring to fig. 2, the variables include "deadline", "buyamount", "buytime _ new", and "weekd", these characters are variable names of the variables, where each group has corresponding data below, and the data below each group is associated with the corresponding variable name, that is, the value of each data can be assigned to the current variable, so that the variable structure is completed.

It should be noted that, in the embodiments of the present disclosure, the filling value of the missing value is not limited, and the required function may be achieved.

Further, referring to fig. 1, next, step S200 is executed to segment the data in each variable to obtain a plurality of data segments, where the missing value is divided into the same data segment.

In a possible implementation manner, data included in each variable is segmented according to a preset condition to obtain a plurality of initial data segments, and each initial data segment is merged or retained according to a similarity between any two initial data segments in the same variable to obtain a data segment. For example, referring to fig. 3, taking the QR _ BUYAMOUT _04Q variable as an example, the MISSING value is individually divided into one data segment, and this data segment is represented by using the character "MISSING", wherein the other segments are respectively a data segment of 0 (zero), a data segment of 0 to 50000, a data segment of 50000 to 150000, and a data segment of more than 150000, wherein the intervals of each segment are not the same, if this variable represents the sales situation of the fourth quarter, and the sales volume of all the commodities is taken as a data group, the sales volume is firstly divided into initial data segments at a preset interval of 50000, commodities with the sales volume of zero are divided into zero data segments, commodities with the sales volume of 0 to 50000 are divided into initial data segments of 0 to 50000, the commodity with sales volume of 50000 to 100000 is classified into an initial data segment of 50000 to 100000, the commodity with sales volume of 100000 to 150000 is classified into an initial data segment of 100000 to 150000, the commodity with sales volume of more than 150000 is classified into an initial data segment of more than 150000, then, if the price of the commodity with sales volume of zero is 20000 yuan or more, the price of the commodity with sales volume of 0 to 50000 is 1500 yuan to 1600 yuan, the price of the commodity with sales volume of 50000 to 100000 is 100 yuan to 120 yuan, the price of the commodity with sales volume of 100000 to 150000 is 80 yuan to 95 yuan, and the price of the commodity with sales volume of more than 150000 is 500 yuan to 700 yuan according to the similarity between any two initial data segments in the same variable, wherein merging or reserving each initial data segment comprises: and obtaining a weight value corresponding to each initial data segment in another variable under the same variable, wherein the weight value is any one of the mean value and the mode of data corresponding to each data in each initial data segment in another variable, if the difference value of the weight values of the two initial data segments is less than or equal to a set value, combining the two initial data segments, and if the difference value of the weight values of the two initial data segments is greater than the set value, keeping the two initial data segments.

Wherein, obtaining the weight value corresponding to each initial data segment in another variable under the same variable comprises: and acquiring data corresponding to each data in each initial data segment in another variable, and calculating based on the data corresponding to each data in the initial data segment in another variable to obtain a weight value corresponding to the initial data segment in another variable. Here, it should be noted that the weight refers to a reference value when each initial data segment obtained in the process of segmenting data in each variable is merged or retained.

In a possible implementation manner, the weight may be represented by various statistics such as a mean, a mode, and a variance.

For example, under the QR _ BUYAMOUT _04Q variable, the following several initial data segments are obtained: a "MISSING" data segment, a data segment of 0 (zero), a data segment from 0 to 50000, a data segment from 50000 to 100000, a data segment from 50000 to 150000, and a data segment greater than 150000.

Then, four initial data pieces of a data piece of 0 (zero), a data piece from 0 to 50000, a data piece from 50000 to 150000, and a data piece larger than 150000 are again fragmented. When segmenting again, respectively calculating the weight of each initial data segment under the variable of the sum of money.

Such as: when the weight value is represented by the mean value, the data corresponding to each data in the initial data segment under the variable of the amount of money can be determined. That is, the initial data segment of 0 (zero) includes one data 0. Then, the data corresponding to 0 in the amount variable is determined as: 20000. therefore, the weight of the initial data segment of 0 (zero) obtained by the mean value calculation method under the amount variable is 20000. For the initial data segment 0 to 50000, four data including data 100, data 1300, data 20000 and data 40000 correspond to the following data in the variable of the amount: 1500. 1520, 1580 and 1600, the weight of the initial data segment from 0 to 50000 under the variable of the amount of money is 1550, the weight of the data segment from 50000 to 100000, the weight of the data segment from 50000 to 150000 and the weight of the data segment greater than 150000 are respectively calculated by the above-mentioned method and are respectively 110, 90 and 600, and the weight difference of the data segment from 50000 to 100000 and the weight difference of the data segment from 50000 to 150000 are only 20 according to the difference between the weights, and two data segments are adjacent, then the two initial data segments are merged to finally obtain four data segments.

Further, referring to fig. 1, next, step S300 is performed to calculate information values of the variables, select a first number of variables from the variables, and build a model based on the selected first number of variables.

In one possible implementation, information values of the variables are calculated, and the information values (IV values) are mainly used for encoding and prediction capability evaluation of the input variables. The magnitude of the information value indicates the strength of the variable prediction capability. The information value calculation steps are as follows:

after grouping, for the ith section, WOE (WOE is called "weight of evidence" in all, namely evidence weight. Intuitively, WOE is a form of encoding the original variable, and to encode a variable, it first needs to group the variable, namely, binning or discretizing, usually discrete

The method for converting the data into the equal-width grouping, the equal-height grouping or the decision tree grouping) is as follows:

where WOE represents the "proportion of responding clients to all responding clients in the current segment" (i.e., P) _yi ) And "the proportion of unresponsive clients in the current segment to all unresponsive clients" (i.e., P) _ni ) The difference in (a). Wherein, y _i For responding clients in the current segment, y _s For all responding clients, n _i For clients that do not respond in the current segment, n _s All clients that do not respond.

For segment i, its corresponding information value is calculated using the following formula, where n is the number of segments.

After calculating the information values of the various groups of a variable, we can calculate the information values of the whole variable:

that is, the information value of each segment in the same variable is added to obtain the information value of this variable.

Referring to fig. 4, after all variables have calculated the Information Value, each variable (charctertic) corresponds to an Information Value (Information Value).

In a possible implementation manner, referring to fig. 4, the variables are sorted from large to small according to the information value, and a part of the variables are selected by a recursive algorithm to build a model. For example,

obtaining information values of all variables, if the number of the variables is 100, firstly sorting the 100 variables from large to small, selecting the first 30 variables, selecting part of the variables to build a model through a recursive algorithm, namely selecting a part of the variables according to the correlation of the 30 variables, for example, finally screening 10 variables from the 30 variables through the recursive algorithm, and putting the 10 variables into the model to build the model.

After the model is established, the established model can be verified and data statistical analysis can be performed by applying the established model. That is, referring to fig. 5, the model established according to any of the above manners outputs a trend relationship for the researchers to check, and when the model result is interpreted and verified, and the trend relationship between each variable and the analysis target is studied, the variables (i.e., all variables) in the verification set are directly scored according to the previous variable segmentation results, so as to obtain a result and study the trend relationship between each segment and the target variable. The MISSING values are treated as a single segment and replaced by "MISSING" throughout the process, which helps analysts and business personnel to understand and apply the model more clearly.

It should be noted that, although the missing value processing method in data modeling of the present disclosure is described above by taking the above steps as examples, those skilled in the art will understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set the missing value processing method in the data modeling according to personal preference and/or actual application scenes as long as the required functions are achieved.

In this way, by acquiring the sample data set, the missing values in the sample data set are replaced by preset values, and a plurality of variables are constructed on the basis of each data in the sample data set; the method comprises the steps of obtaining a plurality of variables, segmenting data in each variable to obtain a plurality of data segments, dividing missing values into the same data segment, calculating information values of each variable, selecting a first number of variables from the plurality of variables, and establishing a model based on the selected first number of variables. The missing values are not substantially changed, so that the authenticity and the accuracy of data are kept, meanwhile, the missing values are specially processed in the modeling process, the smooth proceeding of modeling calculation is not influenced, the missing values are regarded as normal attribute values and participate in the modeling calculation process together with other attribute values, the trend relation of the missing values of variables to modeling targets can be more clearly expressed, and the classification capability of the model is improved, and the model can be better explained in the later model evaluation process.

Further, according to another aspect of the present disclosure, there is also provided a missing value processing apparatus 100 in data modeling. Since the working principle of the missing value processing apparatus 100 in the data modeling of the embodiment of the present disclosure is the same as or similar to the principle of the missing value processing method in the data modeling of the embodiment of the present disclosure, repeated descriptions are omitted. Referring to fig. 6, a missing value processing apparatus 100 in data modeling of the embodiment of the present disclosure includes a variable construction module 110, a variable segmentation module 120, and a modeling variable selection module 130;

the variable constructing module 110 is configured to acquire a sample data set, replace missing values in the sample data set with preset values, and construct a plurality of variables based on each data in the sample data set; each variable comprises a plurality of data;

a variable segmentation module 120 configured to segment data in each variable to obtain a plurality of data segments; wherein, the missing values are divided into the same data segment;

a modeling variable selection module 130 configured to calculate information values of the variables, select a first number of variables from the plurality of variables, and build a model based on the selected first number of variables.

Still further, according to another aspect of the present disclosure, there is also provided a missing value processing apparatus 200 in data modeling. Referring to fig. 7, a missing value processing apparatus 200 in data modeling according to an embodiment of the present disclosure includes a processor 210 and a memory 220 for storing instructions executable by the processor 210. Wherein the processor 210 is configured to execute the executable instructions to implement any of the above-described missing value handling methods in data modeling.

Here, it should be noted that the number of the processors 210 may be one or more. Meanwhile, in the missing value processing apparatus 200 in data modeling of the embodiment of the present disclosure, an input device 230 and an output device 240 may be further included. The processor 210, the memory 220, the input device 230, and the output device 240 may be connected via a bus, or may be connected via other methods, which is not limited in detail herein.

The memory 220, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and various modules, such as: the program or the module corresponding to the missing value processing method in the data modeling of the embodiment of the disclosure. The processor 210 performs various functional applications and data processing of the missing value processing apparatus 200 in data modeling by running software programs or modules stored in the memory 220.

The input device 230 may be used to receive an input number or signal. Wherein the signal may be a key signal generated in connection with user settings and function control of the device/terminal/server. The output device 240 may include a display device such as a display screen.

According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by the processor 210, implement the missing value processing method in data modeling as described in any of the preceding.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A missing value processing method in data modeling is characterized by being used for carrying out sales data statistical analysis after a model is built, and comprising the following steps of:

acquiring a sample data set, replacing missing values in the sample data set with preset values, and constructing a plurality of variables based on each data in the sample data set; wherein each variable comprises a plurality of data; the sample data set comprises data of purchased products stored in a hard disk, variables comprise 'deadline', 'buyamount', 'buytime _ new' and 'weekd', the characters are variable names of the variables, corresponding data are arranged below each group, the data below each group are associated with the corresponding variable names, namely, the value of each data can be assigned to the current variable, and then variable construction is completed;

calculating an information value of each of the variables, selecting a first number of variables from the plurality of variables, and modeling based on the selected first number of variables; wherein the established model is used for statistical analysis of sales data;

wherein constructing a plurality of variables based on each data in the sample data set comprises:

acquiring each data in the sample data set and a preset variable name of each variable;

wherein the attribute of the data corresponds to the variable name;

segmenting the data in each variable to obtain a plurality of data segments, including:

segmenting data contained in each variable according to preset conditions to obtain a plurality of initial data segments; wherein the initial data segment includes sales volume;

2. The method of claim 1, wherein merging or retaining the initial data segments according to the similarity between any two initial data segments in the same variable comprises:

acquiring a weight value corresponding to each initial data segment in another variable under the same variable; the weight value is any one of the mean value and the mode of the data corresponding to the other variable of each data in each initial data section;

3. The method of claim 2, wherein obtaining a weight value corresponding to each of the initial data segments in another variable under the same variable comprises:

acquiring data corresponding to each data in each initial data segment in another variable;

and calculating based on the data corresponding to each data in the initial data segment in another variable to obtain the weight value corresponding to the initial data segment in another variable.

4. The method of claim 1, wherein selecting a first number of variables from the plurality of variables comprises:

5. The method of claim 1, wherein selecting a first number of variables from the plurality of variables and modeling based on the first number of variables uses a recursive algorithm to select the variables.

6. A missing value processing device in data modeling is characterized by being used for carrying out sales data statistical analysis after a model is built and comprising a variable construction module, a variable segmentation module and a modeling variable selection module;

the variable construction module is configured to acquire a sample data set, replace missing values in the sample data set with preset values, and construct a plurality of variables based on each data in the sample data set; wherein each variable comprises a plurality of data; the sample data set comprises data of purchased products stored in a hard disk, variables comprise 'deadline', 'buyamount', 'buytime _ new' and 'weekd', the characters are variable names of the variables, corresponding data are arranged below each group, the data below each group are associated with the corresponding variable names, namely, the value of each data can be assigned to the current variable, and then variable construction is completed;

the modeling variable selection module is configured to calculate information values of the variables, select a first number of variables from the variables, and build a model based on the selected first number of variables; wherein the established model is used for statistical analysis of sales data;

wherein the attribute of the data corresponds to the variable name;

7. An absent value processing apparatus in data modeling, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any one of claims 1 to 5 when executing the executable instructions.

8. A non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of claims 1 to 5.