CN112783884A

CN112783884A - Data optimization method based on normal distribution

Info

Publication number: CN112783884A
Application number: CN202110123387.XA
Authority: CN
Inventors: 姜振荣; 王国良; 黄少军; 邱实; 张鹏
Original assignee: Inspur Software Co Ltd
Current assignee: Inspur Software Co Ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2021-05-11

Abstract

The invention particularly relates to a data optimization method based on normal distribution. The data optimization method based on normal distribution collects data information, converts the collected data information into numerical values and stores the numerical values in a database for later use; checking the consistency of the data, and processing invalid values and missing values; the cleaned data is used as a variable x, a variable x normal distribution curve f (x) is obtained according to a probability density function of normal distribution, the area enclosed by the normal distribution curve and an x axis is set to be 1, and an optimal numerical value x can be obtained through an inverse function of a normal distribution cumulative density function according to the proportion A of the optimized data in all data₀. The data optimization method based on normal distribution comprehensively considers the dispersion degree and the distribution condition of the data, thereby avoiding the unfairness problem easily caused by the conventional optimization strategy, leading the data optimization result to be more reasonable, reflecting the actual situation of the global data more, and improving the accuracy rate of data optimization。

Description

Data optimization method based on normal distribution

Technical Field

The invention relates to the technical field of data processing, in particular to a data optimization method based on normal distribution.

Background

Data optimization refers to selecting relatively excellent individuals from a population, and a common method is to sort the individuals according to observed attributes and then select the individuals with the highest ranking.

For example, when a commodity is studied, it is often necessary to determine which markets of a certain commodity have good sales status and which markets have poor sales status, and a related incentive system is established according to the difference of sales status. Generally, technicians sort the markets according to the sales conditions of commodities from good to bad, and the first few markets are good. But this has a clear disadvantage, assuming that the number of markets is 10, the sales status scores are in turn: 99. 95, 93, 92, 85, 83, 81, 78, 76, 71; wherein, the first three names are: 99. 95, 93 are rated as excellent, but the fourth name 92 is only one-half different from the third name 93, and obviously, the method does not consider the distribution condition of the global data. Meanwhile, the data selected by the method cannot reflect the actual situation of the global data, so that technicians can easily misunderstand the global situation, and make wrong judgment.

Aiming at the defect, the invention provides a data optimization method based on normal distribution

Disclosure of Invention

In order to make up for the defects of the prior art, the invention provides a simple and efficient data optimization method based on normal distribution.

The invention is realized by the following technical scheme:

a data optimization method based on normal distribution is characterized in that: the method comprises the following steps:

first, data acquisition

Collecting data information aiming at a research object, converting the collected data information into a numerical value and storing the numerical value in a database for later use;

second, cleaning data to remove abnormal value

Checking the consistency of the data, checking whether the data meets the requirements, searching out the data which exceeds a normal range and is logically unreasonable or contradictory, ensuring that each data is in a reasonable value range and a mutual relation, and processing invalid values and missing values;

thirdly, establishing a model

Taking the cleaned data as a variable x, solving a mean value mu and a standard deviation sigma of the variable x, and obtaining a variable x normal distribution curve f (x) according to a probability density function of normal distribution, wherein the probability density function is as follows:

wherein mu is a mean value and sigma is a standard deviation;

the fourth step, result output

Setting the area enclosed by the normal distribution curve and the x axis as 1, and obtaining the optimal value x by the inverse function of the normal distribution cumulative density function according to the proportion A of the optimal data in all data₀。

In the second step, when the data is less missing and the missing mechanism is clear, the data cleaning adopts a Regression algorithm to calculate the estimated value of the missing value or the invalid value, and fills the missing value or the invalid value according to the estimation result.

And in the second step, all selected continuous variables are set as independent variables, variables with missing values or invalid values are set as dependent variables to establish a regression equation, and the regression equation is used for filling the corresponding missing values or invalid values of the dependent variables after the regression equation is obtained.

In order to make the filling value closer to the actual situation, the filling value of the dependent variable is the sum of the regression prediction value and the regression residual error.

In the second step, when the data is missing more and the variables are in curve connection, the data cleaning adopts an EM (Expectation Maximization Algorithm) Algorithm to calculate the estimation value of the missing value or the invalid value, and the missing value or the invalid value is filled according to the estimation result.

In the second step, when the data association degree is smaller, the variable deletion or the whole deletion operation is directly carried out on the variable of the missing value or the invalid value;

and when the data is pairwise paired variables and the missing value or the invalid value is the paired variables, performing pairwise deletion operation on the variables of the missing value or the invalid value.

In the fourth step, when the data is preferably to select a numerical value having a larger value among all the data, x is taken as x₀The normal distribution curve, the x-axis and x ∈ (x)₀And +∞) is set as A, wherein A belongs to (0, 1);

the user specifies the percentage of the preferred value in all the values, obtains the value A according to the specified percentage, and obtains the x through the inverse function of the normal distribution cumulative density function₀。

In the fourth step, when the data is preferably to select a numerical value having a smaller value among all the data, x is taken as x₀The normal distribution curve, the x-axis and x ∈ (— ∞, x)₀) The enclosed area is set as A, wherein A belongs to (0, 1);

The invention has the beneficial effects that: according to the data optimization method based on normal distribution, the dispersion degree and the distribution condition of data are comprehensively considered, so that the unfairness problem easily caused by a conventional optimization strategy is avoided, the data optimization result is more reasonable, the actual situation of global data can be reflected, and the accuracy of data optimization is improved.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The data optimization method based on normal distribution comprises the following steps:

first, data acquisition

second, cleaning data to remove abnormal value

thirdly, establishing a model

wherein mu is a mean value and sigma is a standard deviation;

the fourth step, result output

Example 1

Take the automatic sorting of watermelon as an example. A batch of watermelons are automatically sorted and divided into first-class products, second-class products and third-class products, and the first-class products, the second-class products and the third-class products are sent to markets for sale according to different prices.

First, data acquisition is performed using a data acquisition device such as a camera or a microphone. Data is collected from outside the system and input to an interface inside the system. Specifically to this embodiment, install the camera on the assembly line and gather the colour, decorative pattern, the pedicel shape of the melon, melon navel size etc. of each melon, convert it into digital signal and keep in the database for the division of follow-up grade.

Then, data cleaning is performed to remove abnormal values. Invalid and missing value checks are due to investigation, encoding and logging errors, and there may be some invalid and missing values in the data that need to be given appropriate treatment. Specifically, in this embodiment, the variables of the invalid value and the missing value are directly deleted.

And (5) establishing a model. Scoring watermelon according to each attribute of the watermelon, then carrying out grade division according to the scores, adding a weight to each variable when calculating the scores, calculating all the scores by using an entropy method, then solving a mean value and a standard deviation, and obtaining a variable x normal distribution curve f (x) according to a probability density function of normal distribution, wherein the probability density function is as follows:

where μ is the mean and σ is the standard deviation.

And finally, outputting the result.

If the batch is to be divided into two levels, one at 30%, then the normal distribution curve, the x-axis, and x e (x ∈)₀And +∞) is set to 0.3, that is, A is set to 0.3 (assuming that the score is between 0 and 100), and x is obtained by the inverse function of the normal distribution cumulative density function₀Score greater than x₀The watermelon is the first-grade watermelon.

Example 2

If the batch of melons is classified into four grades, it can be calculated in the same way that the first-grade (25%) score is larger than x1, the second-grade (25%) score is larger than x2 and smaller than x1, and the third-grade (25%) score is larger than x3 and smaller than x2, and the rest are the fourth-grade melons (25%) with score smaller than x 3.

Example 3

If the batch is to be divided into twoGrade, eliminating inferior melons, setting the proportion of inferior melons to be eliminated as 20%, and setting normal distribution curve, x axis and x ∈ (— ∞, x)₀) The area enclosed is set to 0.2, namely A is 0.2, and x can be obtained by the inverse function of the normal distribution cumulative density function₀Score less than x₀The watermelon is inferior watermelon.

The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims

1. A data optimization method based on normal distribution is characterized in that: the method comprises the following steps:

first, data acquisition

second, cleaning data to remove abnormal value

thirdly, establishing a model

wherein mu is a mean value and sigma is a standard deviation;

the fourth step, result output

The area enclosed by the normal distribution curve and the x-axis is set to 1, and the ratio of the normal distribution curve to the x-axis in all data is determined according to the preferred dataExample A, the optimal value x can be obtained by the inverse function of the normal distribution cumulative density function₀。

2. The normal distribution-based data optimization method according to claim 1, wherein: in the second step, when the data are less missing and the missing mechanism is clear, the data cleaning adopts a regression algorithm to calculate the estimated value of the missing value or the invalid value, and the missing value or the invalid value is filled according to the estimation result.

3. The normal distribution-based data optimization method according to claim 2, wherein: and in the second step, all selected continuous variables are set as independent variables, variables with missing values or invalid values are set as dependent variables to establish a regression equation, and the regression equation is used for filling the corresponding missing values or invalid values of the dependent variables after the regression equation is obtained.

4. The normal distribution-based data optimization method according to claim 3, wherein: and the filling value of the dependent variable is the sum of the regression prediction value and the regression residual error.

5. The normal distribution-based data optimization method according to claim 1, wherein: and in the second step, when the data are more missing and the variables are in curve connection, the data cleaning adopts an EM algorithm to calculate the estimation value of the missing value or the invalid value, and the missing value or the invalid value is filled according to the estimation result.

6. The normal distribution-based data optimization method according to claim 1, wherein: in the second step, when the data association degree is smaller, the variable deletion or the whole deletion operation is directly carried out on the variable of the missing value or the invalid value;

7. The normal distribution-based data optimization method according to claim 1, wherein: in the fourth step, when the data is preferably to select a numerical value having a larger value among all the data, x is taken as x₀The normal distribution curve, the x-axis and x ∈ (x)₀And +∞) is set as A, wherein A belongs to (0, 1);

8. The normal distribution-based data optimization method according to claim 1, wherein: in the fourth step, when the data is preferably to select a numerical value having a smaller value among all the data, x is taken as x₀The normal distribution curve, the x-axis and x ∈ (— ∞, x)₀) The enclosed area is set as A, wherein A belongs to (0, 1);