CN112783884A - Data optimization method based on normal distribution - Google Patents

Data optimization method based on normal distribution Download PDF

Info

Publication number
CN112783884A
CN112783884A CN202110123387.XA CN202110123387A CN112783884A CN 112783884 A CN112783884 A CN 112783884A CN 202110123387 A CN202110123387 A CN 202110123387A CN 112783884 A CN112783884 A CN 112783884A
Authority
CN
China
Prior art keywords
data
value
normal distribution
missing
optimization method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110123387.XA
Other languages
Chinese (zh)
Inventor
姜振荣
王国良
黄少军
邱实
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Co Ltd
Original Assignee
Inspur Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Co Ltd filed Critical Inspur Software Co Ltd
Priority to CN202110123387.XA priority Critical patent/CN112783884A/en
Publication of CN112783884A publication Critical patent/CN112783884A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Algebra (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Complex Calculations (AREA)

Abstract

The invention particularly relates to a data optimization method based on normal distribution. The data optimization method based on normal distribution collects data information, converts the collected data information into numerical values and stores the numerical values in a database for later use; checking the consistency of the data, and processing invalid values and missing values; the cleaned data is used as a variable x, a variable x normal distribution curve f (x) is obtained according to a probability density function of normal distribution, the area enclosed by the normal distribution curve and an x axis is set to be 1, and an optimal numerical value x can be obtained through an inverse function of a normal distribution cumulative density function according to the proportion A of the optimized data in all data0. The data optimization method based on normal distribution comprehensively considers the dispersion degree and the distribution condition of the data, thereby avoiding the unfairness problem easily caused by the conventional optimization strategy, leading the data optimization result to be more reasonable, reflecting the actual situation of the global data more, and improving the accuracy rate of data optimization。

Description

Data optimization method based on normal distribution
Technical Field
The invention relates to the technical field of data processing, in particular to a data optimization method based on normal distribution.
Background
Data optimization refers to selecting relatively excellent individuals from a population, and a common method is to sort the individuals according to observed attributes and then select the individuals with the highest ranking.
For example, when a commodity is studied, it is often necessary to determine which markets of a certain commodity have good sales status and which markets have poor sales status, and a related incentive system is established according to the difference of sales status. Generally, technicians sort the markets according to the sales conditions of commodities from good to bad, and the first few markets are good. But this has a clear disadvantage, assuming that the number of markets is 10, the sales status scores are in turn: 99. 95, 93, 92, 85, 83, 81, 78, 76, 71; wherein, the first three names are: 99. 95, 93 are rated as excellent, but the fourth name 92 is only one-half different from the third name 93, and obviously, the method does not consider the distribution condition of the global data. Meanwhile, the data selected by the method cannot reflect the actual situation of the global data, so that technicians can easily misunderstand the global situation, and make wrong judgment.
Aiming at the defect, the invention provides a data optimization method based on normal distribution
Disclosure of Invention
In order to make up for the defects of the prior art, the invention provides a simple and efficient data optimization method based on normal distribution.
The invention is realized by the following technical scheme:
a data optimization method based on normal distribution is characterized in that: the method comprises the following steps:
first, data acquisition
Collecting data information aiming at a research object, converting the collected data information into a numerical value and storing the numerical value in a database for later use;
second, cleaning data to remove abnormal value
Checking the consistency of the data, checking whether the data meets the requirements, searching out the data which exceeds a normal range and is logically unreasonable or contradictory, ensuring that each data is in a reasonable value range and a mutual relation, and processing invalid values and missing values;
thirdly, establishing a model
Taking the cleaned data as a variable x, solving a mean value mu and a standard deviation sigma of the variable x, and obtaining a variable x normal distribution curve f (x) according to a probability density function of normal distribution, wherein the probability density function is as follows:
Figure BDA0002922852300000021
wherein mu is a mean value and sigma is a standard deviation;
the fourth step, result output
Setting the area enclosed by the normal distribution curve and the x axis as 1, and obtaining the optimal value x by the inverse function of the normal distribution cumulative density function according to the proportion A of the optimal data in all data0
In the second step, when the data is less missing and the missing mechanism is clear, the data cleaning adopts a Regression algorithm to calculate the estimated value of the missing value or the invalid value, and fills the missing value or the invalid value according to the estimation result.
And in the second step, all selected continuous variables are set as independent variables, variables with missing values or invalid values are set as dependent variables to establish a regression equation, and the regression equation is used for filling the corresponding missing values or invalid values of the dependent variables after the regression equation is obtained.
In order to make the filling value closer to the actual situation, the filling value of the dependent variable is the sum of the regression prediction value and the regression residual error.
In the second step, when the data is missing more and the variables are in curve connection, the data cleaning adopts an EM (Expectation Maximization Algorithm) Algorithm to calculate the estimation value of the missing value or the invalid value, and the missing value or the invalid value is filled according to the estimation result.
In the second step, when the data association degree is smaller, the variable deletion or the whole deletion operation is directly carried out on the variable of the missing value or the invalid value;
and when the data is pairwise paired variables and the missing value or the invalid value is the paired variables, performing pairwise deletion operation on the variables of the missing value or the invalid value.
In the fourth step, when the data is preferably to select a numerical value having a larger value among all the data, x is taken as x0The normal distribution curve, the x-axis and x ∈ (x)0And +∞) is set as A, wherein A belongs to (0, 1);
the user specifies the percentage of the preferred value in all the values, obtains the value A according to the specified percentage, and obtains the x through the inverse function of the normal distribution cumulative density function0
In the fourth step, when the data is preferably to select a numerical value having a smaller value among all the data, x is taken as x0The normal distribution curve, the x-axis and x ∈ (— ∞, x)0) The enclosed area is set as A, wherein A belongs to (0, 1);
the user specifies the percentage of the preferred value in all the values, obtains the value A according to the specified percentage, and obtains the x through the inverse function of the normal distribution cumulative density function0
The invention has the beneficial effects that: according to the data optimization method based on normal distribution, the dispersion degree and the distribution condition of data are comprehensively considered, so that the unfairness problem easily caused by a conventional optimization strategy is avoided, the data optimization result is more reasonable, the actual situation of global data can be reflected, and the accuracy of data optimization is improved.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The data optimization method based on normal distribution comprises the following steps:
first, data acquisition
Collecting data information aiming at a research object, converting the collected data information into a numerical value and storing the numerical value in a database for later use;
second, cleaning data to remove abnormal value
Checking the consistency of the data, checking whether the data meets the requirements, searching out the data which exceeds a normal range and is logically unreasonable or contradictory, ensuring that each data is in a reasonable value range and a mutual relation, and processing invalid values and missing values;
thirdly, establishing a model
Taking the cleaned data as a variable x, solving a mean value mu and a standard deviation sigma of the variable x, and obtaining a variable x normal distribution curve f (x) according to a probability density function of normal distribution, wherein the probability density function is as follows:
Figure BDA0002922852300000031
wherein mu is a mean value and sigma is a standard deviation;
the fourth step, result output
Setting the area enclosed by the normal distribution curve and the x axis as 1, and obtaining the optimal value x by the inverse function of the normal distribution cumulative density function according to the proportion A of the optimal data in all data0
In the second step, when the data is less missing and the missing mechanism is clear, the data cleaning adopts a Regression algorithm to calculate the estimated value of the missing value or the invalid value, and fills the missing value or the invalid value according to the estimation result.
And in the second step, all selected continuous variables are set as independent variables, variables with missing values or invalid values are set as dependent variables to establish a regression equation, and the regression equation is used for filling the corresponding missing values or invalid values of the dependent variables after the regression equation is obtained.
In order to make the filling value closer to the actual situation, the filling value of the dependent variable is the sum of the regression prediction value and the regression residual error.
In the second step, when the data is missing more and the variables are in curve connection, the data cleaning adopts an EM (Expectation Maximization Algorithm) Algorithm to calculate the estimation value of the missing value or the invalid value, and the missing value or the invalid value is filled according to the estimation result.
In the second step, when the data association degree is smaller, the variable deletion or the whole deletion operation is directly carried out on the variable of the missing value or the invalid value;
and when the data is pairwise paired variables and the missing value or the invalid value is the paired variables, performing pairwise deletion operation on the variables of the missing value or the invalid value.
In the fourth step, when the data is preferably to select a numerical value having a larger value among all the data, x is taken as x0The normal distribution curve, the x-axis and x ∈ (x)0And +∞) is set as A, wherein A belongs to (0, 1);
the user specifies the percentage of the preferred value in all the values, obtains the value A according to the specified percentage, and obtains the x through the inverse function of the normal distribution cumulative density function0
In the fourth step, when the data is preferably to select a numerical value having a smaller value among all the data, x is taken as x0The normal distribution curve, the x-axis and x ∈ (— ∞, x)0) The enclosed area is set as A, wherein A belongs to (0, 1);
the user specifies the percentage of the preferred value in all the values, obtains the value A according to the specified percentage, and obtains the x through the inverse function of the normal distribution cumulative density function0
Example 1
Take the automatic sorting of watermelon as an example. A batch of watermelons are automatically sorted and divided into first-class products, second-class products and third-class products, and the first-class products, the second-class products and the third-class products are sent to markets for sale according to different prices.
First, data acquisition is performed using a data acquisition device such as a camera or a microphone. Data is collected from outside the system and input to an interface inside the system. Specifically to this embodiment, install the camera on the assembly line and gather the colour, decorative pattern, the pedicel shape of the melon, melon navel size etc. of each melon, convert it into digital signal and keep in the database for the division of follow-up grade.
Then, data cleaning is performed to remove abnormal values. Invalid and missing value checks are due to investigation, encoding and logging errors, and there may be some invalid and missing values in the data that need to be given appropriate treatment. Specifically, in this embodiment, the variables of the invalid value and the missing value are directly deleted.
And (5) establishing a model. Scoring watermelon according to each attribute of the watermelon, then carrying out grade division according to the scores, adding a weight to each variable when calculating the scores, calculating all the scores by using an entropy method, then solving a mean value and a standard deviation, and obtaining a variable x normal distribution curve f (x) according to a probability density function of normal distribution, wherein the probability density function is as follows:
Figure BDA0002922852300000051
where μ is the mean and σ is the standard deviation.
And finally, outputting the result.
If the batch is to be divided into two levels, one at 30%, then the normal distribution curve, the x-axis, and x e (x ∈)0And +∞) is set to 0.3, that is, A is set to 0.3 (assuming that the score is between 0 and 100), and x is obtained by the inverse function of the normal distribution cumulative density function0Score greater than x0The watermelon is the first-grade watermelon.
Example 2
If the batch of melons is classified into four grades, it can be calculated in the same way that the first-grade (25%) score is larger than x1, the second-grade (25%) score is larger than x2 and smaller than x1, and the third-grade (25%) score is larger than x3 and smaller than x2, and the rest are the fourth-grade melons (25%) with score smaller than x 3.
Example 3
If the batch is to be divided into twoGrade, eliminating inferior melons, setting the proportion of inferior melons to be eliminated as 20%, and setting normal distribution curve, x axis and x ∈ (— ∞, x)0) The area enclosed is set to 0.2, namely A is 0.2, and x can be obtained by the inverse function of the normal distribution cumulative density function0Score less than x0The watermelon is inferior watermelon.
The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims (8)

1. A data optimization method based on normal distribution is characterized in that: the method comprises the following steps:
first, data acquisition
Collecting data information aiming at a research object, converting the collected data information into a numerical value and storing the numerical value in a database for later use;
second, cleaning data to remove abnormal value
Checking the consistency of the data, checking whether the data meets the requirements, searching out the data which exceeds a normal range and is logically unreasonable or contradictory, ensuring that each data is in a reasonable value range and a mutual relation, and processing invalid values and missing values;
thirdly, establishing a model
Taking the cleaned data as a variable x, solving a mean value mu and a standard deviation sigma of the variable x, and obtaining a variable x normal distribution curve f (x) according to a probability density function of normal distribution, wherein the probability density function is as follows:
Figure FDA0002922852290000011
wherein mu is a mean value and sigma is a standard deviation;
the fourth step, result output
The area enclosed by the normal distribution curve and the x-axis is set to 1, and the ratio of the normal distribution curve to the x-axis in all data is determined according to the preferred dataExample A, the optimal value x can be obtained by the inverse function of the normal distribution cumulative density function0
2. The normal distribution-based data optimization method according to claim 1, wherein: in the second step, when the data are less missing and the missing mechanism is clear, the data cleaning adopts a regression algorithm to calculate the estimated value of the missing value or the invalid value, and the missing value or the invalid value is filled according to the estimation result.
3. The normal distribution-based data optimization method according to claim 2, wherein: and in the second step, all selected continuous variables are set as independent variables, variables with missing values or invalid values are set as dependent variables to establish a regression equation, and the regression equation is used for filling the corresponding missing values or invalid values of the dependent variables after the regression equation is obtained.
4. The normal distribution-based data optimization method according to claim 3, wherein: and the filling value of the dependent variable is the sum of the regression prediction value and the regression residual error.
5. The normal distribution-based data optimization method according to claim 1, wherein: and in the second step, when the data are more missing and the variables are in curve connection, the data cleaning adopts an EM algorithm to calculate the estimation value of the missing value or the invalid value, and the missing value or the invalid value is filled according to the estimation result.
6. The normal distribution-based data optimization method according to claim 1, wherein: in the second step, when the data association degree is smaller, the variable deletion or the whole deletion operation is directly carried out on the variable of the missing value or the invalid value;
and when the data is pairwise paired variables and the missing value or the invalid value is the paired variables, performing pairwise deletion operation on the variables of the missing value or the invalid value.
7. The normal distribution-based data optimization method according to claim 1, wherein: in the fourth step, when the data is preferably to select a numerical value having a larger value among all the data, x is taken as x0The normal distribution curve, the x-axis and x ∈ (x)0And +∞) is set as A, wherein A belongs to (0, 1);
the user specifies the percentage of the preferred value in all the values, obtains the value A according to the specified percentage, and obtains the x through the inverse function of the normal distribution cumulative density function0
8. The normal distribution-based data optimization method according to claim 1, wherein: in the fourth step, when the data is preferably to select a numerical value having a smaller value among all the data, x is taken as x0The normal distribution curve, the x-axis and x ∈ (— ∞, x)0) The enclosed area is set as A, wherein A belongs to (0, 1);
the user specifies the percentage of the preferred value in all the values, obtains the value A according to the specified percentage, and obtains the x through the inverse function of the normal distribution cumulative density function0
CN202110123387.XA 2021-01-29 2021-01-29 Data optimization method based on normal distribution Pending CN112783884A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110123387.XA CN112783884A (en) 2021-01-29 2021-01-29 Data optimization method based on normal distribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110123387.XA CN112783884A (en) 2021-01-29 2021-01-29 Data optimization method based on normal distribution

Publications (1)

Publication Number Publication Date
CN112783884A true CN112783884A (en) 2021-05-11

Family

ID=75759629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110123387.XA Pending CN112783884A (en) 2021-01-29 2021-01-29 Data optimization method based on normal distribution

Country Status (1)

Country Link
CN (1) CN112783884A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114510525A (en) * 2022-04-18 2022-05-17 深圳丰尚智慧农牧科技有限公司 Data format conversion method and device, computer equipment and storage medium
CN114936208A (en) * 2022-07-26 2022-08-23 广州天维信息技术股份有限公司 Information analysis system based on data cleaning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969356A (en) * 2019-12-03 2020-04-07 浪潮软件股份有限公司 Method and system for setting index threshold based on normal distribution
CN111668845A (en) * 2020-06-16 2020-09-15 广东工业大学 Probability load flow calculation method considering photovoltaic correlation
CN112017025A (en) * 2020-08-26 2020-12-01 天元大数据信用管理有限公司 Enterprise credit assessment method based on fusion of deep learning and logistic regression

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969356A (en) * 2019-12-03 2020-04-07 浪潮软件股份有限公司 Method and system for setting index threshold based on normal distribution
CN111668845A (en) * 2020-06-16 2020-09-15 广东工业大学 Probability load flow calculation method considering photovoltaic correlation
CN112017025A (en) * 2020-08-26 2020-12-01 天元大数据信用管理有限公司 Enterprise credit assessment method based on fusion of deep learning and logistic regression

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张文彤等: "《SPSS统计分析高级教程》", 30 September 2004 *
李双其等: "《大数据侦查实践》", 30 September 2019 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114510525A (en) * 2022-04-18 2022-05-17 深圳丰尚智慧农牧科技有限公司 Data format conversion method and device, computer equipment and storage medium
CN114510525B (en) * 2022-04-18 2022-08-30 深圳丰尚智慧农牧科技有限公司 Data format conversion method and device, computer equipment and storage medium
CN114936208A (en) * 2022-07-26 2022-08-23 广州天维信息技术股份有限公司 Information analysis system based on data cleaning
CN114936208B (en) * 2022-07-26 2022-09-23 广州天维信息技术股份有限公司 Information analysis system based on data cleaning

Similar Documents

Publication Publication Date Title
CN108898479B (en) Credit evaluation model construction method and device
CN112783884A (en) Data optimization method based on normal distribution
CN111062806B (en) Personal finance credit risk evaluation method, system and storage medium
CN110675029A (en) Dynamic management and control method and device for commercial tenant, server and readable storage medium
CN112561082A (en) Method, device, equipment and storage medium for generating model
CN108399255A (en) A kind of input data processing method and device of Classification Data Mining model
CN114971227A (en) Power distribution network equipment risk assessment method based on MARCOS method
CN113935535A (en) Principal component analysis method for medium-and-long-term prediction model
CN111861704A (en) Wind control feature generation method and system
CN113327047A (en) Power marketing service channel decision method and system based on fuzzy comprehensive model
CN112634062B (en) Hadoop-based data processing method, device, equipment and storage medium
Hossain et al. GDP growth prediction of bangladesh using machine learning algorithm
WO2024093468A1 (en) Risk evaluation method and system for windage yaw flashover, device, and readable storage medium
CN109377041A (en) A kind of two-phase evaluation method about shipping business
CN116883153A (en) Pedestrian credit investigation-based automobile finance pre-credit rating card development method and terminal
TW200421146A (en) Generating a sampling plan for testing generated content
CN116862626A (en) Multi-mode commodity alignment method
Hadi-Vencheh et al. Robust ABC inventory classification using hybrid TOPSIS-alternative factor extraction approaches
CN114880923A (en) Enterprise credit prediction method fusing power data
CN111160693B (en) Power grid planning coordination evaluation method and system
CN113065742A (en) Credit evaluation method, system, storage medium and electronic equipment for small and micro enterprises
Yang et al. Historical load curve correction for short-term load forecasting
CN111126790A (en) Power distribution network risk assessment method based on multi-level fuzzy comprehensive decision
Thuy Export diversification and economic growth: A comparative study between Vietnam and Thailand
Tandrayen-Ragoobur The services sector and economic growth in Mauritius. A bounds testing approach to cointergration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 271000 Langchao science and Technology Park, 527 Dongyue street, Tai'an City, Shandong Province

Applicant after: INSPUR SOFTWARE Co.,Ltd.

Address before: No. 1036, Shandong high tech Zone wave road, Ji'nan, Shandong

Applicant before: INSPUR SOFTWARE Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20210511

RJ01 Rejection of invention patent application after publication