CN112783884A - Data optimization method based on normal distribution - Google Patents
Data optimization method based on normal distribution Download PDFInfo
- Publication number
- CN112783884A CN112783884A CN202110123387.XA CN202110123387A CN112783884A CN 112783884 A CN112783884 A CN 112783884A CN 202110123387 A CN202110123387 A CN 202110123387A CN 112783884 A CN112783884 A CN 112783884A
- Authority
- CN
- China
- Prior art keywords
- data
- value
- normal distribution
- missing
- optimization method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005457 optimization Methods 0.000 title claims abstract description 26
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000001186 cumulative effect Effects 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 5
- 238000004140 cleaning Methods 0.000 claims description 10
- 238000012217 deletion Methods 0.000 claims description 9
- 230000037430 deletion Effects 0.000 claims description 9
- 230000001419 dependent effect Effects 0.000 claims description 9
- 230000002159 abnormal effect Effects 0.000 claims description 4
- 230000008094 contradictory effect Effects 0.000 claims description 3
- 238000011160 research Methods 0.000 claims description 3
- 239000006185 dispersion Substances 0.000 abstract description 2
- 241000219109 Citrullus Species 0.000 description 8
- 235000012828 Citrullus lanatus var citroides Nutrition 0.000 description 8
- 241000219112 Cucumis Species 0.000 description 7
- 235000015510 Cucumis melo subsp melo Nutrition 0.000 description 7
- FJJCIZWZNKZHII-UHFFFAOYSA-N [4,6-bis(cyanoamino)-1,3,5-triazin-2-yl]cyanamide Chemical compound N#CNC1=NC(NC#N)=NC(NC#N)=N1 FJJCIZWZNKZHII-UHFFFAOYSA-N 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Optimization (AREA)
- Evolutionary Computation (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Algebra (AREA)
- Quality & Reliability (AREA)
- Probability & Statistics with Applications (AREA)
- Complex Calculations (AREA)
Abstract
The invention particularly relates to a data optimization method based on normal distribution. The data optimization method based on normal distribution collects data information, converts the collected data information into numerical values and stores the numerical values in a database for later use; checking the consistency of the data, and processing invalid values and missing values; the cleaned data is used as a variable x, a variable x normal distribution curve f (x) is obtained according to a probability density function of normal distribution, the area enclosed by the normal distribution curve and an x axis is set to be 1, and an optimal numerical value x can be obtained through an inverse function of a normal distribution cumulative density function according to the proportion A of the optimized data in all data0. The data optimization method based on normal distribution comprehensively considers the dispersion degree and the distribution condition of the data, thereby avoiding the unfairness problem easily caused by the conventional optimization strategy, leading the data optimization result to be more reasonable, reflecting the actual situation of the global data more, and improving the accuracy rate of data optimization。
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a data optimization method based on normal distribution.
Background
Data optimization refers to selecting relatively excellent individuals from a population, and a common method is to sort the individuals according to observed attributes and then select the individuals with the highest ranking.
For example, when a commodity is studied, it is often necessary to determine which markets of a certain commodity have good sales status and which markets have poor sales status, and a related incentive system is established according to the difference of sales status. Generally, technicians sort the markets according to the sales conditions of commodities from good to bad, and the first few markets are good. But this has a clear disadvantage, assuming that the number of markets is 10, the sales status scores are in turn: 99. 95, 93, 92, 85, 83, 81, 78, 76, 71; wherein, the first three names are: 99. 95, 93 are rated as excellent, but the fourth name 92 is only one-half different from the third name 93, and obviously, the method does not consider the distribution condition of the global data. Meanwhile, the data selected by the method cannot reflect the actual situation of the global data, so that technicians can easily misunderstand the global situation, and make wrong judgment.
Aiming at the defect, the invention provides a data optimization method based on normal distribution
Disclosure of Invention
In order to make up for the defects of the prior art, the invention provides a simple and efficient data optimization method based on normal distribution.
The invention is realized by the following technical scheme:
a data optimization method based on normal distribution is characterized in that: the method comprises the following steps:
first, data acquisition
Collecting data information aiming at a research object, converting the collected data information into a numerical value and storing the numerical value in a database for later use;
second, cleaning data to remove abnormal value
Checking the consistency of the data, checking whether the data meets the requirements, searching out the data which exceeds a normal range and is logically unreasonable or contradictory, ensuring that each data is in a reasonable value range and a mutual relation, and processing invalid values and missing values;
thirdly, establishing a model
Taking the cleaned data as a variable x, solving a mean value mu and a standard deviation sigma of the variable x, and obtaining a variable x normal distribution curve f (x) according to a probability density function of normal distribution, wherein the probability density function is as follows:
wherein mu is a mean value and sigma is a standard deviation;
the fourth step, result output
Setting the area enclosed by the normal distribution curve and the x axis as 1, and obtaining the optimal value x by the inverse function of the normal distribution cumulative density function according to the proportion A of the optimal data in all data0。
In the second step, when the data is less missing and the missing mechanism is clear, the data cleaning adopts a Regression algorithm to calculate the estimated value of the missing value or the invalid value, and fills the missing value or the invalid value according to the estimation result.
And in the second step, all selected continuous variables are set as independent variables, variables with missing values or invalid values are set as dependent variables to establish a regression equation, and the regression equation is used for filling the corresponding missing values or invalid values of the dependent variables after the regression equation is obtained.
In order to make the filling value closer to the actual situation, the filling value of the dependent variable is the sum of the regression prediction value and the regression residual error.
In the second step, when the data is missing more and the variables are in curve connection, the data cleaning adopts an EM (Expectation Maximization Algorithm) Algorithm to calculate the estimation value of the missing value or the invalid value, and the missing value or the invalid value is filled according to the estimation result.
In the second step, when the data association degree is smaller, the variable deletion or the whole deletion operation is directly carried out on the variable of the missing value or the invalid value;
and when the data is pairwise paired variables and the missing value or the invalid value is the paired variables, performing pairwise deletion operation on the variables of the missing value or the invalid value.
In the fourth step, when the data is preferably to select a numerical value having a larger value among all the data, x is taken as x0The normal distribution curve, the x-axis and x ∈ (x)0And +∞) is set as A, wherein A belongs to (0, 1);
the user specifies the percentage of the preferred value in all the values, obtains the value A according to the specified percentage, and obtains the x through the inverse function of the normal distribution cumulative density function0。
In the fourth step, when the data is preferably to select a numerical value having a smaller value among all the data, x is taken as x0The normal distribution curve, the x-axis and x ∈ (— ∞, x)0) The enclosed area is set as A, wherein A belongs to (0, 1);
the user specifies the percentage of the preferred value in all the values, obtains the value A according to the specified percentage, and obtains the x through the inverse function of the normal distribution cumulative density function0。
The invention has the beneficial effects that: according to the data optimization method based on normal distribution, the dispersion degree and the distribution condition of data are comprehensively considered, so that the unfairness problem easily caused by a conventional optimization strategy is avoided, the data optimization result is more reasonable, the actual situation of global data can be reflected, and the accuracy of data optimization is improved.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The data optimization method based on normal distribution comprises the following steps:
first, data acquisition
Collecting data information aiming at a research object, converting the collected data information into a numerical value and storing the numerical value in a database for later use;
second, cleaning data to remove abnormal value
Checking the consistency of the data, checking whether the data meets the requirements, searching out the data which exceeds a normal range and is logically unreasonable or contradictory, ensuring that each data is in a reasonable value range and a mutual relation, and processing invalid values and missing values;
thirdly, establishing a model
Taking the cleaned data as a variable x, solving a mean value mu and a standard deviation sigma of the variable x, and obtaining a variable x normal distribution curve f (x) according to a probability density function of normal distribution, wherein the probability density function is as follows:
wherein mu is a mean value and sigma is a standard deviation;
the fourth step, result output
Setting the area enclosed by the normal distribution curve and the x axis as 1, and obtaining the optimal value x by the inverse function of the normal distribution cumulative density function according to the proportion A of the optimal data in all data0。
In the second step, when the data is less missing and the missing mechanism is clear, the data cleaning adopts a Regression algorithm to calculate the estimated value of the missing value or the invalid value, and fills the missing value or the invalid value according to the estimation result.
And in the second step, all selected continuous variables are set as independent variables, variables with missing values or invalid values are set as dependent variables to establish a regression equation, and the regression equation is used for filling the corresponding missing values or invalid values of the dependent variables after the regression equation is obtained.
In order to make the filling value closer to the actual situation, the filling value of the dependent variable is the sum of the regression prediction value and the regression residual error.
In the second step, when the data is missing more and the variables are in curve connection, the data cleaning adopts an EM (Expectation Maximization Algorithm) Algorithm to calculate the estimation value of the missing value or the invalid value, and the missing value or the invalid value is filled according to the estimation result.
In the second step, when the data association degree is smaller, the variable deletion or the whole deletion operation is directly carried out on the variable of the missing value or the invalid value;
and when the data is pairwise paired variables and the missing value or the invalid value is the paired variables, performing pairwise deletion operation on the variables of the missing value or the invalid value.
In the fourth step, when the data is preferably to select a numerical value having a larger value among all the data, x is taken as x0The normal distribution curve, the x-axis and x ∈ (x)0And +∞) is set as A, wherein A belongs to (0, 1);
the user specifies the percentage of the preferred value in all the values, obtains the value A according to the specified percentage, and obtains the x through the inverse function of the normal distribution cumulative density function0。
In the fourth step, when the data is preferably to select a numerical value having a smaller value among all the data, x is taken as x0The normal distribution curve, the x-axis and x ∈ (— ∞, x)0) The enclosed area is set as A, wherein A belongs to (0, 1);
the user specifies the percentage of the preferred value in all the values, obtains the value A according to the specified percentage, and obtains the x through the inverse function of the normal distribution cumulative density function0。
Example 1
Take the automatic sorting of watermelon as an example. A batch of watermelons are automatically sorted and divided into first-class products, second-class products and third-class products, and the first-class products, the second-class products and the third-class products are sent to markets for sale according to different prices.
First, data acquisition is performed using a data acquisition device such as a camera or a microphone. Data is collected from outside the system and input to an interface inside the system. Specifically to this embodiment, install the camera on the assembly line and gather the colour, decorative pattern, the pedicel shape of the melon, melon navel size etc. of each melon, convert it into digital signal and keep in the database for the division of follow-up grade.
Then, data cleaning is performed to remove abnormal values. Invalid and missing value checks are due to investigation, encoding and logging errors, and there may be some invalid and missing values in the data that need to be given appropriate treatment. Specifically, in this embodiment, the variables of the invalid value and the missing value are directly deleted.
And (5) establishing a model. Scoring watermelon according to each attribute of the watermelon, then carrying out grade division according to the scores, adding a weight to each variable when calculating the scores, calculating all the scores by using an entropy method, then solving a mean value and a standard deviation, and obtaining a variable x normal distribution curve f (x) according to a probability density function of normal distribution, wherein the probability density function is as follows:
where μ is the mean and σ is the standard deviation.
And finally, outputting the result.
If the batch is to be divided into two levels, one at 30%, then the normal distribution curve, the x-axis, and x e (x ∈)0And +∞) is set to 0.3, that is, A is set to 0.3 (assuming that the score is between 0 and 100), and x is obtained by the inverse function of the normal distribution cumulative density function0Score greater than x0The watermelon is the first-grade watermelon.
Example 2
If the batch of melons is classified into four grades, it can be calculated in the same way that the first-grade (25%) score is larger than x1, the second-grade (25%) score is larger than x2 and smaller than x1, and the third-grade (25%) score is larger than x3 and smaller than x2, and the rest are the fourth-grade melons (25%) with score smaller than x 3.
Example 3
If the batch is to be divided into twoGrade, eliminating inferior melons, setting the proportion of inferior melons to be eliminated as 20%, and setting normal distribution curve, x axis and x ∈ (— ∞, x)0) The area enclosed is set to 0.2, namely A is 0.2, and x can be obtained by the inverse function of the normal distribution cumulative density function0Score less than x0The watermelon is inferior watermelon.
The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.
Claims (8)
1. A data optimization method based on normal distribution is characterized in that: the method comprises the following steps:
first, data acquisition
Collecting data information aiming at a research object, converting the collected data information into a numerical value and storing the numerical value in a database for later use;
second, cleaning data to remove abnormal value
Checking the consistency of the data, checking whether the data meets the requirements, searching out the data which exceeds a normal range and is logically unreasonable or contradictory, ensuring that each data is in a reasonable value range and a mutual relation, and processing invalid values and missing values;
thirdly, establishing a model
Taking the cleaned data as a variable x, solving a mean value mu and a standard deviation sigma of the variable x, and obtaining a variable x normal distribution curve f (x) according to a probability density function of normal distribution, wherein the probability density function is as follows:
wherein mu is a mean value and sigma is a standard deviation;
the fourth step, result output
The area enclosed by the normal distribution curve and the x-axis is set to 1, and the ratio of the normal distribution curve to the x-axis in all data is determined according to the preferred dataExample A, the optimal value x can be obtained by the inverse function of the normal distribution cumulative density function0。
2. The normal distribution-based data optimization method according to claim 1, wherein: in the second step, when the data are less missing and the missing mechanism is clear, the data cleaning adopts a regression algorithm to calculate the estimated value of the missing value or the invalid value, and the missing value or the invalid value is filled according to the estimation result.
3. The normal distribution-based data optimization method according to claim 2, wherein: and in the second step, all selected continuous variables are set as independent variables, variables with missing values or invalid values are set as dependent variables to establish a regression equation, and the regression equation is used for filling the corresponding missing values or invalid values of the dependent variables after the regression equation is obtained.
4. The normal distribution-based data optimization method according to claim 3, wherein: and the filling value of the dependent variable is the sum of the regression prediction value and the regression residual error.
5. The normal distribution-based data optimization method according to claim 1, wherein: and in the second step, when the data are more missing and the variables are in curve connection, the data cleaning adopts an EM algorithm to calculate the estimation value of the missing value or the invalid value, and the missing value or the invalid value is filled according to the estimation result.
6. The normal distribution-based data optimization method according to claim 1, wherein: in the second step, when the data association degree is smaller, the variable deletion or the whole deletion operation is directly carried out on the variable of the missing value or the invalid value;
and when the data is pairwise paired variables and the missing value or the invalid value is the paired variables, performing pairwise deletion operation on the variables of the missing value or the invalid value.
7. The normal distribution-based data optimization method according to claim 1, wherein: in the fourth step, when the data is preferably to select a numerical value having a larger value among all the data, x is taken as x0The normal distribution curve, the x-axis and x ∈ (x)0And +∞) is set as A, wherein A belongs to (0, 1);
the user specifies the percentage of the preferred value in all the values, obtains the value A according to the specified percentage, and obtains the x through the inverse function of the normal distribution cumulative density function0。
8. The normal distribution-based data optimization method according to claim 1, wherein: in the fourth step, when the data is preferably to select a numerical value having a smaller value among all the data, x is taken as x0The normal distribution curve, the x-axis and x ∈ (— ∞, x)0) The enclosed area is set as A, wherein A belongs to (0, 1);
the user specifies the percentage of the preferred value in all the values, obtains the value A according to the specified percentage, and obtains the x through the inverse function of the normal distribution cumulative density function0。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110123387.XA CN112783884A (en) | 2021-01-29 | 2021-01-29 | Data optimization method based on normal distribution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110123387.XA CN112783884A (en) | 2021-01-29 | 2021-01-29 | Data optimization method based on normal distribution |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112783884A true CN112783884A (en) | 2021-05-11 |
Family
ID=75759629
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110123387.XA Pending CN112783884A (en) | 2021-01-29 | 2021-01-29 | Data optimization method based on normal distribution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112783884A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114510525A (en) * | 2022-04-18 | 2022-05-17 | 深圳丰尚智慧农牧科技有限公司 | Data format conversion method and device, computer equipment and storage medium |
CN114936208A (en) * | 2022-07-26 | 2022-08-23 | 广州天维信息技术股份有限公司 | Information analysis system based on data cleaning |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110969356A (en) * | 2019-12-03 | 2020-04-07 | 浪潮软件股份有限公司 | Method and system for setting index threshold based on normal distribution |
CN111668845A (en) * | 2020-06-16 | 2020-09-15 | 广东工业大学 | Probability load flow calculation method considering photovoltaic correlation |
CN112017025A (en) * | 2020-08-26 | 2020-12-01 | 天元大数据信用管理有限公司 | Enterprise credit assessment method based on fusion of deep learning and logistic regression |
-
2021
- 2021-01-29 CN CN202110123387.XA patent/CN112783884A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110969356A (en) * | 2019-12-03 | 2020-04-07 | 浪潮软件股份有限公司 | Method and system for setting index threshold based on normal distribution |
CN111668845A (en) * | 2020-06-16 | 2020-09-15 | 广东工业大学 | Probability load flow calculation method considering photovoltaic correlation |
CN112017025A (en) * | 2020-08-26 | 2020-12-01 | 天元大数据信用管理有限公司 | Enterprise credit assessment method based on fusion of deep learning and logistic regression |
Non-Patent Citations (2)
Title |
---|
张文彤等: "《SPSS统计分析高级教程》", 30 September 2004 * |
李双其等: "《大数据侦查实践》", 30 September 2019 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114510525A (en) * | 2022-04-18 | 2022-05-17 | 深圳丰尚智慧农牧科技有限公司 | Data format conversion method and device, computer equipment and storage medium |
CN114510525B (en) * | 2022-04-18 | 2022-08-30 | 深圳丰尚智慧农牧科技有限公司 | Data format conversion method and device, computer equipment and storage medium |
CN114936208A (en) * | 2022-07-26 | 2022-08-23 | 广州天维信息技术股份有限公司 | Information analysis system based on data cleaning |
CN114936208B (en) * | 2022-07-26 | 2022-09-23 | 广州天维信息技术股份有限公司 | Information analysis system based on data cleaning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108898479B (en) | Credit evaluation model construction method and device | |
CN112783884A (en) | Data optimization method based on normal distribution | |
CN111062806B (en) | Personal finance credit risk evaluation method, system and storage medium | |
CN110675029A (en) | Dynamic management and control method and device for commercial tenant, server and readable storage medium | |
CN112561082A (en) | Method, device, equipment and storage medium for generating model | |
CN108399255A (en) | A kind of input data processing method and device of Classification Data Mining model | |
CN114971227A (en) | Power distribution network equipment risk assessment method based on MARCOS method | |
CN113935535A (en) | Principal component analysis method for medium-and-long-term prediction model | |
CN111861704A (en) | Wind control feature generation method and system | |
CN113327047A (en) | Power marketing service channel decision method and system based on fuzzy comprehensive model | |
CN112634062B (en) | Hadoop-based data processing method, device, equipment and storage medium | |
Hossain et al. | GDP growth prediction of bangladesh using machine learning algorithm | |
WO2024093468A1 (en) | Risk evaluation method and system for windage yaw flashover, device, and readable storage medium | |
CN109377041A (en) | A kind of two-phase evaluation method about shipping business | |
CN116883153A (en) | Pedestrian credit investigation-based automobile finance pre-credit rating card development method and terminal | |
TW200421146A (en) | Generating a sampling plan for testing generated content | |
CN116862626A (en) | Multi-mode commodity alignment method | |
Hadi-Vencheh et al. | Robust ABC inventory classification using hybrid TOPSIS-alternative factor extraction approaches | |
CN114880923A (en) | Enterprise credit prediction method fusing power data | |
CN111160693B (en) | Power grid planning coordination evaluation method and system | |
CN113065742A (en) | Credit evaluation method, system, storage medium and electronic equipment for small and micro enterprises | |
Yang et al. | Historical load curve correction for short-term load forecasting | |
CN111126790A (en) | Power distribution network risk assessment method based on multi-level fuzzy comprehensive decision | |
Thuy | Export diversification and economic growth: A comparative study between Vietnam and Thailand | |
Tandrayen-Ragoobur | The services sector and economic growth in Mauritius. A bounds testing approach to cointergration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 271000 Langchao science and Technology Park, 527 Dongyue street, Tai'an City, Shandong Province Applicant after: INSPUR SOFTWARE Co.,Ltd. Address before: No. 1036, Shandong high tech Zone wave road, Ji'nan, Shandong Applicant before: INSPUR SOFTWARE Co.,Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210511 |
|
RJ01 | Rejection of invention patent application after publication |