CN115455359A

CN115455359A - Automatic correction and distribution fitting method for small-batch error data

Info

Publication number: CN115455359A
Application number: CN202210876577.3A
Authority: CN
Inventors: 曾静文; 李晓蕊; 杨扬; 邓晓春; 郭双明; 樊娜娜; 陈氖华
Original assignee: Chengdu Aircraft Industrial Group Co Ltd
Current assignee: Chengdu Aircraft Industrial Group Co Ltd
Priority date: 2022-07-25
Filing date: 2022-07-25
Publication date: 2022-12-09

Abstract

The invention belongs to the field of aviation production, and particularly relates to a small-batch error data automatic correction and distribution fitting method based on Anderson-Darling inspection and cycle estimation, which comprises the following steps: reading annual error data of the same characteristics of the product from the record table; clearing abnormal data of the error data; constructing Anderson-Darling test statistics under four continuous distributions; constructing a p value of the test statistic under continuous distribution; automatically correcting each data by adopting cycle estimation; and setting different compensation values repeatedly to find the optimal compensation value. The method can automatically search the optimal correction value, can quickly and effectively make up the data deviation caused by improper measurement mode or replacement of operators, promotes the data set to be more similar to the real distribution of data, and provides a basis for the subsequent statistical process control and control chart construction.

Description

Automatic correction and distribution fitting method for small-batch error data

Technical Field

The invention belongs to the field of aviation production, and particularly relates to a small-batch error data automatic correction and distribution fitting method based on Anderson-Darling inspection and cycle estimation.

Background

As the top of an industrial system, the aviation industry has strict control on the product quality, and the distribution characteristics of error data of the observed quantity value and the theoretical value outside the product reflect the quality information of the manufacturing process, so that the method is a basic basis for realizing statistical process control, optimization and production management. However, due to improper measurement mode and operator replacement, the recorded values of the error data tend to deviate from the real values, which especially brings about greater challenges to the statistical distribution inference of small batches of error data. Therefore, the method has important significance in accurately automatically correcting the error data in small batches and analyzing the statistical distribution characteristics of the error data.

Currently, most documents are as follows: the application of multivariable statistical process control in the reverse flotation production process, wind turbine generator gear case state evaluation integrating SCADA data, research and practice of leveling process on-line monitoring and statistical process control, and data preprocessing method comparative analysis based on a typical data set are disclosed, wherein the contents disclosed in the documents are mainly that abnormal data are removed based on a 4-quantile method before statistical process control is carried out, namely data beyond the upper and lower 4-quantile are removed from sample data. The method is suitable for the condition of large samples, for a small batch process, the fitting of the sample amount for further reduction on the distribution is unfavorable, and a more reasonable method is to find the true value of the data by a data correction method so as to accurately obtain the statistical distribution characteristics of the data. However, the current preprocessing method of the data is only limited to normalization, standardization and normalization of the data, wherein the normalization method comprises Box-Cox conversion, johnson transformation and the like, and the normality and symmetry of the data can be improved; normalization and normalization methods aim at dimensionless data by mathematical operations. These methods are only applicable to cases that follow normal distribution, and real manufacturing error data may also follow truncated normal distribution, gamma distribution, t distribution, or the like.

Disclosure of Invention

In order to overcome the problems in the prior art, the invention provides an automatic correction and distribution fitting method for small-batch error data based on Anderson-Darling test and cycle estimation, which comprises the steps of constructing Anderson-Darling test statistics under different continuous distributions (normal distribution, truncated normal distribution, gamma distribution and t distribution), determining the statistical distribution type of the error data according to the p values of the test statistics under different distributions, randomly dividing a data set into a history set and an observation set based on the distribution type by adopting cycle estimation, selecting a correction mode with the highest total distribution fitting p value for the data in the observation set, and cyclically selecting different data as the observation set until each data is optimally corrected or the p values are converged, thereby completing the automatic correction of the data.

In order to realize the invention, the technical scheme is as follows:

an automatic correction and distribution fitting method for small batch error data,

the method specifically comprises the following steps:

step 1: reading annual error data of the same characteristics of a small batch of production products from the record table;

and 2, step: removing abnormal data from error data to obtain an initial data set D = { x = _i ,i＝1,…,n}；

And step 3: building Anderson-Darling test statistics under four continuous distributions of normal distribution, truncated normal distribution, gamma distribution and t distribution;

and 4, step 4: quantity A according to Anderson-Darling test ² The method comprises the steps of constructing p values of Anderson-Darling test statistics under four continuous distributions of normal distribution, truncated normal distribution, gamma distribution and t distribution, determining the statistical distribution type of error data by comparing the p values, wherein the higher the p values are, the higher the distribution fitting goodness is, namely the determined distribution type is j ^* ＝max _j＝1,2,3,4 p _j ；

And 5: based on the obtained distribution type j ^* Using pairs of cyclic estimatesEach data is automatically corrected; a compensation value delta is preset, and a data set D is randomly disturbed and divided into a history set D ₁ And observation set D ₂ (ii) a Specifying a data correction strategy, and carrying out continuous iteration to obtain final correction data;

and 6: setting different compensation values delta and repeating the step 5 to find the optimal compensation value; under the compensation value, obtaining an optimally corrected data set D' and solving the distribution j by adopting maximum likelihood estimation ^* The parameter (c) of (c).

Further, in step 3, in order to measure goodness of fit between the real data distribution and the theoretical distribution, the Anderson-Darling test statistics under four continuous distributions of normal distribution, truncated normal distribution, gamma distribution and t distribution are constructed as follows:

in the formula:

an Anderson-Darling test statistic representing the jth hypothetical distribution, used to measure the difference between the hypothetical distribution and the true distribution of the data,

the smaller the distribution of the true data to the hypothesis, n is the number of samples, F _D (x) Is a distribution function of the sample;

normal distribution, truncated normal distribution, gamma distribution, t distribution, these four distributions most closely fitting the distribution type of error data in the field of aeronautical manufacturing, F _j (x) Theoretical distribution function for jth hypothetical distribution:

in the formula: Γ represents the gamma function, μ, σ, a, b, α, β, v represents the distribution coefficient associated with the distribution.

Further, the specific method of step 4 is as follows:

quantity A was examined according to Anderson-Darl ² Limit distribution of (2) by

P-values for the four distributions were constructed as follows:

p _j p-value of the Anderson-Darling test statistic representing the jth hypothesis distribution, the p-value being a probability value between 0 and 1, the goodness of fit of real data distribution and theoretical distribution can be qualitatively and visually represented, and the larger the p value is, the greater the goodness of fit of real data distribution and theoretical distribution is

The smaller the distribution goodness of fit, the higher the distribution, and thus, the statistical distribution type of the error data can be determined by comparing the magnitudes of p values, i.e., the determined distribution type is j ^* ＝max _j＝1,2,3,4 p _j 。

Further, the step 5 is based on the obtained distribution type j ^* Each data was automatically corrected using a loop estimation.

Still further, the step 5 specifically includes the following steps:

step 501: presetting a compensation value delta, and suggesting that the compensation value is set as an integral multiple of the data recording precision;

step 502: the data set D after the r-1 cycle correction ^r-1 Randomly disorganizing, and dividing the information into history sets according to the proportion of 8

And observation set

Step 503: there are three data correction strategies: subtracting the compensation value

Is kept unchanged

And adding the compensation value

For each data in the observation set, merging the data with the history set into a new data set, and calculating the distribution type j ^* The p values of the next three correction strategies are respectively recorded as

Selecting the correction mode with the highest p value to correct x, e.g. if

Then

Repeating the steps on all other data in the observation set to finally obtain the corrected observation set

Step 504: after recording the r cycle correctionIs a data set of

And p value thereof is p ^r ；

Step 505: comparison of p ^r And p ^r-1 If the difference is negligible (p) ^r -p ^r-1 <0.001 ) then the correction is ended; otherwise let r = r +1 and repeat steps 502-504.

Compared with the prior art, the invention has the following advantages:

the method can automatically search the optimal correction value, is suitable for various small-batch production processes in the field of aviation manufacturing, can quickly and effectively make up for data deviation caused by improper measurement mode or replacement of operators, promotes the data set to be more similar to the real distribution of data, and simultaneously provides a basis for the subsequent statistical process control and control chart construction.

Drawings

Fig. 1 is a block diagram of an automatic calibration process.

Fig. 2 is a fit of the raw data under four distributions.

Fig. 3 is a graph of the p-value of the gamma distribution of the corrected data set at different compensation values.

Fig. 4 shows the gamma distribution fit of the corrected data set at the optimal compensation value (δ = 0.006).

Fig. 5 is a maximum likelihood estimation of gamma distribution parameters.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are for explaining the present invention and not for limiting the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The following description of the embodiments of the present invention is made with reference to the accompanying drawings and examples, and the present invention is not limited to the embodiments.

Example 1

As shown in fig. 1, a method for automatic correction and distribution fitting of small batches of error data,

the method specifically comprises the following steps:

and 2, step: removing abnormal data by using prior information of the production process, removing data (corresponding to unqualified products) which do not meet the technical requirements of product characteristics, and obtaining an initial data set D = { x = _i ,i＝1,…,n}；

And step 3: from the basic knowledge of statistics, many random variables are subject to normal distribution, such as measurement error, product weight, person height, etc. So the error data is generally defaulted to normal distribution in the production field, and the data is analyzed in the background of normal distribution. However, due to improper measurement mode, replacement of operators and other factors, the recorded value of error data is often deviated from the true value, and the presented data does not necessarily follow the normal distribution. Through actual verification of field data, error data most possibly obeys one of four distributions, namely normal distribution, truncated normal distribution, gamma distribution and t distribution. In order to more accurately determine the actual distribution of the data, anderson-Darling test statistics under four continuous distributions of normal distribution, truncated normal distribution, gamma distribution and t distribution are constructed;

and 4, step 4: quantity A according to Anderson-Darling test ² The method comprises the steps of constructing p values of Anderson-Darling test statistics under four continuous distributions of normal distribution, truncated normal distribution, gamma distribution and t distribution, determining the statistical distribution type of error data by comparing the p values, wherein the higher the p value is, the higher the distribution fitting goodness is, namely the determined distribution type is j ^* ＝max _j＝1,2,3,4 p _j ；

And 5: based on the obtained distribution type j ^* Automatically correcting each datum by adopting cycle estimation; a compensation value delta is given in advance, and a data set D is randomly disturbed and divided into a history set D ₁ And observation set D ₂ (ii) a Specifying a data correction strategy, and carrying out continuous iteration to obtain final correction data;

and 6: setting different compensation values delta and repeating the step 5 to find the optimal compensation value; under the compensation value, obtaining an optimally corrected data set D' and solving a distribution j by adopting maximum likelihood estimation ^* The parameter (c) of (c).

Further, in the step 3, in order to measure the goodness of fit of the real data distribution and the theoretical distribution, the Anderson-Darling test statistic under four continuous distributions, namely normal distribution, truncated normal distribution, gamma distribution and t distribution, is constructed as follows:

in the formula:

Anderson-Darling test statistics representing the jth hypothetical distribution, used to measure the difference between the hypothetical distribution and the true distribution of the data,

smaller indicates a closer distribution of the true data to the hypothesis, n is the number of samples, F _D (x) A distribution function for the sample;

in the formula: Γ represents the gamma function, μ, σ, a, b, α, β, v represents the distribution coefficients associated with the distribution.

Further, the specific method of step 4 is as follows:

P-values for the four distributions were constructed as follows:

The smaller the distribution, the higher the goodness of fit of the distribution, and thus, the statistical distribution type of the error data can be determined by comparing the magnitudes of the p-values, i.e., the determined distribution type is j ^* ＝max _j＝1,2,3,4 p _j 。

Further, the step 5 is based on the obtained distribution type j ^* Each data is automatically corrected using a loop estimate. The method can quickly and effectively make up the deviation caused by improper measurement mode or replacement of operators, and promote the data set to be more similar to the real distribution of the data. And the method is suitable for the field of aeronautical manufacturingThe various small-batch production processes can accurately automatically correct the small-batch error data, and the existing method is more suitable for the situation of large samples.

Still further, the step 5 specifically includes the following steps:

step 502: the data set D after the r-1 cycle correction ^r-1 Randomly scrambling, and dividing the data into history sets according to the proportion of 8

And observation set

Is kept unchanged

And adding the compensation value

The correction method with the highest p value is selected to correct x, for example, if

Then

Step 504: note the data set after the r cycle correction as

And p value thereof is p ^r ；

Example 2

An automatic correction and distribution fitting method for small-batch error data based on Anderson-Darling test and cycle estimation is based on error data distribution types (normal distribution, truncated normal distribution, gamma distribution and t distribution) in a small-batch production process which is most suitable for the field of aviation manufacturing, the Anderson-Darling test statistics under four distributions are constructed, and the distribution type of the data is determined through the fitting goodness test. The automatic correction is carried out on each data by adopting the cycle estimation, so that the data deviation caused by improper measurement mode or replacement of operators can be quickly and effectively compensated, the data set is more similar to the real distribution of the data, and meanwhile, a foundation is provided for the subsequent statistical process control and the construction of a control chart.

The Anderson-Darling test statistic under the error data distribution type (normal distribution, truncated normal distribution, gamma distribution, t distribution) which is most suitable for the small-batch production process in the field of aviation manufacturing is constructed. Through actual verification of field data, error data most possibly obeys one of four distributions, namely normal distribution, truncated normal distribution, gamma distribution and t distribution, but the default error data obeys the normal distribution and is analyzed according to statistical basic knowledge blindly. The reason is that the recorded value of error data is often deviated from the true value due to factors such as improper measurement mode and replacement of operators on a production field, and the presented data is not always in accordance with normal distribution.

The method is suitable for the small-batch production process in the field of aviation manufacturing, and the true value of the data is found through a data correction method of circular estimation, so that the statistical distribution characteristics of the data are accurately obtained. At present, most documents can remove abnormal data based on a 4-quantile method, namely, data exceeding the upper and lower 4 quantiles are removed from sample data, and the method is suitable for the condition of large samples. For various small batch processes in the field of aviation manufacturing, the further reduction of the sample size is unfavorable for the fitting of the distribution, and a more reasonable method is to find the true value of the data by a data correction method so as to accurately obtain the statistical distribution characteristics of the data.

Each data was automatically corrected using a loop estimation. A compensation value delta is given in advance, and a data set D is randomly disturbed and divided into a history set D ₁ And observation set D ₂ . Three correction strategies of the data are specified, and continuous iteration is carried out to obtain the final correction data. The method can quickly and effectively make up the deviation caused by improper measurement mode or replacement of operators, and promote the data set to be more similar to the real distribution of the data.

A flow framework of the method for automatically correcting and fitting the distribution of the small batch of error data is shown in figure 1, and the method comprises the following steps:

step 1: reading annual error data of the same characteristics of a small-batch production product from the record table;

and 2, step: removing abnormal data from the error data by using prior information of the production process to obtain an initial data set D = { x = { (x) } _i I =1, \ 8230;, n }. For example, in order to avoid scrapping parts, certain machining processes only allow surplus of the size, namely, error data is required to be positive, and at the moment, a very small number of data recorded as negative can be deleted;

and 3, step 3: the Anderson-Darling test statistics under four continuous distributions of normal distribution, truncated normal distribution, gamma distribution and t distribution are constructed as follows:

in the formula:

Anderson-Darling test statistic representing the jth hypothesis distribution, which measures the difference between the hypothesis distribution and the true data distribution, n is the number of samples, F _D (x) As a distribution function of the sample, F _j (x) The theoretical distribution function for the jth hypothetical distribution is as follows:

And 4, step 4: quantity A according to Anderson-Darling test ² Limit distribution of (2) by

The p-values for the four distributions were calculated as follows:

the higher the p-value is, the higher the distribution goodness of fit is, and the statistical distribution type of the error data can be determined by comparing the p-value;

and 5: based on the distribution type, each data is automatically corrected using a loop estimation. The method can quickly and effectively make up the deviation caused by improper measurement mode or replacement of operators, and promote the data set to be more similar to the real distribution of the data.

The step 5 comprises the following steps:

step 501: a compensation value δ is given in advance, and it is recommended that the compensation value be set to an integer multiple of the data recording accuracy.

And observation set

Is kept unchanged

And adding the compensation value

Then

Step 504: recording the data set corrected by the r cycle as

And p value thereof is p ^r 。

And 6: and (5) setting different compensation values delta and repeating the step to find the optimal compensation value. Under the compensation value, obtaining an optimally corrected data set D' and solving distribution parameters by adopting maximum likelihood estimation.

Analysis by calculation example:

actual measurement error data of quality excircle features machined by a numerical control lathe of an aviation enterprise in the city of Sichuan province are selected, and the number of small-batch data sets is 60. Because the error of the excircle cannot be negative, two negative numbers are removed first, and a removed data set D = { x = is obtained _i I =1, \8230;, 58}, as shown in table 1 below.

TABLE 1 original data set D

-0.010	-0.004	0	0	0	0	0	0	0.001	0.001
										0.001	0.001	0.001	0.002	0.002	0.002	0.002	0.002	0.002	0.002
0.002	0.003	0.003	0.003	0.003	0.004	0.004	0.004	0.004	0.004
										0.005	0.005	0.005	0.005	0.005	0.005	0.006	0.007	0.007	0.009
0.010	0.010	0.010	0.010	0.010	0.010	0.011	0.012	0.012	0.013
										0.015	0.015	0.015	0.015	0.016	0.020	0.020	0.020	0.020	0.030

Performing distribution fitting on D by adopting four continuous distributions of normal distribution, truncated normal distribution, gamma distribution and t distribution, wherein the result is shown in FIG. 2, and respectively constructing Anderson-Darling test statistics under the four distributions and calculating p values thereof, as shown in the following Table 2, wherein p corresponding to gamma distribution is p ₃ =0.326 max, so the data set is considered to be uniformFrom the gamma distribution.

TABLE 2P-values of Anderson-Darling test statistics under four distributions

Based on the gamma distribution, the compensation value δ =0.0001a, a =1, \8230;, 10 is set in accordance with a multiple of the precision of 0.0001 of the error data. For each compensation value, the data is automatically corrected using a round-robin estimation. The data set is randomly divided into a history set and an observation set, a correction mode with the highest p value of overall distribution fitting is selected for the data in the observation set, different data are selected as the observation set in a circulating mode until each data is optimally corrected or the p value is converged, and automatic correction of the data is completed. The p-value of the gamma distribution of the corrected data set at different compensation values is calculated, as shown in fig. 3, the p-value is stable when the compensation value is between 0.002 and 0.006, and the p-value is highest when the compensation value is δ =0.006, so the optimal compensation value is 0.006. Based on the compensation value, a corrected data set D' is obtained as shown in table 3 below, and statistics is performed, wherein a compensation value adding strategy is performed on 4 samples in total, a constant maintaining strategy is performed on 20 samples in total, and a compensation value subtracting strategy is performed on 24 samples in total.

Corrected data set D 'at table 3 δ = 0.006'

Deletion of	Deleting	0	0	0	0.001	0.001	0.001	0.002	0.002
										0.002	0.002	0.003	0.003	0.003	0.004	0.004	0.004	0.004	0.005
0.005	0.005	0.005	0.005	0.005	0.006	0.006	0.006	0.006	0.007
										0.007	0.007	0.007	0.007	0.008	0.009	0.009	0.010	0.010	0.010
0.010	0.010	0.010	0.010	0.011	0.012	0.013	0.015	0.015	0.015
										0.015	0.015	0.015	0.017	0.017	0.020	0.020	0.025	0.025	0.035

The corrected data set D' was fitted with a gamma distribution, and the result is shown in fig. 4, where the p-value increased from the original 0.326 to 0.9133. Finally, the Gamma distribution parameters are estimated by the maximum likelihood estimation method, and the result is as shown in fig. 5, i.e., D ' = { x ' | x ' -Gamma (1.257, 0.0063) }, and the statistical distribution can be used for subsequent statistical process control, quality monitoring, and the like.

Claims

1. The automatic correction and distribution fitting method of the small-batch error data is characterized by comprising the following steps of:

the method specifically comprises the following steps:

step 2: removing abnormal data from error data to obtain an initial data set D = { x = _i ,i＝1,…,n}；

And 5: based on the resulting distribution type j ^* Automatically correcting each datum by adopting cycle estimation; a compensation value delta is preset, and a data set D is randomly disturbed and divided into a history set D ₁ And observation set D ₂ (ii) a Specifying a data correction strategy, and carrying out continuous iteration to obtain final correction data;

step 6: setting different compensation values delta and repeating the step 5 to find the optimal compensation value; under the compensation value, obtaining an optimally corrected data set D' and solving the distribution j by adopting maximum likelihood estimation ^* The parameter (c) of (c).

2. The method of claim 1, wherein the method further comprises: in the step 3, in order to measure the goodness of fit of the real data distribution and the theoretical distribution, the Anderson-Darling test statistic under four continuous distributions of normal distribution, truncated normal distribution, gamma distribution and t distribution is constructed as follows:

in the formula:

the smaller the distribution of the true data to the hypothesis, n is the number of samples, F _D (x) A distribution function for the sample;

in the formula: Γ represents the gamma function, μ, σ, a, b,; and β, v represent distribution coefficients associated with the distribution.

3. The method for automatically correcting and fitting a distribution of error data in a small batch of data as recited in claim 2, wherein: the specific method of the step 4 is as follows:

quantity A according to Anderson-Darling test ² Limit distribution of (2) by

j =1,2,3,4 the p-values for the four distributions were constructed as follows:

4. The method for automatically correcting and fitting a distribution of error data in a small batch of data according to claim 3, wherein: said step 5 is based on the obtained distribution type j ^* Each data is automatically corrected using a loop estimate.

5. The method of claim 4, wherein the method further comprises: the step 5 specifically comprises the following steps:

step 502: the data set D after the r-1 cycle correction ^r-1 Randomly disorganized, and divided into 8As a set of histories

And observation set

Is kept unchanged

And adding the compensation value

Selecting the correction mode with the highest p value to correct x, repeating the step on all other data in the observation set, and finally obtaining the corrected observation set

Step 504: note the data set after the r cycle correction as

And p value thereof is p ^r ；

Step 505: comparison of p ^r And p ^r-1 If the difference is ignored (p) ^r -p ^r-1 <0.001 ) then the correction is ended; otherwise let r = r +1 and repeat steps 502-504.