CN113537280A

CN113537280A - Intelligent manufacturing industry big data analysis method based on feature selection

Info

Publication number: CN113537280A
Application number: CN202110559197.2A
Authority: CN
Inventors: 吴志生; 曾敬其; 李倩倩
Original assignee: Beijing University of Chinese Medicine
Current assignee: Beijing University of Chinese Medicine
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2021-10-22

Abstract

The invention provides an intelligent manufacturing industry big data analysis method based on feature selection, and belongs to the field of intelligent manufacturing. The method comprises the following steps: obtaining a representative subset of original data by adopting a feature selection method, and determining the subordination relation between samples in the representative subset and samples in industrial big data; based on the membership of the samples in the representative subset and the samples in the industrial big data, replacing the samples in the industrial big data with the samples in the representative subset to obtain reconstructed industrial big data; and realizing the analysis of the industrial big data based on the feature extraction through the reconstructed industrial big data. The invention adopts the self-organizing neural network to realize the feature selection of the industrial big data, and realizes the factor analysis, the process monitoring and the intelligent decision of the intelligent manufacturing industrial big data through the representative subset obtained by the feature selection.

Description

Intelligent manufacturing industry big data analysis method based on feature selection

Technical Field

The invention belongs to the field of intelligent manufacturing, relates to an intelligent manufacturing industry big data analysis method, and particularly relates to an intelligent manufacturing industry big data analysis method based on feature selection.

Background

Data driving is a typical feature of intelligent manufacturing, and industrial big data analysis is the core content of the data driving. The industrial big data of the manufacturing process often has the characteristic of multi-source isomerism, namely, variables are derived from a plurality of manufacturing units, and the distribution structure difference among the variables is large. The industrial big data analysis usually adopts a feature extraction method, and data features are discovered through correlation among variables, for example, feature variables are extracted to replace original variables, but original feature space of the data is changed while the variables are compressed. However, the noise of industrial big data, the difference of distribution characteristics among variables and the outlier among variables lead to the complexity of data feature extraction. Therefore, a scientific and effective industrial big data analysis method is a technical difficulty of intelligent manufacturing data drive.

The invention creatively introduces the feature selection into the industrial big data analysis and establishes the intelligent manufacturing industrial big data analysis method based on the feature selection. In distinction from feature extraction, feature selection reduces noise interference on data feature discovery by deriving a representative subset from an original sample set, and does not change the original feature space of the data. In addition, the invention creatively adopts the self-organizing neural network to realize the feature selection of the industrial big data, and realizes the factor analysis, the process monitoring and the intelligent decision of the intelligent manufacturing industrial big data through the representative subset obtained by the feature selection.

Disclosure of Invention

The invention aims to provide an industrial big data analysis method based on feature selection.

Another object of the invention is to provide an application of the method in smart manufacturing, the application content specifically including factor analysis, process monitoring and intelligent decision making.

In order to achieve the above object, in one aspect, the present invention provides a method for analyzing industrial big data based on feature selection, the method comprising the steps of:

step 1: obtaining a representative subset of original data by adopting a feature selection method, and determining the subordination relation between samples in the representative subset and samples in industrial big data;

step 2: based on the membership of the samples in the representative subset and the samples in the industrial big data, replacing the samples in the industrial big data with the samples in the representative subset to obtain reconstructed industrial big data;

and step 3: and realizing the analysis of the industrial big data based on the feature extraction through the reconstructed industrial big data.

According to some embodiments of the invention, the feature selection method comprises the following steps:

the method comprises the following steps: carrying out normalization processing on variables in the industrial big data;

step two: setting a self-organizing feature mapping neural network output layer neuron number and distance calculation method;

step three: determining the subordination relation between the sample in the industrial big data and the neuron of the output layer through the self-organizing feature mapping neural network iterative training;

step IV: and extracting weight vectors of all neurons of the output layer to form a representative subset, wherein each neuron is a sample, and the characteristic selection of industrial big data is realized.

According to some embodiments of the invention, the iterative training process of the self-organizing feature mapping neural network is as follows:

(1) initialization: defining a topological structure a multiplied by b of output layer neurons of the self-organizing neural network, wherein the total number K of the output layer neurons is a multiplied by b. Assigning [0,1 ] to weight corresponding to n nodes of input layer in neuron of output layer]Random value of interval, and normalization processing to obtain output layer neuron weight matrix w_jp,j＝1,2,......K,p＝1,2......n。

The initial learning rate η (0) is defined, generally taking a value of 0.2, and the maximum value does not exceed 0.5.

An initial neighborhood size N (0) is defined, typically 1/2-1/3 of the larger of the output layer array amplitudes a, b.

(2) Sample input: will train the setNormalizing the sample to obtain X_qpThe method comprises the following steps of 1,2, a.

(3) Finding winning neurons: training set sample X of calculation input_qAnd neuron weight vector w_jAnd determining the neuron with the shortest distance as the winning neuron j^*。

X_qAnd w_jThe distance may be calculated by euclidean distance, mahalanobis distance, link distance, or the like.

(4) Updating the weight of the winning neuron: adjusting weight vectors of winning neurons

Wherein

As winning neuron j at the t-th iteration^*The weight vector of (1), η (T) is a learning rate at the time of the tth iteration, the learning rate decreases as the number of iterations increases, and a general variation function is η (T) ═ η (0) (1-T/T), where T is a set total number of iterations.

(5) Updating the neighborhood weight of the winning neuron: to win neuron j^*Adjusting the weight vector w of other neurons in the square neighborhood with radius N (t) as the center_j：

N(t)＝int[N(0)(1-t/T)]

η(t，d)＝η(t)e^-d

Where int is the rounding symbol and the neighborhood radius decreases as the number of iterations increases. d is the distance between the neuron in the neighborhood and the winning neuron, and the closer the neuron is to the winning neuron, the larger the weight adjustment is.

(6) And (5) finishing training: and (4) when all the training set samples are input, making T equal to T +1, and repeating the training from the step (3) until T equal to T.

According to some embodiments of the invention, the initial learning rate of the self-organizing feature mapping neural network is 0.2 to 0.5, and the initial neighborhood size is 1/2 to 1/3 of the number of neurons in the output layer.

According to some embodiments of the invention, the number of neurons in the output layer of the self-organizing feature-mapped neural network is greater than 3 × 3.

According to some embodiments of the invention, the distance calculation method for the self-organizing feature mapping neural network comprises euclidean distance and connection distance.

According to some embodiments of the present invention, the consistency evaluation method of the representative subset and the original data feature space is to characterize the consistency of the feature extraction space by the association relationship between the first principal component and the second principal component loading matrix.

According to some embodiments of the present invention, the correlation of the load matrix is calculated as follows:

wherein

Is the mean value of the matrix a and,

being the mean of the matrix B, the closer the correlation coefficient r is to ± 1, the stronger the correlation between the two matrices.

In summary, the invention creatively adopts a feature selection method to obtain the representative subset of the industrial big data, and the analysis of the industrial big data is realized through the representative subset.

On the other hand, the invention also provides application of the method in intelligent manufacturing industry big data analysis, and the application content specifically comprises factor analysis, process monitoring and intelligent decision.

According to some embodiments of the present invention, the intelligent manufacturing factor analysis is implemented by analyzing correlation coefficients of factors in the reconstructed industrial big data.

According to some embodiments of the present invention, the intelligent manufacturing process monitoring is performed by setting a confidence interval of the variable in the reconstructed industrial big data to be a standard control of the variable, and the process capability index is used to perform the process monitoring of the industrial big data.

According to some embodiments of the invention, wherein the process capability index (C)_P) The calculation method of (2) is as follows:

when the technical standard requires control of both sides, C_P＝(T_u-T_l)/6σ；

When the technical standard only requires a lower control limit, C_Pu＝(μ-T_l)/3σ；

When the technical standard only requires the upper control limit, C_Pu＝(T_u-μ)/3σ。

Wherein T is_uUpper limit of control for technical standard, T_lμ is the average of the manufacturing process samples for the lower control limit of the technical standard. In addition, σ is the total standard deviation of the sample distribution, which can be estimated by the sample standard deviation S according to the American Society of Testing Materials (American Society of Testing Materials) regulations. For example, when the number of samples is 25, σ is S1.0105.

According to some embodiments of the invention, the process capability index is ranked as follows: c_PLess than 0.67, the process capability is seriously insufficient; c is more than 0.67_PLess than 1.00, insufficient process capability; 1.00 < C_P< 1.33 process capacity is sufficient; 1.33 < C_PLess than 1.67 process capacity is sufficient; 1.67 < C_PThe process capability is too high.

According to some embodiments of the invention, the intelligent manufacturing decision making is performed by independent sample T-test analysis of variables in the reconstructed industrial big data.

In conclusion, the invention provides an intelligent manufacturing industry big data analysis method based on feature selection. The method of the invention has the following advantages:

the invention introduces the feature selection into the industrial big data analysis and establishes the intelligent manufacturing industrial big data analysis method based on the feature selection. In distinction from feature extraction, feature selection reduces noise interference on data feature discovery by deriving a representative subset from an original sample set, and does not change the original feature space of the data. The invention adopts the self-organizing neural network to realize the feature selection of the industrial big data, and realizes the factor analysis, the process monitoring and the intelligent decision of the intelligent manufacturing industrial big data through the representative subset obtained by the feature selection.

Drawings

Fig. 1 industry big data of coated tablet manufacturing process, model A.E-R, b.

Fig. 2 is the correlation analysis result of variables, correlation coefficient between variables a, grey correlation degree between variables b, and yield data distribution c in the big data of industrial manufacturing of coated tablets.

Fig. 3 SOM feature selection results of coated tablet industrial manufacturing big data, a attribution of a sample in neurons, b connecting distance of neurons, c distribution of neuron weights.

Fig. 4 shows the consistency evaluation and influence factors of the representative subset and the feature space of the raw data, a PCA result of the raw data, b PCA result of the representative subset, and influence of c.som neural network model parameters on the correlation coefficient of the load matrix.

Fig. 5 shows the analysis result of the product yield factor selected by the industrial big data characteristics, the result of the variation of the average value and the standard deviation of the product yield along with the production time, the correlation coefficient of the product yield, and the linear regression of the product yield.

Figure 6 dissolution process monitoring results for large industrial data feature selection, a. dissolution paired T test results for raw data and representative subsets, b. process capability index for dissolution of granules and tablets as a function of production time.

Fig. 7 shows the intelligent decision results of the raw material manufacturers selected according to the characteristics of the industrial big data, a. the distribution of the raw material manufacturers in the production time, and b. the difference of the finished product rate and the dissolution rate among different manufacturers.

Detailed Description

The following detailed description is provided for the purpose of illustrating the embodiments and the advantageous effects thereof, and is not intended to limit the scope of the present disclosure.

Example 1: feature selection for coated tablet industry big data

(1) Industrial big data description of coated tablets

907 batches of industrial big data of a certain coated tablet in 11 months to 2018 and 6 months in 2013. The coated tablet is composed of L, S, H and A, and the variables of the industrial big data comprise 16 quality attributes of the manufacturing process of the coated tablet, and relate to four manufacturing units of granulation, tabletting, coating, packaging and the like. The E-R model of the large data for the coated tablet industry is shown in FIG. 1A. In addition, the yield of each batch of samples of industrial big data is 500 ten thousand, 2 batches are taken every day, 10-30 batches are produced every month, and the production month is not fixed every year. It is to be noted that the result of the change in yield with the production time (fig. 1B) shows that the finished product of the coated tablet is continuously decreased, and in the range of 2018 to 2018 and 6 months, although the yield of the granules and the yield of the coating are significantly increased, the problem of the decrease in the yield of the product is still not solved. Therefore, the key problem of product yield reduction is determined through industrial big data analysis, the production efficiency of the coated tablet is improved, and the cost is saved.

(2) Feature extraction of industrial big data

The industrial big data of the coating tablet manufacturing process has the characteristic of multi-source isomerism, namely variables are derived from a plurality of manufacturing units, and the distribution structure difference among the variables is large. . For example, the RSD range of the finished product ratio is 0.63-1.53%, the material balance is 0.60-0.78%, the dissolution rate of the particles is 2.08-2.27%, and the dissolution rate of the tablets is 5.67-6.22%. In addition, the variance of the moisture content of the granules and the weight gain of the coating is large, the RSD is 10.17% and 15.01%, and the statistical results of the variables in the industrial big data are shown in the table 1. Therefore, feature extraction of industrial big data should pay attention to the careful use of Principal Component Analysis (PCA), local projection algorithm (LPP), laplacian mapping (LE), and other variable compression methods, because each variable cannot be simply given the same weight.

Different from data collected in a laboratory, the collection period of industrial big data is long, the interference factors in the collection process of sample data are many, and the data noise can cover the characteristics and the degree of correlation among variables. Taking the yield data as an example, the correlation degree with the product yield (see fig. 2A), the tabletting yield > the granule yield > the coating yield (0.72>0.08>0.02), the correlation coefficient between variables is low, and the reliability of the factor analysis result is low.

From the relation of the samples, the samples of the industrial big data are dynamically associated in time series, so that the grey association degree can be adopted to replace the correlation coefficient to evaluate the association degree between the variables (see fig. 2B). However, when the degree of association between variables is evaluated by the gray degree of association, the distribution characteristics and outliers of the variables must be considered. The yield of granules and tablets obviously does not conform to the normal distribution, and the outliers in the sample are more (fig. 2C), which affects the calculation result of the grey correlation degree. In conclusion, research results reveal the characteristics of multisource isomerism of industrial large data in the manufacturing process of the coated tablets, and meanwhile, the complexity of data feature extraction is caused by the problems of data noise, the difference of distribution features among variables, outliers in the variables and the like.

TABLE 1 statistical results of variables in the industry Mass data of the coated tablet manufacturing Process

(3) Feature selection for industrial big data

The invention provides a method for realizing the characteristic selection of industrial big data by adopting an SOM algorithm, an output layer neural network is a 6 multiplied by 6 hexagonal topological structure, an initial learning rate eta (0) is 0.02, the size of an initial field N (0) is 1/2 of the maximum value of an output layer array, the iteration time T is 1000 times, and the calculation method of the distance between a sample and a neuron is a connection distance, namely the Chebyshev distance. Based on the principle of minimum distance, 907 samples are categorized into 36 neurons (fig. 3A), each neuron contains 16 weight variables, which correspond to 16 variables in the samples. The sample set composed of 36 neurons is a representative subset of industrial big data, and contains characteristic information of original data. The difference of adjacent neurons can be characterized by the connection distance (fig. 3B), and due to the neighborhood weight update strategy, the weight difference of adjacent neurons is small (fig. 3C), so that abnormal samples can be quickly identified. For example, the product yield for the top right hand position sample is much lower than the other samples, and additionally, the sheeting yield also shows similar results.

Example 2: effect of model parameters on feature selection results

The invention provides an industrial big data analysis method based on feature selection, which is provided with the premise that a representative subset is consistent with a feature space of original data, wherein the consistency comprises the consistency of variable types in the feature space and the consistency of incidence relations among variables. Here, PCA is used to perform feature extraction on the raw data and the representative subset respectively, and the correlation between the variables is characterized by the load matrices of the first principal component and the second principal component, as shown in fig. 7A and 7B, and the similarity of the two load matrices is 0.9955. The research result shows that the representative subset has consistency with the feature space of the original data.

It should be noted that the representative subset is extracted by the SOM neural network model, and improper model parameters may result in insufficient feature extraction of the raw data. The number of neurons in the output layer and a distance calculation method determine the number of samples in the representative subset, and the distance calculation method determines a selection method of the samples in the representative subset. As a result of the research, it is found that when the number of neurons in the output layer is 3 × 3, the correlation coefficient of the load matrix is significantly lower, and when the mahalanobis distance is used as the distance calculation method, the correlation coefficient of the load matrix is significantly lower than those of the other two distance calculation methods, as shown in fig. 7C. Therefore, when the SOM neural network model is used for feature selection, the number of neurons in the output layer should not be less than 3 × 3, and mahalanobis distance should be avoided as a distance calculation method.

Example 3: application of feature selection in factor analysis

After the industrial big data characteristic is selected, a representative subset consisting of 36 samples is obtained, the complexity and the noise of the data are reduced, and the reliability of factor analysis is greatly improved. Taking the product yield as an example, the results of the average value and the standard deviation of the product yield of 30 consecutive batches changing with the production time show (fig. 4A), the product yield of the coated tablets has the problem of continuous reduction, and the cliff type reduction appears between 1 month in 2018 and 6 months in 2018. First, correlation coefficient analysis was performed on the variables (fig. 4B), and the degree of correlation between the product yield, the tabletting yield > the granule yield > the coating yield (0.88>0.56>0.27), and the degree of correlation between the tabletting yield and the product yield was much higher than that of other manufacturing units. In addition, the product yield and the tabletting yield are in positive correlation with each other in fig. 4C, and the fitting degree of regression analysis is high (P < 0.001). In conclusion, the feature selection can improve the reliability of factor-associated regression analysis, and research results reveal that the reduction of the tabletting yield in the manufacturing process of the coated tablet is a key problem influencing the reduction of the product yield.

Example 4: application of feature selection in process monitoring

In order to further explore the advantages of feature selection in industrial big data analysis, an application method of the feature selection in process monitoring is developed. Process capability index (C)_P) The quality stability is evaluated by calculating the ratio of the total standard deviation sigma of the sample to the control limit of the technical standard, and the method is widely used for monitoring the industrial manufacturing process. However, for the factors lacking the control limit of the technical standard in the manufacturing process, the method for monitoring the quality stability process needs to be established. Here, we propose confidence intervals that use factors in the representative subset

As a standard control limit for this factor, C-based implementation_PQuality stability process monitoring. It should be noted that although the samples in the representative subset retain the characteristics of the original data, the standard control limit is calculated by a weighting method considering the frequency distribution of each sample in the original data.

Taking dissolution rates of granules and plain tablets as an example, the dissolution rates become smaller after characteristic selection, but the dissolution rates are not significantly different (paired T test, P test)>0.1), data distribution profile indicating that the profile selection did not alter dissolution, and paired T test results for dissolution are shown in fig. 5A. Since the higher the dissolution rates of both the granules and the tablets, the better, only the lower limit of dissolution rate was controlled to calculate manufacturing Process C_PThe cycle was 25 batches, see FIG. 5B. The research result shows that the process capability of the dissolution rate of the granules is insufficient (C)_P<1.00) in need of further improvement of quality control method thereof, wherein the particles H are C of dissolution rate_PThe whole body is in an ascending trend and can be used as an entry point for improving the quality control method of the dissolution rate of the granules. In addition, the quality control method of the dissolution rate of the tablet also needs to be improved, and the dissolution rate of the tablet A has sufficient process capacity from 5 months in 2017 to 12 months in 2017 (C)_PMore than 1.00) can be used as an entry point for improving the quality control method of the dissolution rate of the plain tablets.

Example 5: application of feature selection in intelligent decision making

The core content of industrial big data analysis is to realize intelligent decision of data-driven manufacturing process. The feature selection can reduce the complexity of data without changing the feature space of the data, thereby improving the reliability of intelligent decision in the manufacturing process. Take decision analysis of the coated tablet raw material manufacturer as an example. There are four manufacturers for raw material L, of which L1 and L2 are two main manufacturers, and two manufacturers for raw material S and raw material H, respectively, and the distribution of raw material manufacturers in the production time is shown in fig. 6A.

The product yield and dissolution rate differences between different manufacturers were analyzed by independent sample T test, see fig. 6B. The research result shows that the product yield of L2 is remarkably higher than that of L1(P <0.01), and in addition, the product yield of L3 is remarkably higher than that of L1(P <0.01) although the use frequency of L3 is less, so that the raw materials of L2 and L3 manufacturers are favorable for improving the product yield. However, the dissolution rate of L3 particles is significantly lower than that of L1(P <0.01), so it is recommended to use L2 as the manufacturer of raw material L. In addition, the product yield of H2 is obviously higher than that of H1(P <0.01), the granule dissolution rate of the H2 is higher (P <0.01), and H2 is selected as a manufacturer of raw material H to improve the product yield although the dissolution rates of plain tablets are not obviously different.

Claims

1. The industrial big data analysis method based on feature selection is characterized in that a representative subset of industrial big data is obtained by the feature selection method, and industrial big data analysis is realized through the representative subset, and the method comprises the following steps:

2. The method of claim 1, wherein the feature selection method in step 1 comprises the steps of:

3. The method of claim 2, wherein the step of (ii) the self-organizing feature mapping neural network has an initial learning rate of 0.2 to 0.5, and an initial neighborhood size of 1/2 to 1/3 of the number of neurons in the output layer.

4. The method of claim 2, wherein step (ii) the number of neurons in the output layer of the ad hoc feature mapping neural network is greater than 3 x 3.

5. The method of claim 2, wherein step (ii) the self-organizing feature mapping neural network distance calculation method comprises euclidean distance and connection distance.

6. Use of the method of any of claims 1 or 2 in intelligent manufacturing industry big data analytics, including in particular factor analysis, process monitoring and intelligent decision making.

7. The application of claim 6, wherein the application in factor analysis is realized by analyzing correlation coefficients of factors in the reconstructed industrial big data.

8. The application of claim 6, wherein in the process monitoring, the standard control of the variable is set through the confidence interval of the variable in the reconstructed industrial big data, and the process monitoring of the industrial big data is realized by adopting the process capability index.

9. The application of the method according to claim 6, wherein the application of the method in intelligent decision making is realized by independent sample T-test analysis of variables in the reconstructed industrial big data.

10. The use of claim 6, comprising pharmaceutical manufacturing process big data containing more than ten thousand data points.