CN116049157B - Quality data analysis method and system - Google Patents

Quality data analysis method and system Download PDF

Info

Publication number
CN116049157B
CN116049157B CN202310007166.5A CN202310007166A CN116049157B CN 116049157 B CN116049157 B CN 116049157B CN 202310007166 A CN202310007166 A CN 202310007166A CN 116049157 B CN116049157 B CN 116049157B
Authority
CN
China
Prior art keywords
data
analyzed
indexes
ppm
quality data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310007166.5A
Other languages
Chinese (zh)
Other versions
CN116049157A (en
Inventor
邓大伟
张彤
洪保成
胡彦
薛铸鑫
王亚
姚帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jinghang Computing Communication Research Institute
Original Assignee
Beijing Jinghang Computing Communication Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jinghang Computing Communication Research Institute filed Critical Beijing Jinghang Computing Communication Research Institute
Priority to CN202310007166.5A priority Critical patent/CN116049157B/en
Publication of CN116049157A publication Critical patent/CN116049157A/en
Application granted granted Critical
Publication of CN116049157B publication Critical patent/CN116049157B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • General Factory Administration (AREA)

Abstract

The invention relates to a quality data analysis method and a quality data analysis system, belongs to the technical field of data analysis, and solves the problems that in the prior art, quality analysis is inaccurate due to redundancy of check characteristic indexes and abnormal data. Comprising the following steps: acquiring quality data and check characteristic indexes in the production process, and removing redundant check characteristic indexes according to the quality data and the correlation coefficient to obtain the check characteristic indexes to be analyzed; removing abnormal data of quality data of each to-be-analyzed inspection characteristic index according to the statistical analysis and the variation self-encoder to obtain to-be-analyzed data; according to the detection characteristic index to be analyzed and the data to be analyzed, calculating each PPM value in the production process, comparing the PPM value with the corresponding PPM threshold range, and carrying out data envelope analysis on the data to be analyzed which is not in the PPM threshold range. Accurate quality data analysis is realized.

Description

Quality data analysis method and system
Technical Field
The present invention relates to the field of data analysis technologies, and in particular, to a quality data analysis method and system.
Background
With the continuous improvement of the informatization level, weaponry generates huge and complex quality data in the development and production processes of design, process, inspection, production, test, use and the like, wherein the data consists of structured data existing in a business system, semi-structured data existing in a detection tool and unstructured data taking paper files or electronic files and the like as carriers. The quality management business has close coupling with the processes of product design, production, management and the like, the equipment quality problem and the occurrence of quality data inspection and the transmission process have high discreteness and heterogeneity, the quality management business has more data association with different information systems, and the data have a large number of conditions of redundancy, deletion, abnormality and the like, and useful information in the data cannot be effectively mined due to the existence of noise signals. In addition, the quality data is scattered, which makes collection and sharing of quality data resources difficult.
The quality data is mostly distributed in the computers of the individual managers, developers, technicians or in the production test equipment. Acquisition and sharing of quality data resources is difficult to achieve. The problems of low data utilization rate, insufficient quantitative analysis and the like caused by the difficulty in fusion and sharing of data resources and the lack of means of data acquisition and analysis tools are solved, and accurate data analysis and statistics are lacked as the basis. The data are urgently needed to be mined and processed, rules behind the data are found out, quality conditions of equipment in the development process are mastered, and therefore quality management, design and technical personnel are assisted to make scientific decisions, and comprehensive on-line quality control is supported.
A large amount of test data is formed in the equipment development process, but quantitative analysis and problem mining application of test data resources in the quality of the acquired equipment are lacking at present. Meanwhile, indexes covered by various characteristics related to quality analysis of equipment are quite different, and all index parameters are mutually interacted, so that analysis is difficult. It is necessary to ensure that the number of indicators sufficient to express the target analysis characteristics is selected while keeping these parameters from affecting each other, so that accurate quality analysis is achieved. By finding out and analyzing the test problems and abnormal data, the aim of pre-predicting and early warning equipment development and supporting and improving the quality weak points is achieved, and the method is a key problem to be solved for realizing the refinement and the intellectualization of quality management control.
Disclosure of Invention
In view of the above analysis, the present invention aims to provide a quality data analysis method and system, which are used for solving the problems of inaccurate quality analysis caused by redundancy of check characteristic indexes and abnormal data.
In one aspect, an embodiment of the present invention provides a quality data analysis method, including the steps of:
Acquiring quality data and check characteristic indexes in the production process, and removing redundant check characteristic indexes according to the quality data and the correlation coefficient to obtain the check characteristic indexes to be analyzed;
Removing abnormal data of quality data of each to-be-analyzed inspection characteristic index according to the statistical analysis and the variation self-encoder to obtain to-be-analyzed data;
According to the detection characteristic index to be analyzed and the data to be analyzed, calculating each PPM value in the production process, comparing the PPM value with the corresponding PPM threshold range, and carrying out data envelope analysis on the data to be analyzed which is not in the PPM threshold range.
Based on the further improvement of the method, before comparing with the corresponding PPM threshold range, the method further comprises: if the data quantity of the data to be analyzed is smaller than or equal to the quantity threshold value, acquiring a fluctuation threshold value by constructing confidence coefficient of t distribution, evaluating whether the difference between each PPM value and an ideal PPM value is smaller than the fluctuation threshold value, and if so, retaining the data to be analyzed for calculating the PPM value.
Based on the further improvement of the method, according to the correlation coefficient matrix, removing redundant test characteristic indexes to obtain test characteristic indexes to be analyzed, wherein the method comprises the following steps: dividing all N detection characteristic indexes into a plurality of paired combinations by a traversing and recursion method, wherein a first group of the paired combinations has i indexes, and a second group of the paired combinations is the rest N-i indexes; taking any one of the indexes of the second group with the smallest quantity as the inspection characteristic indexes to be analyzed from the paired combinations meeting the following conditions: the correlation coefficient between each index in the first group and all indexes in the second group is larger than a correlation threshold value and is taken as a basic condition, and the basic condition is not met after any index is taken out from the second group and added into the first group.
Based on a further improvement of the method, the correlation coefficient between each index in the first group and all indexes in the second group is obtained by obtaining the linear combination of quality data corresponding to the two groups of indexes and maximizing the pearson correlation coefficient of the two groups of linear combinations.
Based on the further improvement of the method, according to the statistical analysis and the variation self-encoder, removing the abnormal data of the quality data of each inspection characteristic index to be analyzed to obtain the data to be analyzed, including:
based on statistical analysis, removing the quality data which is larger than an abnormal threshold value as abnormal data after the quality data of each to-be-analyzed inspection characteristic index is subjected to z-score standardization processing;
the quality data after the standardized treatment of each to-be-analyzed test characteristic index is respectively transmitted into a trained variation self-encoder, the obtained output and the input are subjected to difference comparison, and the quality data with the difference value larger than a difference threshold value is taken as abnormal data to be removed;
the remaining quality data is used as the data to be analyzed.
Based on a further improvement of the method, the loss function of the variable self-encoder comprises a reconstruction term and a KL divergence regularization term, and a weight parameter is added before the KL divergence regularization term for reducing the weight of the KL divergence regularization term.
Based on further improvement of the method, according to the inspection characteristic index to be analyzed and the data to be analyzed, calculating each PPM value in the production process comprises the following steps:
Collecting the defect number and the severity coefficient of each procedure, and obtaining the total number of defects of each procedure; acquiring the quantity of the data to be analyzed corresponding to the inspection characteristic indexes of each process as the total number of the inspection characteristics of each process according to the inspection characteristic indexes to be analyzed and the data to be analyzed; obtaining PPM values of all the procedures according to the total number of the defects of all the procedures and the total number of corresponding procedure checking characteristics;
according to the total number of process defects and the total number of process checking characteristics of the processes related to each product in the generation process, summarizing to obtain the total number of defects of each product and the total number of the checking characteristics of each product; obtaining PPM values of all products according to the total number of the defects of all the products and the total number of corresponding product inspection characteristics;
According to the total number of product defects and the total number of product inspection characteristics of products to which each model number belongs, the total number of the defects of each model number and the total number of inspection characteristics of each model number are obtained, and according to the total number of the defects of each model number and the total number of the inspection characteristics of the corresponding model number, the PPM value of each model is obtained.
Based on a further improvement of the method, performing data envelope analysis on the data to be analyzed which is not in the PPM threshold value range comprises: taking the data to be analyzed which is not in the PPM threshold value range as sample data; acquiring successful data, calculating a confidence interval of the successful data, and indicating whether the sample data is in the confidence interval according to the range of the confidence interval; acquiring an envelope upper limit and an envelope lower limit of successful data according to preset confidence coefficient, wherein the envelope upper limit and the envelope lower limit are used for representing whether the sample data is enveloped or not; acquiring a qualified upper limit and a qualified lower limit according to a preset tolerance value, wherein the qualified upper limit and the qualified lower limit are used for indicating whether sample data are qualified or not; sample data analysis results are generated based on whether the envelope, whether the envelope is acceptable, and whether the confidence interval is present.
Based on a further improvement of the above method, successful data is obtained, and a confidence interval of the successful data is calculated, including: respectively counting successful data of each index according to the to-be-analyzed test characteristic index corresponding to the sample data, and if the number of the successful data is greater than a number threshold, constructing a confidence interval through a Gaussian mixture density function (GMM) algorithm; otherwise, a confidence interval is constructed through t distribution.
In another aspect, an embodiment of the present invention provides a quality data analysis system, including: the test characteristic index acquisition module is used for acquiring quality data and test characteristic indexes in the production process, and removing redundant test characteristic indexes according to the quality data and the correlation coefficient to obtain test characteristic indexes to be analyzed;
the data acquisition module to be analyzed is used for removing abnormal data of quality data of each inspection characteristic index to be analyzed according to the statistical analysis and the variation self-encoder to obtain data to be analyzed;
and the quality data analysis module is used for calculating each PPM value in the production process according to the to-be-analyzed detection characteristic index and the to-be-analyzed data, comparing the PPM value with the corresponding PPM threshold range respectively, and carrying out data envelope analysis on the to-be-analyzed data which is not in the PPM threshold range.
Compared with the prior art, the invention has at least one of the following beneficial effects: based on the collected equipment quality data, the technology of data correlation analysis, data abnormality analysis, small sample data analysis and the like is used for carrying out redundancy detection and rejection on the test quality data, evaluating the confidence of the test data under the condition of the small sample, intelligently analyzing whether the product data falls within an envelope range, discovering quality hidden danger or weak links existing in the production process in advance, and realizing the refinement and intellectualization of quality management control.
In the invention, the technical schemes can be mutually combined to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.
Fig. 1 is a flow chart of a quality data analysis method in embodiment 1 of the present invention.
Detailed Description
The following detailed description of preferred embodiments of the application is made in connection with the accompanying drawings, which form a part hereof, and together with the description of the embodiments of the application, are used to explain the principles of the application and are not intended to limit the scope of the application.
Example 1
In one embodiment of the present invention, a method for analyzing quality data is disclosed, as shown in fig. 1, comprising the steps of:
s11: and acquiring quality data and check characteristic indexes in the production process, and removing redundant check characteristic indexes according to the quality data and the correlation coefficient to obtain the check characteristic indexes to be analyzed.
It should be noted that, the quality data in the production process in this embodiment is data affecting quality in the product test data, for example: for inertial navigation gyroscopes, temperature, weight, pressure and gravity data are acquired as mass data.
Quality data is acquired through structured and unstructured data acquisition and processing modes, and preliminary data cleaning and data preprocessing are carried out, and the method comprises the following steps: detecting missing values, filling the missing values of the data items through Newton interpolation, detecting and eliminating outliers based on a data mining method and a state estimation method, and detecting and deleting repeated values according to the similarity.
The test characteristic index refers to the quantized test characteristic index. In general, the pearson bivariate correlation analysis method only focuses on the correlation coefficient between two indexes, and cannot fully mine the inherent association relation of a plurality of test characteristic indexes. Therefore, in the embodiment, the potential correlation between the redundancy check characteristic index and the rest multiple check characteristic indexes is mined, the redundancy check characteristic indexes are gradually screened out, and the evaluation capability of the indexes on the product quality is improved.
Specifically, according to the quality data and the correlation coefficient, removing redundant test characteristic indexes to obtain test characteristic indexes to be analyzed, including: dividing all N detection characteristic indexes into a plurality of paired combinations by a traversing and recursion method, wherein a first group of the paired combinations has i indexes, and a second group of the paired combinations is the rest N-i indexes; taking any one of the indexes of the second group with the smallest quantity as the inspection characteristic indexes to be analyzed from the paired combinations meeting the following conditions: the correlation coefficient between each index in the first group and all indexes in the second group is larger than a correlation threshold value and is taken as a basic condition, and the basic condition is not met after any index is taken out from the second group and added into the first group.
In the analysis of the test characteristic index, i may be traversed from 1 to obtain a pair combination, or may be traversed from any number smaller than N, as long as a pair combination satisfying the above condition is obtained.
It should be noted that, the correlation coefficient between each index in the first set and all indexes in the second set is obtained by obtaining a linear combination of quality data corresponding to the two sets of indexes, so that the pearson correlation coefficient of the two sets of linear combinations is maximized, and the two sets of linear combinations are expressed by the following formula:
Wherein w 1 and w 2 are the linear combination of the first set of quality data and the linear combination of the second set of quality data, respectively, Σ 12 is the covariance matrix of the first set and the second set, Σ 11 is the covariance matrix of the first set, Σ 22 is the covariance matrix of the second set.
Illustratively, taking the production test of the twist needle as an example, calculating the relevance of 7 test characteristic indexes of the needle body length, the fat degree after the needle body is collected is smaller, the coaxiality, the empty needle, the reverse direction and the loose silk according to the quality data of the twist needle, and taking 1,2,3,4,5,6 and 7 as test characteristic index numbers respectively. Setting the correlation threshold to 0.7, that is, setting the correlation coefficient to be greater than 0.7, indicates that the correlation degree is high, and setting the correlation degree as a redundant test characteristic index.
And B represents the first group, a represents the second group, the initial traversal is performed from i=1, the decomposition is performed, :t1={A=[2,3,4,5,6,7],B=[1]}、t2={A=[1,3,4,5,6,7],B=[2]}、t3={A=[1,2,4,5,6,7],B=[3]}、t4={A=[1,2,3,5,6,7],B=[4]}、t5={A=[1,2,3,4,6,7],B=[5]}、t6={A=[1,2,3,4,5,7],B=[6]}、t7={A=[1,2,3,4,5,6,],B=[7]},, and correlation coefficients of the indexes in B and all indexes in a are calculated respectively, wherein the correlation of t1, t2, t3, t5 and t6 is larger than 0.7, the decomposition of a in the 5 paired combinations is continued, indexes 4 and 7 are not redundant indexes, and the two indexes can be not considered in the subsequent decomposition, so that the calculation speed is increased.
For the second traversal, taking t1 as an example, the decomposition can be performed into :t1_1={A=[3,4,5,6,7],B=[1,2]}、t1_2={A=[2,4,5,6,7],B=[1,3]}、t1_3={A=[2,3,4,6,7],B=[1,5]}、t1_4={A=[2,3,4,5,7],B=[1,6]},, and the correlation coefficient of each index in B with all indexes in A is calculated respectively, and if t1_1 and t1_2 meet the condition that both are greater than 0.7, the decomposition is performed on the A group in t1_1 and t1_2 continuously.
For the third pass, for each pairwise combination of t1_1 and t1_2 decompositions, a has 4 indices and B has 3 indices, but the correlation coefficient of each index in B with respect to all indices in a does not satisfy the condition of greater than 0.7, namely: when any index is taken out from the second group and added into the first group, the basic condition is not met any more, and at least 5 effective test characteristic indexes are indicated, and the A group in the t1_1 and the t1_2 can be used as the test characteristic index to be analyzed, and one index is optional.
Compared with the prior art, the method expands the traditional correlation calculation method for two test characteristics, screens test characteristic indexes with high correlation, and obtains the minimum test characteristic index set meeting the correlation condition, so that the detection result of the redundant test characteristic indexes outside the set can be predicted through the detection result of the test characteristic indexes in the set.
S12: and removing abnormal data of the quality data of each to-be-analyzed inspection characteristic index according to the statistical analysis and the variation self-encoder to obtain to-be-analyzed data.
The quality data of this embodiment includes abnormal data caused by local variability and global variability. The local variability is mainly caused by local detection of environmental mutations and can be exploited by numerical comparison of quality data of individual test problems. The overall difference is caused by the overall change of factors such as process detection environment, detection method and the like, and cannot be identified through the numerical value abnormality of quality data of a single test problem. Thus, the present embodiment calculates abnormal quality data having a single trial problem quality data and a difference in overall distribution of the quality data, respectively, by statistical analysis and variation from the encoder.
Specifically, according to statistical analysis and variation, removing abnormal data of quality data of each inspection characteristic index to be analyzed from an encoder to obtain data to be analyzed, including:
Based on statistical analysis, removing the quality data of each to-be-analyzed inspection feature as abnormal data after the z-score standardization processing;
Respectively transmitting the remaining standardized quality data of each to-be-analyzed test characteristic index into a trained variation self-encoder, performing difference comparison on the obtained output and input, and removing the quality data with the difference value larger than a difference threshold value as abnormal data;
the remaining quality data is used as the data to be analyzed.
The z-score normalization method measures the abnormality of the quality data in a single test characteristic index by a normalized distance from the average value, and the greater the normalized distance, the greater the abnormality of the quality data in the test characteristic.
The quality data distribution is fitted by a variation self-encoder, so that the data distribution characteristics of the test characteristics are learned, and abnormal quality data with overall differences are identified. The variable self-encoder comprises two network structures of an encoder E (x) and a decoder D (z). The encoder E (x) performs feature extraction on the features, maps the checking features into a structural feature space, decodes the feature distribution in the structural feature space, extracts effective features of the quality data through KL divergence regularization by utilizing a variation principle, and outputs the data distribution of the original quality data.
Specifically, the encoder is obtained by approximating the posterior q (z|x, phi), the decoder is obtained by maximum likelihood p (x|z, theta), wherein phi and theta are the parameters of the encoder and decoder, respectively, and a neural network is constructed to learn the parameters of the encoder and decoder. The loss function of the variable self-encoder comprises a reconstruction term and a KL divergence regular term, and a weight parameter is added before the KL divergence regular term for reducing the weight of the KL divergence regular term. Namely: the variation in this embodiment is derived from the encoder by solving the following optimization problem:
Wherein alpha is a preset weight parameter, and parameter optimization is performed in the training process; d KL (. Cndot.) represents the Kullback-Leibler divergence.
In the variation self-encoder of the KL divergence regularization of the present embodiment, by fitting the distribution of the quality data from the encoder, the outliers of the data are measured by calculating the distance from the encoder output to the original quality data. For a given quality data, the higher the outlier, the greater the overall variability of that quality data from other quality data, and removed as outlier.
S13: according to the detection characteristic index to be analyzed and the data to be analyzed, calculating each PPM value in the production process, comparing the PPM value with the corresponding PPM threshold range, and carrying out data envelope analysis on the data to be analyzed which is not in the PPM threshold range.
It should be noted that, in the production process of the equipment product, a typical model and a main product are involved, and in this embodiment, PPM (Parts Per Million parts per million abbreviation, representing reject ratio in each million) values are calculated from three dimensions of the procedure, the product and the model. According to the inspection characteristic index to be analyzed and the data to be analyzed, calculating each PPM value in the production process, wherein the method comprises the following steps:
collecting the defect number and the severity coefficient of each procedure, and obtaining the total number of defects of each procedure; acquiring the quantity of the data to be analyzed corresponding to the inspection characteristic indexes of each process as the total number of the inspection characteristics of each process according to the inspection characteristic indexes to be analyzed and the data to be analyzed; according to the total number of defects of each process and the total number of corresponding process checking characteristics, PPM values of each process are obtained and expressed by the following formula:
Wherein, P i represents the number of defects in the process i, K i represents the severity coefficient of the process i, G i represents the number of test characteristic indexes to be analyzed in the process i, and n r represents the number of data to be analyzed corresponding to the r-th test characteristic index in the process i.
According to the total number of process defects and the total number of process checking characteristics of the processes related to each product in the generation process, summarizing to obtain the total number of defects of each product and the total number of the checking characteristics of each product; obtaining PPM values of all products according to the total number of the defects of all the products and the total number of corresponding product inspection characteristics;
According to the total number of product defects and the total number of product inspection characteristics of products to which each model number belongs, the total number of the defects of each model number and the total number of inspection characteristics of each model number are obtained, and according to the total number of the defects of each model number and the total number of the inspection characteristics of the corresponding model number, the PPM value of each model is obtained.
Preferably, when the amount of the data to be analyzed, i.e. the quality data for calculating the PPM, is smaller and cannot meet the million-level data required by the traditional PPM calculation method, the difference between the PPM value calculated by normalizing the quality data to the million-level and the PPM true value calculated by collecting the million-level PPM quality data is required to be evaluated. If the difference is smaller, the reliability of the PPM value calculated by normalizing the quality data to the million-magnitude is higher, and the PPM value under the condition of the million-magnitude quality data can be directly represented. Otherwise, it is explained that the PPM value calculated by normalizing the quality data to the million-magnitude level may have a larger deviation from the PPM value under the condition of the million-magnitude quality data, and further optimization is required.
Thus, prior to calculating the PPM value, compared to the PPM threshold range, further comprising: if the data quantity of the data to be analyzed is smaller than or equal to the quantity threshold value, acquiring a fluctuation threshold value by constructing confidence coefficient of t distribution, evaluating whether the difference between each PPM value and an ideal PPM value is smaller than the fluctuation threshold value, and if so, retaining the data to be analyzed for calculating the PPM value. Namely: and constructing t distribution of the data to be analyzed through the average value and variance of the data to be analyzed, and estimating the difference between the t distribution and the PPM true value calculated through collecting the million-level PPM quality data through the confidence coefficient of the t distribution.
Specifically, the t distribution is a t distribution that approximately satisfies the degree of freedom n-1, and the ideal PPM calculation result based on the million-magnitude mass data is approximately 10 6 μ, μ being the probability distribution average.
The fluctuation interval of the PPM value at the millions of mass data obtained according to the t distribution is as follows:
Wherein, Data mean, S is data standard deviation, n is data number, a is confidence, and IIs a threshold determined based on the confidence level.
Thus, when the difference between the individual PPM values and the ideal PPM value is smaller thanWhen the method is used, the fact that the data to be analyzed are normalized to the calculated PPM values in the millions can effectively evaluate the PPM true values under the condition of acquisition of the quality data in the millions is explained, otherwise, more quality data need to be acquired, and the PPM values of the existing equipment data are further optimized.
When it is determined that the calculated PPM values can be used for evaluation, comparing the PPM values with the corresponding PPM threshold ranges, and performing data envelope analysis on the data to be analyzed which is not within the PPM threshold ranges includes: taking the data to be analyzed which is not in the PPM threshold value range as sample data; acquiring success data, calculating a confidence interval of the success data, and judging whether the sample data is in the confidence interval according to the range of the confidence interval; acquiring an envelope upper limit and an envelope lower limit of successful data according to preset confidence coefficient, wherein the envelope upper limit and the envelope lower limit are used for representing whether the sample data is enveloped or not; acquiring a qualified upper limit and a qualified lower limit according to a preset tolerance value, wherein the qualified upper limit and the qualified lower limit are used for indicating whether sample data are qualified or not; based on whether the envelope, whether it is acceptable, and whether it is in the confidence interval, an analysis result of the sample data is generated.
The success data refers to product data that has been verified as successful or not failed in experiments or histories. Calculating a confidence interval of the successful data according to the successful data, comprising: respectively counting successful data of each index according to the to-be-analyzed test characteristic index corresponding to the sample data, and if the number of the successful data is greater than a number threshold, constructing a confidence interval through a Gaussian mixture density function (GMM) algorithm; otherwise, a confidence interval is constructed through t distribution.
Specifically, a confidence interval is constructed by a Gaussian mixture density function (GMM) algorithm, parameters in the Gaussian density function are estimated by an EM algorithm, and posterior probability distribution of the estimated parameters is calculated according to a Bayesian formula, so that the confidence interval is obtained.
The EM algorithm is mapped to the parameter estimation in the gaussian mixture density function as follows:
Wherein mu k、∑k and pi k are the mean value and variance of the Gaussian density function corresponding to the quality data of the kth test characteristic index and the proportion of the kth test characteristic index, and n is the number of samples.
The method for constructing the confidence interval through t distribution is the same as the method for evaluating the PPM value, and 10 6 in the formula (4) is removed, namely the expression form of the confidence interval.
Further, the present embodiment generates an envelope upper limit and an envelope lower limit for the confidence of 99.73% (corresponding to 3σ) of the equipment production data definition. And acquiring a qualified upper limit and a qualified lower limit according to a preset tolerance value, wherein a formed interval is a tolerance zone. When the designed tolerance is used as the upper line and the lower line of the product qualification criterion standard and is overlapped with the envelope line trend, the influence relation of single quality data on the task is completely mastered, and the risk brought by decision is extremely small. However, the situation that the tolerance zone is not coincident with the envelope line often occurs, so the embodiment considers the envelope zone, the tolerance zone and the confidence zone simultaneously, and generates an analysis result for the sample data, thereby facilitating more accurate risk analysis and assessment.
The analysis results include: acceptable and envelope (whether or not in confidence interval), acceptable but not envelope (whether or not in confidence interval), unacceptable but envelope (whether or not in confidence interval), and unacceptable and not envelope (whether or not in confidence interval).
Compared with the prior art, the quality data analysis method and system provided by the embodiment are used for performing redundancy detection and rejection on test quality data and evaluating the confidence level of the test data under the condition of a small sample by using technologies such as data correlation analysis, data abnormality analysis and small sample data analysis based on the collected equipment quality data, intelligently analyzing whether the product data falls within an envelope range, discovering quality hidden danger or weak links existing in the production process in advance, and realizing refinement and intellectualization of quality management control.
Example 2
In another embodiment of the present invention, a mass data analysis system is disclosed to implement the mass data analysis method of embodiment 1. The specific implementation of each module is described with reference to the corresponding description in embodiment 1. The system comprises:
The test characteristic index acquisition module is used for acquiring quality data and test characteristic indexes in the production process, and removing redundant test characteristic indexes according to the quality data and the correlation coefficient to obtain test characteristic indexes to be analyzed;
the data acquisition module to be analyzed is used for removing abnormal data of quality data of each inspection characteristic index to be analyzed according to the statistical analysis and the variation self-encoder to obtain data to be analyzed;
and the quality data analysis module is used for calculating each PPM value in the production process according to the to-be-analyzed detection characteristic index and the to-be-analyzed data, comparing the PPM value with the corresponding PPM threshold range respectively, and carrying out data envelope analysis on the to-be-analyzed data which is not in the PPM threshold range.
Since the relevant parts of the mass data analysis system and the mass data analysis method in this embodiment can be referred to each other, the description is repeated here, and thus the description is omitted here. The principle of the system embodiment is the same as that of the method embodiment, so the system embodiment also has the corresponding technical effects of the method embodiment.
Those skilled in the art will appreciate that all or part of the flow of the methods of the embodiments described above may be accomplished by way of a computer program to instruct associated hardware, where the program may be stored on a computer readable storage medium. Wherein the computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims (8)

1. A method of mass data analysis comprising the steps of:
Acquiring quality data and check characteristic indexes in the production process, removing redundant check characteristic indexes according to the quality data and the correlation coefficient to obtain the check characteristic indexes to be analyzed, wherein the method comprises the following steps of: dividing all N detection characteristic indexes into a plurality of paired combinations by a traversing and recursion method, wherein a first group of the paired combinations has i indexes, and a second group of the paired combinations is the rest N-i indexes; taking any one of the indexes of the second group with the smallest quantity as the inspection characteristic indexes to be analyzed from the paired combinations meeting the following conditions: the correlation coefficient between each index in the first group and all indexes in the second group is larger than a correlation threshold value and is taken as a basic condition, and the basic condition is not met after any index is taken out from the second group and added into the first group;
Removing abnormal data of quality data of each to-be-analyzed inspection characteristic index according to the statistical analysis and the variation self-encoder to obtain to-be-analyzed data;
according to the inspection characteristic index to be analyzed and the data to be analyzed, calculating each PPM value in the production process, comparing the PPM value with a corresponding PPM threshold range respectively, and carrying out data envelope analysis on the data to be analyzed which is not in the PPM threshold range, wherein the method comprises the following steps: taking the data to be analyzed which is not in the PPM threshold value range as sample data; acquiring successful data, calculating a confidence interval of the successful data, and indicating whether the sample data is in the confidence interval according to the range of the confidence interval; acquiring an envelope upper limit and an envelope lower limit of successful data according to preset confidence coefficient, wherein the envelope upper limit and the envelope lower limit are used for representing whether the sample data is enveloped or not; acquiring a qualified upper limit and a qualified lower limit according to a preset tolerance value, wherein the qualified upper limit and the qualified lower limit are used for indicating whether sample data are qualified or not; sample data analysis results are generated based on whether the envelope, whether the envelope is acceptable, and whether the confidence interval is present.
2. The quality data analysis method according to claim 1, further comprising, before comparing with the corresponding PPM threshold ranges, respectively: if the data quantity of the data to be analyzed is smaller than or equal to the quantity threshold value, acquiring a fluctuation threshold value by constructing confidence coefficient of t distribution, evaluating whether the difference between each PPM value and an ideal PPM value is smaller than the fluctuation threshold value, and if so, retaining the data to be analyzed for calculating the PPM value.
3. The method according to claim 1, wherein the correlation coefficient between each index in the first group and all indexes in the second group is obtained by obtaining a linear combination of quality data corresponding to the two groups of indexes, and maximizing pearson correlation coefficients of the two groups of linear combinations.
4. The method for analyzing quality data according to claim 1, wherein the removing abnormal data of the quality data of each inspection characteristic index to be analyzed from the encoder according to the statistical analysis and the variation to obtain the data to be analyzed comprises:
based on statistical analysis, removing the quality data which is larger than an abnormal threshold value as abnormal data after the quality data of each to-be-analyzed inspection characteristic index is subjected to z-score standardization processing;
the quality data after the standardized treatment of each to-be-analyzed test characteristic index is respectively transmitted into a trained variation self-encoder, the obtained output and the input are subjected to difference comparison, and the quality data with the difference value larger than a difference threshold value is taken as abnormal data to be removed;
the remaining quality data is used as the data to be analyzed.
5. The quality data analysis method according to claim 4, wherein the loss function of the variation self-encoder includes a reconstruction term and a KL-divergence regularization term, and wherein a weight parameter is added before the KL-divergence regularization term for reducing the weight of the KL-divergence regularization term.
6. The quality data analysis method according to claim 1, wherein the calculating of each PPM value in the production process based on the inspection characteristic index to be analyzed and the data to be analyzed includes:
Collecting the defect number and the severity coefficient of each procedure, and obtaining the total number of defects of each procedure; acquiring the quantity of the data to be analyzed corresponding to the inspection characteristic indexes of each process as the total number of the inspection characteristics of each process according to the inspection characteristic indexes to be analyzed and the data to be analyzed; obtaining PPM values of all the procedures according to the total number of the defects of all the procedures and the total number of corresponding procedure checking characteristics;
according to the total number of process defects and the total number of process checking characteristics of the processes related to each product in the generation process, summarizing to obtain the total number of defects of each product and the total number of the checking characteristics of each product; obtaining PPM values of all products according to the total number of the defects of all the products and the total number of corresponding product inspection characteristics;
According to the total number of product defects and the total number of product inspection characteristics of products to which each model number belongs, the total number of the defects of each model number and the total number of inspection characteristics of each model number are obtained, and according to the total number of the defects of each model number and the total number of the inspection characteristics of the corresponding model number, the PPM value of each model is obtained.
7. The quality data analysis method according to claim 1, wherein the acquiring success data, calculating a confidence interval of the success data, comprises: respectively counting successful data of each index according to the to-be-analyzed test characteristic index corresponding to the sample data, and if the number of the successful data is greater than a number threshold, constructing a confidence interval through a Gaussian mixture density function (GMM) algorithm; otherwise, a confidence interval is constructed through t distribution.
8. A mass data analysis system, comprising:
The test characteristic index obtaining module is used for obtaining quality data and test characteristic indexes in the production process, removing redundant test characteristic indexes according to the quality data and the correlation coefficient to obtain test characteristic indexes to be analyzed, and comprises the following steps: dividing all N detection characteristic indexes into a plurality of paired combinations by a traversing and recursion method, wherein a first group of the paired combinations has i indexes, and a second group of the paired combinations is the rest N-i indexes; taking any one of the indexes of the second group with the smallest quantity as the inspection characteristic indexes to be analyzed from the paired combinations meeting the following conditions: the correlation coefficient between each index in the first group and all indexes in the second group is larger than a correlation threshold value and is taken as a basic condition, and the basic condition is not met after any index is taken out from the second group and added into the first group;
the data acquisition module to be analyzed is used for removing abnormal data of quality data of each inspection characteristic index to be analyzed according to the statistical analysis and the variation self-encoder to obtain data to be analyzed;
The quality data analysis module is used for calculating each PPM value in the production process according to the detection characteristic index to be analyzed and the data to be analyzed, comparing the PPM value with the corresponding PPM threshold range respectively, and carrying out data envelope analysis on the data to be analyzed which is not in the PPM threshold range, and comprises the following steps: taking the data to be analyzed which is not in the PPM threshold value range as sample data; acquiring successful data, calculating a confidence interval of the successful data, and indicating whether the sample data is in the confidence interval according to the range of the confidence interval; acquiring an envelope upper limit and an envelope lower limit of successful data according to preset confidence coefficient, wherein the envelope upper limit and the envelope lower limit are used for representing whether the sample data is enveloped or not; obtaining a qualified upper limit and a qualified lower limit according to a preset tolerance value by using
Whether the sample data is qualified or not is indicated; based on whether the envelope, whether it is acceptable and whether it is in the confidence interval,
And generating a sample data analysis result.
CN202310007166.5A 2023-01-04 2023-01-04 Quality data analysis method and system Active CN116049157B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310007166.5A CN116049157B (en) 2023-01-04 2023-01-04 Quality data analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310007166.5A CN116049157B (en) 2023-01-04 2023-01-04 Quality data analysis method and system

Publications (2)

Publication Number Publication Date
CN116049157A CN116049157A (en) 2023-05-02
CN116049157B true CN116049157B (en) 2024-05-07

Family

ID=86128997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310007166.5A Active CN116049157B (en) 2023-01-04 2023-01-04 Quality data analysis method and system

Country Status (1)

Country Link
CN (1) CN116049157B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116777292B (en) * 2023-06-30 2024-04-16 北京京航计算通讯研究所 Defect rate index correction method based on multi-batch small sample space product

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105303311A (en) * 2015-10-21 2016-02-03 中国人民解放军装甲兵工程学院 Assessment index selection method and device based on data envelopment analysis
CN108304350A (en) * 2017-12-25 2018-07-20 明阳智慧能源集团股份公司 Wind turbine index prediction based on large data sets neighbour's strategy and fault early warning method
CN109101632A (en) * 2018-08-15 2018-12-28 中国人民解放军海军航空大学 Product quality abnormal data retrospective analysis method based on manufacture big data
CN110807605A (en) * 2019-11-14 2020-02-18 北京京航计算通讯研究所 Key inspection characteristic defect rate statistical method
CN112149860A (en) * 2019-06-28 2020-12-29 中国电力科学研究院有限公司 Automatic anomaly detection method and system
CN112258689A (en) * 2020-10-26 2021-01-22 上海船舶研究设计院(中国船舶工业集团公司第六0四研究院) Ship data processing method and device and ship data quality management platform
WO2021189904A1 (en) * 2020-10-09 2021-09-30 平安科技(深圳)有限公司 Data anomaly detection method and apparatus, and electronic device and storage medium
CN113609698A (en) * 2021-08-17 2021-11-05 北京无线电测量研究所 Process reliability analysis method and system based on process fault database
CN114036724A (en) * 2021-10-19 2022-02-11 北京轩宇信息技术有限公司 Method and device for analyzing technical index success envelope of aerospace product
WO2022243764A1 (en) * 2021-05-18 2022-11-24 LEONARDO S.p.A Method and system for detecting anomalies relating to components of a transmission system of an aircraft, in particular a helicopter

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SG11201804355UA (en) * 2015-11-26 2018-06-28 Human Metabolome Tech Inc Data analysis apparatus, method, and program
CA2989617A1 (en) * 2016-12-19 2018-06-19 Capital One Services, Llc Systems and methods for providing data quality management
CN113704243A (en) * 2020-05-20 2021-11-26 富泰华工业(深圳)有限公司 Data analysis method, data analysis device, computer device, and storage medium
US11984334B2 (en) * 2021-04-13 2024-05-14 Accenture Global Solutions Limited Anomaly detection method and system for manufacturing processes

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105303311A (en) * 2015-10-21 2016-02-03 中国人民解放军装甲兵工程学院 Assessment index selection method and device based on data envelopment analysis
CN108304350A (en) * 2017-12-25 2018-07-20 明阳智慧能源集团股份公司 Wind turbine index prediction based on large data sets neighbour's strategy and fault early warning method
CN109101632A (en) * 2018-08-15 2018-12-28 中国人民解放军海军航空大学 Product quality abnormal data retrospective analysis method based on manufacture big data
CN112149860A (en) * 2019-06-28 2020-12-29 中国电力科学研究院有限公司 Automatic anomaly detection method and system
CN110807605A (en) * 2019-11-14 2020-02-18 北京京航计算通讯研究所 Key inspection characteristic defect rate statistical method
WO2021189904A1 (en) * 2020-10-09 2021-09-30 平安科技(深圳)有限公司 Data anomaly detection method and apparatus, and electronic device and storage medium
CN112258689A (en) * 2020-10-26 2021-01-22 上海船舶研究设计院(中国船舶工业集团公司第六0四研究院) Ship data processing method and device and ship data quality management platform
WO2022243764A1 (en) * 2021-05-18 2022-11-24 LEONARDO S.p.A Method and system for detecting anomalies relating to components of a transmission system of an aircraft, in particular a helicopter
CN113609698A (en) * 2021-08-17 2021-11-05 北京无线电测量研究所 Process reliability analysis method and system based on process fault database
CN114036724A (en) * 2021-10-19 2022-02-11 北京轩宇信息技术有限公司 Method and device for analyzing technical index success envelope of aerospace product

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Optimizing the Uncertainty of PPM on Small Batch of Quality Data;J. Wang, T. Zhang, C. Wang and X. Shi,;2021 IEEE 6th International Conference on Smart Cloud;20211231;107-110 *

Also Published As

Publication number Publication date
CN116049157A (en) 2023-05-02

Similar Documents

Publication Publication Date Title
JP6725700B2 (en) Method, apparatus, and computer readable medium for detecting abnormal user behavior related application data
CN103513983A (en) Method and system for predictive alert threshold determination tool
Deming et al. Exploratory Data Analysis and Visualization for Business Analytics
CN116049157B (en) Quality data analysis method and system
CN111027615A (en) Middleware fault early warning method and system based on machine learning
CN111338972A (en) Machine learning-based software defect and complexity incidence relation analysis method
CN110011990A (en) Intranet security threatens intelligent analysis method
Sundareswaran Egomotion from global flow field data
Amazal et al. Estimating software development effort using fuzzy clustering‐based analogy
Ishii et al. Classification of time series generation processes using experimental tools: a survey and proposal of an automatic and systematic approach
Nair et al. A life cycle on processing large dataset-LCPL
Pauwels et al. Detecting and explaining drifts in yearly grant applications
Khoshgoftaar et al. Detecting noisy instances with the rule-based classification model
Neela et al. Modeling Software Defects as Anomalies: A Case Study on Promise Repository.
US20230274152A1 (en) System and method for model configuration selection
CN111752995A (en) Student data mining system and method
Mi et al. A modified soft‐likelihood function based on POWA operator
Widad et al. Quality Anomaly Detection Using Predictive Techniques: An Extensive Big Data Quality Framework for Reliable Data Analysis
Nurunnabi et al. Robust-diagnostic regression: a prelude for inducing reliable knowledge from regression
CN115410718B (en) Method for evaluating error of investigator in large-scale face-to-face investigation
CN113377746B (en) Test report database construction and intelligent diagnosis analysis system
CN113378560B (en) Test report intelligent diagnosis analysis method based on natural language processing
CN116701962B (en) Edge data processing method, device, computing equipment and storage medium
Uddin et al. Actor-level dynamicity: Its distribution analysis eases anomaly detection in longitudinal networks
Kılıç et al. Data mining and statistics in data science

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant