CN111724048A

CN111724048A - Characteristic extraction method for finished product library scheduling system performance data based on characteristic engineering

Info

Publication number: CN111724048A
Application number: CN202010494916.2A
Authority: CN
Inventors: 潘佰林; 许小双; 乐欢; 郭妙贞
Original assignee: China Tobacco Zhejiang Industrial Co Ltd
Current assignee: China Tobacco Zhejiang Industrial Co Ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2020-09-29

Abstract

The invention discloses a characteristic extraction method for finished product library scheduling system performance data based on characteristic engineering, which comprises the following steps: 1) pre-judging the fault scene of the finished product warehouse logistics system scheduling subsystem according to experience, analyzing the data performance in the fault scene, and pertinently selecting corresponding indexes; 2) collecting selected index data at equal time intervals, cleaning and preprocessing the data to obtain a data set for feature extraction; 3) and extracting the characteristics of the data set, and amplifying and displaying the characteristics through an excitation function. The method extracts and amplifies the relatively fine features, and the KPI finds a proper feature detector and finds out the key features of the complex data so as to facilitate checking by operation and maintenance personnel, so that the information loss is less, the rules contained in the original data are still kept, and the uncertain factors in the original data can be effectively reduced.

Description

Characteristic extraction method for finished product library scheduling system performance data based on characteristic engineering

Technical Field

The invention relates to the field of logistics equipment monitoring management, in particular to a characteristic extraction method for finished product library scheduling system performance data based on characteristic engineering.

Background

Finished product cigarette scanning sorting backflow fault is a common fault on a logistics scheduling production line of a cigarette factory and is caused by reasons in the aspect of production PLC transmission mechanisms, most of PLC industrial control equipment of cigarettes at present are not monitored sufficiently, sufficient analysis data cannot be acquired, and due to complex production environment factors, specific reasons causing the fault are different, such as performance bottleneck of a firewall, database cluster heartbeat timeout, storage disk IO delay and the like. When the fault occurs, the phenomenon of code sweeping, sorting and backflow of finished cigarette pieces can occur, and a large number of finished cigarette pieces jump out of the production line, so that economic loss is caused. Therefore, finished cigarette smoke scanning sorting backflow faults are used as entry points, and a data base can be laid for correlating and early warning of the faults through environment application data through the research of the acquisition and feature extraction method of environment data related to the PLC equipment. The finished product warehouse logistics system scheduling subsystem of the cigarette factory generates a large amount of application performance data in the operation process, such as: CPU utilization, memory utilization, swap area utilization, disk IO rate, IO read-write frequency, disk average latency, network port rate, and the like. This massive, chaotic and cluttered information is often difficult for the algorithms to directly utilize before feature extraction. Regardless of machine learning, deep learning, or statistical methods, any intelligent system requires support of valid data. Therefore, how to process the original data into qualified data input becomes a difficult problem which troubles the operation and maintenance personnel of the equipment for many years.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a feature extraction method for performance data of a finished product library scheduling system based on feature engineering, which extracts and amplifies relatively fine features, finds a suitable feature detector for KPI, and finds out key features of complex data, so as to facilitate checking by operation and maintenance personnel, reduce information loss, and effectively reduce uncertain factors in original data, while rules included in the original data are still retained.

Based on the above purpose, the invention provides a characteristic extraction method for finished product library scheduling system performance data based on characteristic engineering, which comprises the following steps:

1) pre-judging the fault scene of the finished product warehouse logistics system scheduling subsystem according to experience, analyzing the data performance in the fault scene, and pertinently selecting corresponding indexes;

2) collecting selected index data at equal time intervals, cleaning and preprocessing the data to obtain a data set for feature extraction;

3) and extracting the characteristics of the data set, and amplifying and displaying the characteristics through an excitation function.

Preferably, the characteristic extraction of the data set comprises extracting performance data of a finished product library scheduling system, and checking the continuity and integrity of the performance data of the finished product library scheduling system to remove the interference of CPU utilization rate, memory utilization rate and network port rate;

after the test, 1/2 of the data set is intercepted and used for training a feature selection model, 1/3 of the data set is intercepted in the rest part and used for auxiliary parameter adjustment in the training process, and the final 1/6 of the data set is used for verifying the effect of the model.

Preferably, the extraction of the feature points is performed by using a chi-square test feature point extraction algorithm.

Preferably, the integrity detection comprises detecting the extraction content, the extraction speed, the description of the coincidence condition and the description of the coincidence matching speed of the feature points.

Preferably, a regression analysis method is adopted to check the continuity of the performance data of the finished product library scheduling system.

Preferably, the specific method for checking the integrity of the performance data of the finished product library scheduling system comprises the following steps: and selecting a plurality of points around each time point according to the time dimension to form a set, and judging whether the performance data of the finished product library scheduling system is complete or not according to the kernel density of the set.

Preferably, the cleaning of the index data includes removing abnormal data in the operation data of the logistics sorting machine.

Preferably, the preprocessing the index data includes: and checking the consistency of the residual data, and carrying out ETL (extract transform load) processing, filtering, splitting and expanding on the data after the cleaned data enters the message bus.

Compared with the prior art, the invention has the beneficial effects that:

the method extracts and amplifies the relatively fine features, and the KPI finds a proper feature detector and finds out the key features of the complex data so as to facilitate checking by operation and maintenance personnel, so that the information loss is less, the rules contained in the original data are still kept, and the uncertain factors in the original data can be effectively reduced.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flowchart of a feature extraction method for performance data of a finished product library scheduling system based on feature engineering according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a feature extraction method for performance data of a finished product library scheduling system based on feature engineering according to an embodiment of the present invention;

FIG. 3 is a graph comparing the performance of the Chi-square test, stability selection, and recursive feature elimination three feature extraction algorithms;

FIG. 4 is a schematic illustration of feature extraction on data at a granularity of two hours in an embodiment of the invention;

FIG. 5 is a schematic diagram of the present invention employing Chi-Square testing for the number of different performance feature extractions;

fig. 6 is a schematic diagram of the CPU utilization index feature of the finished product library scheduling system being leveled and extracted to generate a linear feature curve in the embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, elements, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiment provides a feature extraction method for performance data of a finished product library scheduling system based on feature engineering, as shown in fig. 1 and 2, the method includes the following steps:

As a preferred embodiment, the cleaning the index data includes removing abnormal data in the operation data of the logistics sorting machine. And cleaning abnormal data, namely cleaning the abnormal data in the operation data of the logistics sorting machine. The abnormal data comprises abnormal data and missing data contained in the production data, and some known external factors, such as data of abnormal working conditions, are screened and excluded according to actual production experience. The missing value is processed in a mode of eliminating the whole data containing the missing value; outliers were detected using statistical-based outliers: very poorly, this method is suitable for mining univariate numerical data.

As a preferred embodiment, the preprocessing the index data includes: and checking the consistency of the residual data, and carrying out ETL (extract transform load) processing, filtering, splitting and expanding on the data after the cleaned data enters the message bus. Specifically, the data preprocessing includes operations such as feature selection, data normalization, and the like. The data preprocessing also comprises operations of feature selection, data normalization, data standardization and the like. Wherein, standardizing: and z-score standardization, wherein the mean value of the processed data is 0, the standard deviation is 1, and the processing method comprises the following steps: x ═ x- μ. In formula one, x' is the normalized feature, x is the raw feature value, is the sample mean, and is the sample standard deviation. They can be estimated from existing samples. And the stability is relatively stable under the condition that the existing samples are enough. In addition, the data set of the processed data for about 6 months was divided into a training set, a validation set, and a test set. The training set is used for training the model, the verification set is used for assisting in parameter adjustment in the training process, and the test set is used for finally verifying the effect of the model.

After the data preprocessing is finished, different model combinations are reasonably selected for training and generating corresponding models according to the characteristics of the time sequence, the allocation of computing resources and the time of the data. Based on the above, the characteristic extraction of the data set comprises the steps of extracting performance data of the finished product library scheduling system, and checking the continuity and the integrity of the performance data of the finished product library scheduling system to remove the interference of the CPU utilization rate, the memory utilization rate and the network port rate;

As a better implementation mode, a chi-square test feature point extraction algorithm is adopted to extract feature points, wherein the classical chi-square test is used for testing the correlation of qualitative independent variables to qualitative dependent variables. Assuming that the independent variable has N values and the dependent variable has M values, the difference between the observed value of the sample frequency number of the independent variable equal to i and the dependent variable equal to j and the expectation is considered. The meaning of this statistic is simply the dependence of the independent variable on the dependent variable. And selecting K characteristics with the chi-square value in the front as final characteristic selection. The chi-square test calculation formula is as follows:

where fo is the observed frequency (count observed in the cell) and fe is the expected frequency if there is no relationship between the variables, as shown in the equation, chi-squared statistics is based on the difference between the values actually observed in the data and the expected values where there is indeed no relationship between the variables.

Preferably, the extraction of the feature points may also be performed by a method of stability selection or recursive feature elimination, specifically:

and (3) selecting stability: stability selection is a newer method based on a combination of subsampling and a selection algorithm, which may be regression, SVM, or other similar methods. The main idea is to run a feature selection algorithm on different data subsets and feature subsets, repeat the algorithm continuously, and finally summarize feature selection results, for example, the frequency of a certain feature considered as an important feature (the number of times of selecting as an important feature is divided by the number of times of testing the subset in which the feature is located) can be counted. Ideally, the score for an important feature would be close to 100%. A slightly weaker feature score would be a number other than 0, and the least useful feature score would be close to 0.

Recursive feature elimination: the main idea of recursive feature elimination is to iteratively build a model (e.g., SVM or regression model) and then select the best (or worst) feature (which may be selected based on coefficients), set aside the selected feature, and then repeat the process on the remaining features until all features have been traversed. The order in which features are eliminated in this process is the ordering of the features. Thus, this is a greedy algorithm to find the optimal feature subset. FIG. 3 is a graph comparing the performance of three feature extraction algorithms of Chi-squared test, stability selection, and recursive feature elimination. By checking the test results of various performances, the speed of the chi-square test feature point extraction algorithm is 370MS, and although the extraction speed has no advantage, as shown in fig. 5, the chi-square test feature point extraction algorithm has obvious advantages in terms of descriptor extraction, matching speed and matching point quantity.

After the feature selection is completed, the model can be directly trained, but the problems of large calculation amount and long training time can be caused due to the fact that the feature matrix is too large, so that the reduction of the dimension of the feature matrix is also indispensable. The dimensionality reduction method adopts Principal Component Analysis (PCA), which essentially maps original samples into a sample space with lower dimensionality, and in order to enable the mapped samples to have maximum divergence, the PCA is an unsupervised dimensionality reduction method.

As a preferred embodiment, the integrity detection includes detecting the extraction content of the feature points, extracting speed, describing the matching condition and describing the matching speed.

As a better implementation mode, a regression analysis method is adopted to check the continuity of the performance data of the finished product library scheduling system.

As a preferred embodiment, the specific method for checking the integrity of the performance data of the finished product library scheduling system is as follows: and selecting a plurality of points around each time point according to the time dimension to form a set, and judging whether the performance data of the finished product library scheduling system is complete or not according to the kernel density of the set.

In addition, in step 3), since a KPI is normal most of the time, there is no large fluctuation, and only random noise exists. Fluctuations occur only when the service is affected. Therefore, the amount of fluctuation is much smaller than normal data. To attenuate the effects of noise, a modified version of the excitation function is used: the larger the fluctuation degree of one KPI is, the larger the fluctuation feature is amplified, so that the fluctuation feature is more distinctive and the final relevance judgment is more helpful.

Extracting fluctuation characteristics of KPIs

For a time series S ═ S1, S2, …, sm ], si is the data for KPI S at time i, and m is the length of KPI. For a single KPI, the time interval between data at adjacent time instants is required to be the same during data acquisition and preprocessing. For two KPIs, if the time intervals are different, the least common multiple of the two KPI time intervals can be taken as the common time interval. The predicted sequence P of KPI S ═ P1, P2, …, pm ], pi is the predicted value of si. Thus, the prediction error sequence F ═ F1, F2, …, fm ], fi ═ si-pi for a KPI. For a KPI, normal parts are relatively accurate and easy to predict, but abnormal fluctuation parts are usually caused by some unpredictable burst factors and are difficult to predict. Therefore, the prediction error can be well used for representing the fluctuation characteristics of the KPI, and the KPI fluctuation characteristics are represented by using the KPI prediction error sequence.

After the model is built, online detection in an actual environment can be started after data are accumulated to a certain degree, the online detection uses a key feature generation algorithm corresponding to the trained model to generate features of a new time point, the trained model is used for scoring the abnormal degree of the new time point, and in the online detection process, the following actual problems need to be processed:

the disadvantages are as follows: no data at a certain fixed time acquisition point

Disorder: the latter time first reaches the anomaly detection algorithm while the point of the previous time is still in the queue

Characteristic change: the characteristics of the time series are different from before due to new deployment and the like

An abnormal score can be given to a value algorithm corresponding to each time point, whether one point is an abnormal result or not can be given according to a default threshold value of abnormal detection, of course, the meanings of time sequences generated in a production environment are different, and expected abnormal detection effects may be different if the meanings of the same time sequences are different, so that the algorithm is automatically adjusted to achieve the expected effects according to a mode of marking feedback of abnormal missing report and normal false report.

In the above process, for the curve which has been labeled, a version of the model can be trained first to predict the curve which has not been labeled. And then, the new curve and the predicted probability value are used together with the original clustering cluster to readjust the optimization direction of the model. The iteration process is repeated in a circulating mode until the predicted value of the curve which is not marked is not changed any more or the specified iteration times are reached.

To illustrate the method of the present invention, a feature extraction process is further described as an example below:

the experimental environment includes: an Intel dual-core processor (master frequency 2.6GHz, memory 4 GB); the software environment comprises a Windows server2008 operating system and finished product library scheduling client software; and the tested data is acquired by zabbix and APM, sent to a database server, collected through a message bus and stored in a time sequence database of the cloud platform.

When selecting the characteristics, the operation and maintenance personnel of the finished product warehouse dispatching system presumes and predicts the possible fault scenes of the finished product warehouse logistics system dispatching subsystem according to experience, and selects the corresponding indexes in a targeted manner by analyzing the performance of faults under the scenes, and for reflecting the time distribution of operation behaviors (code scanning, bar code transmission, sorting action execution and the like) in one day of the production actions, the KPIs corresponding to the specific abnormal scenes are further statistically extracted according to the granularity of two hours in addition to the statistics according to the day, as shown in FIG. 4, which is equivalent to further decomposing the characteristics counted according to the day into 12 characteristics counted according to the two hours.

Collecting selected index data, removing abnormal working condition data of all the index data through manual inspection, and removing the whole data containing a missing value; and checking the data consistency, entering the cleaned data into a message bus, and carrying out ETL (extract transform load) processing, filtering, splitting, expanding and the like on the data.

And (3) using the washed and preprocessed six monthly history data for feature extraction, wherein 3 months of data are used for training a feature selection model, 2 months of data are used for assisting parameter adjustment in the training process, and 1 month of data are used for finally verifying the effect of the model.

The implementation content of feature extraction includes checking the continuity and integrity of performance data of the finished product library scheduling system, and the continuity and integrity of performance analysis mainly aims at various interferences, such as: the method mainly comprises the following steps of checking continuity and integrity of the CPU utilization rate, the memory utilization rate, the network port rate and the like, and mainly comprises the aspects of extracting content of feature points, extracting speed, describing conforming conditions, describing conforming matching speed and the like.

And (3) checking continuity of feature extraction:

the operation and maintenance abnormity is analyzed through the monitoring data, and the continuity of the monitoring data directly influences the final abnormal result. Regression analysis (regression analysis) is a statistical analysis method for determining the quantitative relationship of interdependence between two or more variables, and for studying the dependency relationship of dependent variables on independent variables, aiming at estimating or predicting the mean value of the dependent variables by given values of the independent variables. It can be used for prediction, time series modeling and discovery of causal relationships between various variables. In contrast to discrete data from previous classifications, the regression is performed to process continuous target data, and therefore, the objective of regression is to predict values of target variables of the numerical type.

And (3) integrity check of feature extraction:

in the production of cigarettes, IT systems are operated 7x24 hours. The amount of data for the machine and application is not constant, however, because the data follows variations as the production volume varies. In addition, during the holiday period, the shutdown maintenance is carried out, and when the system is closed, the service index of the cigarette production related system is completely zero. The two phases of shutdown and traffic peak, perfectly clear, the normal algorithm is almost certainly misinformed at the moment of these two transitions.

Therefore, according to the dimension of the day, a plurality of points around each time point of each day are selected to form a set, nuclear density analysis is carried out, and then all the points in one day are combined to obtain a final data normal distribution model. Meanwhile, in order to improve the effect, some noise errors can be actively added to the training data. And then, during actual detection, comparing the distribution of the last small section of simulation curve obtained by encoding and decoding the test data with the actual data, and judging whether serious deviation occurs or not.

This model is somewhat analogous to the mountains of counties on a 3D map, where numerous normal distributions are piled up together. Then the value coming from the corresponding time at the time of detection is obviously abnormal if appearing in the plain zone. Similarly, the index is a very simple curve, so that the curve is cut into a section of small curve according to the form of a sliding window, the small curve and the small curve are combined to form a characteristic matrix, and then the characteristic matrix enters multi-layer coding and decoding, and iteration is repeated to obtain the best model.

In addition, in order to strengthen the processing of the time characteristics, according to the dimension of the day, a plurality of points around the time point are selected for each time point of each day to form a set.

The characteristic engineering is to 'beat' the characteristic log or multi-system data into the characteristic available for the model and make various changes on the characteristic to generate a curve, and fig. 6 is how to 'beat' and extract the CPU utilization index characteristic of the finished product library scheduling system on the system to generate a linear characteristic curve.

Although the embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and not to be construed as limiting the present invention, and those skilled in the art can make changes, modifications, substitutions and alterations to the above embodiments without departing from the principle and spirit of the present invention, and any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention still fall within the technical scope of the present invention.

Claims

1. A characteristic extraction method for finished product library scheduling system performance data based on characteristic engineering is characterized by comprising the following steps:

2. The feature extraction method for the performance data of the finished product library scheduling system based on the feature engineering as claimed in claim 1, wherein the feature extraction of the data set comprises extracting the performance data of the finished product library scheduling system and checking the continuity and integrity of the performance data of the finished product library scheduling system;

3. The feature extraction method for the performance data of the finished product library scheduling system based on the feature engineering as claimed in claim 2, wherein the extraction of the feature points is performed by using a chi-square test feature point extraction algorithm.

4. The feature extraction method for the performance data of the finished product library scheduling system based on the feature engineering as claimed in claim 2, wherein the integrity detection includes detecting the extraction content, the extraction speed, the description coincidence condition and the description coincidence matching speed of the feature points.

5. The feature extraction method for the performance data of the finished product library scheduling system based on the feature engineering as claimed in claim 2, wherein a regression analysis method is adopted to check the continuity of the performance data of the finished product library scheduling system.

6. The feature extraction method for the performance data of the finished product library scheduling system based on the feature engineering as claimed in claim 2, wherein the specific method for checking the integrity of the performance data of the finished product library scheduling system is as follows: and selecting a plurality of points around each time point according to the time dimension to form a set, and judging whether the performance data of the finished product library scheduling system is complete or not according to the kernel density of the set.

7. The feature extraction method for the performance data of the finished product warehouse dispatching system based on the feature engineering as claimed in claim 1, wherein the cleaning of the index data comprises removing abnormal data in the operation data of the logistics sorting machine.

8. The feature extraction method for the performance data of the finished product library scheduling system based on the feature engineering as claimed in claim 1, wherein the preprocessing the index data comprises: and checking the consistency of the residual data, and carrying out ETL (extract transform load) processing, filtering, splitting and expanding on the data after the cleaned data enters the message bus.