CN112131388B

CN112131388B - Abnormal data detection method containing text data types

Info

Publication number: CN112131388B
Application number: CN202011037634.6A
Authority: CN
Inventors: 范馨月; 魏斐斐; 杜逆索; 沈齐
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2024-02-06
Anticipated expiration: 2040-09-28
Also published as: CN112131388A

Abstract

The invention discloses an abnormal data detection method containing text data types. Comprising the following steps: s1, judging whether the data type of the read data is pure numerical data, pure text data or numerical text combined data; s2, for pure numerical data, carrying out data anomaly detection based on a deep learning and machine learning algorithm, and outputting an anomaly detection result; and replacing text values in the data with numerical values for plain text data or numerical text combination data, performing data anomaly detection based on a deep learning and machine learning algorithm, and outputting an anomaly detection result. The invention can perform abnormality detection on the text data or the numerical value and text mixed data, and also performs abnormality detection on the text data under the condition of ensuring better detection accuracy of the numerical value data, thereby realizing abnormality detection of multiple types of data, ensuring that the types of the detected data are more abundant and ensuring better data quality.

Description

Abnormal data detection method containing text data types

Technical Field

The invention relates to the technical field of data processing, in particular to an abnormal data detection method containing text data types.

Background

Abnormal data refers to the detection of unusual data in a given dataset. In different fields, these distinctive data can also be called noise, outliers, etc. These isolated points are significantly different from the rest of the points, which makes it suspected that this is not biased, but is caused by a different mechanism. With the advent of the large data age, data sizes and dimensions have exploded, and the types of data are not just numerical data. In order to better ensure the data quality, how to effectively detect the abnormal value of the data becomes a primary task of data analysis and processing. In the existing data anomaly detection algorithms, the anomaly detection of numerical data exists in a large number, and a large number of algorithms such as I-Forest, VAE and other models already appear in machine learning and deep learning; however, anomaly detection of plain text data or numerical and text-combined data is rarely involved in the existing research, and the anomaly detection method is still to be broken through.

Disclosure of Invention

The invention aims to provide an abnormal data detection method containing text type data types. The invention can perform abnormality detection on the text data or the numerical value and text mixed data, and also perform abnormality detection on the text data under the condition of ensuring better detection accuracy of the numerical value data, thereby realizing abnormality detection of multiple types of data, ensuring that the detected data types are more abundant and better ensuring the data quality under certain detection accuracy.

The technical scheme of the invention is as follows: an abnormal data detection method including text type data type, comprising the steps of:

s1, judging whether the data type of the read data is pure numerical data, pure text data or numerical text combined data;

s2, for pure numerical data, carrying out data anomaly detection based on a deep learning and machine learning algorithm, and outputting an anomaly detection result; and replacing text values in the data with numerical values for plain text data or numerical text combination data, performing data anomaly detection based on a deep learning and machine learning algorithm, and outputting an anomaly detection result.

In step S1 of the foregoing method for detecting abnormal data, the method for determining the data type of the read data is as follows:

replacing all the digits in each datum with a digit a;

replacing all Chinese in each data with a number b;

replacing all letters in each data with a number c;

replacing all other characters with a number d;

and connecting all the data lists, and performing deduplication by using a dictionary of Python to obtain the data type contained in each column.

In step S2 described in the foregoing method for detecting abnormal data, the method for detecting abnormal data of a pure numerical value is as follows:

performing preliminary anomaly detection on the data by using an LSTM-VAE model, and storing the data with normal detection;

and (3) performing anomaly detection on the data with normal primary anomaly detection by using a Gaussian mixture clustering algorithm, and outputting an anomaly detection result.

In step S2, the anomaly detection method for plain text data or numerical text combined data is as follows:

replacing the text value in the data with the normal detection value;

and carrying out anomaly detection on the replaced data by using a density clustering algorithm, and outputting an anomaly detection result.

The beneficial effects are that: compared with the prior art, the method can not only detect the abnormality of the pure numerical value type data, but also detect the abnormality of the data based on the deep learning and machine learning algorithm after replacing the text value in the pure text type data or the numerical value text combination type data with the numerical value; therefore, the invention can also perform abnormal detection on the text data under the condition of ensuring higher detection accuracy of the pure numerical data, so that the detected data types are richer, and the data quality is better ensured. The inventors found at the time of study that: when a single algorithm model is used for carrying out anomaly detection on data, the detection precision is limited; since few abnormal detection of text-based data or text-numerical-combination-type data is involved, the text or text-numerical-type data is not detected in general detection. In order to solve the technical problems, when the inventor detects the abnormality of the text or the text numerical value combination type data, the number in each data is replaced by the number a, the Chinese is replaced by the number b, the letters are replaced by the number c, and other characters are replaced by the number d; by this processing, the text data is converted into numeric values in its entirety, and abnormality detection is performed on the text or text numeric value-combined data. In addition, in order to ensure the detection precision, the inventors finally adopt the following method to carry out abnormality detection on data after multiple-turn experiments: firstly, using an LSTM-VAE model to perform preliminary abnormal detection on data, and storing the data with normal detection; and for the numerical data, the Gaussian mixture clustering algorithm is used for abnormality detection, so that the accuracy of the numerical data abnormality detection is improved to 100%, and the detection effect is good. For text or text numerical data, replacing text values in the data with normal detection values; and then carrying out anomaly detection on the replaced data by using a density clustering algorithm. By the step-by-step method, the abnormality detection precision of the data is effectively improved.

In summary, the invention can perform anomaly detection on text data or numerical value and text mixed data, and perform anomaly detection on the text data under the condition of ensuring better detection accuracy of the numerical value data, thereby realizing anomaly detection of multiple types of data, ensuring that the detected data types are richer and the data quality is better ensured under certain detection precision.

Drawings

FIG. 1 is a flow chart of anomaly detection in accordance with the present invention;

FIG. 2 is a data type determination flow chart;

FIG. 3 is a flow chart of purely numerical data anomaly detection;

FIG. 4 is a LSTM model diagram;

FIG. 5 is a VAE model diagram;

FIG. 6 is GMM model pseudocode;

FIG. 7 is a partial screenshot of csv data used for anomaly detection verification of purely numerical data;

FIG. 8 is an anomaly detection result for the data of FIG. 7;

FIG. 9 is a partial screenshot of the expanded data of FIG. 7 after exception data has been augmented;

FIG. 10 is an anomaly detection result for the data of FIG. 9;

FIG. 11 is a telephone number GMM cluster map;

FIG. 12 is an outlier output of FIG. 11;

FIG. 13 is a partial screenshot of license plate number data for use in anomaly detection verification of numeric text combined data;

FIG. 14 is a calculation result of the DBSCAN density clustering algorithm of the data of FIG. 13;

fig. 15 is an outlier output result of fig. 14.

Detailed Description

The invention is further illustrated by the following figures and examples, which are not intended to be limiting.

Example 1. A method for detecting abnormal data including text type data, see fig. 1, comprising the steps of:

s1, judging whether the data type of the read data is pure numerical data, pure text data or numerical text combined data; the read data is mainly read from a database and stored in a local csv file;

s2, carrying out data anomaly detection on the pure numerical data based on a deep learning algorithm, and outputting an anomaly detection result; and replacing text values in the data with numerical values for plain text data or numerical text combination data, detecting data anomalies based on a deep learning algorithm, and outputting anomaly detection results.

In the foregoing step S1, the method for determining the data type of the read data, see fig. 2, is specifically as follows:

replacing all the digits in each datum with a digit a;

replacing all Chinese in each data with a number b;

replacing all letters in each data with a number c;

replacing all other characters with a number d;

In the foregoing step S2, for the anomaly detection method of the pure numerical data, see fig. 3 as follows:

In the invention, firstly, the anomaly detection is carried out on the bit number of the numerical data, and an LSTM-VAE model is adopted. LSTM is one of the Recurrent Neural Networks (RNNs) and has a greater memory capacity than the underlying RNN, and is better at processing time series data. The variance self-coding (VAE) is to add variance inference on the basis of self-coding (AE) so that the whole neural network is changed into a generation model from a discrimination model, and the VAE model can well detect abnormal data according to the abnormal detection score value on the premise that the prior distribution is standard normal distribution. The LSTM model and the VAE model are combined, so that the memory of the neural network is improved, and the network abnormality detection capability is enhanced. Reference is made to figures 4 and 5 for model diagrams of both LSTM and VAE models, respectively.

The Gaussian mixture clustering algorithm (GMM) is specifically as follows:

the gaussian mixture distribution is defined as:

the distribution consists of k mixed components, each corresponding to a Gaussian distribution, where mu _i Sum sigma _i Is the parameter of the ith Gaussian mixture component, and alpha _i And > 0 is the corresponding mixing coefficient,

assume that the samples are given by a gaussian mixture distribution: i.e. according to alpha ₁ ,α ₂ ,......,α _k A defined a priori distribution selects gaussian mixture components, where α _i To select the probability of the ith blend component. The samples are then sampled according to the probability density function of the selected mixed component, thereby generating corresponding samples.

If training set D= { x ₁ ,x ₂ ,......x _m Generated by the above process, let the random variable z _j E {1,2, the term "k" means generating a sample x _j The value of the gaussian mixture component is unknown. z _j Is (z) _j =i) corresponds to α _i (i=1, 2,) the term "k", z according to bayesian theorem _j The posterior distribution of (a) corresponds to

From formula (4-2), p _M (z _j ＝i|x _j ) Give sample x _j The posterior probability generated from the ith Gaussian mixture component is denoted as γ _ji . If equation (4-1) is known, the Gaussian mixture cluster will divide the sample set D into k clusters C= { C ₁ ,C ₂ ,.......C _k Each sample x _j Cluster marking lambda _j The determination is as follows:

from the above, pseudo code of the gaussian mixture cluster is shown in fig. 6.

The LSTM-VAE model is combined with the GMM model to detect the abnormality of the pure numerical data, the detection effect is good, and the specific effect is shown in the following test results:

(1) The test data is derived from the csv data that the database leads to locally, and a partial screenshot is shown in FIG. 7.

The data set comprises five groups of data including time, ID card number, telephone number and license plate number, 10000 pieces of data in total and 216 pieces of abnormal data. The results of initial anomaly detection using LSTM-VAE are shown in FIG. 8. As can be seen from fig. 8, the judgment accuracy is close to 100%.

(2) The data shown in fig. 7 was continuously increased in the amount of abnormal data, changing the abnormal data to 300. A partial screenshot is shown in fig. 9 below. The data in FIG. 9 was also subjected to anomaly detection using LSTM-VAE, and the results are shown in FIG. 10 below. As can be seen from fig. 10, the accuracy is 0.9941, and the accuracy is also high, and all abnormal data is output.

(3) For pure numerical data, taking telephone numbers as an example, after primary inspection by an LSTM-VAE model, GMM clustering is adopted, and the result is shown in figure 11. From fig. 11, it can be seen that there are outliers, and all of the outliers are output as shown in fig. 12.

The original data are searched, all abnormal values are found out, the accuracy is 100%, and the detection effect is good.

In the aforementioned step S2, the anomaly detection method for plain text data or numerical text-combined data is as follows:

replacing the text value in the data with the normal detection value;

The anomaly detection of numerical text-combined data is exemplified by license plate number, and a partial data screenshot is shown in fig. 13. After the number of license plate numbers is primarily checked through the LSTM-VAE model, the DBSCAN density clustering algorithm is adopted on the data after the number of license plates is replaced with the numerical value, and the result is shown in the following figure 14. By outputting the outliers in fig. 14, the results are shown in fig. 15 as follows: by comparing with the original data, it can be known that the outliers of the original data are all found.

Claims

1. A method for detecting abnormal data including a text type data type, comprising the steps of: s1, judging whether the data type of the read data is pure numerical data, pure text data or numerical text combined data;

s2, for pure numerical data, carrying out data anomaly detection based on a deep learning and machine learning algorithm, and outputting an anomaly detection result; the data anomaly detection method based on the deep learning and machine learning algorithm comprises the following steps: performing preliminary anomaly detection on the data by using an LSTM-VAE model, and storing the data with normal detection; performing anomaly detection again on the data with the initial anomaly detection normal by using a Gaussian mixture clustering algorithm;

for plain text data or numerical text combination data, replacing text values in the data with numerical values, then carrying out data anomaly detection based on a deep learning and machine learning algorithm, and outputting an anomaly detection result; the data anomaly detection method based on the deep learning and machine learning algorithm comprises the steps of using an LSTM-VAE model to perform preliminary anomaly detection on data, and storing the data with normal detection; replacing the text value in the data with the normal detection value; and carrying out anomaly detection on the replaced data by using a density clustering algorithm.

2. The method for detecting abnormal data according to claim 1, wherein: in step S1, the method for determining the data type of the read data is as follows:

replacing all the digits in each datum with a digit a;

replacing all Chinese in each data with a number b;

replacing all letters in each data with a number c;

replacing all other characters with a number d;