CN112131388B - Abnormal data detection method containing text data types - Google Patents

Abnormal data detection method containing text data types Download PDF

Info

Publication number
CN112131388B
CN112131388B CN202011037634.6A CN202011037634A CN112131388B CN 112131388 B CN112131388 B CN 112131388B CN 202011037634 A CN202011037634 A CN 202011037634A CN 112131388 B CN112131388 B CN 112131388B
Authority
CN
China
Prior art keywords
data
text
anomaly detection
numerical
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011037634.6A
Other languages
Chinese (zh)
Other versions
CN112131388A (en
Inventor
范馨月
魏斐斐
杜逆索
沈齐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202011037634.6A priority Critical patent/CN112131388B/en
Publication of CN112131388A publication Critical patent/CN112131388A/en
Application granted granted Critical
Publication of CN112131388B publication Critical patent/CN112131388B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The invention discloses an abnormal data detection method containing text data types. Comprising the following steps: s1, judging whether the data type of the read data is pure numerical data, pure text data or numerical text combined data; s2, for pure numerical data, carrying out data anomaly detection based on a deep learning and machine learning algorithm, and outputting an anomaly detection result; and replacing text values in the data with numerical values for plain text data or numerical text combination data, performing data anomaly detection based on a deep learning and machine learning algorithm, and outputting an anomaly detection result. The invention can perform abnormality detection on the text data or the numerical value and text mixed data, and also performs abnormality detection on the text data under the condition of ensuring better detection accuracy of the numerical value data, thereby realizing abnormality detection of multiple types of data, ensuring that the types of the detected data are more abundant and ensuring better data quality.

Description

Abnormal data detection method containing text data types
Technical Field
The invention relates to the technical field of data processing, in particular to an abnormal data detection method containing text data types.
Background
Abnormal data refers to the detection of unusual data in a given dataset. In different fields, these distinctive data can also be called noise, outliers, etc. These isolated points are significantly different from the rest of the points, which makes it suspected that this is not biased, but is caused by a different mechanism. With the advent of the large data age, data sizes and dimensions have exploded, and the types of data are not just numerical data. In order to better ensure the data quality, how to effectively detect the abnormal value of the data becomes a primary task of data analysis and processing. In the existing data anomaly detection algorithms, the anomaly detection of numerical data exists in a large number, and a large number of algorithms such as I-Forest, VAE and other models already appear in machine learning and deep learning; however, anomaly detection of plain text data or numerical and text-combined data is rarely involved in the existing research, and the anomaly detection method is still to be broken through.
Disclosure of Invention
The invention aims to provide an abnormal data detection method containing text type data types. The invention can perform abnormality detection on the text data or the numerical value and text mixed data, and also perform abnormality detection on the text data under the condition of ensuring better detection accuracy of the numerical value data, thereby realizing abnormality detection of multiple types of data, ensuring that the detected data types are more abundant and better ensuring the data quality under certain detection accuracy.
The technical scheme of the invention is as follows: an abnormal data detection method including text type data type, comprising the steps of:
s1, judging whether the data type of the read data is pure numerical data, pure text data or numerical text combined data;
s2, for pure numerical data, carrying out data anomaly detection based on a deep learning and machine learning algorithm, and outputting an anomaly detection result; and replacing text values in the data with numerical values for plain text data or numerical text combination data, performing data anomaly detection based on a deep learning and machine learning algorithm, and outputting an anomaly detection result.
In step S1 of the foregoing method for detecting abnormal data, the method for determining the data type of the read data is as follows:
replacing all the digits in each datum with a digit a;
replacing all Chinese in each data with a number b;
replacing all letters in each data with a number c;
replacing all other characters with a number d;
and connecting all the data lists, and performing deduplication by using a dictionary of Python to obtain the data type contained in each column.
In step S2 described in the foregoing method for detecting abnormal data, the method for detecting abnormal data of a pure numerical value is as follows:
performing preliminary anomaly detection on the data by using an LSTM-VAE model, and storing the data with normal detection;
and (3) performing anomaly detection on the data with normal primary anomaly detection by using a Gaussian mixture clustering algorithm, and outputting an anomaly detection result.
In step S2, the anomaly detection method for plain text data or numerical text combined data is as follows:
performing preliminary anomaly detection on the data by using an LSTM-VAE model, and storing the data with normal detection;
replacing the text value in the data with the normal detection value;
and carrying out anomaly detection on the replaced data by using a density clustering algorithm, and outputting an anomaly detection result.
The beneficial effects are that: compared with the prior art, the method can not only detect the abnormality of the pure numerical value type data, but also detect the abnormality of the data based on the deep learning and machine learning algorithm after replacing the text value in the pure text type data or the numerical value text combination type data with the numerical value; therefore, the invention can also perform abnormal detection on the text data under the condition of ensuring higher detection accuracy of the pure numerical data, so that the detected data types are richer, and the data quality is better ensured. The inventors found at the time of study that: when a single algorithm model is used for carrying out anomaly detection on data, the detection precision is limited; since few abnormal detection of text-based data or text-numerical-combination-type data is involved, the text or text-numerical-type data is not detected in general detection. In order to solve the technical problems, when the inventor detects the abnormality of the text or the text numerical value combination type data, the number in each data is replaced by the number a, the Chinese is replaced by the number b, the letters are replaced by the number c, and other characters are replaced by the number d; by this processing, the text data is converted into numeric values in its entirety, and abnormality detection is performed on the text or text numeric value-combined data. In addition, in order to ensure the detection precision, the inventors finally adopt the following method to carry out abnormality detection on data after multiple-turn experiments: firstly, using an LSTM-VAE model to perform preliminary abnormal detection on data, and storing the data with normal detection; and for the numerical data, the Gaussian mixture clustering algorithm is used for abnormality detection, so that the accuracy of the numerical data abnormality detection is improved to 100%, and the detection effect is good. For text or text numerical data, replacing text values in the data with normal detection values; and then carrying out anomaly detection on the replaced data by using a density clustering algorithm. By the step-by-step method, the abnormality detection precision of the data is effectively improved.
In summary, the invention can perform anomaly detection on text data or numerical value and text mixed data, and perform anomaly detection on the text data under the condition of ensuring better detection accuracy of the numerical value data, thereby realizing anomaly detection of multiple types of data, ensuring that the detected data types are richer and the data quality is better ensured under certain detection precision.
Drawings
FIG. 1 is a flow chart of anomaly detection in accordance with the present invention;
FIG. 2 is a data type determination flow chart;
FIG. 3 is a flow chart of purely numerical data anomaly detection;
FIG. 4 is a LSTM model diagram;
FIG. 5 is a VAE model diagram;
FIG. 6 is GMM model pseudocode;
FIG. 7 is a partial screenshot of csv data used for anomaly detection verification of purely numerical data;
FIG. 8 is an anomaly detection result for the data of FIG. 7;
FIG. 9 is a partial screenshot of the expanded data of FIG. 7 after exception data has been augmented;
FIG. 10 is an anomaly detection result for the data of FIG. 9;
FIG. 11 is a telephone number GMM cluster map;
FIG. 12 is an outlier output of FIG. 11;
FIG. 13 is a partial screenshot of license plate number data for use in anomaly detection verification of numeric text combined data;
FIG. 14 is a calculation result of the DBSCAN density clustering algorithm of the data of FIG. 13;
fig. 15 is an outlier output result of fig. 14.
Detailed Description
The invention is further illustrated by the following figures and examples, which are not intended to be limiting.
Example 1. A method for detecting abnormal data including text type data, see fig. 1, comprising the steps of:
s1, judging whether the data type of the read data is pure numerical data, pure text data or numerical text combined data; the read data is mainly read from a database and stored in a local csv file;
s2, carrying out data anomaly detection on the pure numerical data based on a deep learning algorithm, and outputting an anomaly detection result; and replacing text values in the data with numerical values for plain text data or numerical text combination data, detecting data anomalies based on a deep learning algorithm, and outputting anomaly detection results.
In the foregoing step S1, the method for determining the data type of the read data, see fig. 2, is specifically as follows:
replacing all the digits in each datum with a digit a;
replacing all Chinese in each data with a number b;
replacing all letters in each data with a number c;
replacing all other characters with a number d;
and connecting all the data lists, and performing deduplication by using a dictionary of Python to obtain the data type contained in each column.
In the foregoing step S2, for the anomaly detection method of the pure numerical data, see fig. 3 as follows:
performing preliminary anomaly detection on the data by using an LSTM-VAE model, and storing the data with normal detection;
and (3) performing anomaly detection on the data with normal primary anomaly detection by using a Gaussian mixture clustering algorithm, and outputting an anomaly detection result.
In the invention, firstly, the anomaly detection is carried out on the bit number of the numerical data, and an LSTM-VAE model is adopted. LSTM is one of the Recurrent Neural Networks (RNNs) and has a greater memory capacity than the underlying RNN, and is better at processing time series data. The variance self-coding (VAE) is to add variance inference on the basis of self-coding (AE) so that the whole neural network is changed into a generation model from a discrimination model, and the VAE model can well detect abnormal data according to the abnormal detection score value on the premise that the prior distribution is standard normal distribution. The LSTM model and the VAE model are combined, so that the memory of the neural network is improved, and the network abnormality detection capability is enhanced. Reference is made to figures 4 and 5 for model diagrams of both LSTM and VAE models, respectively.
The Gaussian mixture clustering algorithm (GMM) is specifically as follows:
the gaussian mixture distribution is defined as:
the distribution consists of k mixed components, each corresponding to a Gaussian distribution, where mu i Sum sigma i Is the parameter of the ith Gaussian mixture component, and alpha i And > 0 is the corresponding mixing coefficient,
assume that the samples are given by a gaussian mixture distribution: i.e. according to alpha 12 ,......,α k A defined a priori distribution selects gaussian mixture components, where α i To select the probability of the ith blend component. The samples are then sampled according to the probability density function of the selected mixed component, thereby generating corresponding samples.
If training set D= { x 1 ,x 2 ,......x m Generated by the above process, let the random variable z j E {1,2, the term "k" means generating a sample x j The value of the gaussian mixture component is unknown. z j Is (z) j =i) corresponds to α i (i=1, 2,) the term "k", z according to bayesian theorem j The posterior distribution of (a) corresponds to
From formula (4-2), p M (z j =i|x j ) Give sample x j The posterior probability generated from the ith Gaussian mixture component is denoted as γ ji . If equation (4-1) is known, the Gaussian mixture cluster will divide the sample set D into k clusters C= { C 1 ,C 2 ,.......C k Each sample x j Cluster marking lambda j The determination is as follows:
from the above, pseudo code of the gaussian mixture cluster is shown in fig. 6.
The LSTM-VAE model is combined with the GMM model to detect the abnormality of the pure numerical data, the detection effect is good, and the specific effect is shown in the following test results:
(1) The test data is derived from the csv data that the database leads to locally, and a partial screenshot is shown in FIG. 7.
The data set comprises five groups of data including time, ID card number, telephone number and license plate number, 10000 pieces of data in total and 216 pieces of abnormal data. The results of initial anomaly detection using LSTM-VAE are shown in FIG. 8. As can be seen from fig. 8, the judgment accuracy is close to 100%.
(2) The data shown in fig. 7 was continuously increased in the amount of abnormal data, changing the abnormal data to 300. A partial screenshot is shown in fig. 9 below. The data in FIG. 9 was also subjected to anomaly detection using LSTM-VAE, and the results are shown in FIG. 10 below. As can be seen from fig. 10, the accuracy is 0.9941, and the accuracy is also high, and all abnormal data is output.
(3) For pure numerical data, taking telephone numbers as an example, after primary inspection by an LSTM-VAE model, GMM clustering is adopted, and the result is shown in figure 11. From fig. 11, it can be seen that there are outliers, and all of the outliers are output as shown in fig. 12.
The original data are searched, all abnormal values are found out, the accuracy is 100%, and the detection effect is good.
In the aforementioned step S2, the anomaly detection method for plain text data or numerical text-combined data is as follows:
performing preliminary anomaly detection on the data by using an LSTM-VAE model, and storing the data with normal detection;
replacing the text value in the data with the normal detection value;
and carrying out anomaly detection on the replaced data by using a density clustering algorithm, and outputting an anomaly detection result.
The anomaly detection of numerical text-combined data is exemplified by license plate number, and a partial data screenshot is shown in fig. 13. After the number of license plate numbers is primarily checked through the LSTM-VAE model, the DBSCAN density clustering algorithm is adopted on the data after the number of license plates is replaced with the numerical value, and the result is shown in the following figure 14. By outputting the outliers in fig. 14, the results are shown in fig. 15 as follows: by comparing with the original data, it can be known that the outliers of the original data are all found.

Claims (2)

1. A method for detecting abnormal data including a text type data type, comprising the steps of: s1, judging whether the data type of the read data is pure numerical data, pure text data or numerical text combined data;
s2, for pure numerical data, carrying out data anomaly detection based on a deep learning and machine learning algorithm, and outputting an anomaly detection result; the data anomaly detection method based on the deep learning and machine learning algorithm comprises the following steps: performing preliminary anomaly detection on the data by using an LSTM-VAE model, and storing the data with normal detection; performing anomaly detection again on the data with the initial anomaly detection normal by using a Gaussian mixture clustering algorithm;
for plain text data or numerical text combination data, replacing text values in the data with numerical values, then carrying out data anomaly detection based on a deep learning and machine learning algorithm, and outputting an anomaly detection result; the data anomaly detection method based on the deep learning and machine learning algorithm comprises the steps of using an LSTM-VAE model to perform preliminary anomaly detection on data, and storing the data with normal detection; replacing the text value in the data with the normal detection value; and carrying out anomaly detection on the replaced data by using a density clustering algorithm.
2. The method for detecting abnormal data according to claim 1, wherein: in step S1, the method for determining the data type of the read data is as follows:
replacing all the digits in each datum with a digit a;
replacing all Chinese in each data with a number b;
replacing all letters in each data with a number c;
replacing all other characters with a number d;
and connecting all the data lists, and performing deduplication by using a dictionary of Python to obtain the data type contained in each column.
CN202011037634.6A 2020-09-28 2020-09-28 Abnormal data detection method containing text data types Active CN112131388B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011037634.6A CN112131388B (en) 2020-09-28 2020-09-28 Abnormal data detection method containing text data types

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011037634.6A CN112131388B (en) 2020-09-28 2020-09-28 Abnormal data detection method containing text data types

Publications (2)

Publication Number Publication Date
CN112131388A CN112131388A (en) 2020-12-25
CN112131388B true CN112131388B (en) 2024-02-06

Family

ID=73840358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011037634.6A Active CN112131388B (en) 2020-09-28 2020-09-28 Abnormal data detection method containing text data types

Country Status (1)

Country Link
CN (1) CN112131388B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391443A (en) * 2017-06-28 2017-11-24 北京航空航天大学 A kind of sparse data method for detecting abnormality and device
CN107506451A (en) * 2017-08-28 2017-12-22 泰康保险集团股份有限公司 abnormal information monitoring method and device for data interaction
JP2019067069A (en) * 2017-09-29 2019-04-25 アンリツ株式会社 Abnormality detecting apparatus, abnormality detecting method and abnormality detecting program
CN110086829A (en) * 2019-05-14 2019-08-02 四川长虹电器股份有限公司 A method of Internet of Things unusual checking is carried out based on machine learning techniques
CN111344721A (en) * 2017-11-13 2020-06-26 国际商业机器公司 Anomaly detection using cognitive computation
CN111695792A (en) * 2020-05-29 2020-09-22 北方国际合作股份有限公司 Subway illumination system abnormal energy consumption analysis method based on multi-attribute clustering

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150271030A1 (en) * 2014-03-18 2015-09-24 Vmware, Inc. Methods and systems for detection of data anomalies
US20200273570A1 (en) * 2019-02-22 2020-08-27 Accenture Global Solutions Limited Predictive analysis platform

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391443A (en) * 2017-06-28 2017-11-24 北京航空航天大学 A kind of sparse data method for detecting abnormality and device
CN107506451A (en) * 2017-08-28 2017-12-22 泰康保险集团股份有限公司 abnormal information monitoring method and device for data interaction
JP2019067069A (en) * 2017-09-29 2019-04-25 アンリツ株式会社 Abnormality detecting apparatus, abnormality detecting method and abnormality detecting program
CN111344721A (en) * 2017-11-13 2020-06-26 国际商业机器公司 Anomaly detection using cognitive computation
CN110086829A (en) * 2019-05-14 2019-08-02 四川长虹电器股份有限公司 A method of Internet of Things unusual checking is carried out based on machine learning techniques
CN111695792A (en) * 2020-05-29 2020-09-22 北方国际合作股份有限公司 Subway illumination system abnormal energy consumption analysis method based on multi-attribute clustering

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Anomaly detection for time series using VAE-LSTM hybrid model;Shuyu Lin等;《2020 IEEE international conference on acoustics,speech and signal processing》;1-21 *
基于无监督异常检测算法的数据质量研究及应用;魏斐斐;《中国优秀硕士学位论文全文数据库 信息科技辑》;I138-1072 *
面向云计算环境的异常检测技术研究;李沛原;《中国优秀硕士学位论文全文数据库 信息科技辑》;I139-194 *

Also Published As

Publication number Publication date
CN112131388A (en) 2020-12-25

Similar Documents

Publication Publication Date Title
CN109344845B (en) Feature matching method based on triple deep neural network structure
CN113516228B (en) Network anomaly detection method based on deep neural network
CN111581092B (en) Simulation test data generation method, computer equipment and storage medium
CN109145114B (en) Social network event detection method based on Kleinberg online state machine
CN107247873B (en) Differential methylation site recognition method
CN111343147A (en) Network attack detection device and method based on deep learning
CN112950445A (en) Compensation-based detection feature selection method in image steganalysis
CN112036476A (en) Data feature selection method and device based on two-classification service and computer equipment
CN113704082A (en) Model evaluation method and device, electronic equipment and storage medium
CN111709439A (en) Feature selection method based on word frequency deviation rate factor
CN111507385A (en) Extensible network attack behavior classification method
CN112200259A (en) Information gain text feature selection method and classification device based on classification and screening
CN113032525A (en) False news detection method and device, electronic equipment and storage medium
CN110544047A (en) Bad data identification method
CN115114484A (en) Abnormal event detection method and device, computer equipment and storage medium
CN114358193A (en) Transformer state diagnosis method based on oil chromatography, terminal and storage medium
CN112131388B (en) Abnormal data detection method containing text data types
CN112836731A (en) Signal random forest classification method, system and device based on decision tree accuracy and relevance measurement
CN107423319B (en) Junk web page detection method
CN114185785A (en) Natural language processing model test case reduction method for deep neural network
Matlach et al. A method for comparison of general sequences via type-token ratio
CN111488903A (en) Decision tree feature selection method based on feature weight
CN112101468A (en) Method for judging abnormal sequence in sequence combination
CN111383716A (en) Method and device for screening gene pairs, computer equipment and storage medium
CN113919235B (en) Mobile source pollution abnormal emission detection method and medium based on LSTM evolution clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant