CN112163624A - Data abnormity judgment method and system based on deep learning and extreme value theory - Google Patents

Data abnormity judgment method and system based on deep learning and extreme value theory Download PDF

Info

Publication number
CN112163624A
CN112163624A CN202011060903.0A CN202011060903A CN112163624A CN 112163624 A CN112163624 A CN 112163624A CN 202011060903 A CN202011060903 A CN 202011060903A CN 112163624 A CN112163624 A CN 112163624A
Authority
CN
China
Prior art keywords
data
abnormal
score
threshold
scoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011060903.0A
Other languages
Chinese (zh)
Inventor
金耀辉
何浩
黄宗源
李龙元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202011060903.0A priority Critical patent/CN112163624A/en
Publication of CN112163624A publication Critical patent/CN112163624A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The invention provides a data anomaly judgment method and system based on deep learning and extreme value theory, wherein an anomaly scoring model is constructed according to data samples in a current data set, and iterative optimization is carried out on the anomaly scoring model, so that the anomaly scoring model approaches an optimization target; obtaining an abnormal score value of the data sample through an abnormal scoring model; estimating parameters of an extreme value distribution formula according to extreme values in the abnormal score values of the acquired data samples, and calculating an abnormal score threshold by using a threshold calculation formula; and acquiring the abnormal score of the data to be judged in the current data set by using the abnormal score model, comparing the abnormal score of the data to be judged with an abnormal score threshold value, and calibrating the abnormal data. The invention optimizes the abnormal score end to end, which is beneficial to fully utilizing data and characterizing learning ability; meanwhile, the abnormal score threshold can be judged according to the actual data set, the complexity and subjectivity of manually judging the threshold are effectively avoided, and the migration capability and the abnormal recognition capability of the method are improved.

Description

Data abnormity judgment method and system based on deep learning and extreme value theory
Technical Field
The invention relates to the technical field of data processing, in particular to a data abnormity judgment method and system based on deep learning and an extreme value theory.
Background
Of the data, a small amount of data that significantly deviates from the majority of the data is called anomalous data. The anomaly detection aims to discover the anomaly data and has important application value in many fields, such as false transaction detection in financial activities and network attack identification in network security. The general flow of anomaly detection, firstly defining an anomaly score according to the adopted subject detection technology, for example, defining the anomaly score according to the distance between common data of a distance-based method; then, evaluating an abnormal score for each data sample by using a detection technology; and setting a threshold value for the abnormal score, and regarding the data higher than the threshold value as abnormal data. The selection of the threshold value affects the process of determining whether the data is abnormal, so that the accuracy of the determination of the abnormal detection is affected. In most of the existing anomaly detection methods, the threshold is selected through manual setting, related workers are usually required to fully observe analysis data characteristics and identify anomaly data from data with high anomaly score values, a proper anomaly score value can be selected to define an anomaly, and the threshold selection difficulty is high, and the time and labor cost is high. Meanwhile, the manually set threshold value is difficult to ensure the objective rationality and cannot be scientifically explained. In addition, once the data set changes, the data set needs to be re-analyzed to find the appropriate threshold.
The methods for anomaly detection are classified into two major categories, the traditional method and the neural network method. Conventional anomaly detection methods are based on conventional machine learning models, such as clustering models, distance models, tree models, etc., which have limited processing power for non-linear characteristic relationships and are prone to being subject to dimension cursing and are unable to process high-dimensional data efficiently. The appearance of the neural network and the deep learning provides a new idea for anomaly detection. The existing unsupervised anomaly detection method based on deep learning generally comprises two steps, wherein a first step uses a characterization learning method such as an auto-encoder to represent data in a new characterization space, and a second step defines an anomaly score in the new characterization space based on a reconstruction error or a distance relation. These token learning methods aim at improving the expressive power of the tokens, do not optimize the anomaly scores directly, nor guide the data distribution further in the token space, so that the data is not fully utilized, and the quality of the anomaly scores obtained in the second step is low.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a data abnormity judgment method and system based on deep learning and an extreme value theory.
The invention is realized by the following technical scheme.
According to one aspect of the invention, a data anomaly judgment method based on deep learning and extreme value theory is provided, and comprises the following steps:
constructing an abnormal scoring model according to data samples in the current data set, and performing iterative optimization on the abnormal scoring model to enable the abnormal scoring model to approach an optimization target;
obtaining an abnormal score value of the data sample through an abnormal scoring model;
estimating parameters of an extreme value distribution formula according to extreme values in the abnormal score values of the acquired data samples, and calculating an abnormal score threshold by using a threshold calculation formula;
and obtaining the abnormal score of the data to be judged in the current data set by using the abnormal score model, comparing the abnormal score of the data to be judged with an abnormal score threshold value, calibrating the data of which the abnormal score exceeds the abnormal score threshold value into abnormal data, and finishing the abnormal judgment of the data.
Preferably, the constructing an abnormal scoring model according to the data samples and performing iterative optimization on the abnormal scoring model to make the abnormal scoring model approach to the optimization goal includes:
training a characterization learning device for constructing an abnormal scoring model by using a neural network technology according to the data sample; mapping the data to the representation by using a representation learner, so as to obtain the representation of the data in a representation space, namely the data representation;
according to the obtained data representation, training and constructing an abnormal scoring device of an abnormal scoring model by using a neural network technology; grading the data representation by using an anomaly grading device to obtain an anomaly score corresponding to the data sample;
setting prior distribution of data, generating data meeting the prior distribution through a prior generator, and calculating prior distribution observation parameters of the data meeting the prior distribution through a reference arithmetic unit;
calculating the deviation of the abnormal fraction and the prior distribution observation parameters by using a deviation arithmetic unit, and further calculating a loss function;
and updating the abnormal scoring model parameters through multiple iterations by using an optimization iterator, reducing a loss function, and finally obtaining an optimized abnormal scoring model.
Preferably, the method of scoring a data characterization using an anomaly scorer comprises:
the anomaly scorer is as follows: score psi (eta; theta)s) Scoring the data representation to obtain an abnormal score value score of the data;
wherein the content of the first and second substances,
Figure BDA0002712338700000021
to characterize the learner, x is implementedi→ηiEta.iIs xiCorresponding data characterisation, thetarAnd thetasNetwork structure parameters characterizing the learner and the anomaly scorer, respectively.
Preferably, the method for calculating the prior distribution observation parameters of the data satisfying the prior distribution by the reference operator comprises:
selecting data distribution meeting the current application requirement as prior distribution, generating a certain amount of data according to the prior distribution, and calculating the characteristic parameter observation result of the distribution through the data.
Preferably, the method of calculating a loss function comprises:
the loss function is expressed as
Figure BDA0002712338700000031
Guiding to optimize an abnormal scoring model;
wherein ψ (x; θ) is an abnormality score obtained by the abnormality score model, as in claimAs set forth in claim 3, θ ═ θ (θ)rs);θfFor a prior distribution parameter, e.g. theta in gaussian distribution, as specified in claim 4f(μ, σ); and the function L measures the difference between the abnormal score and the prior distribution, and defines different forms according to different prior distributions and task types.
Preferably, the estimating extreme value distribution formula parameters according to extreme values in the anomaly score values of the acquired data samples, and calculating the anomaly score threshold value by using a threshold value calculation formula includes:
dividing the tail end part of the abnormal score distribution into extreme parts by using a parameter estimator according to the abnormal scores of the data samples, wherein the abnormal scores of the extreme parts are extreme values; estimating extreme value distribution formula parameters by using extreme values;
and substituting the extreme value distribution formula parameters into a threshold calculation formula of the threshold generator by using a threshold generator to calculate the abnormal score threshold suitable for the current data set.
Preferably, the method for estimating the extreme value distribution formula parameter by using the extreme value comprises the following steps:
the extreme values are expressed as: j ═ { score | scorei>t,scoreiThe method comprises the following steps of belonging to S, wherein t is a tail division point, and S is an abnormal score set corresponding to all data;
according to the formula
Figure BDA0002712338700000032
Using maximum likelihood estimation method to estimate gamma and beta in the formula to obtain corresponding estimation value
Figure BDA0002712338700000033
Namely the extreme value distribution formula parameters.
Preferably, the substituting the extreme value distribution formula parameter into the threshold calculation formula of the threshold generator includes:
in the threshold value generator, the estimated value is calculated
Figure BDA0002712338700000034
Calculation of substitution thresholdIn the formula (I)
Figure BDA0002712338700000035
Calculating to obtain an abnormal distribution threshold th suitable for the current data set;
where q is the desired probability and | J | is the data size of the extreme set of values.
According to another aspect of the present invention, there is provided a data anomaly determination system based on deep learning and extremum theory, including:
the abnormal scoring module is used for training an abnormal scoring model and generating an abnormal score for each datum by using the trained abnormal scoring model;
the reference scoring module is used for providing a reference scoring index for the deviation optimization module according to the prior distribution;
the deviation optimization module measures the target deviation degree of the abnormal score by using the reference scoring index, and realizes the optimization and iteration of the abnormal scoring model;
the threshold value judging module is used for obtaining the abnormal score threshold value of the current data set by applying extreme value theoretical calculation;
and the abnormality calibration module is used for performing abnormality calibration on the data according to the abnormality score obtained by the abnormality scoring module and the abnormality score threshold obtained by the threshold judgment module.
Preferably, the reference scoring module, the deviation optimization module, the threshold determination module and the abnormality calibration module are sequentially connected, and the abnormality scoring module is respectively connected with the reference scoring module, the deviation optimization module, the threshold determination module and the abnormality calibration module.
Preferably, the system further comprises any one or more of:
-the anomaly scoring module comprising a characterization learner and an anomaly scorer, wherein the characterization learner is configured to learn a characterization of the data, enabling a mapping from a data space to a characterization space; the anomaly scoring device is used for scoring the characterization data to obtain an anomaly score of each data;
the reference scoring module comprises a priori generator and a reference operator, wherein the priori generator generates data meeting the requirement of the prior distribution according to the prior distribution, and the reference operator calculates an observation parameter of the prior distribution according to the input data meeting the requirement of the prior distribution, so as to serve as a reference scoring index of an optimization target;
the deviation optimization module comprises a deviation operator and an optimization iterator, wherein the deviation operator is used for calculating the deviation degree between the current abnormal score and the reference scoring index, and the optimization iterator guides the abnormal score model to perform optimization iteration according to the deviation degree calculated by the deviation operator to reduce the deviation degree of an optimization target;
the threshold decision module comprises a parameter estimator and a threshold generator, wherein the parameter estimator estimates extreme value distribution formula parameters according to extreme values of the abnormal scores obtained by the abnormal scoring module, and the threshold generator calculates an abnormal score threshold of the current data set by using a threshold calculation formula according to the extreme value distribution formula parameters;
the abnormity calibration module comprises an abnormity calibrator, wherein the abnormity calibrator obtains an abnormity score given by the abnormity scoring module and an abnormity score threshold given by the threshold determination module, calibrates data with an abnormity score exceeding the threshold as abnormal, and calibrates data with an abnormity score within the threshold as normal.
Due to the adoption of the technical scheme, compared with the prior art, the invention has the following beneficial effects:
the data anomaly judgment method and system based on deep learning and extreme value theory provided by the invention have strong feature processing capability, can effectively process the nonlinear feature relation, and are suitable for high-dimensional data. According to the method and the system, the abnormal score is obtained by using the end-to-end deep neural network model, the abnormal score is directly optimized, data and characterization learning capacity can be fully utilized, and the abnormal score with better abnormal depicting capacity is obtained. In addition, an extreme value theory is fully applied, so that the model can judge the abnormal score threshold value according to the actual data set, the complexity and subjectivity of manual judgment of the threshold value are effectively avoided, and the migration capability and the abnormal identification capability of the method are improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
fig. 1 is a flowchart of a data anomaly determination method based on deep learning and extremum theory in a preferred embodiment of the present invention.
Fig. 2 is a schematic block diagram of a data anomaly determination system based on deep learning and extremum theory according to a preferred embodiment of the present invention.
Detailed Description
The following examples illustrate the invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and a detailed implementation mode and a specific operation process are given. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.
The embodiment of the invention provides a data anomaly judgment method based on deep learning and an extreme value theory, which is based on a deep neural network representation learning method, a probability method and an extreme value theory thereof, has strong feature processing capability, can effectively process a nonlinear feature relationship and is suitable for high-dimensional data; the end-to-end optimization of the abnormal score is beneficial to fully utilizing data and representing learning capacity, and the describing capacity of the abnormal score on the abnormity is improved; meanwhile, the abnormal score threshold can be judged according to the actual data set, the complexity and subjectivity of manually judging the threshold are effectively avoided, and the migration capability and the abnormal recognition capability of the method are improved.
The data anomaly determination method based on deep learning and extreme value theory provided by this embodiment, as shown in fig. 1, includes the following steps:
step S1, constructing an abnormal scoring model according to the data samples in the current data set, and performing iterative optimization on the abnormal scoring model to enable the abnormal scoring model to approach an optimization target;
step S2, obtaining an abnormal score value of the data sample through an abnormal score model;
step S3, estimating extreme value distribution formula parameters according to extreme values in the abnormal score values of the acquired data samples, and calculating an abnormal score threshold value by using a threshold value calculation formula;
step S4, obtaining the abnormal score of the data to be judged in the current data set by using the abnormal score model, comparing the abnormal score of the data to be judged with the abnormal score threshold value, calibrating the data of which the abnormal score exceeds the abnormal score threshold value into abnormal data, and finishing the abnormal judgment of the data.
As a preferred embodiment, step S1 includes:
step S101, according to data samples, training and constructing a characterization learning device of an abnormal scoring model by using a neural network technology; mapping the data to the representation by using a representation learner, so as to obtain the representation of the data in a representation space, namely the data representation;
step S102, according to data representation, training and constructing an abnormal scoring device of an abnormal scoring model by using a neural network technology; grading the data representation by using an anomaly grading device to obtain an anomaly score corresponding to the data sample;
step S103, setting prior distribution (such as Gaussian distribution) of data, generating data of a certain data quantity meeting the prior distribution through a prior generator, and calculating prior distribution observation parameters (such as mean value and standard deviation) of the data meeting the prior distribution through a reference arithmetic unit;
step S104, calculating the deviation between the abnormal score and the prior distribution observation parameter by using a deviation arithmetic unit, and further calculating a loss function, wherein the loss function is used for measuring the performance of the abnormal score model;
and S105, reducing a loss function by using an optimization iterator, specifically, updating the abnormal scoring model parameters through multiple iterations to make the loss function as small as possible, and obtaining an optimized abnormal scoring model when the iteration is carried out for a certain number of times.
And determining an abnormal score value of the data sample by using the optimized abnormal score model.
As a preferred embodiment, in step S101, a corresponding characterization learner is selected according to the type of the input data, for example, a convolutional neural network may be used to process the image data and a cyclic neural network may be used to process the sequence data.
As a preferred embodiment, in step S102, an anomaly scorer is used to score the data characterization, specifically:
the anomaly scorer is as follows: score psi (eta; theta)s) Scoring the data representation to obtain an abnormal score value score of the data;
wherein the content of the first and second substances,
Figure BDA0002712338700000061
to characterize the learner, x is implementedi→ηiEta.iIs xiCorresponding data characterisation, thetasIs the network structure parameter of the anomaly scorer.
As a preferred embodiment, in step S103, the reference calculator calculates a priori distribution observation parameter of the data satisfying the priori distribution, specifically:
selecting data distribution meeting the current application requirement as prior distribution, generating a certain amount of data according to the prior distribution, and calculating the characteristic parameter observation result of the distribution through the data.
In the step, observation results of characteristic parameters are used instead of preset values, so that the randomness is introduced to improve the robustness of the model. Taking a Gaussian distribution as an example of the prior distribution, the Gaussian distribution is expressed as N (mu, sigma)2) Let the reference data amount generated by the prior generator be l, i.e. the reference data be r1,r2,...rl~N(μ,σ2). The characteristic parameters of the Gaussian distribution are mean value mu and standard deviation sigma, so that the observed parameters of the distribution can be calculated by using observed data, for example, the mean value can be calculated by formula
Figure BDA0002712338700000071
Calculation, standard deviation σrAgain calculated by definition.
As a preferred embodiment, in step S104, the method for calculating the loss function specifically includes:
the Loss function is expressed as Loss L (ψ (x; θ), θf) Guiding to optimize an abnormal scoring model;
where ψ (x; θ) is the anomaly score obtained by the anomaly score model, and θ ═ θ (θ ═ as described in step S102rs);θfFor a prior distribution parameter, as described in step S103, e.g. θ in Gaussian distributionf(μ, σ); and the function L measures the difference between the abnormal score and the prior distribution, and defines different forms according to different prior distributions and task types. In particular, assuming that a gaussian distribution is currently used as the prior distribution, if an unsupervised modeling environment is currently used, the loss function L may be defined as Lrec+λLdevIs defined wherein LrecRepresenting the reconstruction error, which can be expressed as Lrec=||x-ψ||2;LdevRepresents the deviation error and can be expressed as Lrec=f(|dev|),
Figure BDA0002712338700000072
f linear or other functions may be selected, etc., depending on the usage scenario. If the supervised or weakly supervised modeling environment is currently used, the loss function L may be expressed as L ═ y | dev | + (1-y) max (0, a-dev), where y represents the label, y ═ 1 when the training data is normal data, y ═ 0 when the training data is abnormal data, and a is a confidence parameter, selected according to the dev result.
As a preferred embodiment, step S2 includes:
step S201, dividing the tail end part of abnormal score distribution into extreme parts by using a parameter estimator according to the abnormal scores of the data samples, wherein the abnormal scores of the extreme parts are extreme values; estimating parameters of an extreme value distribution formula by using extreme values;
and S202, substituting the extreme value distribution formula parameters into a threshold calculation formula by using a threshold generator to calculate the abnormal score threshold suitable for the current data set.
As a preferred embodiment, in step S201, the method for estimating the extremum distribution formula parameter using the extreme values includes:
the extreme values are expressed as: j ═ { score | scorei>t,scoreiThe method comprises the following steps of belonging to S, wherein t is a tail division point, and S is an abnormal score set corresponding to all data;
according to the formula
Figure BDA0002712338700000073
Using maximum likelihood estimation method to estimate gamma and beta in the formula to obtain corresponding estimation value
Figure BDA0002712338700000074
Namely the extreme value distribution formula parameters.
As a preferred embodiment, substituting the extreme value distribution formula parameter into the threshold value calculation formula of the threshold value generator in step S202 includes:
in the threshold value generator, the estimated value is calculated
Figure BDA0002712338700000081
Substituted into the formula for calculating the threshold value, i.e.
Figure BDA0002712338700000082
Calculating to obtain an abnormal distribution threshold th suitable for the current data set;
where q is the desired probability and | J | is the data size of the extreme set of values.
As a preferred embodiment, in step S3, the part of the data with the abnormal score exceeding the threshold is calibrated as abnormal data, and the part of the data with the abnormal score within the threshold is calibrated as normal data.
Another embodiment of the present invention provides a data anomaly determination system based on deep learning and extremum theory, as shown in fig. 2, including: the device comprises an anomaly scoring module, a reference scoring module, a loss optimization module, a threshold judgment module and an anomaly calibration module; wherein:
the abnormal scoring module is used for training an abnormal scoring model and generating an abnormal score for each datum by using the trained abnormal scoring model;
the reference scoring module is used for providing a reference scoring index for the deviation optimization module according to the prior distribution;
the deviation optimization module measures the target deviation degree of the abnormal score by using the reference scoring index, and realizes the optimization and iteration of the abnormal scoring model;
the threshold value judging module is used for obtaining the abnormal score threshold value of the current data set by applying extreme value theoretical calculation;
and the abnormality calibration module is used for performing abnormality calibration on the data according to the abnormality score obtained by the abnormality scoring module and the abnormality score threshold obtained by the threshold judgment module.
The anomaly scoring module comprises a characterization learner and an anomaly scorer, wherein the characterization learner is used for learning the characterization of the data and realizing the mapping from the data space to the characterization space; the anomaly scoring device is used for scoring the characterization data to obtain an anomaly score of each data;
as a preferred embodiment, the reference scoring module comprises a prior generator and a reference operator, wherein the prior generator generates data meeting the requirement of prior distribution according to the prior distribution, and the reference operator calculates the prior distribution observation parameters according to the input data meeting the requirement of prior distribution, so as to serve as the reference scoring index of the optimization target;
as a preferred embodiment, the deviation optimization module comprises a deviation operator and an optimization iterator, wherein the deviation operator is used for calculating the deviation degree between the current abnormal score and the reference scoring index, and the optimization iterator guides the abnormal score model to perform optimization iteration according to the deviation degree calculated by the deviation operator, so as to reduce the deviation degree of the optimization target;
as a preferred embodiment, the threshold determination module includes a parameter estimator and a threshold generator, wherein the parameter estimator estimates an extreme value distribution formula parameter according to an extreme value of the abnormal score obtained by the abnormal scoring module, and the threshold generator calculates an abnormal score threshold of the current data set by using a threshold calculation formula according to the extreme value distribution formula parameter;
as a preferred embodiment, the anomaly calibrating module includes an anomaly calibrator, wherein the anomaly calibrator obtains an anomaly score given by the anomaly scoring module and an anomaly score threshold given by the threshold determining module, calibrates data with an anomaly score exceeding the threshold as an anomaly, and calibrates data with an anomaly score within the threshold as a normal.
As a preferred embodiment, among the five modules, the reference scoring module, the deviation optimization module, the threshold determination module, and the abnormality calibration module are sequentially connected, and the abnormality scoring module is connected to all of the four modules.
The technical solutions provided by the above embodiments of the present invention are further detailed below with reference to a specific application example.
The credit card transaction fraud detection is taken as an example of a specific application scene, data of the credit card transaction fraud detection represents a piece of transaction information in a vector form, and N-dimensional characteristics of the transaction information are described on the assumption that the vector is N-dimensional. Expressed mathematically, assume that the overall data set size is m, i.e., D ═ x1,x2,…,xmFor each piece of data, represented by an N-dimensional vector
Figure BDA0002712338700000091
Corresponding to steps S101 and S102, in the anomaly scoring module, an anomaly scoring model is constructed using a neural network, an N-dimensional feature vector is input to the model, and a scoring value with a dimension of 1 is output. For the characterization learner, a multi-layer perceptron structure can be selected and used according to the characteristics of the current data; the anomaly scorer may select a single-layer forward neural network. After each piece of data passes through the abnormal scoring model, a corresponding abnormal scoring value can be obtained. Is expressed in a mathematical form and represents a learning device
Figure BDA0002712338700000092
Implementation xi→ηiEta.iIs xiCorresponding data characterisation, thetarNetwork structure parameters for characterizing the learner; anomaly scorer score psi (eta; theta)s) And scoring the data characterization to obtain an abnormal score, theta, of the datasIs the network structure parameter of the anomaly scorer.
In fact, the choice of the characterization learner can vary depending on the type of input data, such as image data processed using a convolutional neural network, sequence data processed using a circular neural network. Different implementation modes are selected according to practical application.
Corresponding to step S103, in the reference scoring module, the prior distribution of data is assumed to be gaussian, i.e. N (μ, σ)2) And assuming that the a priori generator generates the reference data amount of l, i.e. r1,r2,...rl~N(μ,σ2). The reference operator may select the observed mean and standard deviation as the reference index, for example, the mean may be formulated by
Figure BDA0002712338700000093
Calculation, standard deviation σrAgain calculated from the reference data.
Corresponding to steps S104 and S105, in the deviation optimization module, the deviation calculator calculates the deviation between the abnormal score and the prior distribution, and a formula can be used
Figure BDA0002712338700000094
And (6) expressing the deviation. And calculating the loss function according to the calculated deviation expression by using an optimization iterator. So that the anomaly scoring model parameters theta are continuously updated and iterated with the goal of reducing the loss functionrsAnd (6) completing the iteration of the abnormal scoring model until the loss function converges or the iteration time requirement is met.
Corresponding to step S106, after the iteration of the abnormal score model is completed, the abnormal score corresponding to the data can be obtained by using the model.
For the content of step S3, the following example content can be referred to in the detailed implementation:
in the threshold determination module, the parameter estimator may divide a tail end portion of the abnormal score distribution into extreme portions according to abnormal scores of the data, and the abnormal scores of the extreme portions are extreme values. The extreme values are used to estimate the parameters of the extreme value distribution formula. Mathematically described, the set of extreme values may be expressed as J ═ score | scorei>t,scoreiIs belonged to S, wherein t is a tail division point, and S is corresponding to the whole dataA set of anomaly scores. The extreme value distribution formula may use a generalized pareto distribution formula in extreme value theory, i.e.
Figure BDA0002712338700000101
The maximum likelihood estimation method can be used for estimating gamma and beta in the formula to obtain an estimated value
Figure BDA0002712338700000102
In the threshold generator, the estimated value is substituted into the threshold calculation formula, i.e.
Figure BDA0002712338700000103
An anomaly distribution threshold th applicable to the current data set can be calculated. In the above equation, q is the desired probability and | J | is the data size of the extreme value set.
For the content of step S4, the data of the part with the abnormal score exceeding the threshold is calibrated as abnormal data, and the data of the part with the abnormal score within the threshold is calibrated as normal data. Mathematically, the abnormal data set and the normal data set can be expressed as:
Figure BDA0002712338700000104
and
Figure BDA0002712338700000105
in the data anomaly determination method and system based on deep learning and extremum theory provided by the above embodiments of the present invention, an end-to-end deep neural network is used to train an anomaly score model, so that the anomaly score distribution obtained by data passing through the model tends to a selected prior distribution, specifically, the anomaly score of normal data approaches a distribution center, and the anomaly score of abnormal data is far away from the distribution center, thereby realizing the separation of normal data and abnormal data. And then, based on an extreme value distribution formula in an extreme value theory, performing parameter estimation by using extreme values in the abnormal scores, determining an abnormal score threshold value according with the current data condition, and calibrating data abnormality according to the abnormal scores and the abnormal score threshold value of the data. The method and the system provided by the embodiment of the invention have strong feature processing capability, can effectively process the nonlinear feature relationship and are suitable for high-dimensional data; the end-to-end optimization of the abnormal score is beneficial to fully utilizing data and representing learning capacity, and the describing capacity of the abnormal score on the abnormity is improved; meanwhile, the abnormal score threshold can be judged according to the actual data set, the complexity and subjectivity of manually judging the threshold are effectively avoided, and the migration capability and the abnormal recognition capability of the method are improved.
It should be noted that, the steps in the method provided by the present invention can be implemented by using corresponding modules, devices, units, and the like in the system, and those skilled in the art can implement the step flow of the method by referring to the technical scheme of the system, that is, the embodiment in the system can be understood as a preferred example of the implementation method, and details are not described herein.
The above specific embodiments are only used for illustrating the technical solutions of the embodiments of the present invention, and are not limited thereto; the foregoing detailed description may be modified by those skilled in the art without departing from the spirit and scope of the present invention, and equivalents may be substituted for some or all of the features thereof; and such modifications or substitutions do not depart from the spirit and scope of the present invention.

Claims (10)

1. A data abnormity judgment method based on deep learning and extreme value theory is characterized by comprising the following steps:
constructing an abnormal scoring model according to data samples in the current data set, and performing iterative optimization on the abnormal scoring model to enable the abnormal scoring model to approach an optimization target;
obtaining an abnormal score value of the data sample through an abnormal scoring model;
estimating parameters of an extreme value distribution formula according to extreme values in the abnormal score values of the acquired data samples, and calculating an abnormal score threshold by using a threshold calculation formula;
and obtaining the abnormal score of the data to be judged in the current data set by using the abnormal score model, comparing the abnormal score of the data to be judged with an abnormal score threshold value, calibrating the data of which the abnormal score exceeds the abnormal score threshold value into abnormal data, and finishing the abnormal judgment of the data.
2. The method for judging the data abnormality based on the deep learning and extremum theory according to claim 1, wherein the steps of constructing an abnormality scoring model according to the data samples and performing iterative optimization on the abnormality scoring model to enable the abnormality scoring model to approach an optimization goal comprise:
training a characterization learning device for constructing an abnormal scoring model by using a neural network technology according to the data sample; mapping the data to the representation by using a representation learner, so as to obtain the representation of the data in a representation space, namely the data representation;
according to the obtained data representation, training and constructing an abnormal scoring device of an abnormal scoring model by using a neural network technology; grading the data representation by using an anomaly grading device to obtain an anomaly score corresponding to the data sample;
setting prior distribution of data, generating data meeting the prior distribution through a prior generator, and calculating prior distribution observation parameters of the data meeting the prior distribution through a reference arithmetic unit;
calculating the deviation of the abnormal fraction and the prior distribution observation parameters by using a deviation arithmetic unit, and further calculating a loss function;
and updating the abnormal scoring model parameters through multiple iterations by using an optimization iterator, reducing a loss function, and finally obtaining an optimized abnormal scoring model.
3. The method for judging data abnormality based on deep learning and extreme value theory according to claim 2, wherein the method for scoring the data characterization by using the abnormality scorer comprises the following steps:
the anomaly scorer is as follows: score psi (eta; theta)s) Scoring the data representation to obtain an abnormal score value score of the data;
wherein the content of the first and second substances,
Figure FDA0002712338690000011
to characterize the learner, x is implementedi→ηiEta.iIs xiCorresponding data characterisation, thetarAnd thetasNetwork structure parameters characterizing the learner and the anomaly scorer, respectively.
4. The method for judging data abnormality based on deep learning and extreme value theory according to claim 2, wherein the method for calculating the prior distribution observation parameters of the data satisfying the prior distribution by the reference operator comprises:
selecting data distribution meeting the current application requirement as prior distribution, generating a certain amount of data according to the prior distribution, and calculating the characteristic parameter observation result of the distribution through the data.
5. The method for judging data abnormality based on deep learning and extreme value theory according to claim 2, wherein the method for calculating the loss function comprises:
the Loss function is expressed as Loss L (ψ (x; θ), θf) Guiding to optimize an abnormal scoring model;
where ψ (x; θ) is the anomaly score obtained by the anomaly score model, as set forth in claim 3, θ ═ θrs);θfFor a prior distribution parameter, e.g. theta in gaussian distribution, as specified in claim 4f(μ, σ); and the function L measures the difference between the abnormal score and the prior distribution, and defines different forms according to different prior distributions and task types.
6. The method for judging data abnormality based on deep learning and extreme value theory according to claim 1, wherein the estimating extreme value distribution formula parameters according to extreme values in the abnormality score values of the acquired data samples, and calculating the abnormality score threshold value by using a threshold calculation formula comprises:
dividing the tail end part of the abnormal score distribution into extreme parts by using a parameter estimator according to the abnormal scores of the data samples, wherein the abnormal scores of the extreme parts are extreme values; estimating extreme value distribution formula parameters by using extreme values;
and substituting the extreme value distribution formula parameters into a threshold calculation formula of the threshold generator by using a threshold generator to calculate the abnormal score threshold suitable for the current data set.
7. The method for determining data abnormality based on deep learning and extreme value theory according to claim 6, wherein the method for estimating extreme value distribution formula parameters by using extreme values comprises:
the extreme values are expressed as: j ═ { score | scorei>t,scoreiThe method comprises the following steps of belonging to S, wherein t is a tail division point, and S is an abnormal score set corresponding to all data;
according to the formula
Figure FDA0002712338690000021
Using maximum likelihood estimation method to estimate gamma and beta in the formula to obtain corresponding estimation value
Figure FDA0002712338690000022
The extreme value distribution formula parameter is obtained;
the substituting the extreme value distribution formula parameter into the threshold value calculation formula of the threshold value generator includes:
in the threshold value generator, the estimated value is calculated
Figure FDA0002712338690000023
Substituted into the formula for calculating the threshold value, i.e.
Figure FDA0002712338690000024
Calculating to obtain an abnormal distribution threshold th suitable for the current data set;
where q is the desired probability and | J | is the data size of the extreme set of values.
8. A data anomaly determination system based on deep learning and extreme value theory is characterized by comprising:
the abnormal scoring module is used for training an abnormal scoring model and generating an abnormal score for each datum by using the trained abnormal scoring model;
the reference scoring module is used for providing a reference scoring index for the deviation optimization module according to the prior distribution;
the deviation optimization module measures the target deviation degree of the abnormal score by using the reference scoring index, and realizes the optimization and iteration of the abnormal scoring model;
the threshold value judging module is used for obtaining the abnormal score threshold value of the current data set by applying extreme value theoretical calculation;
and the abnormality calibration module is used for performing abnormality calibration on the data according to the abnormality score obtained by the abnormality scoring module and the abnormality score threshold obtained by the threshold judgment module.
9. The deep learning and extremum theory-based data anomaly determination system according to claim 8, wherein the reference scoring module, the deviation optimization module, the threshold determination module and the anomaly calibration module are sequentially connected, and the anomaly scoring module is respectively connected with the reference scoring module, the deviation optimization module, the threshold determination module and the anomaly calibration module.
10. The deep learning and extreme value theory based data anomaly determination system according to claim 8 or 9, characterized in that the system further comprises any one or more of the following items:
-the anomaly scoring module comprising a characterization learner and an anomaly scorer, wherein the characterization learner is configured to learn a characterization of the data, enabling a mapping from a data space to a characterization space; the anomaly scoring device is used for scoring the characterization data to obtain an anomaly score of each data;
the reference scoring module comprises a priori generator and a reference operator, wherein the priori generator generates data meeting the requirement of the prior distribution according to the prior distribution, and the reference operator calculates an observation parameter of the prior distribution according to the input data meeting the requirement of the prior distribution, so as to serve as a reference scoring index of an optimization target;
the deviation optimization module comprises a deviation operator and an optimization iterator, wherein the deviation operator is used for calculating the deviation degree between the current abnormal score and the reference scoring index, and the optimization iterator guides the abnormal score model to perform optimization iteration according to the deviation degree calculated by the deviation operator to reduce the deviation degree of an optimization target;
the threshold decision module comprises a parameter estimator and a threshold generator, wherein the parameter estimator estimates extreme value distribution formula parameters according to extreme values of the abnormal scores obtained by the abnormal scoring module, and the threshold generator calculates an abnormal score threshold of the current data set by using a threshold calculation formula according to the extreme value distribution formula parameters;
the abnormity calibration module comprises an abnormity calibrator, wherein the abnormity calibrator obtains an abnormity score given by the abnormity scoring module and an abnormity score threshold given by the threshold determination module, calibrates data with an abnormity score exceeding the threshold as abnormal, and calibrates data with an abnormity score within the threshold as normal.
CN202011060903.0A 2020-09-30 2020-09-30 Data abnormity judgment method and system based on deep learning and extreme value theory Pending CN112163624A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011060903.0A CN112163624A (en) 2020-09-30 2020-09-30 Data abnormity judgment method and system based on deep learning and extreme value theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011060903.0A CN112163624A (en) 2020-09-30 2020-09-30 Data abnormity judgment method and system based on deep learning and extreme value theory

Publications (1)

Publication Number Publication Date
CN112163624A true CN112163624A (en) 2021-01-01

Family

ID=73860846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011060903.0A Pending CN112163624A (en) 2020-09-30 2020-09-30 Data abnormity judgment method and system based on deep learning and extreme value theory

Country Status (1)

Country Link
CN (1) CN112163624A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112882954A (en) * 2021-03-25 2021-06-01 浪潮云信息技术股份公司 Distributed database operation and maintenance dynamic threshold value warning method and device
CN113204590A (en) * 2021-05-31 2021-08-03 中国人民解放军国防科技大学 Unsupervised KPI (Key performance indicator) anomaly detection method based on serialization self-encoder
CN113554128A (en) * 2021-09-22 2021-10-26 中国光大银行股份有限公司 Unconventional anomaly detection method and system and storage medium
CN113780138A (en) * 2021-08-31 2021-12-10 中国科学技术大学先进技术研究院 Self-adaptive robustness VOCs gas leakage detection method, system and storage medium
CN115001997A (en) * 2022-04-11 2022-09-02 北京邮电大学 Extreme value theory-based smart city network equipment performance abnormity threshold evaluation method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112882954A (en) * 2021-03-25 2021-06-01 浪潮云信息技术股份公司 Distributed database operation and maintenance dynamic threshold value warning method and device
CN112882954B (en) * 2021-03-25 2024-01-30 浪潮云信息技术股份公司 Distributed database operation and maintenance dynamic threshold alarming method and device
CN113204590A (en) * 2021-05-31 2021-08-03 中国人民解放军国防科技大学 Unsupervised KPI (Key performance indicator) anomaly detection method based on serialization self-encoder
CN113204590B (en) * 2021-05-31 2021-11-23 中国人民解放军国防科技大学 Unsupervised KPI (Key performance indicator) anomaly detection method based on serialization self-encoder
CN113780138A (en) * 2021-08-31 2021-12-10 中国科学技术大学先进技术研究院 Self-adaptive robustness VOCs gas leakage detection method, system and storage medium
CN113554128A (en) * 2021-09-22 2021-10-26 中国光大银行股份有限公司 Unconventional anomaly detection method and system and storage medium
CN115001997A (en) * 2022-04-11 2022-09-02 北京邮电大学 Extreme value theory-based smart city network equipment performance abnormity threshold evaluation method
CN115001997B (en) * 2022-04-11 2024-02-09 北京邮电大学 Extreme value theory-based smart city network equipment performance abnormal threshold evaluation method

Similar Documents

Publication Publication Date Title
CN112163624A (en) Data abnormity judgment method and system based on deep learning and extreme value theory
CN107832581B (en) State prediction method and device
US11283991B2 (en) Method and system for tuning a camera image signal processor for computer vision tasks
US20200034692A1 (en) Machine learning system and method for coping with potential outliers and perfect learning in concept-drifting environment
CN112632351B (en) Classification model training method, classification method, device and equipment
CN111340233B (en) Training method and device of machine learning model, and sample processing method and device
CN115587543A (en) Federal learning and LSTM-based tool residual life prediction method and system
CN111178537A (en) Feature extraction model training method and device
CN112613617A (en) Uncertainty estimation method and device based on regression model
CN115357764A (en) Abnormal data detection method and device
CN114428748B (en) Simulation test method and system for real service scene
CN116227786A (en) Unmanned aerial vehicle comprehensive efficiency evaluation system
Gierjatowicz et al. Robust pricing and hedging via neural stochastic differential equations
CN110084301B (en) Hidden Markov model-based multi-working-condition process working condition identification method
Kirichenko et al. Generalized approach to Hurst exponent estimating by time series
CN114139593A (en) Training method and device for Deviational graph neural network and electronic equipment
Nazarov et al. Optimization of prediction results based on ensemble methods of machine learning
CN117370913A (en) Method, device and equipment for detecting abnormal data in photovoltaic system
Lim et al. More powerful selective kernel tests for feature selection
Asai Bayesian analysis of stochastic volatility models with mixture-of-normal distributions
JP7148445B2 (en) Information estimation device and information estimation method
Benyacoub et al. Building classification models for customer credit scoring
Cai et al. Online risk measure estimation via natural gradient boosting
CN111291020A (en) Dynamic process soft measurement modeling method based on local weighted linear dynamic system
CN112082769A (en) Intelligent BIT design method of analog input module based on expert system and Bayesian decision maker

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210101