CN113723555A - Abnormal data detection method and device, storage medium and terminal - Google Patents

Abnormal data detection method and device, storage medium and terminal Download PDF

Info

Publication number
CN113723555A
CN113723555A CN202111047033.8A CN202111047033A CN113723555A CN 113723555 A CN113723555 A CN 113723555A CN 202111047033 A CN202111047033 A CN 202111047033A CN 113723555 A CN113723555 A CN 113723555A
Authority
CN
China
Prior art keywords
training sample
sample data
label
data
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111047033.8A
Other languages
Chinese (zh)
Inventor
刘胜
魏国富
夏玉明
周晓勇
马影
殷钱安
梁淑云
余贤喆
陶景龙
王启凡
徐�明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN202111047033.8A priority Critical patent/CN113723555A/en
Publication of CN113723555A publication Critical patent/CN113723555A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for detecting abnormal data, a storage medium and a terminal, relates to the technical field of data processing, and mainly aims to solve the problem of low detection accuracy of the existing abnormal data. The method comprises the following steps: acquiring at least one group of training sample data in a training sample data set to be subjected to model training; respectively screening the training sample data based on the feature classification, the label attribute and the time dimension to obtain a feature classification result, a label attribute result and a time dimension result of the training sample data; and if at least one of the feature classification result, the label attribute result and the time dimension result is matched with a preset abnormal state, determining the training sample data as abnormal data. The method is mainly used for detecting abnormal data.

Description

Abnormal data detection method and device, storage medium and terminal
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for detecting abnormal data, a storage medium, and a terminal.
Background
With the rapid development of machine learning algorithms, machine learning algorithms are used as important steps of data processing in more and more fields, and particularly, in the artificial intelligence industry, machine learning algorithms are used as important algorithm layer processing means. The machine learning algorithm needs to perform model training in the application process so as to meet business requirements of different artificial intelligence applications, and the application of the data model depends on the precision of training data on the model training, so that the processing precision of the model can be greatly influenced if abnormal data occur in the training data.
At present, abnormal data in training data cannot be detected alone, and model precision caused by data abnormality is avoided only by frequently replacing a training data set to train a model for multiple times, but the mode of multiple training can greatly increase resource waste of model training, increase time consumption of model training, and cannot fundamentally eliminate precision influence of the abnormal data on the model, so that the detection accuracy of the abnormal data is low, and the effectiveness of model training is poor, therefore, a method for detecting the abnormal data is urgently needed to solve the problems.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for detecting abnormal data, a storage medium, and a terminal, and mainly aims to solve the problem of low accuracy in detecting the existing abnormal data.
According to an aspect of the present invention, there is provided a method for detecting abnormal data, including:
acquiring at least one group of training sample data in a training sample data set to be subjected to model training;
respectively screening the training sample data based on feature classification, label attributes and a time dimension to obtain a feature classification result, a label attribute result and a time dimension result of the training sample data, wherein the feature classification can comprise a feature discrete threshold and a data-to-noise ratio threshold, the label attributes comprise a label concentration threshold and a label coverage threshold, and the time dimension comprises time-diversity span data and non-time-diversity span data;
if at least one of the feature classification result, the label attribute result and the time dimension result matches a preset abnormal state, determining the training sample data as abnormal data;
the obtaining of the feature classification result, the label attribute result, and the time dimension result of the training sample data by respectively performing screening processing on the training sample data based on the feature classification, the label attribute, and the time dimension includes:
performing feature extraction on the training sample data based on a feature extraction model, analyzing the sample features after the feature extraction and a first matching state of the feature classification, and determining a feature classification result;
performing label clustering on the training sample data with the labels based on a label clustering algorithm, analyzing a second matching state of the label clustering labels and the label attributes of the samples after label classification, and determining a label attribute result;
and determining the time identification of the training sample data, analyzing the training sample data with the time identification and a third matching state of the time dimension, and determining a time dimension result.
Further, the feature classification includes a feature discrete threshold and a data noise ratio threshold, the feature extraction is performed on the training sample data based on the feature extraction model, the first matching state of the feature-extracted sample features and the feature classification is analyzed, and determining a feature classification result includes:
performing feature extraction on the training sample data based on a feature extraction model for completing model training to obtain sample features of the training sample data;
calculating the feature dispersion of the training sample data based on the standard deviation and the average value of the sample features, and calculating the data-to-noise ratio of the training sample data based on the classification probability of the sample features;
if the characteristic dispersion is larger than the characteristic dispersion threshold, determining the training sample data as characteristic dispersion data; and/or the presence of a gas in the gas,
and if the data-to-noise ratio is larger than the data-to-noise ratio threshold value, determining the training sample data as characteristic noise data.
Further, the tag attribute includes a tag concentration threshold and a tag coverage threshold, the tag clustering is performed on the training sample data with tags based on a tag clustering algorithm, a second matching state between the tag-classified sample cluster tag and the tag attribute is analyzed, and the determining the tag attribute result includes:
obtaining a label of the training sample data, and performing label clustering on the training sample data corresponding to the label based on a label clustering algorithm of completed model training to obtain a sample clustering label;
calculating a label set ratio of the training sample data based on the variance of the sample clustering labels, and calculating a label coverage ratio of the training sample data based on the number of the sample clustering labels and the required number of labels;
if the ratio in the label set is smaller than the threshold value in the label set, determining the training sample data as label discrete data; and/or the presence of a gas in the gas,
and if the label coverage ratio is smaller than the label coverage threshold, determining the training sample data as label offset data.
Further, the analyzing the third matching state of the training sample data with the time identifier and the time dimension, and the determining the time dimension result includes:
determining a time length and a time span of the training sample data based on the time identification;
and if the ratio of the time length to the time span is greater than the time dimension, determining the training sample data as time diversity span data.
Further, after determining that the training sample data is abnormal data, the method further includes:
searching an abnormal target from the training sample data based on the feature classification, the label attribute and the time dimension, and deleting the abnormal target;
and re-screening the training sample data with the abnormal target deleted to obtain normal training sample data for model training.
Further, the acquiring at least one set of training sample data in the training sample data set to be subjected to model training includes:
determining a business requirement to be subjected to model training, wherein the business requirement is used for representing business content expected to be processed by using a model;
determining the group number of training sample data matched with the service requirement based on a preset service requirement proportional relation, wherein the preset service requirement proportional relation is used for representing the corresponding relation between different service requirements and the group number of different training sample data;
and randomly acquiring training sample data corresponding to the group number.
According to another aspect of the present invention, there is provided an apparatus for detecting abnormal data, including:
the acquisition module is used for acquiring at least one group of training sample data in a training sample data set to be subjected to model training;
the processing module is used for respectively screening the training sample data based on feature classification, label attributes and a time dimension to obtain a feature classification result, a label attribute result and a time dimension result of the training sample data, wherein the feature classification can comprise a feature discrete threshold and a data-to-noise ratio threshold, the label attributes comprise a label concentration threshold and a label coverage threshold, and the time dimension comprises time-diversity span data and non-time-diversity span data;
a determining module, configured to determine that the training sample data is abnormal data if at least one of the feature classification result, the tag attribute result, and the time dimension result matches a preset abnormal state;
wherein the processing module comprises:
the first determining unit is used for extracting the features of the training sample data based on a feature extraction model, analyzing the sample features after the features are extracted and the first matching state of the feature classification, and determining a feature classification result;
a second determining unit, configured to perform label clustering on the training sample data with labels based on a label clustering algorithm, and analyze a second matching state between the label-classified sample clustering labels and the label attributes, and determine a label attribute result;
and the third determining unit is used for determining the time identifier of the training sample data, analyzing the training sample data with the time identifier and a third matching state of the time dimension, and determining a time dimension result.
Further, the feature classification includes a feature discrete threshold and a data-to-noise ratio threshold, and the first determining unit includes:
the extraction subunit is used for extracting the features of the training sample data based on the feature extraction model after model training is completed to obtain the sample features of the training sample data;
a first calculating subunit, configured to calculate a feature dispersion of the training sample data based on a standard deviation and an average of the sample features, and calculate a data-to-noise ratio of the training sample data based on a classification probability of the sample features;
a first determining subunit, configured to determine the training sample data as feature discrete data if the feature dispersion is greater than the feature dispersion threshold; and/or the presence of a gas in the gas,
a second determining subunit, configured to determine the training sample data as feature noise data if the data-to-noise ratio is greater than the data-to-noise ratio threshold.
Further, the tag attributes include a threshold in the tag set and a threshold in the tag coverage, and the second determining unit includes:
the acquisition subunit is used for acquiring the label of the training sample data and performing label clustering on the training sample data corresponding to the label based on a label clustering algorithm of the completed model training to obtain a sample clustering label;
the second calculating subunit is used for calculating the ratio in the label set of the training sample data based on the variance of the sample clustering labels, and calculating the label coverage ratio of the training sample data based on the number of the sample clustering labels and the required number of labels;
a third determining subunit, configured to determine, if the ratio in the tag set is smaller than the threshold in the tag set, the training sample data as tag discrete data; and/or the presence of a gas in the gas,
a fourth determining subunit, configured to determine, if the label coverage ratio is smaller than the label coverage threshold, the training sample data as label offset data.
Further, the third determination unit includes:
a fifth determining subunit, configured to determine a time length and a time span of the training sample data based on the time identifier;
a sixth determining subunit, configured to determine, if a ratio of the time length to the time span is greater than the time dimension, the training sample data as time diversity span data.
Further, the apparatus further comprises: the search module is used for searching the search results,
the searching module is used for searching an abnormal target from the training sample data based on the feature classification, the label attribute and the time dimension, and deleting the abnormal target;
and the processing module is also used for carrying out screening processing again on the training sample data with the abnormal target deleted so as to obtain normal training sample data for model training.
Further, the obtaining module comprises:
a fourth determining unit, configured to determine a service requirement to be subjected to model training, where the service requirement is used to represent service content expected to be processed by using a model;
a fifth determining unit, configured to determine, based on a preset service demand proportional relationship, a group number of training sample data matched with the service demand, where the preset service demand proportional relationship is used to represent a corresponding relationship between different service demands and the group number of different training sample data;
and the acquisition unit is used for randomly acquiring the training sample data corresponding to the group number.
According to another aspect of the present invention, there is provided a storage medium having at least one executable instruction stored therein, the executable instruction causing a processor to perform operations corresponding to the above-mentioned abnormal data detection method.
According to still another aspect of the present invention, there is provided a terminal including: a processor, a memory, a communication interface, and a communication bus through which the processor, the memory, and the communication interface communicate;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the abnormal data detection method.
By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:
the invention provides a method and a device for detecting abnormal data, a storage medium and a terminal, compared with the prior art, the embodiment of the invention obtains at least one group of training sample data in a training sample data set to be subjected to model training; respectively screening the training sample data based on the feature classification, the label attribute and the time dimension to obtain a feature classification result, a label attribute result and a time dimension result of the training sample data; and if at least one of the feature classification result, the label attribute result and the time dimension result is matched with a preset abnormal state, determining the training sample data as abnormal data, ensuring the accuracy of model training, greatly accelerating the speed of model training, and fundamentally avoiding the condition that the accuracy of model training is reduced due to abnormal data, thereby realizing the high efficiency of model training.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart of a method for detecting abnormal data according to an embodiment of the present invention;
FIG. 2 is a flow chart of another abnormal data detection method provided by the embodiment of the invention;
FIG. 3 is a schematic diagram illustrating an abnormal data detection engine according to an embodiment of the present invention;
FIG. 4 is a block diagram of an apparatus for detecting abnormal data according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
An embodiment of the present invention provides a method for detecting abnormal data, as shown in fig. 1, the method includes:
101. at least one group of training sample data in a training sample data set to be subjected to model training is obtained.
In the embodiment of the invention, the training sample data set of the model training comprises a plurality of arrays consisting of different training sample data so as to be used as an acquisition object for each abnormal data detection. When training sample data to be detected is acquired, abnormal data detection may be performed on a group of training sample data, or abnormal data detection may be performed on multiple groups of training sample data at the same time.
It should be noted that, the model training in the embodiment of the present invention is applicable to machine learning models with different processing requirements established in different service scenarios, where the service scenarios include, but are not limited to, network security, artificial intelligence, information transaction, product application, and the like, the processing requirements include, but are not limited to, classification, prediction, regression, and the like, and the machine learning models include, but are not limited to, a neural network model, a support vector machine model, a decision tree model, and the like, and the embodiment of the present invention is not particularly limited. In addition, the training sample data set for model training needs to detect abnormal data before performing model training, so as to avoid the accuracy reduction of model training caused by the existence of abnormal data.
102. And respectively screening the training sample data based on the feature classification, the label attribute and the time dimension to obtain a feature classification result, a label attribute result and a time dimension result of the training sample data.
The feature classification is used for representing classification contents determined by the training sample data based on the data feature attribute distribution condition, the tag attribute is used for representing attribute contents determined by the training sample data based on the data tag distribution condition, and the time dimension is used for representing the dimension crossing condition of the training sample data according to different time identification records. In addition, since the training sample data is screened in the embodiment of the present invention, the feature classification, the tag attribute, and the time dimension are all predetermined thresholds as screening bases, that is, the feature classification may include a feature discrete threshold and a data-to-noise ratio threshold, the tag attribute includes a tag concentration threshold and a tag coverage threshold, and the time dimension includes whether the training sample data is time-diverse span data, so as to obtain a feature classification result, a tag attribute result, and a time dimension result of the training sample data.
It should be noted that, in the process of performing the screening processing based on the feature classification, the tag attribute, and the time dimension, the feature dispersion, the data noise ratio, the tag concentration ratio, the tag coverage ratio, the time length, and the time span of each data in the training sample data need to be determined, so that the screening processing is completed by comparing the feature classification, the tag attribute, and the time dimension, and the feature classification result, the tag attribute result, and the time dimension result are obtained.
103. And if at least one of the feature classification result, the label attribute result and the time dimension result is matched with a preset abnormal state, determining the training sample data as abnormal data.
In the embodiment of the invention, in order to accurately determine the data belonging to the abnormal condition in the training sample data, if at least one of the feature classification result, the label attribute result and the time dimension result matches the preset abnormal state, the training sample data is indicated as abnormal data. The preset abnormal state is a content in which the feature classification result, the tag attribute result, and the time dimension result are pre-configured and meet the abnormal number, the abnormal numerical value, or the abnormal numerical value ratio, for example, the preset abnormal state may be configured such that the number of the feature noise data is 10, and if the number of the feature noise data in the feature classification result is greater than 10, the preset abnormal state is matched, and it is determined that the training sample data is abnormal data, which is not specifically limited in the embodiment of the present invention.
It should be noted that, in the embodiment of the present invention, in order to increase the anomaly detection strength on the training sample data, for a preset anomaly state, as long as any one or more of the feature classification result, the tag attribute result, and the time dimension is matched, it is determined that the training sample data is anomalous data, and a set of training sample data detected by the anomalous data is determined as anomalous data, so that the user is not using the set of training sample data, or further performs target anomaly data extraction on the set of training sample data.
In this embodiment of the present invention, for further limitation and explanation, as shown in fig. 2, the step 102 of respectively performing a screening process on the training sample data based on the feature classification, the tag attribute, and the time dimension to obtain a feature classification result, a tag attribute result, and a time dimension result of the training sample data includes: 1021. performing feature extraction on the training sample data based on a feature extraction model, analyzing the sample features after the feature extraction and a first matching state of the feature classification, and determining a feature classification result; 1022. performing label clustering on the training sample data with the labels based on a label clustering algorithm, analyzing a second matching state of the label clustering labels and the label attributes of the samples after label classification, and determining a label attribute result; 1023. and determining the time identification of the training sample data, analyzing the training sample data with the time identification and a third matching state of the time dimension, and determining a time dimension result.
In order to perform screening processing on training sample data according to the feature classification, the label attribute and the time dimension, the training sample data is respectively subjected to feature extraction, label classification and time identification determination processing so as to obtain states respectively matched with the feature classification, the label attribute and the time dimension. The feature extraction model is a neural network model for extracting features of training sample data, including but not limited to a deep residual neural network model ResNet, a convolutional neural network model VGG and the like, and can be used for obtaining a feature extraction model capable of directly extracting features by performing model training on marked different feature data in advance, and analyzing the extracted sample features and a first matching state of feature classification after feature extraction is completed through the feature extraction model, so that a feature classification result is determined. The label clustering algorithm is a clustering algorithm for performing label clustering on training sample data, and includes but is not limited to a K-Means clustering algorithm, a Gaussian mixed type clustering algorithm and the like, the label clustering algorithm is obtained by training the label clustering algorithm in advance, label clustering is completed through the label clustering algorithm, and a second matching state of a clustered sample clustering label and a label attribute is analyzed, so that a label attribute result is determined. In addition, for the screening of the time dimension, each training sample data corresponds to the time identifier of the acquired data in advance, so that a third matching state with the time dimension can be determined through the time identifier of each training sample data, and a time dimension result is determined.
It should be noted that the matching state is used to characterize the state of whether the determined sample feature, sample identifier, time identifier and feature classification, tag attribute, and time dimension respectively satisfy the conditions, and thus serves as a basis for determining the result of the feature classification, the result of the tag attribute, and the result of the time dimension.
In an embodiment of the present invention, for further limitation and description, the feature classification includes a feature discrete threshold and a data noise ratio threshold, step 1021 performs feature extraction on the training sample data based on a feature extraction model, and analyzes a first matching state between a sample feature after feature extraction and the feature classification, and determining a feature classification result includes: performing feature extraction on the training sample data based on a feature extraction model for completing model training to obtain sample features of the training sample data; calculating the feature dispersion of the training sample data based on the standard deviation and the average value of the sample features, and calculating the data-to-noise ratio of the training sample data based on the classification probability of the sample features; if the characteristic dispersion is larger than the characteristic dispersion threshold, determining the training sample data as characteristic dispersion data; and/or determining the training sample data as characteristic noise data if the data-to-noise ratio is greater than the data-to-noise ratio threshold.
Specifically, in order to improve the detection accuracy of feature classification in abnormal data, since the feature extraction model includes, but is not limited to, a deep residual error neural network model ResNet, a convolutional neural network model VGG, and the like, when feature extraction is performed, feature extraction is performed on training sample data only through the feature extraction model after training is completed, so that the sample features of the training sample data are obtained. In the embodiment of the invention, because the feature classification comprises a feature discrete threshold and a data noise ratio threshold, relatively, after the sample features are obtained, the feature dispersion of the training sample data is calculated based on the standard deviation and the average value of the sample features, and the data noise ratio of the training sample data is calculated based on the classification probability of the sample features. Wherein, the characteristic dispersion is the characteristic variation coefficient, and for the standard deviation and variance of different training sample data, the characteristic dispersion is the ratio of the standard deviation and the average of a group of data, and the formula is: viS is a standard deviation of the sample characteristics, and Y is an average value of the sample characteristics, which is not specifically limited in the embodiments of the present invention. In addition, the calculation of the data-to-noise ratio is obtained based on the classification probability of the sample features, that is, the sample features are classified by using a classifier, the classification probability is determined according to the number of the classified sample features, and a ratio is made between the classification probability and a preset classification threshold to obtain the data-to-noise ratio, where the feature features and the preset classification threshold may be pre-configured, and embodiments of the present invention are not particularly limited. After the feature dispersion and the data-to-noise ratio are calculated, the feature dispersion and the data-to-noise ratio are respectively compared with a feature dispersion threshold and a data-to-noise ratio threshold, so that when the feature dispersion is greater than the feature dispersion threshold, training sample data is determined as feature dispersion data, or when the data-to-noise ratio is greater than the data-to-noise ratio threshold, training sample data is determined as feature noise data, that is, whether a feature classification result includes feature dispersion data and feature noise data is determined, which is not specifically limited in the embodiment of the present invention.
In this embodiment of the present invention, for further limitation and description, the tag attribute includes a threshold in a tag set and a tag coverage threshold, step 1022 performs tag clustering on the training sample data with tags based on a tag clustering algorithm, and analyzes a second matching state between a sample cluster tag after tag classification and the tag attribute, and determining a tag attribute result includes: obtaining a label of the training sample data, and performing label clustering on the training sample data corresponding to the label based on a label clustering algorithm of completed model training to obtain a sample clustering label; calculating a label set ratio of the training sample data based on the variance of the sample clustering labels, and calculating a label coverage ratio of the training sample data based on the number of the sample clustering labels and the required number of labels; if the ratio in the label set is smaller than the threshold value in the label set, determining the training sample data as label discrete data; and/or if the label coverage ratio is smaller than the label coverage threshold, determining the training sample data as label offset data.
Specifically, in order to improve the detection accuracy of the label attribute in the abnormal data, because the clustering algorithm includes, but is not limited to, a K-Means clustering algorithm, a gaussian mixed type clustering algorithm, and the like, when performing label clustering, only the training sample data with the label is subjected to label clustering through the label clustering algorithm which completes training, so as to obtain the sample clustering label of the training sample data. Because the label attributes comprise a label set threshold and a label coverage threshold, relatively, after the sample clustering labels are obtained, the label set ratio of the training sample data is calculated based on the variance of the sample clustering labels, and the label coverage ratio of the training sample data is calculated based on the number of the sample clustering labels and the required number of the labels. The label concentration ratio is used for representing the concentration trend of the sample labels, the calculation method is the ratio of the variance of the sample clustering labels to the variance of the preset classification labels, the label coverage ratio is used for representing the comprehensiveness of the class of the training sample, and the calculation method is the ratio of the number of the sample clustering labels to the number of the required labels, wherein the preset classification label variance is determined based on the number and the type of the predetermined classification labels, and the number of the required labels is determined based on the preset training data requirements. After the ratio and the coverage ratio of the labels in the label set are calculated, the ratio and the coverage ratio are compared with a threshold value and a threshold value in the label set respectively, so that when the ratio in the label set is smaller than the threshold value in the label set, training sample data is determined to be label discrete data, or when the coverage ratio of the labels is smaller than the threshold value, the training sample data is determined to be label offset data, namely, whether the obtained label attribute result comprises label discrete data and whether the label attribute result is label offset data is determined.
In an embodiment of the present invention, for further limitation and description, the step 1023 parses a third matching state of the training sample data with a time identifier and the time dimension, and the determining the time dimension result includes: determining a time length and a time span of the training sample data based on the time identification; and if the ratio of the time length to the time span is greater than the time dimension, determining the training sample data as time diversity span data.
Specifically, each training sample data corresponds to a time identifier when being acquired or generated, and in order to improve the detection accuracy of the time dimension in the abnormal data, the time length and the time span of the training sample data may be determined based on the time identifier, for example, the time length is 30 hours, the time span is 2 days, and the like. Furthermore, a ratio of the time length to the time span is calculated and used as a comparison basis with a time dimension to determine whether the training sample data is time-diversity span data, wherein the time dimension is a preset dimension threshold, which is not specifically limited in the embodiments of the present invention.
It should be noted that, as shown in the schematic structural diagram of the abnormal data detection engine shown in fig. 3, training sample data is injected into the detection engine, and after the feature dispersion, the data noise ratio, the tag concentration ratio, the tag coverage ratio, and the time span of the training sample data are respectively calculated, screening is performed based on preset feature classification, tag attributes, and time dimensions to obtain a feature classification result, a tag attribute result, and a time dimension result, so that matching is performed based on the feature classification result, the tag attribute result, and the time dimension result with a preset abnormal state, and thus whether the abnormal data is determined.
In an embodiment of the present invention, for further limitation and description, after determining that the training sample data is abnormal data, the method further includes: searching an abnormal target from the training sample data based on the feature classification, the label attribute and the time dimension, and deleting the abnormal target; and re-screening the training sample data with the abnormal target deleted to obtain normal training sample data for model training.
After the training sample data is determined to be abnormal data, in order to reuse the training sample data and improve the accuracy of model training, the abnormal target searching process in the abnormal data needs to be deleted, so that the abnormal target is searched from the training sample data based on the feature classification, the label attribute and the time dimension. Specifically, because the feature classification and the label attribute are obtained by calculating the ratio based on the variance or the average, the search of the abnormal target is the target data with the largest influence on the feature dispersion and the largest data-to-noise ratio, the feature dispersion threshold and the data-to-noise ratio can be recalculated by iteratively deleting any one of a group of training sample data, and the data with the maximum feature dispersion threshold and the maximum data-to-noise ratio after the data is deleted is selected as the abnormal target. Similarly, for the abnormal target of the tag attribute and the time dimension, the ratio in the tag set, the tag coverage threshold value, and the ratio of the time length to the time span are recalculated based on iterative deletion of each data, and the data which enables the ratio in the tag set, the tag coverage threshold value, and the ratio of the time length to the time span to be still the maximum value after the data is deleted is selected as the abnormal target. And after searching the abnormal target, deleting the abnormal target, and screening the training sample data with the abnormal target deleted again until the training sample data can not detect the abnormal data, so that the training sample data can be used as normal training sample data for model training.
In the embodiment of the present invention, for further limitation and description, the step 101 of acquiring at least one set of training sample data in a training sample data set to be subjected to model training includes: determining a business requirement to be subjected to model training; determining the group number of training sample data matched with the service requirement based on a preset service requirement proportional relation; and randomly acquiring training sample data corresponding to the group number.
The acquired training sample data is at least one group in the training sample data set, so that the efficiency of abnormal data detection is improved, the service requirement of model training is determined firstly, the group number of the training sample data matched with the service requirement is determined based on the proportional relation with the preset service requirement, and the training sample data is acquired. In the embodiment of the present invention, since different models need to be model-trained based on different service requirements, firstly, the service requirements of the model corresponding to the training sample data set, for example, classification requirements, prediction requirements, and the like, need to be determined, and then, the group number of the training sample data to be obtained is determined based on a preset service requirement proportional relationship, where the preset service requirement proportional relationship is used to represent a corresponding relationship between different service requirements and the group number of different training sample data, so that after the group number is determined, the training sample data of the corresponding group number is randomly obtained.
Compared with the prior art, the embodiment of the invention provides a method for detecting abnormal data, and the method comprises the steps of acquiring at least one group of training sample data in a training sample data set to be subjected to model training; respectively screening the training sample data based on the feature classification, the label attribute and the time dimension to obtain a feature classification result, a label attribute result and a time dimension result of the training sample data; and if at least one of the feature classification result, the label attribute result and the time dimension result is matched with a preset abnormal state, determining the training sample data as abnormal data, ensuring the accuracy of model training, greatly accelerating the speed of model training, and fundamentally avoiding the condition that the accuracy of model training is reduced due to abnormal data, thereby realizing the high efficiency of model training.
Further, as an implementation of the method shown in fig. 1, an embodiment of the present invention provides an apparatus for detecting abnormal data, as shown in fig. 4, where the apparatus includes:
an obtaining module 21, configured to obtain at least one set of training sample data in a training sample data set to be subjected to model training;
a processing module 22, configured to perform screening processing on the training sample data based on a feature classification, a tag attribute, and a time dimension, to obtain a feature classification result, a tag attribute result, and a time dimension result of the training sample data, where the feature classification may include a feature discrete threshold and a data-to-noise ratio threshold, the tag attribute includes a tag concentration threshold and a tag coverage threshold, and the time dimension includes time-diversity span data and non-time-diversity span data;
a determining module 23, configured to determine that the training sample data is abnormal data if at least one of the feature classification result, the tag attribute result, and the time dimension result matches a preset abnormal state;
wherein the processing module 22 comprises:
the first determining unit is used for extracting the features of the training sample data based on a feature extraction model, analyzing the sample features after the features are extracted and the first matching state of the feature classification, and determining a feature classification result;
a second determining unit, configured to perform label clustering on the training sample data with labels based on a label clustering algorithm, and analyze a second matching state between the label-classified sample clustering labels and the label attributes, and determine a label attribute result;
and the third determining unit is used for determining the time identifier of the training sample data, analyzing the training sample data with the time identifier and a third matching state of the time dimension, and determining a time dimension result.
Further, the feature classification includes a feature discrete threshold and a data-to-noise ratio threshold, and the first determining unit includes:
the extraction subunit is used for extracting the features of the training sample data based on the feature extraction model after model training is completed to obtain the sample features of the training sample data;
a first calculating subunit, configured to calculate a feature dispersion of the training sample data based on a standard deviation and an average of the sample features, and calculate a data-to-noise ratio of the training sample data based on a classification probability of the sample features;
a first determining subunit, configured to determine the training sample data as feature discrete data if the feature dispersion is greater than the feature dispersion threshold; and/or the presence of a gas in the gas,
a second determining subunit, configured to determine the training sample data as feature noise data if the data-to-noise ratio is greater than the data-to-noise ratio threshold.
Further, the tag attributes include a threshold in the tag set and a threshold in the tag coverage, and the second determining unit includes:
the acquisition subunit is used for acquiring the label of the training sample data and performing label clustering on the training sample data corresponding to the label based on a label clustering algorithm of the completed model training to obtain a sample clustering label;
the second calculating subunit is used for calculating the ratio in the label set of the training sample data based on the variance of the sample clustering labels, and calculating the label coverage ratio of the training sample data based on the number of the sample clustering labels and the required number of labels;
a third determining subunit, configured to determine, if the ratio in the tag set is smaller than the threshold in the tag set, the training sample data as tag discrete data; and/or the presence of a gas in the gas,
a fourth determining subunit, configured to determine, if the label coverage ratio is smaller than the label coverage threshold, the training sample data as label offset data.
Further, the third determination unit includes:
a fifth determining subunit, configured to determine a time length and a time span of the training sample data based on the time identifier;
a sixth determining subunit, configured to determine, if a ratio of the time length to the time span is greater than the time dimension, the training sample data as time diversity span data.
Further, the apparatus further comprises: the search module is used for searching the search results,
the searching module is used for searching an abnormal target from the training sample data based on the feature classification, the label attribute and the time dimension, and deleting the abnormal target;
and the processing module is also used for carrying out screening processing again on the training sample data with the abnormal target deleted so as to obtain normal training sample data for model training.
Further, the obtaining module comprises:
a fourth determining unit, configured to determine a service requirement to be subjected to model training, where the service requirement is used to represent service content expected to be processed by using a model;
a fifth determining unit, configured to determine, based on a preset service demand proportional relationship, a group number of training sample data matched with the service demand, where the preset service demand proportional relationship is used to represent a corresponding relationship between different service demands and the group number of different training sample data;
and the acquisition unit is used for randomly acquiring the training sample data corresponding to the group number.
Compared with the prior art, the embodiment of the invention provides a device for detecting abnormal data, and the embodiment of the invention acquires at least one group of training sample data in a training sample data set to be subjected to model training; respectively screening the training sample data based on the feature classification, the label attribute and the time dimension to obtain a feature classification result, a label attribute result and a time dimension result of the training sample data; and if at least one of the feature classification result, the label attribute result and the time dimension result is matched with a preset abnormal state, determining the training sample data as abnormal data, ensuring the accuracy of model training, greatly accelerating the speed of model training, and fundamentally avoiding the condition that the accuracy of model training is reduced due to abnormal data, thereby realizing the high efficiency of model training.
According to an embodiment of the present invention, a storage medium is provided, where at least one executable instruction is stored, and the computer executable instruction can execute the method for detecting abnormal data in any of the above method embodiments.
Fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the terminal.
As shown in fig. 5, the terminal may include: a processor (processor)302, a communication Interface 304, a memory 306, and a communication bus 308.
Wherein: the processor 302, communication interface 304, and memory 306 communicate with each other via a communication bus 308.
A communication interface 304 for communicating with network elements of other devices, such as clients or other servers.
The processor 302 is configured to execute the program 310, and may specifically perform relevant steps in the above-described method for detecting abnormal data.
In particular, program 310 may include program code comprising computer operating instructions.
The processor 302 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement an embodiment of the present invention. The terminal comprises one or more processors, which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
And a memory 306 for storing a program 310. Memory 306 may comprise high-speed RAM memory and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program 310 may specifically be configured to cause the processor 302 to perform the following operations:
acquiring at least one group of training sample data in a training sample data set to be subjected to model training;
respectively screening the training sample data based on feature classification, label attributes and a time dimension to obtain a feature classification result, a label attribute result and a time dimension result of the training sample data, wherein the feature classification can comprise a feature discrete threshold and a data-to-noise ratio threshold, the label attributes comprise a label concentration threshold and a label coverage threshold, and the time dimension comprises time-diversity span data and non-time-diversity span data;
if at least one of the feature classification result, the label attribute result and the time dimension result matches a preset abnormal state, determining the training sample data as abnormal data;
the obtaining of the feature classification result, the label attribute result, and the time dimension result of the training sample data by respectively performing screening processing on the training sample data based on the feature classification, the label attribute, and the time dimension includes:
performing feature extraction on the training sample data based on a feature extraction model, analyzing the sample features after the feature extraction and a first matching state of the feature classification, and determining a feature classification result;
performing label clustering on the training sample data with the labels based on a label clustering algorithm, analyzing a second matching state of the label clustering labels and the label attributes of the samples after label classification, and determining a label attribute result;
and determining the time identification of the training sample data, analyzing the training sample data with the time identification and a third matching state of the time dimension, and determining a time dimension result.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A method for detecting anomalous data, comprising:
acquiring at least one group of training sample data in a training sample data set to be subjected to model training;
respectively screening the training sample data based on feature classification, label attributes and a time dimension to obtain a feature classification result, a label attribute result and a time dimension result of the training sample data, wherein the feature classification can comprise a feature discrete threshold and a data-to-noise ratio threshold, the label attributes comprise a label concentration threshold and a label coverage threshold, and the time dimension comprises time-diversity span data and non-time-diversity span data;
if at least one of the feature classification result, the label attribute result and the time dimension result matches a preset abnormal state, determining the training sample data as abnormal data;
the obtaining of the feature classification result, the label attribute result, and the time dimension result of the training sample data by respectively performing screening processing on the training sample data based on the feature classification, the label attribute, and the time dimension includes:
performing feature extraction on the training sample data based on a feature extraction model, analyzing the sample features after the feature extraction and a first matching state of the feature classification, and determining a feature classification result;
performing label clustering on the training sample data with the labels based on a label clustering algorithm, analyzing a second matching state of the label clustering labels and the label attributes of the samples after label classification, and determining a label attribute result;
and determining the time identification of the training sample data, analyzing the training sample data with the time identification and a third matching state of the time dimension, and determining a time dimension result.
2. The method according to claim 1, wherein the feature classification includes a feature discrete threshold and a data-to-noise ratio threshold, the performing feature extraction on the training sample data based on the feature extraction model, and analyzing a first matching state between the feature extracted sample feature and the feature classification, and determining the feature classification result includes:
performing feature extraction on the training sample data based on a feature extraction model for completing model training to obtain sample features of the training sample data;
calculating the feature dispersion of the training sample data based on the standard deviation and the average value of the sample features, and calculating the data-to-noise ratio of the training sample data based on the classification probability of the sample features;
if the characteristic dispersion is larger than the characteristic dispersion threshold, determining the training sample data as characteristic dispersion data; and/or the presence of a gas in the gas,
and if the data-to-noise ratio is larger than the data-to-noise ratio threshold value, determining the training sample data as characteristic noise data.
3. The method of claim 1, wherein the tag attributes comprise a threshold in a tag set and a threshold in a tag coverage, the tag clustering is performed on the training sample data with tags based on a tag clustering algorithm, and the second matching state of the tag attributes and the sample cluster tags after the tag classification is analyzed, and the determining the tag attribute result comprises:
obtaining a label of the training sample data, and performing label clustering on the training sample data corresponding to the label based on a label clustering algorithm of completed model training to obtain a sample clustering label;
calculating a label set ratio of the training sample data based on the variance of the sample clustering labels, and calculating a label coverage ratio of the training sample data based on the number of the sample clustering labels and the required number of labels;
if the ratio in the label set is smaller than the threshold value in the label set, determining the training sample data as label discrete data; and/or the presence of a gas in the gas,
and if the label coverage ratio is smaller than the label coverage threshold, determining the training sample data as label offset data.
4. The method according to claim 1, wherein said parsing the training sample data with time identification to a third matching state of the time dimension, determining a time dimension result comprises:
determining a time length and a time span of the training sample data based on the time identification;
and if the ratio of the time length to the time span is greater than the time dimension, determining the training sample data as time diversity span data.
5. The method according to claim 1, wherein after determining that the training sample data is abnormal data, the method further comprises:
searching an abnormal target from the training sample data based on the feature classification, the label attribute and the time dimension, and deleting the abnormal target;
and re-screening the training sample data with the abnormal target deleted to obtain normal training sample data for model training.
6. The method according to any of claims 1-5, wherein said obtaining at least one set of training sample data in a set of training sample data to be model trained comprises:
determining a business requirement to be subjected to model training, wherein the business requirement is used for representing business content expected to be processed by using a model;
determining the group number of training sample data matched with the service requirement based on a preset service requirement proportional relation, wherein the preset service requirement proportional relation is used for representing the corresponding relation between different service requirements and the group number of different training sample data;
and randomly acquiring training sample data corresponding to the group number.
7. An apparatus for detecting abnormal data, comprising:
the acquisition module is used for acquiring at least one group of training sample data in a training sample data set to be subjected to model training;
the processing module is used for respectively screening the training sample data based on feature classification, label attributes and a time dimension to obtain a feature classification result, a label attribute result and a time dimension result of the training sample data, wherein the feature classification can comprise a feature discrete threshold and a data-to-noise ratio threshold, the label attributes comprise a label concentration threshold and a label coverage threshold, and the time dimension comprises time-diversity span data and non-time-diversity span data;
a determining module, configured to determine that the training sample data is abnormal data if at least one of the feature classification result, the tag attribute result, and the time dimension result matches a preset abnormal state;
wherein the processing module comprises:
the first determining unit is used for extracting the features of the training sample data based on a feature extraction model, analyzing the sample features after the features are extracted and the first matching state of the feature classification, and determining a feature classification result;
a second determining unit, configured to perform label clustering on the training sample data with labels based on a label clustering algorithm, and analyze a second matching state between the label-classified sample clustering labels and the label attributes, and determine a label attribute result;
and the third determining unit is used for determining the time identifier of the training sample data, analyzing the training sample data with the time identifier and a third matching state of the time dimension, and determining a time dimension result.
8. A storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the method for detecting abnormal data according to any one of claims 1 to 6.
9. A terminal, comprising: a processor, a memory, a communication interface, and a communication bus through which the processor, the memory, and the communication interface communicate;
the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the abnormal data detection method according to any one of claims 1-6.
CN202111047033.8A 2021-09-07 2021-09-07 Abnormal data detection method and device, storage medium and terminal Pending CN113723555A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111047033.8A CN113723555A (en) 2021-09-07 2021-09-07 Abnormal data detection method and device, storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111047033.8A CN113723555A (en) 2021-09-07 2021-09-07 Abnormal data detection method and device, storage medium and terminal

Publications (1)

Publication Number Publication Date
CN113723555A true CN113723555A (en) 2021-11-30

Family

ID=78682372

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111047033.8A Pending CN113723555A (en) 2021-09-07 2021-09-07 Abnormal data detection method and device, storage medium and terminal

Country Status (1)

Country Link
CN (1) CN113723555A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115392812A (en) * 2022-10-31 2022-11-25 成都飞机工业(集团)有限责任公司 Abnormal root cause positioning method, device, equipment and medium
CN117633706A (en) * 2023-11-30 2024-03-01 众悦(威海)信息技术有限公司 Data processing method for information system data fusion

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115392812A (en) * 2022-10-31 2022-11-25 成都飞机工业(集团)有限责任公司 Abnormal root cause positioning method, device, equipment and medium
CN117633706A (en) * 2023-11-30 2024-03-01 众悦(威海)信息技术有限公司 Data processing method for information system data fusion

Similar Documents

Publication Publication Date Title
WO2019222462A1 (en) Identification of sensitive data using machine learning
US11636387B2 (en) System and method for improving machine learning models based on confusion error evaluation
CN111026653B (en) Abnormal program behavior detection method and device, electronic equipment and storage medium
CN113723555A (en) Abnormal data detection method and device, storage medium and terminal
CN111368289B (en) Malicious software detection method and device
CN112733146B (en) Penetration testing method, device and equipment based on machine learning and storage medium
US20180032917A1 (en) Hierarchical classifiers
CN109933502B (en) Electronic device, user operation record processing method and storage medium
CN111563074A (en) Data quality detection method and system based on multi-dimensional label
CN111931809A (en) Data processing method and device, storage medium and electronic equipment
CN113626241A (en) Application program exception handling method, device, equipment and storage medium
CN114818643A (en) Log template extraction method for reserving specific service information
CN115758183A (en) Training method and device for log anomaly detection model
CN116107834A (en) Log abnormality detection method, device, equipment and storage medium
CN112632000B (en) Log file clustering method, device, electronic equipment and readable storage medium
CN111240942A (en) Log abnormity detection method and device
CN116956026A (en) Training method and system for network asset identification model
CN115130110B (en) Vulnerability discovery method, device, equipment and medium based on parallel integrated learning
CN110795308A (en) Server inspection method, device, equipment and storage medium
CN114116811B (en) Log processing method, device, equipment and storage medium
CN113688240B (en) Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium
CN113535458B (en) Abnormal false alarm processing method and device, storage medium and terminal
CN115098679A (en) Method, device, equipment and medium for detecting abnormality of text classification labeling sample
CN111931229B (en) Data identification method, device and storage medium
CN114595136A (en) Log analysis method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination