CN113435464B - Abnormal data detection method and device, electronic equipment and computer storage medium - Google Patents

Abnormal data detection method and device, electronic equipment and computer storage medium Download PDF

Info

Publication number
CN113435464B
CN113435464B CN202010154535.XA CN202010154535A CN113435464B CN 113435464 B CN113435464 B CN 113435464B CN 202010154535 A CN202010154535 A CN 202010154535A CN 113435464 B CN113435464 B CN 113435464B
Authority
CN
China
Prior art keywords
data
abnormal data
abnormal
distance
width
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010154535.XA
Other languages
Chinese (zh)
Other versions
CN113435464A (en
Inventor
欧阳昭暐
谢峰
田赟
龙欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Cloud Computing Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010154535.XA priority Critical patent/CN113435464B/en
Publication of CN113435464A publication Critical patent/CN113435464A/en
Application granted granted Critical
Publication of CN113435464B publication Critical patent/CN113435464B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The embodiment of the invention discloses an abnormal data detection method, an abnormal data detection device, electronic equipment and a computer storage medium, wherein the method comprises the following steps: acquiring data samples in a preset time period, and dividing the data samples into different types of data sample sets; acquiring data to be detected, and comparing the data to be detected with the data sample set to determine suspected abnormal data; and comparing the segmentation distance of the suspected abnormal data, and determining the suspected abnormal data meeting the preset conditions as abnormal data. The technical scheme can break through the limitation of the detection method on the premise of algorithm requirements, realize the comprehensive detection of abnormal data, improve the accuracy of the effectiveness of the abnormal data detection, and further provide reliable data support for the prediction of subsequent future data.

Description

Abnormal data detection method and device, electronic equipment and computer storage medium
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to an abnormal data detection method and device, electronic equipment and a computer storage medium.
Background
With the development of data technology, the application of data is more and more extensive. For example, in many scenarios such as machine learning and artificial intelligence, it is necessary to predict future data by analyzing historical data. Obviously, in this scenario, the accuracy of future data prediction depends on the validity of the historical data to a great extent, but actually, not all the historical data are valid data or useful information, some data points or data segments may be abnormal data introduced due to the occurrence of random small-probability events, and these abnormal data tend to affect the validity of the historical data to a certain extent, and further introduce abnormal factors for subsequent data prediction, which affects the accuracy of data prediction, so that these abnormal data need to be effectively detected. However, most of the existing abnormal data detection methods only achieve simple detection of abnormal data, and do not consider how much influence the abnormal data will have on the prediction result if the abnormal data exists in the historical data. Moreover, the current abnormal data detection method is basically limited to a single-mode abnormal data detection, and cannot detect the abnormal data when the abnormal data does not meet the assumed value or the algorithm requirement, which brings great inconvenience to the processing of the abnormal data and the prediction of future data.
Disclosure of Invention
The embodiment of the invention provides an abnormal data detection method and device, electronic equipment and a computer storage medium.
In a first aspect, an embodiment of the present invention provides an abnormal data detection method.
Specifically, the abnormal data detection method includes:
acquiring data samples in a preset time period, and dividing the data samples into different types of data sample sets;
acquiring data to be detected, and comparing the data to be detected with the data sample set to determine suspected abnormal data;
and comparing the segmentation distance of the suspected abnormal data, and determining the suspected abnormal data meeting the preset conditions as abnormal data.
With reference to the first aspect, in a first implementation manner of the first aspect, the obtaining data samples within a preset time period and dividing the data samples into different types of data sample sets includes:
acquiring a data sample in a preset time period;
clustering the data samples according to the similarity among the data samples to obtain data sample sets of different categories;
and training to obtain data regression models corresponding to the different types of data sample sets, and extracting data baselines of the different types of data sample sets.
With reference to the first aspect and the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the obtaining data to be detected, comparing the data to be detected with the data sample set, and determining suspected abnormal data includes:
acquiring data to be detected;
calculating the distance between the data to be detected and the data base lines of the different types of data sample sets;
and determining the data with the minimum distance between the data baselines of the different classes of data sample sets and exceeding a first preset distance threshold as the suspected abnormal data.
With reference to the first aspect, the first implementation manner of the first aspect, and the second implementation manner of the first aspect, in a third implementation manner of the first aspect, the performing a segmentation distance comparison on the suspected abnormal data, and determining the suspected abnormal data meeting a preset condition as abnormal data includes:
carrying out sectional processing on the suspected abnormal data to obtain two or more suspected abnormal data sections;
calculating the distance between the suspected abnormal data segments to generate a distance matrix of the suspected abnormal data segments;
and determining the suspected abnormal data segment with the maximum distance from other suspected abnormal data segments exceeding a second preset distance threshold value as abnormal data.
With reference to the first implementation manner of the first aspect, the second implementation manner of the first aspect, and the third implementation manner of the first aspect, in a fourth implementation manner of the first aspect, the present disclosure further includes:
and evaluating the effect of the abnormal data.
With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, the third implementation manner of the first aspect, and the fourth implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the performing action evaluation on the abnormal data includes:
acquiring attribute information of the abnormal data, wherein the attribute information of the abnormal data comprises one or more of the following information: the width of the abnormal data, the height of the abnormal data, and the distance proportion of the distance between the starting position of the abnormal data and the starting position of the data to be detected to the total length of the data to be detected;
and performing action evaluation on the abnormal data according to the attribute information of the abnormal data.
With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, the third implementation manner of the first aspect, the fourth implementation manner of the first aspect, and the fifth implementation manner of the first aspect, in a sixth implementation manner of the first aspect, the performing, according to the attribute information of the abnormal data, action evaluation on the abnormal data includes:
calculating according to the height of the abnormal data to obtain a height evaluation value;
calculating according to the width of the abnormal data to obtain a width evaluation value;
calculating according to the distance proportion of the abnormal data to obtain a distance evaluation value;
determining a height weight value, a width weight value and a distance weight value, and calculating to obtain a total weight value according to the height weight value, the width weight value and the distance weight value;
and acquiring an uncertainty factor evaluation value, and calculating to obtain an action evaluation value of the abnormal data according to the height evaluation value, the width evaluation value, the distance evaluation value, the total weight value and the uncertainty factor evaluation value.
With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, the third implementation manner of the first aspect, the fourth implementation manner of the first aspect, the fifth implementation manner of the first aspect, and the sixth implementation manner of the first aspect, in a seventh implementation manner of the first aspect, the present disclosure further includes:
and performing preset processing on the abnormal data according to the action evaluation value.
With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, the third implementation manner of the first aspect, the fourth implementation manner of the first aspect, the fifth implementation manner of the first aspect, the sixth implementation manner of the first aspect, and the seventh implementation manner of the first aspect, in an eighth implementation manner of the first aspect, the performing preset processing on the abnormal data according to the action evaluation value includes:
and when the action evaluation value exceeds a preset evaluation threshold value, filtering the abnormal data.
In a second aspect, an embodiment of the present invention provides an abnormal data detection apparatus.
Specifically, the abnormal data detection device includes:
the data processing device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is configured to acquire data samples in a preset time period and divide the data samples into different classes of data sample sets;
the comparison module is configured to acquire data to be detected, compare the data to be detected with the data sample set and determine suspected abnormal data;
and the determining module is configured to perform segmentation distance comparison on the suspected abnormal data and determine the suspected abnormal data meeting the preset conditions as abnormal data.
With reference to the second aspect, in a first implementation manner of the second aspect, the obtaining module includes:
the first obtaining submodule is configured to obtain data samples within a preset time period;
the clustering submodule is configured to perform clustering processing on the data samples according to the similarity between the data samples to obtain data sample sets of different categories;
and the extraction submodule is configured to train to obtain data regression models corresponding to the different types of data sample sets, and extract the data base lines of the different types of data sample sets.
With reference to the second aspect and the first implementation manner of the second aspect, in a second implementation manner of the second aspect, the comparing module includes:
the second acquisition submodule is configured to acquire data to be detected;
the first calculation submodule is configured to calculate the distance between the data to be detected and the data baseline of the data sample set of different classes;
a first determining sub-module configured to determine data having a minimum distance from the data baseline of the different classes of data sample sets exceeding a first preset distance threshold as suspected abnormal data.
With reference to the second aspect, the first implementation manner of the second aspect, and the second implementation manner of the second aspect, in a third implementation manner of the second aspect, the determining module includes:
the segmentation sub-module is configured to perform segmentation processing on the suspected abnormal data to obtain two or more suspected abnormal data segments;
the second calculation submodule is configured to calculate the distance between the suspected abnormal data segments and generate a suspected abnormal data segment distance matrix;
and the second determining submodule is configured to determine the suspected abnormal data segments with the maximum distance from other suspected abnormal data segments exceeding a second preset distance threshold value as abnormal data.
With reference to the second aspect, the first implementation manner of the second aspect, the second implementation manner of the second aspect, and the third implementation manner of the second aspect, in a fourth implementation manner of the second aspect, the present disclosure further includes:
an evaluation module configured to perform an action evaluation on the anomaly data.
With reference to the second aspect, the first implementation manner of the second aspect, the second implementation manner of the second aspect, the third implementation manner of the second aspect, and the fourth implementation manner of the second aspect, in a fifth implementation manner of the second aspect, the evaluation module includes:
a third obtaining sub-module configured to obtain attribute information of the abnormal data, wherein the attribute information of the abnormal data includes one or more of the following information: the width of the abnormal data, the height of the abnormal data, and the distance proportion of the distance between the starting position of the abnormal data and the starting position of the data to be detected to the total length of the data to be detected;
and the evaluation sub-module is configured to evaluate the effect of the abnormal data according to the attribute information of the abnormal data.
With reference to the second aspect, the first implementation manner of the second aspect, the second implementation manner of the second aspect, the third implementation manner of the second aspect, the fourth implementation manner of the second aspect, and the fifth implementation manner of the second aspect, in a sixth implementation manner of the second aspect, the evaluation sub-module includes:
a third calculation submodule configured to calculate a height evaluation value according to the height of the abnormal data;
a fourth calculation submodule configured to calculate a width evaluation value from a width of the abnormal data;
a fifth calculation submodule configured to calculate a distance evaluation value according to a distance proportion of the abnormal data;
the sixth calculating submodule is configured to determine a height weight value, a width weight value and a distance weight value, and calculate a total weight value according to the height weight value, the width weight value and the distance weight value;
and the seventh calculating submodule is configured to acquire an uncertainty factor evaluation value and calculate an action evaluation value of the abnormal data according to the height evaluation value, the width evaluation value, the distance evaluation value, the total weight value and the uncertainty factor evaluation value.
With reference to the second aspect, the first implementation manner of the second aspect, the second implementation manner of the second aspect, the third implementation manner of the second aspect, the fourth implementation manner of the second aspect, the fifth implementation manner of the second aspect, and the sixth implementation manner of the second aspect, in a seventh implementation manner of the second aspect, the present disclosure further includes:
and the processing module is configured to perform preset processing on the abnormal data according to the action evaluation value.
With reference to the second aspect, the first implementation manner of the second aspect, the second implementation manner of the second aspect, the third implementation manner of the second aspect, the fourth implementation manner of the second aspect, the fifth implementation manner of the second aspect, the sixth implementation manner of the second aspect, and the seventh implementation manner of the second aspect, in an eighth implementation manner of the second aspect, the processing module is configured to:
and when the action evaluation value exceeds a preset evaluation threshold value, filtering the abnormal data.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory and a processor, where the memory is used to store one or more computer instructions that support an abnormal data detection apparatus to execute the abnormal data detection method in the first aspect, and the processor is configured to execute the computer instructions stored in the memory. The abnormal data detection apparatus may further include a communication interface for the abnormal data detection apparatus to communicate with other devices or a communication network.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium for storing computer instructions for an abnormal data detection apparatus, where the computer instructions include computer instructions for executing the abnormal data detection method in the first aspect to the abnormal data detection apparatus.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
the technical scheme detects and determines the abnormal data by combining the classification of the data samples and the comparison of the distances between the data sections, namely, the complexity of screening the abnormal data and comparing the distances is simplified through the classification of the data samples, and then the abnormal data is finally determined through the distance comparison of the data sections for the data which do not meet the classification requirements of the data samples. The technical scheme can break through the limitation of the detection method on the premise of algorithm requirements, realize the comprehensive detection of abnormal data, improve the accuracy of the effectiveness of the abnormal data detection, and further provide reliable data support for the prediction of subsequent future data.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of embodiments of the invention.
Drawings
Other features, objects and advantages of embodiments of the invention will become more apparent from the following detailed description of non-limiting embodiments thereof, when taken in conjunction with the accompanying drawings. In the drawings:
FIG. 1 illustrates a flow diagram of an abnormal data detection method according to an embodiment of the present invention;
FIG. 2 illustrates a flow diagram of an abnormal data detection method according to another embodiment of the present invention;
FIG. 3 shows an example graph of CPU load time series data;
FIG. 4 shows an example diagram of another time series data;
FIG. 5 shows a sliding window segmentation schematic;
FIG. 6 illustrates a flow diagram of an abnormal data detection method according to still another embodiment of the present invention;
fig. 7 is a block diagram showing the structure of an abnormal data detecting apparatus according to an embodiment of the present invention;
fig. 8 is a block diagram showing the structure of an abnormal data detecting apparatus according to another embodiment of the present invention;
fig. 9 is a block diagram showing the configuration of an abnormal data detecting apparatus according to still another embodiment of the present invention;
fig. 10 is a schematic structural diagram of a computer system suitable for implementing an abnormal data detecting method according to an embodiment of the present invention.
Detailed Description
Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.
In the embodiments of the present invention, it is to be understood that terms such as "including" or "having", etc., are intended to indicate the presence of the features, numbers, steps, actions, components, parts, or combinations thereof disclosed in the present specification, and are not intended to exclude the possibility that one or more other features, numbers, steps, actions, components, parts, or combinations thereof may be present or added.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. Embodiments of the present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
The technical scheme provided by the embodiment of the invention detects and determines the abnormal data by combining the classification of the data samples and the comparison of the distances between the data sections, namely, the complexity of screening the abnormal data and comparing the distances is simplified by classifying the data samples, and then the abnormal data is finally determined by comparing the distances between the data sections for the data which do not meet the classification requirement of the data samples. The technical scheme can break through the limitation of the detection method on the premise of algorithm requirements, realize the comprehensive detection of abnormal data, improve the accuracy of the effectiveness of the abnormal data detection, and further provide reliable data support for the prediction of subsequent future data.
Fig. 1 shows a flowchart of an abnormal data detecting method according to an embodiment of the present invention, as shown in fig. 1, the abnormal data detecting method includes the following steps S101 to S103:
in step S101, obtaining data samples within a preset time period, and dividing the data samples into data sample sets of different categories;
in step S102, data to be detected is acquired, and the data to be detected is compared with the data sample set to determine suspected abnormal data;
in step S103, a segmentation distance comparison is performed on the suspected abnormal data, and the suspected abnormal data meeting a preset condition is determined as abnormal data.
As mentioned above, as data technology has developed, the use of data has become more widespread. For example, in many scenarios such as machine learning and artificial intelligence, it is necessary to predict future data by analyzing historical data. Obviously, in this scenario, the accuracy of future data prediction depends on the validity of the historical data to a great extent, but actually, not all the historical data are valid data or useful information, some data points or data segments may be abnormal data introduced due to the occurrence of random small-probability events, and these abnormal data tend to affect the validity of the historical data to a certain extent, and further introduce abnormal factors for subsequent data prediction, which affects the accuracy of data prediction, so that these abnormal data need to be effectively detected. However, most of the existing abnormal data detection methods only achieve simple detection of abnormal data, and do not consider how much influence the abnormal data will have on the prediction result if the abnormal data exists in the historical data. Moreover, the current abnormal data detection method is basically limited to a single-mode abnormal data detection, and cannot detect the abnormal data when the abnormal data does not meet the assumed value or the algorithm requirement, which brings great inconvenience to the processing of the abnormal data and the prediction of future data.
In view of the above problem, in this embodiment, an abnormal data detection method is proposed, which combines classification of data samples and comparison of distances between data segments to detect and determine abnormal data, that is, complexity of abnormal data screening and distance comparison is simplified by data sample classification, and then abnormal data is finally determined by data segment distance comparison for data that does not meet the data sample classification requirement. The technical scheme can break through the limitation of the detection method on the premise of algorithm requirements, realize the comprehensive detection of abnormal data, improve the accuracy of the effectiveness of the abnormal data detection, and further provide reliable data support for the prediction of subsequent future data.
In an embodiment of the present invention, the data to be detected refers to data that may include both normal data and abnormal data, and is detected to determine whether abnormal data exists therein, for example, time series data that may include both normal data and abnormal data. And according to different application scenes, the data content of the data to be detected is correspondingly different.
In an embodiment of the present invention, the suspected abnormal data refers to data that is possibly abnormal data obtained through preliminary detection of data sample classification, and these data are subsequently subjected to comparison of distances between data segments to finally determine whether the data are actually abnormal data. The combination judgment and detection of the data sample classification and the data segment distance comparison can break through the limitation of the detection method on the premise of algorithm requirements, avoid the situation that some data types cannot be covered due to incomplete data characteristic types or cannot be effectively detected due to unsatisfied preconditions, and further realize the comprehensive detection of abnormal data.
In an embodiment of the present invention, the preset time period may be set according to a requirement of an actual application and a characteristic of a data sample, and is not specifically limited in this disclosure, and the preset time period may be a historical time period or a current or future time period, as long as the data sample can be obtained in the time period.
In an embodiment of the present invention, the step S101, namely, the step of acquiring data samples within a preset time period and dividing the data samples into different types of data sample sets, includes the following steps:
acquiring a data sample in a preset time period;
clustering the data samples according to the similarity among the data samples to obtain data sample sets of different categories;
and training to obtain data regression models corresponding to the different types of data sample sets, and extracting data baselines of the different types of data sample sets.
In order to effectively and accurately classify the data samples, in this embodiment, the data samples are classified into different classes of data sample sets by using a clustering method, specifically, the data samples within a preset time period are obtained first; then, clustering the data samples according to the similarity between the data samples to obtain data sample sets of different categories, wherein the similarity between the data samples can be measured by using the distance between the data samples, and of course, other similarity measurement modes can be used, which is not specifically limited by the disclosure; and finally, training to obtain data regression models corresponding to the different types of data sample sets, and extracting data baselines of the different types of data sample sets, wherein the data baselines are used for representing data distribution characteristics of the data sample sets and can be used as comparison reference values to classify real-time data.
In an embodiment of the present invention, the step S102 of acquiring data to be detected, comparing the data to be detected with the data sample set, and determining suspected abnormal data includes the following steps:
acquiring data to be detected;
calculating the distance between the data to be detected and the data base lines of the different types of data sample sets;
and determining the data with the minimum distance between the data baselines of the different classes of data sample sets and exceeding a first preset distance threshold as the suspected abnormal data.
In this embodiment, when classifying acquired data to be detected, first calculating a distance between the data to be detected and a data baseline of a data sample set of different categories, if a minimum distance between the data to be detected and the data baseline of the data sample set of different categories is less than or equal to a first preset distance threshold, the data may be considered to belong to a data category corresponding to the minimum distance, and if the minimum distance between the data baseline of the data sample set of different categories and the data baseline of the data sample set of different categories is greater than the first preset distance threshold, the data may not belong to any data category, which may be abnormal data, but may also be normal data that does not meet data characteristics or type requirements of a data clustering method, that is, suspected abnormal data, and then further detecting and identifying by means of other methods.
The first preset distance threshold may be determined according to the requirements of actual applications, different application scenarios, and data characteristics, which are not specifically limited by the present disclosure.
In an embodiment of the present invention, the step S103 of comparing the segmentation distances of the suspected abnormal data and determining the suspected abnormal data meeting the preset condition as abnormal data includes the following steps:
carrying out sectional processing on the suspected abnormal data to obtain two or more suspected abnormal data sections;
calculating the distance between the suspected abnormal data segments to generate a distance matrix of the suspected abnormal data segments;
and determining the suspected abnormal data segment with the maximum distance from other suspected abnormal data segments exceeding a second preset distance threshold value as abnormal data.
In this embodiment, the distance comparison between the data segments is used to further detect and identify the suspected abnormal data, specifically, the suspected abnormal data is firstly segmented to obtain two or more suspected abnormal data segments; then, calculating the distance between the suspected abnormal data segments, and generating a distance matrix of the suspected abnormal data segments, wherein for the Time series data, the distance can be a DTW (dynamic Time warping) distance as a distance measurement mode; and then comparing the distance between each suspected abnormal data segment and other suspected abnormal data segments, obtaining the maximum distance between each suspected abnormal data segment and other suspected abnormal data segments in an iterative calculation mode, if the maximum distance is smaller than or equal to a second preset distance threshold value, indicating that the data corresponding to the maximum distance and the data corresponding to the maximum distance belong to the same category, the data are normal data, but if the maximum distance is larger than the second preset distance threshold value, indicating that the data and other suspected abnormal data segments do not belong to the same category, and determining the data as abnormal data.
The second preset distance threshold may be determined according to the requirements of practical applications, different application scenarios, and data characteristics, and is not specifically limited by the present disclosure.
In an embodiment of the present invention, the method further includes a step of evaluating the effect of the abnormal data, that is, as shown in fig. 2, the abnormal data detection method includes the following steps S201 to S204:
in step S201, obtaining data samples within a preset time period, and dividing the data samples into different types of data sample sets;
in step S202, data to be detected is acquired, and the data to be detected is compared with the data sample set to determine suspected abnormal data;
in step S203, comparing the segmentation distances of the suspected abnormal data, and determining the suspected abnormal data meeting a preset condition as abnormal data;
in step S204, a role evaluation is performed on the abnormal data.
As mentioned above, most of the conventional abnormal data detection methods only achieve simple detection of abnormal data, and do not consider how much the abnormal data will affect the prediction result if the abnormal data exists in the historical data, but the inventors of the present application found that, in the course of the invention creation, if the abnormal data has different degrees of influence on the prediction result, the processing manner of the abnormal data is different. Therefore, in this implementation, after determining the abnormal data, action evaluation is also performed on the abnormal data to determine a processing manner of subsequent abnormal data, where the action evaluation refers to evaluation on an influence that would be exerted on a data prediction result if the abnormal data exists in the historical data.
In an embodiment of the present invention, the step S204 of evaluating the effect of the abnormal data includes the following steps:
acquiring attribute information of the abnormal data, wherein the attribute information of the abnormal data comprises one or more of the following information: the width of the abnormal data, the height of the abnormal data, and the distance proportion of the distance between the starting position of the abnormal data and the starting position of the data to be detected to the total length of the data to be detected;
and performing action evaluation on the abnormal data according to the attribute information of the abnormal data.
In this embodiment, when evaluating the effect of the abnormal data, first obtaining attribute information of the abnormal data; and then, evaluating the action of the abnormal data according to the attribute information of the abnormal data.
In an embodiment of the present invention, the attribute information of the abnormal data may include one or more of the following information: the width of the abnormal data, the height of the abnormal data, and the distance proportion of the distance between the starting position of the abnormal data and the starting position of the data to be detected to the total length of the data to be detected.
Taking time series data as an example, the width of the abnormal data refers to the span of an abnormal data segment in a time dimension; the height of the anomaly data refers to the maximum value of the data within the segment of anomaly data.
In one embodiment of the invention, the width of the anomaly data is determined by an adaptive sliding window method.
Next, the determination of the width of the abnormal data by the adaptive sliding window method will be specifically explained and explained by taking the CPU load time series data shown in fig. 3 as an example.
Firstly, setting an attribute of an initial sliding window, performing sliding window segmentation on the data to be detected by using the set initial sliding window, and determining a first abnormal sliding window, wherein the attribute of the sliding window at least comprises information such as the width of the sliding window;
in particular, the width dimension W of the initial sliding window may be determined from empirical valuesSlidingWindowAnd taking the width of the initial sliding window as the sliding step length S ═ WSlidingWindowAnd carrying out sliding window segmentation on the data to be detected, comparing the maximum value and the minimum value of the data in each sliding window with a preset numerical threshold value to judge whether abnormal data exist in the sliding window, and screening a first batch of sliding windows containing the abnormal data, namely first abnormal sliding windows. The preset numerical value threshold can be determined according to a data baseline value and also can be determined according to a mean value of the data to be detected.
Then, judging whether the first abnormal sliding window is continuous or not and recording the continuous times N of the first abnormal sliding window;
as shown in the time series data of FIG. 3, there are two abnormal data segments, assuming the width W of the initial sliding windowSlidingWindowTheoretically, two abnormal data segments can be screened out by sequentially sliding a sliding window for comparison, but the actual situation is usually complicatedIt is difficult to capture a complete abnormal data segment with only one or several sliding window operations, and it is also difficult to make the width of the sliding window equal to the width W of the abnormal data segmentOutlierSuch as the anomalous data segment shown on the right side of fig. 3. For another example, fig. 4 shows another time series data, and for the time series data shown in fig. 4, if the width W of the initial sliding window is set as wellSlidingWindowThe first occurring outlier segment spans 2 sliding window widths, 5. Of course, there is a possibility that the width of the sliding window is too wide, and two abnormal data segments are included in one sliding window width, such as the abnormal data segment shown in the right side of fig. 4.
Therefore, in order to capture the information of the abnormal data segment accurately, it is necessary to adaptively adjust the size of the initial sliding window, that is, to adaptively adjust the width of the initial sliding window, and to confirm the start point of the abnormal data segment.
In an embodiment of the present invention, the number N of consecutive occurrences of the first abnormal sliding window and the width W of the pre-initial sliding windowSlidingWindowAnd an exception data segment width WOutlierThe following relationships exist:
when N ═ 1: wOutlier<WSlidingWindow
When N ═ 2: w is not less than 2Outlier<2*WSlidingWindow
When N is present>And 2, time: (N-2) × WSlidingWindow≤WOutlier<N*WSlidingWindow
From the above, when the start point of the abnormal data segment cannot be specified, the size W of the sliding window is adaptively changedSlidingWindowTo determine the width W of the abnormal data segmentOutlierIt is complicated and accurate, so it is necessary to first confirm the position of the start point of the abnormal data segment.
When the starting point of the abnormal data segment is confirmed, for each abnormal sliding window with abnormal data, the step of reducing the width of the sliding window to segment the sliding window and deleting the non-abnormal sliding window is repeated until the width of the sliding window meets the precision requirement, and the starting point of the first abnormal sliding window is determined as the starting point of the abnormal data segment;
specifically, the method comprises the following steps: for each abnormal sliding window with abnormal data, the width of the sliding window is first reduced, for example, the width of the sliding window is set as: wSlidingWindow=WSlidingWindowAnd/2, then, carrying out sliding window segmentation again by using the sliding window with the reduced width from the initial position of the abnormal sliding window with abnormal data, and assuming that the requirement on the width measurement precision is
Figure BDA0002403611170000141
Then the width of the sliding window has not yet reached the accuracy
Figure BDA0002403611170000142
When the size of the sliding window is small enough, a non-abnormal sliding window occurs during the segmentation of the sliding window again, that is, the sliding window does not contain the abnormal data segment, for example, the window in fig. 5 is the sliding window segmentation performed on the first group of abnormal sliding windows in fig. 4 again, after the segmentation is performed again, a non-abnormal sliding window occurs, the non-abnormal sliding window is filtered, then the width of the sliding window is continuously reduced, the subsequent abnormal sliding window is segmented by the sliding window, so that the starting point of the abnormal data segment is gradually approached until the width of the sliding window meets the precision requirement
Figure BDA0002403611170000143
At this time, the starting point of the first abnormal sliding window is the starting point of the abnormal data segment.
And then determining the width of the abnormal data segment, carrying out sliding window segmentation from the starting point of the abnormal data segment, carrying out self-adaptive adjustment on the width of the sliding window until the continuous times N of the abnormal sliding window change, determining the width of the abnormal data segment, and further obtaining the end point of the abnormal data segment.
Specifically, the method comprises the following steps: starting from the starting point of the found abnormal data segment, and in this case, whether the abnormal sliding window continuously corresponds to the sliding window width WSlidingWindowAnd an abnormal data width WOutlierThe effect of (c) is only two cases:
when N ═ 1: wOutlier<WSlidingWindow
When N is present>1, time: wOutlier>WSlidingWindow
In both cases, different sliding window adaptive processing can be performed, and when the sliding window width is larger, the sliding window width can be reduced by different reduction ranges, for example, the sliding window width can be directly reduced to WSlidingWindow=WSlidingWindow2, the decreasing operation can also be executed on the width of the sliding window; when the width of the sliding window is smaller, the width of the sliding window can be increased according to different increasing ranges, for example, the width of the sliding window can be directly increased to WSlidingWindow=WSlidingWindow2, an increment operation may also be performed on the sliding window width. In the adaptive sliding window processing process, if N is changed from 1 to 2, the width of the abnormal data segment is the width of the last sliding window: wOutlier=last(WSlidingWindow) If N is present>If 1 becomes N ═ 1, the width of the abnormal data is the width of the current sliding window: wOutlier=WSlidingWindowAnd determining the width of the abnormal data segment, and further obtaining the end point of the abnormal data segment.
By means of the self-adaptive sliding window width adjusting method, the detection speed of the abnormal data segment can be improved, and the attribute information of the abnormal data segment can be effectively and accurately determined on the premise of greatly reducing the calculation cost.
In an embodiment of the present invention, the step of evaluating the effect of the abnormal data according to the attribute information of the abnormal data includes the steps of:
calculating according to the height of the abnormal data to obtain a height evaluation value;
calculating according to the width of the abnormal data to obtain a width evaluation value;
calculating according to the distance proportion of the abnormal data to obtain a distance evaluation value;
determining a height weight value, a width weight value and a distance weight value, and calculating to obtain a total weight value according to the height weight value, the width weight value and the distance weight value;
and acquiring an uncertainty factor evaluation value, and calculating to obtain an action evaluation value of the abnormal data according to the height evaluation value, the width evaluation value, the distance evaluation value, the total weight value and the uncertainty factor evaluation value.
In one embodiment of the present invention, a height evaluation value is calculated from the height of the abnormal data based on the following formula:
Iheight=(ci·h+cj)m
wherein, IheightH is the height of the abnormal data, ci,cjIs constant, m is a real number.
In an embodiment of the present invention, a width evaluation value is calculated from the width of the abnormal data based on the following formula:
Iwidth=(cp·w+cq)n
wherein, IwidthIs a width evaluation value, w is a width of the abnormal data, cp,cqIs constant and n is a real number.
In an embodiment of the present invention, a distance evaluation value is calculated from a distance ratio of the abnormal data based on the following formula:
Figure BDA0002403611170000163
wherein, IdistD is a distance ratio of the abnormal data, cu,ck,chIs a constant.
In one embodiment of the present invention, a total weight value is calculated according to the height weight value, the width weight value and the distance weight value based on the following formula:
Figure BDA0002403611170000161
wherein, alpha is total weight value, wheightIs a heightWeight value, wwidthIs a width weight value, wdistIs a distance weight value, ctIs a constant.
In an embodiment of the present invention, the action evaluation value of the abnormal data is calculated from the height evaluation value, the width evaluation value, the distance evaluation value, the total weight value, and the uncertainty factor evaluation value based on the following formula:
Figure BDA0002403611170000162
wherein, alpha is the total weight value, IheightFor a height evaluation value, IwidthAs an evaluation value of width, IdistAs a distance evaluation value, S{h,w,v,…}And the evaluation value of uncertainty factor of gamma composition, S{h,w,v,…}For the shape evaluation value, { h, w, v, … } is used to characterize the influence of the height h, width w, and variance v of the anomaly data on the shape, and γ is the other uncertainty factor evaluation value.
Wherein the constant ci、cj、cp、cq、cu、ck、ch、ctReal number m, n, height weight value wheightWeight of width wwidthDistance weighted value wdistShape evaluation value S{h,w,v,…}The other uncertain factor evaluation values γ can be determined according to the requirements of practical application and the characteristics of data, and can also be determined according to empirical values, and the specific values of the above parameters are not particularly limited in the present disclosure.
After the action evaluation value of the abnormal data is obtained through calculation according to the embodiment, the influence of the abnormal data on the prediction result can be quantized, and different processing means can be intuitively adopted for different abnormal data sections according to the influence degree.
In an embodiment of the present invention, the method further includes a step of performing preset processing on the abnormal data according to the action evaluation value, that is, as shown in fig. 6, the abnormal data detection method includes the following steps S601 to S605:
in step S601, obtaining data samples within a preset time period, and dividing the data samples into data sample sets of different categories;
in step S602, data to be detected is acquired, and the data to be detected is compared with the data sample set to determine suspected abnormal data;
in step S603, comparing the segmentation distances of the suspected abnormal data, and determining the suspected abnormal data meeting a preset condition as abnormal data;
in step S604, evaluating the effect of the abnormal data;
in step S605, preset processing is performed on the abnormal data according to the action evaluation value.
Wherein the preset processing may include one or more of the following processes: filtering processing, deleting processing, enlarging processing, reducing processing, and the like. Those skilled in the art can select an appropriate processing method according to the requirements of practical application and the characteristics of data, and the disclosure does not limit the processing method specifically.
In an embodiment of the present invention, the step S605, namely, the step of performing the preset processing on the abnormal data according to the action evaluation value, includes the following steps:
when the action evaluation value exceeds a preset evaluation threshold value, the abnormal data is considered to have a large adverse effect on a prediction result, and at the moment, the abnormal data is required to be filtered;
when the action evaluation value does not exceed the preset evaluation threshold value, that is, is less than or equal to the preset evaluation threshold value, it may be determined that the abnormal data does not have a large adverse effect on the prediction result, and at this time, a retention measure may be taken for the abnormal data without performing filtering processing.
The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention.
Fig. 7 is a block diagram showing the structure of an abnormal data detecting apparatus according to an embodiment of the present invention, which may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 7, the abnormal data detecting apparatus includes:
an obtaining module 701, configured to obtain data samples within a preset time period, and divide the data samples into different classes of data sample sets;
a comparison module 702 configured to obtain data to be detected, compare the data to be detected with the data sample set, and determine suspected abnormal data;
a determining module 703 configured to perform a segmentation distance comparison on the suspected abnormal data, and determine the suspected abnormal data meeting a preset condition as abnormal data.
As mentioned above, as data technology has developed, the use of data has become more widespread. For example, in many scenarios such as machine learning and artificial intelligence, it is necessary to predict future data by analyzing historical data. Obviously, in this scenario, the accuracy of future data prediction depends on the validity of the historical data to a great extent, but actually, not all the historical data are valid data or useful information, some data points or data segments may be abnormal data introduced due to the occurrence of random small-probability events, and these abnormal data tend to affect the validity of the historical data to a certain extent, and further introduce abnormal factors for subsequent data prediction, which affects the accuracy of data prediction, so that these abnormal data need to be effectively detected. However, most of the existing abnormal data detection methods only achieve simple detection of abnormal data, and do not consider how much influence the abnormal data will have on the prediction result if the abnormal data exists in the historical data. Moreover, the current abnormal data detection method is basically limited to a single-mode abnormal data detection, and cannot detect the abnormal data when the abnormal data does not meet the assumed value or the algorithm requirement, which brings great inconvenience to the processing of the abnormal data and the prediction of future data.
In view of the above problem, in this embodiment, an abnormal data detection apparatus is proposed, which detects and determines abnormal data by combining classification of data samples and comparison of distances between data segments, that is, complexity of abnormal data screening and distance comparison is simplified by data sample classification, and then abnormal data is finally determined by data segment distance comparison for data that does not meet the data sample classification requirement. The technical scheme can break through the limitation of the detection method on the premise of algorithm requirements, realize the comprehensive detection of abnormal data, improve the accuracy of the effectiveness of the abnormal data detection, and further provide reliable data support for the prediction of subsequent future data.
In an embodiment of the present invention, the data to be detected refers to data that may include both normal data and abnormal data, and is detected to determine whether abnormal data exists therein, for example, time series data that may include both normal data and abnormal data. And according to different application scenes, the data content of the data to be detected is correspondingly different.
In an embodiment of the present invention, the suspected abnormal data refers to data that is possibly abnormal data obtained through preliminary detection of data sample classification, and these data are subsequently subjected to comparison of distances between data segments to finally determine whether the data are actually abnormal data. The combination judgment and detection of the data sample classification and the data segment distance comparison can break through the limitation of the detection method on the premise of algorithm requirements, avoid the situation that some data types cannot be covered due to incomplete data characteristic types or cannot be effectively detected due to unsatisfied preconditions, and further realize the comprehensive detection of abnormal data.
In an embodiment of the present invention, the preset time period may be set according to a requirement of an actual application and a characteristic of a data sample, and is not specifically limited in this disclosure, and the preset time period may be a historical time period or a current or future time period, as long as the data sample can be obtained in the time period.
In an embodiment of the present invention, the obtaining module 701 includes:
the first acquisition submodule is configured to acquire a data sample in a preset time period;
the clustering submodule is configured to perform clustering processing on the data samples according to the similarity between the data samples to obtain data sample sets of different categories;
and the extraction submodule is configured to train to obtain data regression models corresponding to the different types of data sample sets, and extract the data base lines of the different types of data sample sets.
In order to effectively and accurately classify the data samples, in this embodiment, the clustering sub-module uses a clustering method to classify the data samples acquired by the first acquiring sub-module into different classes of data sample sets, specifically, first acquiring data samples within a preset time period; then, clustering the data samples according to the similarity between the data samples to obtain data sample sets of different categories, wherein the similarity between the data samples can be measured by using the distance between the data samples, and other similarity measurement modes can be used, which is not specifically limited by the disclosure; the extraction submodule trains to obtain a data regression model corresponding to the different types of data sample sets, and extracts data baselines of the different types of data sample sets, wherein the data baselines are used for representing data distribution characteristics of the data sample sets and can be used as comparison reference values to classify real-time data.
In an embodiment of the present invention, the comparing module 702 includes:
the second acquisition submodule is configured to acquire data to be detected;
a first calculation submodule configured to calculate a distance between the data to be detected and the data base lines of the different classes of data sample sets;
a first determining sub-module configured to determine data having a minimum distance from the data baseline of the different classes of data sample sets exceeding a first preset distance threshold as suspected abnormal data.
In this embodiment, when classifying acquired data to be detected, the first computation sub-module first computes a distance between the data to be detected and a data baseline of a sample set of data of different types, and if the minimum distance between the data to be detected and the data baseline of the sample set of data of different types is smaller than or equal to a first preset distance threshold, the data may be considered to belong to a data category corresponding to the minimum distance, and if the minimum distance between the data baseline of the sample set of data of different types is greater than the first preset distance threshold, the data may not belong to any data category, which may be abnormal data, but may also be normal data that does not meet data characteristics or type requirements of a data clustering method, that is, suspected abnormal data, and then further detection and identification need to be performed by means of other methods.
The first preset distance threshold may be determined according to the requirements of actual applications, different application scenarios, and data characteristics, which are not specifically limited by the present disclosure.
In an embodiment of the present invention, the determining module 703 includes:
the segmentation submodule is configured to perform segmentation processing on the suspected abnormal data to obtain two or more suspected abnormal data segments;
the second calculation submodule is configured to calculate the distance between the suspected abnormal data segments and generate a suspected abnormal data segment distance matrix;
and the second determining submodule is configured to determine the suspected abnormal data segments with the maximum distance from other suspected abnormal data segments exceeding a second preset distance threshold value as abnormal data.
In this embodiment, the distance comparison between the data segments is used to further detect and identify the suspected abnormal data, specifically, the segmentation sub-module performs segmentation processing on the suspected abnormal data to obtain two or more suspected abnormal data segments; the second calculation submodule calculates the distance between the suspected abnormal data segments and generates a distance matrix of the suspected abnormal data segments, wherein for the Time series data, the distance can be a DTW (dynamic Time warping) distance as a distance measurement mode; and the second determining submodule compares the distance between each suspected abnormal data segment and other suspected abnormal data segments, obtains the maximum distance between each suspected abnormal data segment and other suspected abnormal data segments in an iterative calculation mode, if the maximum distance is smaller than or equal to a second preset distance threshold, the data corresponding to the maximum distance and the data corresponding to the maximum distance belong to the same category and are normal data, and if the maximum distance is larger than the second preset distance threshold, the data and other suspected abnormal data segments do not belong to the same category and are determined to be abnormal data.
The second preset distance threshold may be determined according to requirements of actual applications, different application scenarios, and data characteristics, which are not specifically limited by the present disclosure.
In an embodiment of the present invention, the apparatus further includes a part for evaluating the effect of the abnormal data, that is, as shown in fig. 8, the abnormal data detecting apparatus includes:
an obtaining module 801 configured to obtain data samples within a preset time period and divide the data samples into different categories of data sample sets;
a comparison module 802 configured to obtain data to be detected, compare the data to be detected with the data sample set, and determine suspected abnormal data;
a determining module 803, configured to perform a segmentation distance comparison on the suspected abnormal data, and determine the suspected abnormal data meeting a preset condition as abnormal data;
an evaluation module 804 configured to perform an action evaluation on the anomaly data.
As mentioned above, most of the conventional abnormal data detection devices only achieve simple detection of abnormal data, and do not consider how much the abnormal data will affect the prediction result if the abnormal data exists in the historical data. Therefore, in this implementation, after determining the abnormal data, action evaluation is also performed on the abnormal data to determine a processing manner of subsequent abnormal data, where the action evaluation refers to evaluation on an influence that would be exerted on a data prediction result if the abnormal data exists in the historical data.
In an embodiment of the present invention, the evaluation module 804 includes:
a third obtaining sub-module configured to obtain attribute information of the abnormal data, wherein the attribute information of the abnormal data includes one or more of the following information: the width of the abnormal data, the height of the abnormal data, and the distance proportion of the distance between the starting position of the abnormal data and the starting position of the data to be detected to the total length of the data to be detected;
and the evaluation sub-module is configured to evaluate the effect of the abnormal data according to the attribute information of the abnormal data.
In this embodiment, when performing action evaluation on the abnormal data, the third obtaining sub-module first obtains attribute information of the abnormal data; and the evaluation sub-module evaluates the action of the abnormal data according to the attribute information of the abnormal data.
In an embodiment of the present invention, the attribute information of the abnormal data may include one or more of the following information: the width of the abnormal data, the height of the abnormal data, and the distance proportion of the distance between the starting position of the abnormal data and the starting position of the data to be detected to the total length of the data to be detected.
Taking time series data as an example, the width of the abnormal data refers to the span of an abnormal data segment in a time dimension; the height of the anomaly data refers to the maximum value of the data within the segment of anomaly data.
In an embodiment of the present invention, the third obtaining sub-module determines the width of the abnormal data by an adaptive sliding window method.
Next, the determination of the width of the abnormal data by the adaptive sliding window method will be specifically explained and explained by taking the CPU load time series data shown in fig. 3 as an example.
Firstly, setting an attribute of an initial sliding window, performing sliding window segmentation on the data to be detected by using the set initial sliding window, and determining a first abnormal sliding window, wherein the attribute of the sliding window at least comprises information such as the width of the sliding window;
in particular, the width dimension W of the initial sliding window may be determined from empirical valuesSlidingWindowAnd taking the width of the initial sliding window as the sliding step length S ═ WSlidingWindowAnd performing sliding window segmentation on the data to be detected, comparing the maximum value and the minimum value of the data in each sliding window with a preset value threshold value to judge whether abnormal data exist in the sliding window, and screening a first batch of sliding windows containing the abnormal data, namely a first abnormal sliding window. The preset numerical value threshold can be determined according to a data baseline value and also can be determined according to a mean value of the data to be detected.
Then, judging whether the first abnormal sliding window is continuous or not and recording the continuous times N of the first abnormal sliding window;
as shown in the time series data of FIG. 3, there are two abnormal data segments, assuming the width W of the initial sliding windowSlidingWindowTheoretically, the two abnormal data segments can be screened out by sequentially sliding the sliding window for comparison, but the practical situation is usually complicated, and it is difficult to capture a certain complete abnormal data segment just by one or several sliding window operations, and it is also difficult to make the width of the sliding window exactly equal to the width W of the abnormal data segmentOutlierSuch as the anomalous data segment shown on the right side of fig. 3. For another example, fig. 4 shows another time series data, and for the time series data shown in fig. 4, if the width W of the initial sliding window is set as wellSlidingWindowThe first occurring outlier segment spans 2 sliding window widths, 5. Of course, there is a possibility that the width of the sliding window is too wide, and two abnormal data segments are included in one sliding window width, such as the abnormal data segment shown in the right side of fig. 4.
Therefore, in order to capture the information of the abnormal data segment accurately, it is necessary to adaptively adjust the size of the initial sliding window, that is, to adaptively adjust the width of the initial sliding window, and to confirm the start point of the abnormal data segment.
In an embodiment of the present invention, the number N of consecutive occurrences of the first abnormal sliding window and the width W of the pre-initial sliding windowSlidingWindowAnd an exception data segment width WOutlierThe following relationships exist:
when N ═ 1: wOutlier<WSlidingWindow
When N ═ 2: w is not less than 2Outlier<2*WSlidingWindow
When N is present>And 2, time: (N-2) × WSlidingWindow≤WOutlier<N*WSlidingWindow
From the above, when the start point of the abnormal data segment cannot be specified, the size W of the sliding window is adaptively changedSlidingWindowTo determine the width W of the abnormal data segmentOutlierIt is complicated and accurate, so it is necessary to first confirm the position of the start point of the abnormal data segment.
When the starting point of the abnormal data segment is confirmed, for each abnormal sliding window with abnormal data, the step of reducing the width of the sliding window to segment the sliding window and deleting the non-abnormal sliding window is repeated until the width of the sliding window meets the precision requirement, and the starting point of the first abnormal sliding window is determined as the starting point of the abnormal data segment;
specifically, the method comprises the following steps: for each abnormal sliding window with abnormal data, the width of the sliding window is first reduced, for example, the width of the sliding window is set as: wSlidingWindow=WSlidingWindowAnd/2, then, carrying out sliding window segmentation again by using the sliding window with the reduced width from the initial position of the abnormal sliding window with abnormal data, and assuming that the requirement on the width measurement precision is
Figure BDA0002403611170000241
Then the width of the sliding window has not yet reached the accuracy
Figure BDA0002403611170000242
When the size of the sliding window is small enough, a non-abnormal sliding window occurs during the segmentation of the sliding window again, that is, the sliding window does not contain the abnormal data segment, for example, the window in fig. 5 is the sliding window segmentation performed on the first group of abnormal sliding windows in fig. 4 again, after the segmentation is performed again, a non-abnormal sliding window occurs, the non-abnormal sliding window is filtered, then the width of the sliding window is continuously reduced, the subsequent abnormal sliding window is segmented by the sliding window, so that the starting point of the abnormal data segment is gradually approached until the width of the sliding window meets the precision requirement
Figure BDA0002403611170000243
At this time, the starting point of the first abnormal sliding window is the starting point of the abnormal data segment.
And then determining the width of the abnormal data segment, carrying out sliding window segmentation from the starting point of the abnormal data segment, carrying out self-adaptive adjustment on the width of the sliding window until the continuous times N of the abnormal sliding window change, determining the width of the abnormal data segment, and further obtaining the end point of the abnormal data segment.
Specifically, the method comprises the following steps: starting from the starting point of the found abnormal data segment, and in this case, whether the abnormal sliding window continuously corresponds to the sliding window width WSlidingWindowAnd an abnormal data width WOutlierThe effect of (c) is only two cases:
when N ═ 1: wOutlier<WSlidingWindow
When N is present>1, time: wOutlier>WSlidingWindow
In both cases, different sliding window adaptive processing can be performed, and when the sliding window width is larger, the sliding window width can be reduced by different reduction ranges, for example, the sliding window width can be directly reduced to WSlidingWindow=WSlidingWindow2, the decreasing operation can be executed to the width of the sliding window; when the width of the sliding window is smaller, the width of the sliding window can be increased according to different increasing ranges, for example, the width of the sliding window can be directly increased to beWSlidingWindow=WSlidingWindow2, an increment operation may also be performed on the sliding window width. In the adaptive sliding window processing process, if N is changed from 1 to 2, the width of the abnormal data segment is the width of the last sliding window: wOutlier=last(WSlidingWindow) If N is present>If 1 becomes N ═ 1, the width of the abnormal data is the width of the current sliding window: wOutlier=WSlidingWindowAnd determining the width of the abnormal data segment, and further obtaining the end point of the abnormal data segment.
By means of the self-adaptive sliding window width adjusting method, the detection speed of the abnormal data segment can be improved, and the attribute information of the abnormal data segment can be effectively and accurately determined on the premise of greatly reducing the calculation cost.
In an embodiment of the present invention, the evaluation sub-module includes:
a third calculation submodule configured to calculate a height evaluation value according to the height of the abnormal data;
a fourth calculation submodule configured to calculate a width evaluation value from a width of the abnormal data;
a fifth calculation submodule configured to calculate a distance evaluation value according to a distance proportion of the abnormal data;
the sixth calculating submodule is configured to determine a height weight value, a width weight value and a distance weight value, and calculate a total weight value according to the height weight value, the width weight value and the distance weight value;
and the seventh calculating submodule is configured to acquire an uncertainty factor evaluation value and calculate an action evaluation value of the abnormal data according to the height evaluation value, the width evaluation value, the distance evaluation value, the total weight value and the uncertainty factor evaluation value.
In an embodiment of the present invention, the third calculation submodule calculates a height estimation value according to the height of the abnormal data based on the following formula:
Iheight=(ci·h+cj)m
wherein, IheightH is the height of the abnormal data, ci,cjIs constant, m is a real number.
In an embodiment of the present invention, the fourth calculation sub-module calculates a width evaluation value from the width of the abnormal data based on the following equation:
Iwidth=(cp·w+cq)n
wherein, IwidthIs a width evaluation value, w is a width of the abnormal data, cp,cqIs constant and n is a real number.
In an embodiment of the present invention, the fifth calculation sub-module calculates a distance evaluation value according to a distance ratio of the abnormal data based on the following formula:
Figure BDA0002403611170000261
wherein, IdistD is a distance ratio of the abnormal data, cu,ck,chIs a constant.
In an embodiment of the present invention, the sixth calculating sub-module calculates a total weight value according to the height weight value, the width weight value and the distance weight value based on the following formula:
Figure BDA0002403611170000262
wherein, alpha is total weight value, wheightTo a high weight value, wwidthIs a width weight value, wdistIs a distance weight value, ctIs a constant.
In an embodiment of the present invention, the seventh calculating submodule calculates an action evaluation value of the abnormal data from the height evaluation value, the width evaluation value, the distance evaluation value, the total weight value, and the uncertainty factor evaluation value based on the following formula:
Figure BDA0002403611170000271
wherein, alpha is the total weight value, IheightFor a height evaluation value, IwidthAs an evaluation value of width, IdistAs a distance evaluation value, S{h,w,v,…}And the evaluation value of uncertainty factor of gamma composition, S{h,w,v,…}For the shape evaluation value, { h, w, v, … } is used to characterize the influence of the height h, width w, and variance v of the anomaly data on the shape, and γ is the other uncertainty factor evaluation value.
Wherein the constant ci、cj、cp、cq、cu、ck、ch、ctReal number m, n, height weight value wheightWeight of width wwidthDistance weighted value wdistShape evaluation value S{h,w,v,…}The other uncertain factor evaluation values γ can be determined according to the requirements of practical application and the characteristics of data, and can also be determined according to empirical values, and the specific values of the above parameters are not particularly limited in the present disclosure.
After the action evaluation value of the abnormal data is obtained through calculation according to the embodiment, the influence of the abnormal data on the prediction result can be quantized, and different processing means can be intuitively adopted for different abnormal data sections according to the influence degree.
In an embodiment of the present invention, the apparatus further includes a section for performing preset processing on the abnormal data according to the action evaluation value, that is, as shown in fig. 9, the abnormal data detecting apparatus includes:
an obtaining module 901 configured to obtain data samples within a preset time period and divide the data samples into different classes of data sample sets;
a comparison module 902 configured to obtain data to be detected, compare the data to be detected with the data sample set, and determine suspected abnormal data;
a determining module 903, configured to perform segmentation distance comparison on the suspected abnormal data, and determine the suspected abnormal data meeting a preset condition as abnormal data;
an evaluation module 904 configured to perform a role evaluation on the anomaly data;
a processing module 905 configured to perform preset processing on the abnormal data according to the action evaluation value.
Wherein the preset processing may include one or more of the following processes: filtering processing, deleting processing, enlarging processing, reducing processing, and the like. Those skilled in the art can select an appropriate processing method according to the requirements of practical application and the characteristics of data, and the disclosure does not limit the processing method specifically.
In an embodiment of the present invention, the processing module 905 is configured to:
when the action evaluation value exceeds a preset evaluation threshold value, the abnormal data is considered to have a large adverse effect on a prediction result, and at the moment, the abnormal data is required to be filtered;
when the action evaluation value does not exceed the preset evaluation threshold value, that is, is less than or equal to the preset evaluation threshold value, it may be determined that the abnormal data does not have a large adverse effect on the prediction result, and at this time, a retention measure may be taken for the abnormal data without performing filtering processing.
The embodiment of the invention also discloses an electronic device, which comprises a memory processor; wherein the content of the first and second substances,
the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to perform any of the method steps described above.
Fig. 10 is a schematic structural diagram of a computer system suitable for implementing an abnormal data detecting method according to an embodiment of the present invention.
As shown in fig. 10, the computer system 1000 includes a processing unit 1001 that can execute various processes in the above-described embodiments according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM1003, various programs and data necessary for the operation of the system 1000 are also stored. The processing unit 1001, the ROM1002, and the RAM1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary. The processing unit 1001 may be implemented as a CPU, a GPU, an FPGA, an NPU, or other processing units.
In particular, the above described method may be implemented as a computer software program according to an embodiment of the present invention. For example, embodiments of the invention include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing the method for anomaly data detection. In such embodiments, the computer program may be downloaded and installed from a network through the communication section 1009 and/or installed from the removable medium 1011.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowcharts or block diagrams may represent a module, a program segment, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.
As another aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium may be a computer-readable storage medium included in the apparatus in the foregoing embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the embodiments of the present invention.
The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention according to the embodiments of the present invention is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present invention are mutually replaced to form the technical solution.

Claims (18)

1. An abnormal data detection method, comprising:
acquiring data samples in a preset time period, and dividing the data samples into different types of data sample sets;
extracting data baselines of the different types of data sample sets, wherein the data baselines are used for characterizing data distribution characteristics of the data sample sets;
acquiring data to be detected, and comparing the data to be detected with the data sample set to determine suspected abnormal data;
comparing the sectional distances of the suspected abnormal data, and determining the suspected abnormal data meeting preset conditions as abnormal data;
wherein the comparing the data to be detected with the data sample set and the determining the suspected abnormal data comprises:
calculating the distance between the data to be detected and the data base lines of the different types of data sample sets;
and determining the data with the minimum distance between the data baselines of the different classes of data sample sets and exceeding a first preset distance threshold as the suspected abnormal data.
2. The method of claim 1, wherein the obtaining data samples within a preset time period and the classifying the data samples into different classes of data sample sets comprises:
acquiring a data sample in a preset time period;
clustering the data samples according to the similarity among the data samples to obtain data sample sets of different categories;
the extracting the data baselines of the different classes of data sample sets comprises: and training to obtain data regression models corresponding to the different types of data sample sets, and extracting data baselines of the different types of data sample sets.
3. The method according to any one of claims 1-2, wherein the comparing the distances between the suspected abnormal data segments to determine the suspected abnormal data meeting a preset condition as abnormal data comprises:
carrying out sectional processing on the suspected abnormal data to obtain two or more suspected abnormal data sections;
calculating the distance between the suspected abnormal data segments to generate a distance matrix of the suspected abnormal data segments;
and determining the suspected abnormal data segment with the maximum distance from other suspected abnormal data segments exceeding a second preset distance threshold value as abnormal data.
4. The method of any of claims 1-3, further comprising:
and evaluating the effect of the abnormal data.
5. The method of claim 4, wherein said evaluating the effect of said anomaly data comprises:
acquiring attribute information of the abnormal data, wherein the attribute information of the abnormal data comprises one or more of the following information: the width of the abnormal data, the height of the abnormal data, and the distance proportion of the distance between the starting position of the abnormal data and the starting position of the data to be detected to the total length of the data to be detected;
and performing action evaluation on the abnormal data according to the attribute information of the abnormal data.
6. The method according to claim 5, wherein the evaluating the effect of the abnormal data according to the attribute information of the abnormal data comprises:
calculating according to the height of the abnormal data to obtain a height evaluation value;
calculating according to the width of the abnormal data to obtain a width evaluation value;
calculating according to the distance proportion of the abnormal data to obtain a distance evaluation value;
determining a height weight value, a width weight value and a distance weight value, and calculating to obtain a total weight value according to the height weight value, the width weight value and the distance weight value;
and acquiring an uncertain factor evaluation value, and calculating to obtain an action evaluation value of the abnormal data according to the height evaluation value, the width evaluation value, the distance evaluation value, the total weight value and the uncertain factor evaluation value.
7. The method of claim 6, further comprising:
and performing preset processing on the abnormal data according to the action evaluation value.
8. The method according to claim 7, wherein the performing of the preset processing on the abnormal data according to the action-assessed value includes:
and when the action evaluation value exceeds a preset evaluation threshold value, filtering the abnormal data.
9. An abnormal data detecting apparatus, comprising:
the acquisition module is configured to acquire data samples in a preset time period, divide the data samples into different types of data sample sets, and extract data baselines of the different types of data sample sets, wherein the data baselines are used for representing data distribution characteristics of the data sample sets;
the comparison module is configured to acquire data to be detected, compare the data to be detected with the data sample set and determine suspected abnormal data;
the determining module is configured to perform segmentation distance comparison on the suspected abnormal data and determine the suspected abnormal data meeting preset conditions as abnormal data;
wherein the comparing the data to be detected with the data sample set and the determining the suspected abnormal data comprises: calculating the distance between the data to be detected and the data base lines of the different types of data sample sets; and determining the data with the minimum distance from the data baseline of the different classes of data sample sets exceeding a first preset distance threshold as the suspected abnormal data.
10. The apparatus of claim 9, wherein the obtaining module comprises:
the first obtaining submodule is configured to obtain data samples within a preset time period;
the clustering submodule is configured to perform clustering processing on the data samples according to the similarity between the data samples to obtain data sample sets of different categories;
and the extraction submodule is configured to train to obtain data regression models corresponding to the different types of data sample sets, and extract the data base lines of the different types of data sample sets.
11. The apparatus of any of claims 9-10, wherein the determining module comprises:
the segmentation submodule is configured to perform segmentation processing on the suspected abnormal data to obtain two or more suspected abnormal data segments;
the second calculation submodule is configured to calculate the distance between the suspected abnormal data segments and generate a suspected abnormal data segment distance matrix;
and the second determining submodule is configured to determine the suspected abnormal data segments with the maximum distance from other suspected abnormal data segments exceeding a second preset distance threshold value as abnormal data.
12. The apparatus of any of claims 9-11, further comprising:
an evaluation module configured to perform an action evaluation on the anomaly data.
13. The apparatus of claim 12, wherein the evaluation module comprises:
a third obtaining sub-module configured to obtain attribute information of the abnormal data, wherein the attribute information of the abnormal data includes one or more of the following information: the width of the abnormal data, the height of the abnormal data, and the distance proportion of the distance between the starting position of the abnormal data and the starting position of the data to be detected to the total length of the data to be detected;
and the evaluation sub-module is configured to evaluate the effect of the abnormal data according to the attribute information of the abnormal data.
14. The apparatus of claim 13, wherein the evaluation sub-module comprises:
a third calculation submodule configured to calculate a height evaluation value according to the height of the abnormal data;
a fourth calculation submodule configured to calculate a width evaluation value from a width of the abnormal data;
a fifth calculation submodule configured to calculate a distance evaluation value according to a distance proportion of the abnormal data;
the sixth calculation submodule is configured to determine a height weight value, a width weight value and a distance weight value, and calculate a total weight value according to the height weight value, the width weight value and the distance weight value;
and the seventh calculating submodule is configured to acquire an uncertainty factor evaluation value and calculate an action evaluation value of the abnormal data according to the height evaluation value, the width evaluation value, the distance evaluation value, the total weight value and the uncertainty factor evaluation value.
15. The apparatus of claim 14, further comprising:
and the processing module is configured to perform preset processing on the abnormal data according to the action evaluation value.
16. The apparatus of claim 15, wherein the processing module is configured to:
and when the action evaluation value exceeds a preset evaluation threshold value, filtering the abnormal data.
17. An electronic device comprising a memory and a processor; wherein the content of the first and second substances,
the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method steps of any of claims 1-8.
18. A computer-readable storage medium having stored thereon computer instructions, characterized in that the computer instructions, when executed by a processor, carry out the method steps of any of claims 1-8.
CN202010154535.XA 2020-03-08 2020-03-08 Abnormal data detection method and device, electronic equipment and computer storage medium Active CN113435464B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010154535.XA CN113435464B (en) 2020-03-08 2020-03-08 Abnormal data detection method and device, electronic equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010154535.XA CN113435464B (en) 2020-03-08 2020-03-08 Abnormal data detection method and device, electronic equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN113435464A CN113435464A (en) 2021-09-24
CN113435464B true CN113435464B (en) 2022-05-17

Family

ID=77752357

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010154535.XA Active CN113435464B (en) 2020-03-08 2020-03-08 Abnormal data detection method and device, electronic equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN113435464B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117807547B (en) * 2024-02-29 2024-05-10 国网山东省电力公司经济技术研究院 Regional level comprehensive energy large-scale data cleaning method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107436277A (en) * 2017-07-12 2017-12-05 广东旭诚科技有限公司 The single index data quality control method differentiated based on similarity distance
CN108319981A (en) * 2018-02-05 2018-07-24 清华大学 A kind of time series data method for detecting abnormality and device based on density
CN109032829A (en) * 2018-07-23 2018-12-18 腾讯科技(深圳)有限公司 Data exception detection method, device, computer equipment and storage medium
CN109767352A (en) * 2018-12-24 2019-05-17 国网山西省电力公司信息通信分公司 A kind of power information physics emerging system safety situation evaluation method
CN109978070A (en) * 2019-04-03 2019-07-05 北京市天元网络技术股份有限公司 A kind of improved K-means rejecting outliers method and device
CN110008979A (en) * 2018-12-13 2019-07-12 阿里巴巴集团控股有限公司 Abnormal data prediction technique, device, electronic equipment and computer storage medium
CN110634080A (en) * 2018-06-25 2019-12-31 中兴通讯股份有限公司 Abnormal electricity utilization detection method, device, equipment and computer readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107436277A (en) * 2017-07-12 2017-12-05 广东旭诚科技有限公司 The single index data quality control method differentiated based on similarity distance
CN108319981A (en) * 2018-02-05 2018-07-24 清华大学 A kind of time series data method for detecting abnormality and device based on density
CN110634080A (en) * 2018-06-25 2019-12-31 中兴通讯股份有限公司 Abnormal electricity utilization detection method, device, equipment and computer readable storage medium
CN109032829A (en) * 2018-07-23 2018-12-18 腾讯科技(深圳)有限公司 Data exception detection method, device, computer equipment and storage medium
CN110008979A (en) * 2018-12-13 2019-07-12 阿里巴巴集团控股有限公司 Abnormal data prediction technique, device, electronic equipment and computer storage medium
CN109767352A (en) * 2018-12-24 2019-05-17 国网山西省电力公司信息通信分公司 A kind of power information physics emerging system safety situation evaluation method
CN109978070A (en) * 2019-04-03 2019-07-05 北京市天元网络技术股份有限公司 A kind of improved K-means rejecting outliers method and device

Also Published As

Publication number Publication date
CN113435464A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
CN107194430B (en) Sample screening method and device and electronic equipment
CN110348522B (en) Image detection and identification method and system, electronic equipment, and image classification network optimization method and system
CN110288017B (en) High-precision cascade target detection method and device based on dynamic structure optimization
EP3220353A1 (en) Image processing apparatus, image processing method, and recording medium
US10332244B2 (en) Methods and apparatuses for estimating an ambiguity of an image
CN113435464B (en) Abnormal data detection method and device, electronic equipment and computer storage medium
CN116309344A (en) Insulator abnormality detection method, device, equipment and storage medium
CN116704208B (en) Local interpretable method based on characteristic relation
CN117593115A (en) Feature value determining method, device, equipment and medium of credit risk assessment model
CN112434717B (en) Model training method and device
WO2019177130A1 (en) Information processing device and information processing method
CN115904955A (en) Performance index diagnosis method and device, terminal equipment and storage medium
CN111368837A (en) Image quality evaluation method and device, electronic equipment and storage medium
CN113177603B (en) Training method of classification model, video classification method and related equipment
CN113808088A (en) Pollution detection method and system
Thike et al. Parking space detection using complemented-ULBP background subtraction
CN111798237A (en) Abnormal transaction diagnosis method and system based on application log
CN110580494A (en) Data analysis method based on quantile logistic regression
CN111597934A (en) System and method for processing training data for statistical applications
CN114638851B (en) Image segmentation method, system and storage medium based on generation countermeasure network
CN117132896B (en) Method for detecting and identifying building cracking
CN117474915B (en) Abnormality detection method, electronic equipment and storage medium
CN115083442B (en) Data processing method, device, electronic equipment and computer readable storage medium
CN111835830B (en) Data perception system, method and device
CN113516161B (en) Risk early warning method for tunnel constructors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40062496

Country of ref document: HK

TR01 Transfer of patent right

Effective date of registration: 20230601

Address after: Room 1-2-A06, Yungu Park, No. 1008 Dengcai Street, Sandun Town, Xihu District, Hangzhou City, Zhejiang Province

Patentee after: Aliyun Computing Co.,Ltd.

Address before: Box 847, four, Grand Cayman capital, Cayman Islands, UK

Patentee before: ALIBABA GROUP HOLDING Ltd.

TR01 Transfer of patent right