CN111092757B

CN111092757B - Abnormal data detection method, system and equipment

Info

Publication number: CN111092757B
Application number: CN201911239601.7A
Authority: CN
Inventors: 陈芹浩
Original assignee: Wangsu Science and Technology Co Ltd
Current assignee: Wangsu Science and Technology Co Ltd
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2021-11-23
Anticipated expiration: 2039-12-06
Also published as: WO2021109314A1; CN111092757A

Abstract

The invention discloses a method, a system and equipment for detecting abnormal data, wherein the method comprises the following steps: acquiring access data of a specified time period, training the access data to obtain a threshold model, and judging whether the access data of a target time node is abnormal data or not according to the threshold model; if the access data of the target time node is abnormal data, determining a detection interval, counting the distribution of access data samples in the detection interval, and judging whether the access data of the target time node is abnormal data again; if the access data of the target time node is judged to be abnormal data again, a convergence rule and an amplitude threshold value corresponding to the access data of the target time node are obtained, and whether the access data of the target time node is abnormal data to be processed or not is judged based on the convergence rule and the amplitude threshold value. The technical scheme provided by the application can improve the accuracy of abnormal data detection.

Description

Abnormal data detection method, system and equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, a system, and a device for detecting abnormal data.

Background

In the current CDN (Content Delivery Network), an alarm module may be configured to improve user experience. When the data access is abnormal, the alarm module can send out alarm information in time, so that the network management personnel can detect and repair the abnormality, and the user is prevented from being in a state of data inaccessibility for a long time.

The current data alarm means usually presets multiple types of alarm information, and if the actual data abnormality is matched with one of the types, the corresponding alarm information is sent out. However, the types of data anomalies in the network are very complicated and the number is quite large, and some data anomalies are allowed to occur. According to the existing alarm mode, a lot of unnecessary alarm information is generated, on one hand, a large amount of manpower and material resources are consumed for abnormal investigation, and on the other hand, the truly serious data abnormality can be submerged in a lot of alarm information. Therefore, an accurate abnormal data detection means is needed.

Disclosure of Invention

The application aims to provide a method, a system and equipment for detecting abnormal data, which can improve the accuracy of abnormal data detection.

In order to achieve the above object, an aspect of the present application provides a method for detecting abnormal data, where the method includes: acquiring access data of a specified time period, training the access data to obtain a threshold model, and judging whether the access data of a target time node is abnormal data or not according to the threshold model; if the access data of the target time node is judged to be abnormal data, determining a detection interval containing the target time node, counting the distribution of access data samples in the detection interval, and judging whether the access data of the target time node is abnormal data again according to the counted distribution; if the access data of the target time node is judged to be abnormal data again, a convergence rule and an amplitude threshold value corresponding to the access data of the target time node are obtained, and whether the access data of the target time node is abnormal data to be processed or not is judged based on the convergence rule and the amplitude threshold value.

In order to achieve the above object, another aspect of the present application further provides a system for detecting abnormal data, the system including: the threshold model judging unit is used for acquiring access data of a specified time period, training the access data to obtain a threshold model, and judging whether the access data of a target time node is abnormal data or not according to the threshold model; the distribution judging unit is used for determining a detection interval containing the target time node if the access data of the target time node is judged to be abnormal data, counting the distribution of access data samples in the detection interval, and judging whether the access data of the target time node is abnormal data again according to the counted distribution; and the screening unit is used for acquiring a convergence rule and an amplitude threshold value corresponding to the access data of the target time node if the access data of the target time node is judged to be abnormal data again, and judging whether the access data of the target time node is abnormal data to be processed or not based on the convergence rule and the amplitude threshold value.

In order to achieve the above object, another aspect of the present application further provides an abnormal data detection apparatus, which includes a processor and a memory, where the memory is used to store a computer program, and the computer program is executed by the processor to implement the above abnormal data detection method.

As can be seen from the above, according to the technical solutions provided by one or more embodiments of the present application, when abnormal data is detected, a threshold model can be obtained by training access data in a specified time period. The access data of the target time node can be preliminarily judged through the threshold model. If the abnormal data is determined, a detection interval including the target time node can be determined, and the distribution of the access data samples in the detection interval can be counted. According to the statistical distribution result, whether the access data is abnormal data can be further determined. The beneficial effect of this is that the abnormal data determined according to the uniform threshold model may not belong to the abnormal data within a certain period of time. By carrying out distribution statistics on the access data samples in a specified time period, whether the access data is abnormal or not can be further clarified. If the access data is still determined to be abnormal data, the convergence rule and the amplitude threshold value of the target time node can be continuously obtained, wherein the convergence rule can avoid sudden data abnormality which is not necessary to be processed actually, and the amplitude threshold value can avoid abnormal data determination abnormality caused by too few requests for accessing the data. Through further screening of the convergence rule and the amplitude threshold value, the final abnormal data to be processed can be determined. Therefore, the abnormal data can be detected more accurately by screening layer by layer in various modes.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram illustrating steps of a method for detecting abnormal data according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a flat domain name in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a periodically changing domain name in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a spike variant domain name in an embodiment of the present invention;

FIG. 5 is a schematic diagram of an isolated forest algorithm in an embodiment of the present invention;

FIG. 6 is a schematic diagram of the partitioning of data nodes according to an embodiment of the present invention;

fig. 7 is a schematic configuration diagram of an abnormal data detection apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, the technical solutions of the present application will be clearly and completely described below with reference to the detailed description of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application are within the scope of protection of the present application.

Referring to fig. 1, the method for detecting abnormal data may include the following steps.

S1: the method comprises the steps of obtaining access data in a specified time period, obtaining a threshold model based on the access data training, and judging whether the access data of a target time node are abnormal data or not according to the threshold model.

In this embodiment, an isolated forest algorithm may be adopted to train the access data in a specified time period, so as to obtain a threshold model for distinguishing normal data from abnormal data. The specified time interval can be flexibly set according to the training precision and the training duration of the threshold model. In one specific application example, the last 30 days of access data may be obtained.

It should be noted that, there may be differences between service characteristics corresponding to different types of domain names, and when these domain names provide services, the proportion of abnormal data and the time node at which the abnormal data occurs may be different. By analyzing access data of a large number of domain names, three domain name types can be obtained by dividing: a steady type, a periodic variation type, and a spike variation type. Graphs of the abnormal data corresponding to the three domain name types may be shown in fig. 2, fig. 3, and fig. 4, respectively. In fig. 2, as time goes by, the proportion of abnormal data in the access data of the stationary domain name is always stable in a small interval. In fig. 3, the proportion of abnormal data in the access data of the periodically changing domain name changes periodically with time. In fig. 4, the proportion of abnormal data in the access data of the spur change type domain name shows a sharp change. The access data corresponding to different domain name types may be accompanied by corresponding domain name labels. The domain name label may be a data identifier manually set by an administrator to distinguish different domain name types, or a feature identifier obtained by analyzing big data based on access data of different domain name types. In order to improve the detection precision of abnormal data, different training data can be selected for different types of domain names, so that different threshold models are trained.

Specifically, for the above-mentioned multiple domain name types, the threshold model may be trained in the same manner, except that access data corresponding to the domain name type is used in each training process. In practical application, when the threshold model is trained, for the acquired access data in a specified time period, if the access data contains access data of different domain name types, the access data can be classified according to the domain name types. Then, for the access data to be trained, the domain name type to which the access data belongs can be identified according to the domain name label carried in the access data. Because the tolerance of different domain name types to abnormal data proportion is also different, corresponding screening threshold values can be respectively allocated to the domain name types when the threshold value model training is carried out. The screening threshold may represent a maximum proportion of anomalous data that can be tolerated by the corresponding domain name type. Taking the flat domain name as an example, the screening threshold may be 1/1440, for example, indicating that only 1 abnormal access data is allowed per 1440 access data. For the periodically changing domain name and the spike changing domain name, the corresponding screening threshold may be slightly larger, for example, the screening threshold corresponding to the periodically changing domain name may be 10/1440, and the screening threshold corresponding to the spike changing domain name may be 20/1440.

In this way, after the domain name type to which the access data belongs is identified, the screening threshold corresponding to the domain name type can be acquired. Then, the access abnormal proportion of each time node in the access data can be counted. In practical applications, the time node may be 1 minute, so that the obtained access data may be divided according to a granularity of 1 minute. The access data per minute may include normal access data and abnormal access data, and the proportion of the abnormal access data in the total access data of the current minute is calculated, so that the abnormal access proportion per minute in a specified period can be counted. For each access anomaly ratio obtained by statistics, an islanding algorithm can be adopted, each access anomaly ratio is regarded as a data node, and each access anomaly ratio is isolated according to a layer-by-layer isolation mode shown in fig. 5. The earlier an isolated node becomes an abnormal node. For example, in fig. 5, there are four abcd nodes, and node d is isolated at the earliest, so that node d is likely to be an abnormal node. Through the isolated forest algorithm, different nodes can be finally divided, so that a division schematic diagram shown in fig. 6 is obtained. In fig. 6, black dots may represent data nodes corresponding to access exception ratios. It can be seen that most of the data nodes are converged together, and a small number of the data nodes are discrete. Through the isolated forest algorithm, a closed screening boundary shown in fig. 6 can be obtained, data nodes located within the screening boundary can be called aggregation nodes, and data nodes located outside the screening boundary can be called isolated nodes. The isolated nodes can be regarded as abnormal data nodes, and the number of the isolated nodes in fig. 6 can be determined by the screening threshold corresponding to the domain name type. In this way, by introducing a determined screening threshold in the isolated forest algorithm, the range occupied by the screening boundary can be defined. Finally, the position of the screening boundary can be more and more accurate through continuous training of a large amount of data. Finally, for any input sample data, the screening boundary can accurately judge whether the sample data falls into the screening boundary or out of the screening boundary. In this way, the model having the filtering boundary can be used as the trained threshold model. Of course, for different domain name types, corresponding threshold models can be trained.

In this embodiment, after the threshold model is obtained through training, preliminary judgment can be performed on the access data to be detected. Taking the access data of any one target time node as an example, the access anomaly ratio corresponding to the access data of the target time node can be calculated according to the above-mentioned manner, and the calculated access anomaly ratio is input into the threshold model. Through the threshold model, whether the input access abnormity proportion is an isolated node or an aggregation node can be judged. If the result output by the threshold model is an isolated node, the access data of the target time node can be judged to be abnormal data. And if the output result of the threshold model is the aggregation node, the access data of the target time node can be judged to be non-abnormal data.

S3: if the access data of the target time node is judged to be abnormal data, determining a detection interval containing the target time node, counting the distribution of access data samples in the detection interval, and judging whether the access data of the target time node is abnormal data again according to the counted distribution.

In this embodiment, it is considered that when the threshold model is trained, selected data are randomly extracted, but actually, the number of access data may be greatly different at different times, so that the proportion of access anomalies at different time nodes is greatly changed. However, some time nodes with larger access exception ratios are probably caused by sudden increase of the access data, and the access exception ratios of the time nodes are acceptable and should not be treated as exception data. In view of this, in order to clarify whether or not the abnormal data detected in step S1 is the true abnormal data, the detected abnormal data may be further detected again in the present embodiment.

Specifically, if the access data at the target time node is determined as abnormal data, more data near the target time node may be acquired for analysis, so as to avoid the one-sided detection result. In practical applications, a detection interval including the target time node may be determined first, and the detection interval may correspond to a detection duration. For example, the detection time may be 10 minutes in total, which is 5 minutes before and after the target time node. Of course, the detection duration may also be different for different domain name types. For example, for a flat domain name, the detection period may be relatively short, for example, 20 minutes. For the domain name with period variation and the domain name with spike variation, the detection duration may be relatively long, for example, 1 hour and 2 hours, respectively. In this way, after the access data of the target time node is preliminarily determined to be abnormal data, the detection duration corresponding to the domain name type can be obtained according to the domain name type to which the access data of the target time node belongs. Then, the target time node may be used as a center of the detection interval, and a detection interval including the target time node and having an interval duration equal to the acquired detection duration may be constructed. After the detection interval is constructed, the access data in the detection interval can be acquired. Of course, the access data obtained here is for the identified domain name type, and the access data of other domain name types may be filtered out first.

In the present embodiment, in order to improve the accuracy of data analysis, the access data in the detection interval may be used as the object to be analyzed every day in a certain period of time. For example, if the access data of a certain target domain name at the target time node of 12 o ' clock 05 is preliminarily determined as abnormal data, the access data of the target domain name in the last 30 days, 11 o ' clock 55 to 12 o ' clock 15 per day, can be used as further analysis data. After the data in the detection interval are acquired, the access anomaly ratio of each access data sample in the detection interval can be counted, and similarly, the access anomaly ratio can be divided according to the granularity of 1 minute, so that for each day, one access anomaly ratio can be generated every minute in the detection interval. Subsequently, a mean value and a standard deviation of the access anomaly ratio obtained through statistics can be calculated, and the purpose of calculating the mean value and the standard deviation is to perform normal distribution on the access anomaly ratio obtained through statistics according to the mean value and the standard deviation. The normal distribution can represent general characteristics of data, and generally, a part of data located in the middle of the normal distribution can be regarded as normal data. Data at the edge of a normal distribution may be anomalous. In the result of the normal distribution, the data at the center corresponds to the calculated mean value, and the data can be diffused from the center to both sides in units of standard deviation. Thus, after the result of normal distribution of access abnormal proportion is obtained through statistics, the confidence interval can be determined in the result of normal distribution according to the mean value and the standard deviation. In one specific application example, the confidence interval may be (μ -3 σ, μ +3 σ), and the access anomaly ratio within the confidence interval may be considered as normal data. And the abnormal access proportion outside the confidence interval is abnormal data. Thus, after the result of the normal distribution is obtained statistically, the position of the access data of the target time node can be identified in the result. If the access data of the target time node is located outside the confidence interval, the access data of the target time node can be judged to be abnormal data. And if the access data of the target time node is located in the confidence interval, judging that the access data of the target time node is non-abnormal data.

S5: if the access data of the target time node is judged to be abnormal data again, a convergence rule and an amplitude threshold value corresponding to the access data of the target time node are obtained, and whether the access data of the target time node is abnormal data to be processed or not is judged based on the convergence rule and the amplitude threshold value.

In the present embodiment, the accuracy of data detection is further improved. Different convergence rules and amplitude thresholds may also be configured for different domain name types. When the convergence rule is configured, the convergence rule may be determined according to the service characteristics corresponding to the domain name type. For example, the domain name type may be divided into a plurality of different fields such as a bank field, a payment field, an on-demand field, and the like according to the service characteristics, and different convergence rules may be formulated for the different fields. The convergence rule can be used for comprehensively considering the occurrence condition of abnormal data in a period of time, so as to determine whether the access data at a certain target time node is the real abnormal data to be processed. The purpose of this processing is that the abnormal data determined in steps S1 and S3 is likely to be a burst of abnormal data, and the burst of abnormal data does not frequently appear in the subsequent data access process, so that the processing does not need to waste manpower and material resources. And the configured amplitude threshold can judge whether the quantity of abnormal requests in the access data at the target time node is enough from the aspect of absolute value. The purpose of this is that for some target time nodes, the calculated access anomaly ratio is higher, but the calculation result is often caused by the reduction of the total access number. In fact, the number of exception requests does not change, but only appears to be a higher proportion of access exceptions because the total number of access requests is reduced. This situation also does not require a waste of manpower and material resources to handle.

In view of this, in the present embodiment, the determined abnormal data may be further screened according to the convergence rule and the amplitude threshold. Wherein, the convergence rule can be different according to different domain name types. For example, the convergence rule may be that the target time node is a starting time node, and the access data at a specified number of consecutive time nodes are all determined as abnormal data. In addition, the convergence rule may also be abnormal data that occurs a specified number of times within a preset duration including the target time node. For example, for a flat domain name, the convergence rule may be that 4 consecutive minutes of access data are all determined to be anomalous data. For a domain name with periodic variation, the convergence rule may be 6 occurrences of anomalous data within 10 minutes. For spur change domain names, the convergence rule may be 10 occurrences of anomalous data within 20 minutes.

The amplitude threshold may then be divided according to the magnitude of the access data. The magnitude of the access data may be, for example, in units of QPS (Quests Per Second), and the magnitude of the access data may be larger and the corresponding magnitude threshold may be larger. In practical application, several different magnitude intervals may be set, and each magnitude interval may correspond to a respective magnitude threshold.

In this way, for the access data of the target time node, the domain name type to which the access data of the target time node belongs can be identified, and the convergence rule corresponding to the domain name type is obtained. In addition, a data magnitude corresponding to the access data of the target time node can be calculated, and an amplitude threshold corresponding to a magnitude interval in which the data magnitude is located can be obtained. Subsequently, when a convergence rule and an amplitude threshold are used for abnormal data screening, if the access data of the target time node meets the corresponding convergence rule and the number of abnormal requests in the access data of the target time node is greater than the corresponding amplitude threshold, the access data of the target time node is determined to be abnormal data to be processed. And if the access data of the target time node does not meet the corresponding convergence rule, or the number of abnormal requests in the access data of the target time node is less than or equal to the corresponding amplitude threshold, judging that the access data of the target time node is not used as the abnormal data to be processed. That is, the conditions of the convergence rule and the amplitude threshold value need to be satisfied at the same time to determine the abnormal data to be processed. And if one of the data is not satisfied, the data is not taken as the abnormal data to be processed. The order of determining the convergence rule and the amplitude threshold is not limited in this embodiment.

The present application further provides a system for detecting abnormal data, the system comprising:

the threshold model judging unit is used for acquiring access data of a specified time period, training the access data to obtain a threshold model, and judging whether the access data of a target time node is abnormal data or not according to the threshold model;

the distribution judging unit is used for determining a detection interval containing the target time node if the access data of the target time node is judged to be abnormal data, counting the distribution of access data samples in the detection interval, and judging whether the access data of the target time node is abnormal data again according to the counted distribution;

and the screening unit is used for acquiring a convergence rule and an amplitude threshold value corresponding to the access data of the target time node if the access data of the target time node is judged to be abnormal data again, and judging whether the access data of the target time node is abnormal data to be processed or not based on the convergence rule and the amplitude threshold value.

In one embodiment, the threshold model determining unit includes:

the screening threshold determining module is used for identifying the domain name type to which the access data belongs and acquiring a screening threshold corresponding to the domain name type;

a screening boundary determining module, configured to count access anomaly ratios of each time node in the access data, and determine a screening boundary, where the screening boundary is used to divide the counted access anomaly ratios into aggregation nodes and isolated nodes, where the number of the isolated nodes is determined by the screening threshold;

and the threshold model generation module is used for taking the model with the screening boundary as the threshold model obtained by training.

In one embodiment, the distribution determination unit includes:

the data calculation module is used for counting the access abnormal proportion of each access data sample in the detection interval and calculating the average value and the standard deviation of the access abnormal proportion obtained through counting;

and the normal distribution module is used for performing normal distribution on the access abnormal proportion obtained by statistics according to the mean value and the standard deviation, and taking the result of the normal distribution as the distribution of the access data samples in the detection interval.

In one embodiment, the screening unit comprises:

the first judging module is used for judging that the access data of the target time node is abnormal data to be processed if the access data of the target time node meets the corresponding convergence rule and the number of abnormal requests in the access data of the target time node is greater than the corresponding amplitude threshold value;

and the second judging module is used for judging that the access data of the target time node is not taken as the abnormal data to be processed if the access data of the target time node does not meet the corresponding convergence rule or the number of abnormal requests in the access data of the target time node is less than or equal to the corresponding amplitude threshold value.

Referring to fig. 7, an embodiment of the present application further provides an abnormal data detection apparatus, where the apparatus includes a processor and a memory, where the memory is used to store a computer program, and when the computer program is executed by the processor, the abnormal data detection method may be implemented.

In this embodiment, the memory may include a physical device for storing information, and typically, the information is digitized and then stored in a medium using an electrical, magnetic, or optical method. The memory according to this embodiment may further include: devices that store information using electrical energy, such as RAM or ROM; devices that store information using magnetic energy, such as hard disks, floppy disks, tapes, core memories, bubble memories, or usb disks; devices for storing information optically, such as CDs or DVDs. Of course, there are other ways of memory, such as quantum memory or graphene memory, among others.

In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for embodiments of the system and the apparatus, reference may be made to the introduction of embodiments of the method described above in contrast to the explanation.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an embodiment of the present application, and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for detecting anomalous data, said method comprising:

acquiring access data of a specified time period, training the access data to obtain a threshold model, and judging whether the access data of a target time node is abnormal data or not according to the threshold model; the screening boundary in the threshold model divides isolated nodes, the number of the isolated nodes is determined by a screening threshold corresponding to a domain name type, and the domain name type is classified into a stable type, a periodic variation type and a spike variation type;

if the access data of the target time node is judged to be abnormal data, determining a detection interval which comprises the target time node and corresponds to the domain name type, counting the distribution of access data samples in the detection interval, and judging whether the access data of the target time node is abnormal data again according to the counted distribution; wherein a confidence interval is determined in the statistical distribution and abnormal data is determined based on the confidence interval;

if the access data of the target time node is judged to be abnormal data again, acquiring a convergence rule and an amplitude threshold value of the domain name type corresponding to the access data of the target time node, and judging whether the access data of the target time node is abnormal data to be processed or not based on the convergence rule and the amplitude threshold value.

2. The method of claim 1, wherein training a threshold model based on the visit data comprises:

identifying the domain name type to which the access data belongs, and acquiring a screening threshold corresponding to the domain name type;

counting access abnormal proportions of each time node in the access data, and determining a screening boundary, wherein the screening boundary is used for dividing each counted access abnormal proportion into an aggregation node and an isolated node, and the number of the isolated nodes is determined by the screening threshold;

and taking the model with the screening boundary as a threshold model obtained by training.

3. The method of claim 1 or 2, wherein determining whether the access data of the target time node is abnormal data comprises:

calculating an access abnormity proportion corresponding to the access data of the target time node, and inputting the calculated access abnormity proportion into the threshold model; if the result output by the threshold model is an isolated node, judging that the access data of the target time node is abnormal data; and if the result output by the threshold model is the aggregation node, judging that the access data of the target time node is non-abnormal data.

4. The method of claim 1, wherein determining a detection interval containing the target time node comprises:

identifying the domain name type to which the access data of the target time node belongs, and acquiring the detection duration corresponding to the domain name type;

constructing a detection interval which comprises the target time node and has interval duration equal to the detection duration by taking the target time node as a center; wherein the constructed detection interval is used as the detection interval containing the target time node.

5. The method of claim 1, wherein counting the distribution of accessed data samples within the detection interval comprises:

counting the access abnormal proportion of each access data sample in the detection interval, and calculating the average value and standard deviation of the access abnormal proportion obtained through counting;

and performing normal distribution on the access abnormal proportion obtained by statistics according to the mean value and the standard deviation, and taking the result of the normal distribution as the distribution of the access data samples in the detection interval.

6. The method of claim 5, wherein re-determining whether the access data of the target time node is abnormal comprises:

determining a confidence interval in the result of normal distribution according to the mean and the standard deviation; if the access data of the target time node is located outside the confidence interval, judging that the access data of the target time node is abnormal data; and if the access data of the target time node is located in the confidence interval, judging that the access data of the target time node is non-abnormal data.

7. The method of claim 1, wherein obtaining the convergence rule corresponding to the access data of the target time node comprises:

identifying the domain name type to which the access data of the target time node belongs, and acquiring a convergence rule corresponding to the domain name type; wherein the convergence rule comprises:

taking the target time node as an initial time node, and judging the access data at the time nodes of the continuously specified number as abnormal data;

or

And abnormal data of specified times appear in the preset duration containing the target time node.

8. The method of claim 1, wherein the amplitude threshold corresponding to the access data of the target time node is divided according to the magnitude of the access data, wherein the larger the magnitude of the access data is, the larger the corresponding amplitude threshold is.

9. The method of claim 1, wherein determining whether the access data of the target time node is abnormal data to be processed comprises:

if the access data of the target time node meets the corresponding convergence rule and the number of abnormal requests in the access data of the target time node is greater than the corresponding amplitude threshold value, determining that the access data of the target time node is abnormal data to be processed;

and if the access data of the target time node does not meet the corresponding convergence rule, or the number of abnormal requests in the access data of the target time node is less than or equal to the corresponding amplitude threshold, judging that the access data of the target time node is not used as the abnormal data to be processed.

10. A system for detecting anomalous data, said system comprising:

the threshold model judging unit is used for acquiring access data of a specified time period, training the access data to obtain a threshold model, and judging whether the access data of a target time node is abnormal data or not according to the threshold model; the screening boundary in the threshold model divides isolated nodes, the number of the isolated nodes is determined by a screening threshold corresponding to a domain name type, and the domain name type is classified into a stable type, a periodic variation type and a spike variation type;

the distribution judging unit is used for determining a detection interval which contains the target time node and corresponds to the domain name type if the access data of the target time node is judged to be abnormal data, counting the distribution of access data samples in the detection interval, and judging whether the access data of the target time node is abnormal data again according to the counted distribution; wherein a confidence interval is determined in the statistical distribution and abnormal data is determined based on the confidence interval;

and the screening unit is used for acquiring a convergence rule and an amplitude threshold value of the domain name type corresponding to the access data of the target time node if the access data of the target time node is judged to be abnormal data again, and judging whether the access data of the target time node is abnormal data to be processed or not based on the convergence rule and the amplitude threshold value.

11. The system according to claim 10, wherein the threshold model determining unit comprises:

12. The system according to claim 10, wherein the distribution judgment unit includes:

13. The system of claim 10, wherein the screening unit comprises:

14. An apparatus for detection of anomalous data, characterized in that it comprises a memory for storing a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 9.