WO2021109314A1 - 一种异常数据的检测方法、系统及设备 - Google Patents

一种异常数据的检测方法、系统及设备 Download PDF

Info

Publication number
WO2021109314A1
WO2021109314A1 PCT/CN2020/070703 CN2020070703W WO2021109314A1 WO 2021109314 A1 WO2021109314 A1 WO 2021109314A1 CN 2020070703 W CN2020070703 W CN 2020070703W WO 2021109314 A1 WO2021109314 A1 WO 2021109314A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
access data
time node
target time
abnormal
Prior art date
Application number
PCT/CN2020/070703
Other languages
English (en)
French (fr)
Inventor
陈芹浩
Original Assignee
网宿科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 网宿科技股份有限公司 filed Critical 网宿科技股份有限公司
Publication of WO2021109314A1 publication Critical patent/WO2021109314A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection

Definitions

  • the present invention relates to the technical field of data processing, in particular to a detection method, system and equipment for abnormal data.
  • an alarm module can be configured.
  • the alarm module can send out alarm information in a timely manner, so that the network manager can detect and repair the abnormality, and prevent the user from being in a data inaccessible state for a long time.
  • the current data alarm method usually sets multiple types of alarm information in advance. If the actual data abnormality matches one of the types, the corresponding alarm information will be sent out.
  • the types of data anomalies in the network are very complicated, and the number is quite large. In fact, some data anomalies occur within the allowable range. According to the existing alarm method, a lot of unnecessary alarm information will be generated. On the one hand, it will consume a lot of manpower and material resources for abnormal investigation, and on the other hand, it may make the really serious data anomalies drown in the numerous alarm information. Therefore, there is an urgent need for an accurate detection method for abnormal data.
  • the purpose of this application is to provide a detection method, system and equipment for abnormal data, which can improve the accuracy of abnormal data detection.
  • one aspect of the present application provides a method for detecting abnormal data.
  • the method includes: acquiring access data for a specified period of time, training to obtain a threshold model based on the access data, and judging the target according to the threshold model Whether the access data of the time node is abnormal data; if it is determined that the access data of the target time node is abnormal data, the detection interval including the target time node is determined, and the distribution of the access data samples in the detection interval is counted, and according to According to the statistical distribution, determine again whether the access data of the target time node is abnormal data; if it is determined again that the access data of the target time node is abnormal data, obtain the convergence rules and the corresponding convergence rules of the access data of the target time node. And determine whether the access data of the target time node is abnormal data to be processed based on the convergence rule and the amplitude threshold.
  • the present application also provides a detection system for abnormal data on another aspect.
  • the system includes: a threshold model judging unit for obtaining access data for a specified period of time, and training based on the access data to obtain a threshold model, And judging whether the access data of the target time node is abnormal data according to the threshold model; the distribution judging unit is configured to determine the detection interval that includes the target time node if it is determined that the access data of the target time node is abnormal data, It also counts the distribution of the access data samples in the detection interval, and determines again whether the access data of the target time node is abnormal data according to the statistical distribution; the screening unit is used to determine whether the target time node is again The access data is abnormal data, the convergence rule and the amplitude threshold corresponding to the access data of the target time node are acquired, and based on the convergence rule and the amplitude threshold, it is determined whether the access data of the target time node is an abnormality to be handled data.
  • another aspect of the present application also provides a device for detecting abnormal data.
  • the device includes a processor and a memory.
  • the memory is used to store a computer program.
  • the computer program is executed by the processor, Realize the above-mentioned abnormal data detection method.
  • the technical solutions provided by one or more embodiments of the present application can train a threshold model for the access data within a specified time period when detecting abnormal data.
  • the threshold model the access data of the target time node can be judged preliminarily. If it is determined to be abnormal data, the detection interval containing the target time node can be determined, and the distribution of the access data samples in the detection interval can be counted. According to the statistical distribution results, it can be further determined whether the access data is abnormal data.
  • the beneficial effect of such processing is that the abnormal data determined according to the unified threshold model may not belong to abnormal data in a certain period of time. By performing distribution statistics on access data samples within a specified time period, it is possible to further clarify whether the access data is abnormal.
  • the access data is still judged to be abnormal data, you can continue to obtain the convergence rules and amplitude thresholds of the target time node.
  • the convergence rules can avoid sudden data anomalies, which are actually not necessary to deal with.
  • the amplitude threshold can prevent abnormal data from being judged abnormally due to too few requests for data access.
  • the final abnormal data to be processed can be determined. It can be seen that through multiple screening methods, the detection of abnormal data can be more accurate.
  • Figure 1 is a step diagram of a method for detecting abnormal data in an embodiment of the present invention
  • Figure 2 is a schematic diagram of a stationary domain name in an embodiment of the present invention.
  • Fig. 3 is a schematic diagram of a periodically changing domain name in an embodiment of the present invention.
  • Fig. 4 is a schematic diagram of a spur variant domain name in an embodiment of the present invention.
  • Fig. 5 is a schematic diagram of an isolated forest algorithm in an embodiment of the present invention.
  • Fig. 6 is a schematic diagram of data node division in an embodiment of the present invention.
  • Fig. 7 is a schematic structural diagram of an abnormal data detection device in an embodiment of the present invention.
  • This application provides a method for detecting abnormal data. Please refer to FIG. 1.
  • the method may include the following multiple steps.
  • S1 Obtain access data for a specified period of time, train to obtain a threshold model based on the access data, and determine whether the access data of the target time node is abnormal data according to the threshold model.
  • the isolated forest algorithm can be used to train the access data in a specified time period, so as to obtain a threshold model for distinguishing normal data from abnormal data.
  • the above specified time period can be flexibly set according to the training accuracy and training duration of the threshold model.
  • the access data of the last 30 days can be obtained.
  • the service characteristics corresponding to different types of domain names may also be different.
  • the proportion of abnormal data and the time node when abnormal data appear may be different.
  • three types of domain names can be divided into three types: stable type, periodic change type and sudden change type.
  • the graphs of abnormal data corresponding to these three domain name types can be shown in Figure 2, Figure 3, and Figure 4, respectively.
  • Figure 2 with the passage of time, the proportion of abnormal data in the access data of the stationary domain name is always stable in a small interval.
  • Figure 3 the percentage of abnormal data in the access data of the periodically changing domain name will change periodically over time.
  • the proportion of abnormal data in the access data of the spur-changing domain name will show a sharp change.
  • the access data corresponding to different domain name types can be attached with corresponding domain name tags.
  • the domain name label can be a data identifier manually set by the administrator to distinguish between different domain name types, or it can be a feature identifier obtained after big data analysis based on the access data of different domain name types.
  • the method of obtaining the domain name label is different in this application. Make a limit.
  • different training data can be selected to train different threshold models.
  • the threshold model can be trained in the same way, except that the access data corresponding to the domain name type is used in each training process.
  • the access data can be classified according to the domain name type first.
  • the domain name type to which the access data belongs can be identified according to the domain name tag carried in the access data. Since different domain name types have different tolerances for the proportion of abnormal data, when threshold model training is performed, corresponding screening thresholds can be assigned to each domain name type.
  • the screening threshold may indicate the maximum proportion of abnormal data that can be tolerated by the corresponding domain name type.
  • the filtering threshold may be, for example, 1/1440, which means that only 1 abnormal access data is allowed in every 1440 access data.
  • the corresponding filtering threshold can be slightly larger.
  • the filtering threshold corresponding to periodic-changing domain names can be 10/1440
  • the filtering threshold corresponding to spur-variable domain names can be 20. /1440.
  • the filtering threshold corresponding to the domain name type can be obtained. Then, it is possible to count the abnormal proportions of visits at various time nodes in the visit data. In practical applications, the time node may be 1 minute. In this way, the access data obtained can be divided according to the granularity of 1 minute.
  • the access data per minute may include normal access data and abnormal access data. By calculating the percentage of abnormal access data in the total access data of the current minute, the percentage of abnormal access data per minute in the specified time period can be calculated. .
  • the isolation algorithm can be used to treat each abnormal access rate as a data node, and isolate each abnormal access rate according to the layer-by-layer isolation method shown in Figure 5.
  • the earlier a node is isolated the more likely it is to become an abnormal node.
  • node d there are four nodes abcd, the first to be isolated is node d, then node d is likely to become an abnormal node.
  • the black dots can represent the data nodes corresponding to the abnormal ratio of accesses. It can be seen that most data nodes will be clustered together, while a small number of data nodes will be discrete.
  • the closed screening boundary shown in FIG. 6 can be obtained.
  • the data nodes located within the screening boundary can be referred to as aggregation nodes, and the data nodes located outside the screening boundary can be referred to as isolated nodes.
  • isolated nodes can be regarded as abnormal data nodes, and the number of isolated nodes in Figure 6 can be determined by the filtering threshold corresponding to the domain name type. In this way, by introducing a certain screening threshold into the isolated forest algorithm, the range occupied by the screening boundary can be limited. Finally, through continuous training with a large amount of data, the location of the screening boundary can be made more and more accurate. Finally, for any input sample data, the screening boundary can accurately determine whether the sample data falls within or outside the screening boundary. In this way, a model with the screening boundary can be used as a threshold model obtained by training. Of course, for different domain names, corresponding threshold models can be trained.
  • the access abnormality ratio corresponding to the access data of the target time node can be calculated in the above-mentioned manner, and the calculated access abnormality ratio can be input into the threshold model.
  • the threshold model it can be determined whether the input abnormal ratio of access is an isolated node or an aggregated node. If the output result of the threshold model is an isolated node, it can be determined that the access data of the target time node is abnormal data. If the output result of the threshold model is an aggregation node, it can be determined that the access data of the target time node is non-abnormal data.
  • the selected data is randomly selected, but in fact, the number of accessed data may vary greatly at different times, resulting in visits at different time nodes.
  • the abnormal ratio will also vary greatly. However, some time nodes with a large abnormal access ratio may be caused by a sudden increase in access data. The abnormal access ratio of these time nodes is actually acceptable and should not be treated as abnormal data.
  • the detected abnormal data can be further detected again.
  • the detection interval containing the target time node can be determined first, and the detection interval may correspond to a period of detection time.
  • the target time node may be taken as the center, 5 minutes before and after, a total of 10 minutes of detection time.
  • the detection duration can also be different.
  • the detection time may be relatively short, for example, it may be 20 minutes.
  • the detection time can be relatively long, for example, it can be 1 hour and 2 hours, respectively.
  • the detection duration corresponding to the domain name type can be obtained according to the domain name type to which the access data of the target time node belongs. Then, the target time node may be taken as the center of the detection interval, and a detection interval containing the target time node and having the interval duration equal to the acquired detection duration may be constructed. After the detection interval is constructed, the access data in the detection interval can be obtained.
  • the access data obtained here is for the identified domain name type, and access data of other domain name types can be filtered out first.
  • the access data in the detection interval every day within a certain period of time may be regarded as the object to be analyzed. For example, if the access data of a certain target domain name at the target time node is 12:05 is initially determined as abnormal data, then the access data of the target domain name within the last 30 days from 11:55 to 12:15 every day can be determined All are used as data for further analysis. After acquiring the data in the detection interval, the abnormal access ratio of each access data sample in the detection interval can be counted. Similarly, the abnormal access ratio can also be divided according to the granularity of 1 minute, so that for each day , Within the detection interval, an abnormal access ratio can be generated every minute.
  • the mean value and standard deviation of the statistically obtained abnormal access rate can be calculated, and the purpose of calculating the mean and standard deviation is that the statistically obtained abnormal access rate can be normally distributed according to the mean and the standard deviation.
  • the normal distribution can reflect the general characteristics of the data. Generally speaking, the part of the data in the middle of the normal distribution can be regarded as normal data. The data at the edge of the normal distribution may be abnormal data. In the result of this normal distribution, the most central data corresponds to the calculated mean, which can be diffused in units of standard deviations from the center to both sides. In this way, after the results of the normal distribution of the abnormal proportion of visits are obtained by statistics, the confidence interval can be determined in the results of the normal distribution according to the mean value and the standard deviation.
  • the confidence interval may be ( ⁇ -3 ⁇ , ⁇ +3 ⁇ ), and the percentage of abnormal accesses within the confidence interval can be regarded as normal data.
  • the abnormal proportion of access outside the confidence interval is the abnormal data.
  • the location of the access data of the target time node can be identified in the result. If the access data of the target time node is outside the confidence interval, it can be determined that the access data of the target time node is abnormal data. If the access data of the target time node is within the confidence interval, it can be determined that the access data of the target time node is non-abnormal data.
  • the convergence rule in order to further improve the accuracy of data detection.
  • it can be determined based on the service characteristics corresponding to the domain name type. For example, domain names can be divided into banking, payment, and on-demand fields according to business characteristics. Different convergence rules can be formulated for these different fields.
  • the convergence rule can be used to comprehensively consider the occurrence of abnormal data within a period of time, so as to determine whether the access data at a certain target time node is truly abnormal data to be processed. The purpose of this processing is that the abnormal data determined in steps S1 and S3 is likely to belong to sudden abnormal data.
  • the configured amplitude threshold can be used to determine whether the number of abnormal requests in the access data at the target time node is sufficient from the perspective of absolute value.
  • the purpose of such processing is that for certain target time nodes, the calculated access abnormality ratio will be relatively high, but this calculation result is often caused by a decrease in the total number of accesses. In fact, the number of abnormal requests has not changed, but because the total number of access requests has decreased, it appears that the percentage of abnormal access requests is relatively high. In this case, there is no need to waste manpower and material resources to deal with it.
  • the determined abnormal data can be further filtered according to the convergence rule and the amplitude threshold.
  • the convergence rules can be different according to the type of domain name.
  • the convergence rule may be based on the target time node as the starting time node, and access data at a specified number of consecutive time nodes are all determined to be abnormal data.
  • the convergence rule may also be abnormal data that occurs a specified number of times within a preset time period including the target time node. For example, for a stationary domain name, the convergence rule may be that 4 consecutive minutes of access data are all determined to be abnormal data.
  • the convergence rule can be that there are 6 abnormal data within 10 minutes.
  • the convergence rule can be 10 abnormal data occurrences within 20 minutes.
  • the amplitude threshold can be divided according to the magnitude of the accessed data.
  • the magnitude of the access data may be based on, for example, QPS (Quests Per Second, number of requests per second) as a unit.
  • QPS Quadests Per Second, number of requests per second
  • several different magnitude intervals can be set, and each magnitude interval can correspond to its own amplitude threshold.
  • the domain name type to which the access data of the target time node belongs can be identified, and the convergence rule corresponding to the domain name type can be obtained.
  • the data magnitude corresponding to the access data of the target time node can also be calculated, and the amplitude threshold corresponding to the magnitude interval in which the data magnitude is located can be obtained.
  • the access data of the target time node does not meet the corresponding convergence rule, or the number of abnormal requests in the access data of the target time node is less than or equal to the corresponding amplitude threshold, determine the access of the target time node
  • the data is not regarded as abnormal data to be processed. That is to say, the conditions of the convergence rule and the amplitude threshold need to be met at the same time before it is judged as abnormal data to be processed. As long as one of them is not satisfied, it will not be regarded as abnormal data to be processed.
  • the order of determining the convergence rule and the amplitude threshold is not limited in this embodiment.
  • the present application also provides a detection system for abnormal data, the system includes:
  • a threshold model judging unit configured to obtain access data for a specified period of time, train to obtain a threshold model based on the access data, and determine whether the access data of the target time node is abnormal data according to the threshold model;
  • the distribution judgment unit is configured to determine the detection interval including the target time node if it is judged that the access data of the target time node is abnormal data, and calculate the distribution of the access data samples in the detection interval, and according to the statistics Distribution, again determining whether the access data of the target time node is abnormal data;
  • the screening unit is configured to, if it is determined again that the access data of the target time node is abnormal data, obtain the convergence rule and the amplitude threshold corresponding to the access data of the target time node, and determine based on the convergence rule and the amplitude threshold Whether the access data of the target time node is abnormal data to be processed.
  • the threshold model judgment unit includes:
  • the screening threshold determination module is used to identify the domain name type to which the access data belongs, and obtain the screening threshold value corresponding to the domain name type;
  • the screening boundary determination module is used to count the access abnormal proportions of each time node in the access data and determine the screening boundary, and the screening boundary is used to divide the statistical access abnormal proportions into aggregate nodes and isolated nodes, wherein , The number of the isolated nodes is determined by the screening threshold;
  • the threshold model generation module is used to use the model with the screening boundary as the threshold model obtained by training.
  • the distribution judgment unit includes:
  • a data calculation module configured to count the abnormal access ratio of each access data sample in the detection interval, and calculate the mean value and standard deviation of the abnormal access ratio obtained by statistics
  • the normal distribution module is configured to perform a normal distribution on the statistically obtained abnormal access ratio according to the mean value and the standard deviation, and use the result of the normal distribution as the distribution of the access data samples in the detection interval.
  • the screening unit includes:
  • the first determination module is configured to determine the target if the access data of the target time node meets the corresponding convergence rule, and the number of abnormal requests in the access data of the target time node is greater than the corresponding amplitude threshold
  • the access data of the time node is the abnormal data to be processed
  • the second determination module is configured to determine if the access data of the target time node does not meet the corresponding convergence rule, or the number of abnormal requests in the access data of the target time node is less than or equal to the corresponding amplitude threshold The access data of the target time node is not regarded as abnormal data to be processed.
  • an embodiment of the present application also provides a device for detecting abnormal data.
  • the device includes a processor and a memory.
  • the memory is used to store a computer program.
  • the computer program is executed by the processor, The above-mentioned abnormal data detection method can be realized.
  • the memory may include a physical device for storing information, which is usually digitized and then stored in a medium using electrical, magnetic, or optical methods.
  • the memory described in this embodiment may also include: a device that uses electrical energy to store information, such as RAM or ROM, etc.; a device that uses magnetic energy to store information, such as hard disk, floppy disk, magnetic tape, magnetic core memory, bubble memory, or U disk ; A device that uses optical means to store information, such as CD or DVD.
  • a device that uses electrical energy to store information such as RAM or ROM, etc.
  • a device that uses magnetic energy to store information such as hard disk, floppy disk, magnetic tape, magnetic core memory, bubble memory, or U disk
  • a device that uses optical means to store information such as CD or DVD.
  • quantum memory or graphene memory there are other types of memory, such as quantum memory or graphene memory.
  • the processor can be implemented in any suitable manner.
  • the processor may take the form of, for example, a microprocessor or a processor and a computer-readable medium storing computer-readable program codes (for example, software or firmware) executable by the (micro)processor, logic gates, switches, special-purpose integrated Circuit (Application Specific Integrated Circuit, ASIC), programmable logic controller and embedded microcontroller form, etc.
  • program codes for example, software or firmware
  • the technical solutions provided by one or more embodiments of the present application can train a threshold model for the access data within a specified time period when detecting abnormal data.
  • the threshold model the access data of the target time node can be judged preliminarily. If it is determined to be abnormal data, the detection interval containing the target time node can be determined, and the distribution of the access data samples in the detection interval can be counted. According to the statistical distribution results, it can be further determined whether the access data is abnormal data.
  • the beneficial effect of such processing is that the abnormal data determined according to the unified threshold model may not belong to abnormal data in a certain period of time. By performing distribution statistics on access data samples within a specified time period, it is possible to further clarify whether the access data is abnormal.
  • the access data is still judged to be abnormal data, you can continue to obtain the convergence rules and amplitude thresholds of the target time node.
  • the convergence rules can avoid sudden data anomalies, which are actually not necessary to deal with.
  • the amplitude threshold can prevent abnormal data from being judged abnormally due to too few requests for data access.
  • the final abnormal data to be processed can be determined. It can be seen that through multiple screening methods, the detection of abnormal data can be more accurate.
  • the embodiments of the present invention can be provided as a method, a system, or a computer program product. Therefore, the present invention may adopt a form of a complete hardware implementation, a complete software implementation, or a combination of software and hardware implementations. Moreover, the present invention may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can direct a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-permanent memory in computer readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种异常数据的检测方法、系统及设备,其中,所述方法包括:获取指定时段的访问数据,并基于所述访问数据训练得到阈值模型,以及根据所述阈值模型,判断目标时间节点的访问数据是否为异常数据(S1);若所述目标时间节点的访问数据为异常数据,确定检测区间,并统计所述检测区间内访问数据样本的分布,并再次判断所述目标时间节点的访问数据是否为异常数据(S2);若再次判定所述目标时间节点的访问数据为异常数据,获取所述目标时间节点的访问数据对应的收敛规则和幅度阈值,并基于所述收敛规则和所述幅度阈值,判断所述目标时间节点的访问数据是否为待处理的异常数据(S3)。所述方法、系统、设备能够提高异常数据检测的准确度。

Description

一种异常数据的检测方法、系统及设备 技术领域
本发明涉及数据处理技术领域,特别涉及一种异常数据的检测方法、系统及设备。
背景技术
在当前的CDN(Content Delivery Network,内容分发网络)中,为了提高用户的体验,可以配置告警模块。当出现数据访问异常时,告警模块可以及时地发出告警信息,从而使得网络管理人员能够进行异常检测和修复,避免用户长时间处于数据不可访问的状态。
目前的数据告警手段,通常是预先设置多种类型的告警信息,如果实际的数据异常与其中的一种类型相匹配,就会发出对应的告警信息。但是,网络中的数据异常类型十分繁杂,并且数量相当多,其实有一部分的数据异常是在允许范围内发生的。按照现有的这种告警方式,会产生很多不必要的告警信息,一方面会耗费大量的人力物力进行异常排查,另一方面还可能使得真正严重的数据异常淹没在众多的告警信息中。因此,目前亟需一种准确的异常数据检测手段。
发明内容
本申请的目的在于提供一种异常数据的检测方法、系统及设备,能够提高异常数据检测的准确度。
为实现上述目的,本申请一方面提供一种异常数据的检测方法,所述方法包括:获取指定时段的访问数据,并基于所述访问数据训练得到阈值模型,以及根据所述阈值模型,判断目标时间节点的访问数据是否为异常数据;若判定所述目标时间节点的访问数据为异常数据,确定包含所述目标时间节点的检测区间,并统计所述检测区间内访问数据样本的分布,以及根据统计的所述分布,再次判断所述目标时间节点的访问数据是否为异常数据;若再次判定所述目标 时间节点的访问数据为异常数据,获取所述目标时间节点的访问数据对应的收敛规则和幅度阈值,并基于所述收敛规则和所述幅度阈值,判断所述目标时间节点的访问数据是否为待处理的异常数据。
为实现上述目的,本申请另一方面还提供一种异常数据的检测系统,所述系统包括:阈值模型判断单元,用于获取指定时段的访问数据,并基于所述访问数据训练得到阈值模型,以及根据所述阈值模型,判断目标时间节点的访问数据是否为异常数据;分布判断单元,用于若判定所述目标时间节点的访问数据为异常数据,确定包含所述目标时间节点的检测区间,并统计所述检测区间内访问数据样本的分布,以及根据统计的所述分布,再次判断所述目标时间节点的访问数据是否为异常数据;筛选单元,用于若再次判定所述目标时间节点的访问数据为异常数据,获取所述目标时间节点的访问数据对应的收敛规则和幅度阈值,并基于所述收敛规则和所述幅度阈值,判断所述目标时间节点的访问数据是否为待处理的异常数据。
为实现上述目的,本申请另一方面还提供一种异常数据的检测设备,所述设备包括处理器和存储器,所述存储器用于存储计算机程序,所述计算机程序被所述处理器执行时,实现上述的异常数据的检测方法。
由上可见,本申请一个或者多个实施方式提供的技术方案,在检测异常数据时,可以针对指定时段内的访问数据,训练得到阈值模型。通过该阈值模型可以初步对目标时间节点的访问数据进行判断。若判定为异常数据,可以确定包含目标时间节点的检测区间,并统计该检测区间内的访问数据样本的分布。根据统计出的分布结果,可以进一步地确定该访问数据是否为异常数据。这样处理的有益效果在于,按照统一的阈值模型判定出的异常数据,在某个时段内可能并不属于异常数据。通过对指定时段内的访问数据样本进行分布统计,从而可以进一步地明确访问数据是否异常。如果该访问数据依然被判定为异常数据,可以继续获取该目标时间节点的收敛规则和幅度阈值,其中,收敛规则可以避免突发性的数据异常,该突发性的数据异常其实没有处理的必要,而幅度阈值可以避免由于访问数据的请求数过少而导致异常数据的判定失常。通过收敛规则和幅度阈值的进一步筛选,可以确定出最终待处理的异常数据。可见,通过多种方式的层层筛选,可以使得异常数据的检测更加准确。
附图说明
为了更清楚地说明本发明实施方式中的技术方案,下面将对实施方式描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施方式中异常数据的检测方法步骤图;
图2是本发明实施方式中平稳型域名的示意图;
图3是本发明实施方式中周期变化型域名的示意图;
图4是本发明实施方式中突刺变化型域名的示意图;
图5是本发明实施方式中孤立森林算法的示意图;
图6是本发明实施方式中数据节点的划分示意图;
图7是本发明实施方式中异常数据的检测设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施方式及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施方式仅是本申请一部分实施方式,而不是全部的实施方式。基于本申请中的实施方式,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施方式,都属于本申请保护的范围。
本申请提供一种异常数据的检测方法,请参阅图1,该方法可以包括以下多个步骤。
S1:获取指定时段的访问数据,并基于所述访问数据训练得到阈值模型,以及根据所述阈值模型,判断目标时间节点的访问数据是否为异常数据。
在本实施方式中,可以采用孤立森林算法,对指定时段的访问数据进行训练,从而得到用于区分正常数据和异常数据的阈值模型。上述的指定时段,可以根据阈值模型的训练精度和训练时长灵活设置。在一个具体应用示例中,可以获取最近30天的访问数据。
需要说明的是,不同类型的域名对应的业务特性也可能存在差异,这些域名在提供业务时,出现异常数据的比例以及异常数据出现的时间节点都可能不同。通过对大量域名的访问数据进行分析,可以划分得到三种域名类型:平稳 型、周期变化型以及突刺变化型。这三种域名类型对应的异常数据的曲线图可以分别如图2、图3、图4所示。其中,在图2中,随着时间的推移,平稳型域名的访问数据中异常数据的比例始终稳定在一个较小的区间内。在图3中,周期变化型域名的访问数据中异常数据的比例会随着时间进行周期性的变化。图4中,突刺变化型域名的访问数据中异常数据的比例会呈现尖锐的变化。不同的域名类型对应的访问数据,可以附带对应的域名标签。该域名标签可以是管理员手动设置的用于区分不同域名类型的数据标识,也可以是基于不同域名类型的访问数据进行大数据分析后得到的特征标识,本申请对于域名标签的获取方式并不做限定。为了提高异常数据的检测精度,针对不同类型的域名,可以选用不同的训练数据,从而训练出不同的阈值模型。
具体地,针对上述的多种域名类型,可以采用相同的方式进行阈值模型的训练,只不过每次训练过程中采用的是对应域名类型的访问数据。在实际应用中,在训练阈值模型时,针对获取到的指定时段的访问数据,如果访问数据中包含不同域名类型的访问数据,那么可以对访问数据先按照域名类型进行分类。然后,针对待训练的访问数据而言,可以根据访问数据中携带的域名标签,识别该访问数据所属的域名类型。由于不同的域名类型对异常数据比例的容忍度也是不同的,因此,在进行阈值模型训练时,可以分别为各个域名类型分配对应的筛选阈值。该筛选阈值可以表示对应的域名类型能够容忍的最大异常数据比例。以平稳型域名为例,该筛选阈值例如可以是1/1440,表示每1440份访问数据中,只允许有1份异常访问数据。而对于周期变化型域名和突刺变化型域名而言,对应的筛选阈值可以稍大一些,例如,周期变化型域名对应的筛选阈值可以是10/1440,突刺变化型域名对应的筛选阈值可以是20/1440。
这样,在识别出访问数据所属的域名类型后,可以获取该域名类型对应的筛选阈值。然后,可以统计该访问数据中各个时间节点的访问异常比例。在实际应用中,该时间节点可以是1分钟,这样,获取的访问数据中可以按照1分钟的粒度进行访问数据的划分。在每分钟的访问数据中,可能会包括正常访问数据和异常访问数据,通过计算异常访问数据在当前分钟的访问总数据中所占的比例,从而可以统计出指定时段内每分钟的访问异常比例。针对统计得到的各个访问异常比例,可以采用孤立森立算法,将每个访问异常比例都视为数据节点,按照图5所示的逐层孤立的方式,将各个访问异常比例进行孤立。越早 被孤立的节点,越有可能成为异常的节点。例如在图5中,存在abcd四个节点,最早被孤立的是节点d,那么节点d很有可能成为异常的节点。通过孤立森林算法,最终可以将不同的节点进行划分,从而得到如图6所示的划分示意图。在图6中,黑色的点可以表示访问异常比例对应的数据节点。可见,大部分数据节点会聚合在一起,而少部分数据节点会呈离散状。通过孤立森林算法,可以得到图6所示的封闭的筛选边界,位于该筛选边界内的数据节点可以称为聚合节点,位于该筛选边界外的数据节点可以称为孤立节点。其中,孤立节点便可以视为异常的数据节点,而图6中孤立节点的数量,可以由域名类型对应的筛选阈值来确定。这样,通过在孤立森林算法中引入确定的筛选阈值,从而可以限定筛选边界占据的范围。最终,通过大量数据的不断训练,可以使得筛选边界的位置越来越精准。最终,针对输入的任意一个样本数据,该筛选边界都可以准确地判定出该样本数据是落入筛选边界内,还是筛选边界外。这样,可以将具备所述筛选边界的模型作为训练得到的阈值模型。当然,针对不同的域名类型,可以训练得到对应的阈值模型。
在本实施方式中,在训练得到阈值模型后,可以针对待检测的访问数据进行初步判断。以任意一个目标时间节点的访问数据为例,可以按照上述的方式计算该目标时间节点的访问数据对应的访问异常比例,并将计算的所述访问异常比例输入所述阈值模型中。通过该阈值模型,可以判定输入的访问异常比例是孤立节点还是聚合节点。若所述阈值模型输出的结果为孤立节点,则可以判定所述目标时间节点的访问数据为异常数据。而如果所述阈值模型输出的结果为聚合节点,则可以判定所述目标时间节点的访问数据为非异常数据。
S3:若判定所述目标时间节点的访问数据为异常数据,确定包含所述目标时间节点的检测区间,并统计所述检测区间内访问数据样本的分布,以及根据统计的所述分布,再次判断所述目标时间节点的访问数据是否为异常数据。
在本实施方式中,考虑到在训练阈值模型时,选用的数据都是随机抽取的,但实际上访问数据的数量在不同的时刻可能会出现较大的差异,从而导致不同的时间节点处访问异常比例的变化也会较大。但某些访问异常比例较大的时间节点,很可能是访问数据的突增导致的,这些时间节点的访问异常比例其实是能够接受的,不应当作为异常数据进行处理。鉴于此,为了明确步骤S1检测出的异常数据是否为真正的异常数据,在本实施方式中可以进一步地对检测出的 异常数据再次进行检测。
具体地,若目标时间节点处的访问数据被判定为异常数据,那么可以更多地获取一些该目标时间节点附近的数据进行分析,从而避免造成片面的检测结果。在实际应用中,首先可以确定包含该目标时间节点的检测区间,该检测区间可以对应一段检测时长。例如,可以是以该目标时间节点为中心,前后5分钟,共计10分钟的检测时长。当然,针对不同的域名类型,该检测时长也可以不同。例如,对于平稳型域名而言,该检测时长可以相对较短,例如可以是20分钟。而对于周期变化型域名和突刺变化型域名,该检测时长可以相对较长,例如可以分别为1小时和2小时。这样,在初步判定目标时间节点的访问数据为异常数据后,可以根据该目标时间节点的访问数据所属的域名类型,获取该域名类型对应的检测时长。然后,可以将所述目标时间节点作为检测区间的中心,构建包含该目标时间节点,并且区间时长与获取的检测时长相等的检测区间。在构建出检测区间后,便可以获取该检测区间内的访问数据。当然,这里获取的访问数据,是针对识别出的域名类型而言的,其它域名类型的访问数据可以先过滤掉。
在本实施方式中,为了提高数据分析的准确性,可以将一定时段内,每天该检测区间内的访问数据都作为待分析的对象。例如,某个目标域名在目标时间节点为12点05分时的访问数据被初步判定为异常数据,那么可以将最近30天,每天11点55分至12点15分内该目标域名的访问数据均作为进一步分析的数据。在获取到检测区间内的这些数据后,可以统计该检测区间内各个访问数据样本的访问异常比例,同样地,该访问异常比例也可以按照1分钟的粒度进行划分,这样,对于每一天而言,该检测区间内每分钟都可以产生一个访问异常比例。后续,可以计算统计得到的访问异常比例的均值和标准差,计算均值和标准差的目的在于,可以根据所述均值和所述标准差对统计得到的访问异常比例进行正态分布。正态分布能够体现数据的一般特性,通常而言,位于正态分布中间的部分数据,都可以视为正常的数据。而位于正态分布边缘的数据,才可能是异常的数据。在该正态分布的结果中,最中心的数据对应的是计算出的均值,从中心往两边可以按照标准差为单位进行扩散。这样,在统计得到访问异常比例的正态分布结果后,可以根据所述均值和所述标准差,在正态分布的结果中确定置信区间。在一个具体应用示例中,该置信区间可以是(μ-3σ, μ+3σ),位于该置信区间内的访问异常比例,都可以视为正常的数据。而位于该置信区间外的访问异常比例,才是异常的数据。这样,在统计得到正态分布的结果后,可以在该结果中识别目标时间节点的访问数据所处的位置。若所述目标时间节点的访问数据位于所述置信区间外,则可以判定所述目标时间节点的访问数据为异常数据。而如果所述目标时间节点的访问数据位于所述置信区间内,则可以判定所述目标时间节点的访问数据为非异常数据。
S5:若再次判定所述目标时间节点的访问数据为异常数据,获取所述目标时间节点的访问数据对应的收敛规则和幅度阈值,并基于所述收敛规则和所述幅度阈值,判断所述目标时间节点的访问数据是否为待处理的异常数据。
在本实施方式中,为了进一步提高数据检测的精度。还可以为不同的域名类型配置不同的收敛规则和幅度阈值。其中,在配置收敛规则时,可以针对域名类型对应的业务特性来确定。例如,域名类型按照业务特性可以划分为银行领域、支付领域、点播领域等多个不同的领域,针对这些不同的领域,可以制定不同的收敛规则。该收敛规则可以用于综合考量一段时间内异常数据的出现情况,从而判定某一个目标时间节点处的访问数据是否为真正的待处理的异常数据。这样处理的目的在于,对于步骤S1和S3确定出的异常数据,很可能是属于突发的异常数据,该突发的异常数据在后续的数据访问过程中并不会频繁出现,因此无需浪费人力物力进行处理。而配置的幅度阈值,可以从绝对值的角度来判断目标时间节点处的访问数据中异常请求的数量是否足够。这样处理的目的在于,对于某些目标时间节点而言,其计算出的访问异常比例会比较高,但这种计算结果往往是由于总的访问数量的下降导致的。实际上,异常请求的数量并没有改变,只不过由于总的访问请求的数量变少了,才会显得访问异常比例较高。这种情况实际上也无需浪费人力物力进行处理。
鉴于此,在本实施方式中,可以根据收敛规则和幅度阈值对判定出的异常数据进一步进行筛选。其中,收敛规则可以根据域名类型的不同而不同。例如,收敛规则可以是以所述目标时间节点为起始时间节点,连续指定数量的时间节点处的访问数据均被判定为异常数据。此外,收敛规则还可以是在包含所述目标时间节点的预设时长内出现指定次数的异常数据。举例来说,对于平稳型域名而言,收敛规则可以是连续4分钟的访问数据都被判定为异常数据。而对于周期变化型域名而言,收敛规则可以是10分钟内出现6次异常数据。对于突刺 变化型域名而言,收敛规则可以是20分钟内出现10次异常数据。
幅度阈值则可以根据访问数据的量级进行划分。该访问数据的量级例如可以按照QPS(Quests Per Second,每秒请求数)为单位,访问数据的量级越大,对应的幅度阈值也可以越大。在实际应用中,可以设置几个不同的量级区间,每个量级区间可以对应各自的幅度阈值。
这样,针对目标时间节点的访问数据,可以识别该目标时间节点的访问数据所属的域名类型,并获取该域名类型对应的收敛规则。此外,还可以计算目标时间节点的访问数据对应的数据量级,并可以获取该数据量级所在的量级区间对应的幅度阈值。后续,在利用收敛规则和幅度阈值进行异常数据筛选时,若所述目标时间节点的访问数据满足对应的所述收敛规则,并且所述目标时间节点的访问数据中异常请求的数量大于对应的所述幅度阈值,才判定所述目标时间节点的访问数据为待处理的异常数据。若所述目标时间节点的访问数据未满足对应的所述收敛规则,或者所述目标时间节点的访问数据中异常请求的数量小于或者等于对应的所述幅度阈值,判定所述目标时间节点的访问数据不作为待处理的异常数据。也就是说,收敛规则和幅度阈值的条件需要同时满足,才判定为待处理的异常数据。而只要有其中一个不满足,则不作为待处理的异常数据。而收敛规则和幅度阈值的判定顺序,在本实施方式中并不做限定。
本申请还提供一种异常数据的检测系统,所述系统包括:
阈值模型判断单元,用于获取指定时段的访问数据,并基于所述访问数据训练得到阈值模型,以及根据所述阈值模型,判断目标时间节点的访问数据是否为异常数据;
分布判断单元,用于若判定所述目标时间节点的访问数据为异常数据,确定包含所述目标时间节点的检测区间,并统计所述检测区间内访问数据样本的分布,以及根据统计的所述分布,再次判断所述目标时间节点的访问数据是否为异常数据;
筛选单元,用于若再次判定所述目标时间节点的访问数据为异常数据,获取所述目标时间节点的访问数据对应的收敛规则和幅度阈值,并基于所述收敛规则和所述幅度阈值,判断所述目标时间节点的访问数据是否为待处理的异常数据。
在一个实施方式中,所述阈值模型判断单元包括:
筛选阈值确定模块,用于识别所述访问数据所属的域名类型,并获取所述域名类型对应的筛选阈值;
筛选边界确定模块,用于统计所述访问数据中各个时间节点的访问异常比例,并确定筛选边界,所述筛选边界用于将统计的各个所述访问异常比例划分为聚合节点和孤立节点,其中,所述孤立节点的数量由所述筛选阈值确定;
阈值模型生成模块,用于将具备所述筛选边界的模型作为训练得到的阈值模型。
在一个实施方式中,所述分布判断单元包括:
数据计算模块,用于统计所述检测区间内各个访问数据样本的访问异常比例,并计算统计得到的所述访问异常比例的均值和标准差;
正态分布模块,用于根据所述均值和所述标准差对统计得到的所述访问异常比例进行正态分布,并将正态分布的结果作为所述检测区间内访问数据样本的分布。
在一个实施方式中,所述筛选单元包括:
第一判定模块,用于若所述目标时间节点的访问数据满足对应的所述收敛规则,并且所述目标时间节点的访问数据中异常请求的数量大于对应的所述幅度阈值,判定所述目标时间节点的访问数据为待处理的异常数据;
第二判定模块,用于若所述目标时间节点的访问数据未满足对应的所述收敛规则,或者所述目标时间节点的访问数据中异常请求的数量小于或者等于对应的所述幅度阈值,判定所述目标时间节点的访问数据不作为待处理的异常数据。
请参阅图7,本申请一个实施方式还提供一种异常数据的检测设备,所述设备包括处理器和存储器,所述存储器用于存储计算机程序,所述计算机程序被所述处理器执行时,可以实现上述的异常数据的检测方法。
在本实施方式中,所述存储器可以包括用于存储信息的物理装置,通常是将信息数字化后再以利用电、磁或者光学等方法的媒体加以存储。本实施方式所述的存储器又可以包括:利用电能方式存储信息的装置,如RAM或ROM等;利用磁能方式存储信息的装置,如硬盘、软盘、磁带、磁芯存储器、磁泡存储器或U盘;利用光学方式存储信息的装置,如CD或DVD。当然,还有其他方式的存储器,例如量子存储器或石墨烯存储器等等。
在本实施方式中,所述处理器可以按任何适当的方式实现。例如,所述处理器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式等等。
由上可见,本申请一个或者多个实施方式提供的技术方案,在检测异常数据时,可以针对指定时段内的访问数据,训练得到阈值模型。通过该阈值模型可以初步对目标时间节点的访问数据进行判断。若判定为异常数据,可以确定包含目标时间节点的检测区间,并统计该检测区间内的访问数据样本的分布。根据统计出的分布结果,可以进一步地确定该访问数据是否为异常数据。这样处理的有益效果在于,按照统一的阈值模型判定出的异常数据,在某个时段内可能并不属于异常数据。通过对指定时段内的访问数据样本进行分布统计,从而可以进一步地明确访问数据是否异常。如果该访问数据依然被判定为异常数据,可以继续获取该目标时间节点的收敛规则和幅度阈值,其中,收敛规则可以避免突发性的数据异常,该突发性的数据异常其实没有处理的必要,而幅度阈值可以避免由于访问数据的请求数过少而导致异常数据的判定失常。通过收敛规则和幅度阈值的进一步筛选,可以确定出最终待处理的异常数据。可见,通过多种方式的层层筛选,可以使得异常数据的检测更加准确。
本说明书中的各个实施方式均采用递进的方式描述,各个实施方式之间相同相似的部分互相参见即可,每个实施方式重点说明的都是与其他实施方式的不同之处。尤其,针对系统和设备的实施方式来说,均可以参照前述方法的实施方式的介绍对照解释。
本领域内的技术人员应明白,本发明的实施方式可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施方式、完全软件实施方式、或结合软件和硬件方面的实施方式的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施方式的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框 的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过 程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
以上所述仅为本申请的实施方式而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (14)

  1. 一种异常数据的检测方法,其特征在于,所述方法包括:
    获取指定时段的访问数据,并基于所述访问数据训练得到阈值模型,以及根据所述阈值模型,判断目标时间节点的访问数据是否为异常数据;
    若判定所述目标时间节点的访问数据为异常数据,确定包含所述目标时间节点的检测区间,并统计所述检测区间内访问数据样本的分布,以及根据统计的所述分布,再次判断所述目标时间节点的访问数据是否为异常数据;
    若再次判定所述目标时间节点的访问数据为异常数据,获取所述目标时间节点的访问数据对应的收敛规则和幅度阈值,并基于所述收敛规则和所述幅度阈值,判断所述目标时间节点的访问数据是否为待处理的异常数据。
  2. 根据权利要求1所述的方法,其特征在于,基于所述访问数据训练得到阈值模型包括:
    识别所述访问数据所属的域名类型,并获取所述域名类型对应的筛选阈值;
    统计所述访问数据中各个时间节点的访问异常比例,并确定筛选边界,所述筛选边界用于将统计的各个所述访问异常比例划分为聚合节点和孤立节点,其中,所述孤立节点的数量由所述筛选阈值确定;
    将具备所述筛选边界的模型作为训练得到的阈值模型。
  3. 根据权利要求1或2所述的方法,其特征在于,判断目标时间节点的访问数据是否为异常数据包括:
    计算所述目标时间节点的访问数据对应的访问异常比例,并将计算的所述访问异常比例输入所述阈值模型中;若所述阈值模型输出的结果为孤立节点,判定所述目标时间节点的访问数据为异常数据;若所述阈值模型输出的结果为聚合节点,判定所述目标时间节点的访问数据为非异常数据。
  4. 根据权利要求1所述的方法,其特征在于,确定包含所述目标时间节点的检测区间包括:
    识别所述目标时间节点的访问数据所属的域名类型,并获取所述域名类型 对应的检测时长;
    以所述目标时间节点为中心,构建包含所述目标时间节点,并且区间时长与所述检测时长相等的检测区间;其中,构建的所述检测区间作为所述包含所述目标时间节点的检测区间。
  5. 根据权利要求1所述的方法,其特征在于,统计所述检测区间内访问数据样本的分布包括:
    统计所述检测区间内各个访问数据样本的访问异常比例,并计算统计得到的所述访问异常比例的均值和标准差;
    根据所述均值和所述标准差对统计得到的所述访问异常比例进行正态分布,并将正态分布的结果作为所述检测区间内访问数据样本的分布。
  6. 根据权利要求5所述的方法,其特征在于,再次判断所述目标时间节点的访问数据是否为异常数据包括:
    根据所述均值和所述标准差,在正态分布的结果中确定置信区间;若所述目标时间节点的访问数据位于所述置信区间外,判定所述目标时间节点的访问数据为异常数据;若所述目标时间节点的访问数据位于所述置信区间内,判定所述目标时间节点的访问数据为非异常数据。
  7. 根据权利要求1所述的方法,其特征在于,获取所述目标时间节点的访问数据对应的收敛规则包括:
    识别所述目标时间节点的访问数据所属的域名类型,并获取所述域名类型对应的收敛规则;其中,所述收敛规则包括:
    以所述目标时间节点为起始时间节点,连续指定数量的时间节点处的访问数据均被判定为异常数据;
    或者
    在包含所述目标时间节点的预设时长内出现指定次数的异常数据。
  8. 根据权利要求1所述的方法,其特征在于,所述目标时间节点的访问数据对应的幅度阈值按照访问数据的量级进行划分,其中,访问数据的量级越大, 对应的幅度阈值越大。
  9. 根据权利要求1所述的方法,其特征在于,判断所述目标时间节点的访问数据是否为待处理的异常数据包括:
    若所述目标时间节点的访问数据满足对应的所述收敛规则,并且所述目标时间节点的访问数据中异常请求的数量大于对应的所述幅度阈值,判定所述目标时间节点的访问数据为待处理的异常数据;
    若所述目标时间节点的访问数据未满足对应的所述收敛规则,或者所述目标时间节点的访问数据中异常请求的数量小于或者等于对应的所述幅度阈值,判定所述目标时间节点的访问数据不作为待处理的异常数据。
  10. 一种异常数据的检测系统,其特征在于,所述系统包括:
    阈值模型判断单元,用于获取指定时段的访问数据,并基于所述访问数据训练得到阈值模型,以及根据所述阈值模型,判断目标时间节点的访问数据是否为异常数据;
    分布判断单元,用于若判定所述目标时间节点的访问数据为异常数据,确定包含所述目标时间节点的检测区间,并统计所述检测区间内访问数据样本的分布,以及根据统计的所述分布,再次判断所述目标时间节点的访问数据是否为异常数据;
    筛选单元,用于若再次判定所述目标时间节点的访问数据为异常数据,获取所述目标时间节点的访问数据对应的收敛规则和幅度阈值,并基于所述收敛规则和所述幅度阈值,判断所述目标时间节点的访问数据是否为待处理的异常数据。
  11. 根据权利要求10所述的系统,其特征在于,所述阈值模型判断单元包括:
    筛选阈值确定模块,用于识别所述访问数据所属的域名类型,并获取所述域名类型对应的筛选阈值;
    筛选边界确定模块,用于统计所述访问数据中各个时间节点的访问异常比例,并确定筛选边界,所述筛选边界用于将统计的各个所述访问异常比例划分 为聚合节点和孤立节点,其中,所述孤立节点的数量由所述筛选阈值确定;
    阈值模型生成模块,用于将具备所述筛选边界的模型作为训练得到的阈值模型。
  12. 根据权利要求10所述的系统,其特征在于,所述分布判断单元包括:
    数据计算模块,用于统计所述检测区间内各个访问数据样本的访问异常比例,并计算统计得到的所述访问异常比例的均值和标准差;
    正态分布模块,用于根据所述均值和所述标准差对统计得到的所述访问异常比例进行正态分布,并将正态分布的结果作为所述检测区间内访问数据样本的分布。
  13. 根据权利要求10所述的系统,其特征在于,所述筛选单元包括:
    第一判定模块,用于若所述目标时间节点的访问数据满足对应的所述收敛规则,并且所述目标时间节点的访问数据中异常请求的数量大于对应的所述幅度阈值,判定所述目标时间节点的访问数据为待处理的异常数据;
    第二判定模块,用于若所述目标时间节点的访问数据未满足对应的所述收敛规则,或者所述目标时间节点的访问数据中异常请求的数量小于或者等于对应的所述幅度阈值,判定所述目标时间节点的访问数据不作为待处理的异常数据。
  14. 一种异常数据的检测设备,其特征在于,所述设备包括存储器和处理器,所述存储器用于存储计算机程序,所述计算机程序被所述处理器执行时,实现如权利要求1至9中任一所述的方法。
PCT/CN2020/070703 2019-12-06 2020-01-07 一种异常数据的检测方法、系统及设备 WO2021109314A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911239601.7A CN111092757B (zh) 2019-12-06 2019-12-06 一种异常数据的检测方法、系统及设备
CN201911239601.7 2019-12-06

Publications (1)

Publication Number Publication Date
WO2021109314A1 true WO2021109314A1 (zh) 2021-06-10

Family

ID=70396315

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/070703 WO2021109314A1 (zh) 2019-12-06 2020-01-07 一种异常数据的检测方法、系统及设备

Country Status (2)

Country Link
CN (1) CN111092757B (zh)
WO (1) WO2021109314A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420652A (zh) * 2021-06-22 2021-09-21 中冶赛迪重庆信息技术有限公司 一种时序信号片段异常识别方法、系统、介质及终端
CN113505344A (zh) * 2021-07-16 2021-10-15 长鑫存储技术有限公司 机台插槽的异常侦测方法、修复方法和异常侦测系统
CN114131631A (zh) * 2021-12-16 2022-03-04 山东新一代信息产业技术研究院有限公司 一种巡检机器人报警阈值设置方法、装置及介质
CN114264957A (zh) * 2021-12-02 2022-04-01 东软集团股份有限公司 一种异常单体检测方法及其相关设备
CN114416412A (zh) * 2022-01-14 2022-04-29 建信金融科技有限责任公司 一种基于Arthas的异常定位方法及系统
CN114422267A (zh) * 2022-03-03 2022-04-29 北京天融信网络安全技术有限公司 流量检测方法、装置、设备及介质
WO2024036709A1 (zh) * 2022-08-18 2024-02-22 深圳前海微众银行股份有限公司 一种异常数据检测方法及装置
CN117874653A (zh) * 2024-03-11 2024-04-12 武汉佳华创新电气有限公司 一种基于多源数据的电力系统安全监测方法及系统
CN117874653B (zh) * 2024-03-11 2024-05-31 武汉佳华创新电气有限公司 一种基于多源数据的电力系统安全监测方法及系统

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112008543B (zh) * 2020-07-20 2022-11-01 上海大制科技有限公司 一种焊枪电极帽修磨异常诊断方法
CN112649118A (zh) * 2020-11-10 2021-04-13 许继集团有限公司 一种基于边缘计算的输电线路温度传感系统
CN112526418B (zh) * 2020-11-24 2024-05-28 上海辰光医疗科技股份有限公司 用于磁共振成像的磁场均匀性测量的数据记录和处理方法
CN113091817A (zh) * 2021-04-06 2021-07-09 重庆大学 一种三甘醇脱水装置状态监测及故障诊断系统
CN113094284A (zh) * 2021-04-30 2021-07-09 中国工商银行股份有限公司 应用故障检测方法及装置
CN113391983A (zh) * 2021-06-07 2021-09-14 北京达佳互联信息技术有限公司 报警信息的生成方法、装置、服务器及存储介质
CN113840157B (zh) * 2021-09-23 2023-07-18 上海哔哩哔哩科技有限公司 访问检测方法、系统及装置
CN114152894B (zh) * 2021-12-02 2022-08-02 北京博示电子科技有限责任公司 一种检测灯管的方法、装置、电子设备和存储介质
CN114363062A (zh) * 2021-12-31 2022-04-15 深信服科技股份有限公司 一种域名检测方法、系统、设备及计算机可读存储介质
CN117195273B (zh) * 2023-11-07 2024-02-06 闪捷信息科技有限公司 基于时序数据异常检测的数据泄露检测方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090951A (zh) * 2014-07-04 2014-10-08 李阳 一种异常数据的处理方法
US20160026520A1 (en) * 2014-07-28 2016-01-28 Yahoo! Inc. Rainbow event drop detection system
CN105915555A (zh) * 2016-06-29 2016-08-31 北京奇虎科技有限公司 网络异常行为的检测方法及系统
US20180295146A1 (en) * 2017-04-05 2018-10-11 Yandex Europe Ag Methods and systems for detecting abnormal user activity
CN109948669A (zh) * 2019-03-04 2019-06-28 腾讯科技(深圳)有限公司 一种异常数据检测方法及装置
CN110083475A (zh) * 2019-04-23 2019-08-02 新华三信息安全技术有限公司 一种异常数据的检测方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI592036B (zh) * 2014-11-14 2017-07-11 Chunghwa Telecom Co Ltd Wireless network signal range detection and display methods
CN107342880B (zh) * 2016-04-29 2021-06-08 中兴通讯股份有限公司 异常信息采集方法及系统
US10045218B1 (en) * 2016-07-27 2018-08-07 Argyle Data, Inc. Anomaly detection in streaming telephone network data
CN107657288B (zh) * 2017-10-26 2020-07-03 国网冀北电力有限公司 一种基于孤立森林算法的电力调度流数据异常检测方法
CN108446349B (zh) * 2018-03-08 2022-03-25 国网四川省电力公司电力科学研究院 一种gis异常数据的检测方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090951A (zh) * 2014-07-04 2014-10-08 李阳 一种异常数据的处理方法
US20160026520A1 (en) * 2014-07-28 2016-01-28 Yahoo! Inc. Rainbow event drop detection system
CN105915555A (zh) * 2016-06-29 2016-08-31 北京奇虎科技有限公司 网络异常行为的检测方法及系统
US20180295146A1 (en) * 2017-04-05 2018-10-11 Yandex Europe Ag Methods and systems for detecting abnormal user activity
CN109948669A (zh) * 2019-03-04 2019-06-28 腾讯科技(深圳)有限公司 一种异常数据检测方法及装置
CN110083475A (zh) * 2019-04-23 2019-08-02 新华三信息安全技术有限公司 一种异常数据的检测方法及装置

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420652A (zh) * 2021-06-22 2021-09-21 中冶赛迪重庆信息技术有限公司 一种时序信号片段异常识别方法、系统、介质及终端
CN113505344A (zh) * 2021-07-16 2021-10-15 长鑫存储技术有限公司 机台插槽的异常侦测方法、修复方法和异常侦测系统
CN113505344B (zh) * 2021-07-16 2023-08-29 长鑫存储技术有限公司 机台插槽的异常侦测方法、修复方法和异常侦测系统
CN114264957A (zh) * 2021-12-02 2022-04-01 东软集团股份有限公司 一种异常单体检测方法及其相关设备
CN114264957B (zh) * 2021-12-02 2024-05-07 东软集团股份有限公司 一种异常单体检测方法及其相关设备
CN114131631A (zh) * 2021-12-16 2022-03-04 山东新一代信息产业技术研究院有限公司 一种巡检机器人报警阈值设置方法、装置及介质
CN114131631B (zh) * 2021-12-16 2024-02-02 山东新一代信息产业技术研究院有限公司 一种巡检机器人报警阈值设置方法、装置及介质
CN114416412A (zh) * 2022-01-14 2022-04-29 建信金融科技有限责任公司 一种基于Arthas的异常定位方法及系统
CN114422267A (zh) * 2022-03-03 2022-04-29 北京天融信网络安全技术有限公司 流量检测方法、装置、设备及介质
CN114422267B (zh) * 2022-03-03 2024-02-06 北京天融信网络安全技术有限公司 流量检测方法、装置、设备及介质
WO2024036709A1 (zh) * 2022-08-18 2024-02-22 深圳前海微众银行股份有限公司 一种异常数据检测方法及装置
CN117874653A (zh) * 2024-03-11 2024-04-12 武汉佳华创新电气有限公司 一种基于多源数据的电力系统安全监测方法及系统
CN117874653B (zh) * 2024-03-11 2024-05-31 武汉佳华创新电气有限公司 一种基于多源数据的电力系统安全监测方法及系统

Also Published As

Publication number Publication date
CN111092757B (zh) 2021-11-23
CN111092757A (zh) 2020-05-01

Similar Documents

Publication Publication Date Title
WO2021109314A1 (zh) 一种异常数据的检测方法、系统及设备
US11087329B2 (en) Method and apparatus of identifying a transaction risk
US20210089917A1 (en) Heuristic Inference of Topological Representation of Metric Relationships
US10878102B2 (en) Risk scores for entities
CN112822143B (zh) 一种ip地址的评估方法、系统及设备
RU2017118317A (ru) Система и способ автоматического расчета кибер-риска в бизнес-критических приложениях
US9565203B2 (en) Systems and methods for detection of anomalous network behavior
CN106649831B (zh) 一种数据过滤方法及装置
US20210092160A1 (en) Data set creation with crowd-based reinforcement
CN110362612B (zh) 由电子设备执行的异常数据检测方法、装置和电子设备
US20190065738A1 (en) Detecting anomalous entities
WO2016095626A1 (zh) 监控进程的方法和装置
US20150371044A1 (en) Targeted security alerts
CN110471821B (zh) 异常变更检测方法、服务器及计算机可读存储介质
US20140269339A1 (en) System for analysing network traffic and a method thereof
CN107682345B (zh) Ip地址的检测方法、检测装置及电子设备
CN114465870B (zh) 告警信息的处理方法及装置、存储介质和电子设备
US20210014102A1 (en) Reinforced machine learning tool for anomaly detection
US11074652B2 (en) System and method for model-based prediction using a distributed computational graph workflow
CN109561097B (zh) 结构化查询语言注入安全漏洞检测方法、装置、设备及存储介质
US10705940B2 (en) System operational analytics using normalized likelihood scores
CN109597746A (zh) 故障分析方法及装置
US11675647B2 (en) Determining root-cause of failures based on machine-generated textual data
CN114443437A (zh) 告警根因输出方法、装置、设备、介质和程序产品
CN110399405A (zh) 日志报警方法、装置、系统及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20896002

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20896002

Country of ref document: EP

Kind code of ref document: A1