WO2024060767A1 - Anomaly detection method and related apparatus - Google Patents

Anomaly detection method and related apparatus Download PDF

Info

Publication number
WO2024060767A1
WO2024060767A1 PCT/CN2023/103993 CN2023103993W WO2024060767A1 WO 2024060767 A1 WO2024060767 A1 WO 2024060767A1 CN 2023103993 W CN2023103993 W CN 2023103993W WO 2024060767 A1 WO2024060767 A1 WO 2024060767A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate
field
fields
subset
attribute
Prior art date
Application number
PCT/CN2023/103993
Other languages
French (fr)
Chinese (zh)
Inventor
杨松
周颖杰
陈亮
邵传领
王腾宇
吴迪
刘凡兴
邓怡然
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024060767A1 publication Critical patent/WO2024060767A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/40Network security protocols

Definitions

  • This application relates to the field of cloud service technology, and in particular to an anomaly detection method and related devices.
  • APIs application programming interfaces
  • a web application firewall is deployed between the server of the cloud platform and the Internet connected to the client, and WAF is used to detect and protect API requests from the Internet anomalies.
  • WAF filters and protects API requests based on manually created security rules. These security rules cover some common attack patterns. When WAF detects that certain API requests comply with the attack modes included in security rules, it can reject these API requests, thereby preventing malicious persons from attacking the cloud platform through APIs.
  • WAF-based anomaly detection solutions require manual creation of security rules, which is not only time-consuming and labor-intensive, but also requires manual in-depth analysis and summary of various attack modes to achieve better anomaly detection results.
  • manually created security rules will inevitably have loopholes, resulting in missed detections and false detections in the above solutions.
  • This application provides an anomaly detection method and related devices, which can lower the threshold for abnormal detection by relevant personnel, reduce missed detections and false detections, and improve the efficiency of abnormal detection.
  • the technical solutions are as follows:
  • an anomaly detection method includes:
  • the configuration parameters indicate the sample set, test set and candidate attribute fields.
  • the sample set includes the log data used for parameter tuning in the cloud platform.
  • the test set includes the logs to be detected in the cloud platform.
  • data the candidate attribute field is the attribute field corresponding to the log data of the cloud platform; based on the sample set and the candidate attribute field, the target attribute field is determined from the candidate attribute field, and the target attribute field is the attribute field used for anomaly detection tasks; based on the sample Set and target attribute fields, tune the first hyperparameter of the first detection model; based on the target attribute field, perform anomaly detection on the test set through the parameter-tuned first detection model to obtain the anomaly detection results of the test set .
  • This solution does not require relevant personnel in the cloud service field to manually create security rules, nor does it require manual in-depth analysis and summary of various attack modes, thus avoiding the vulnerabilities of manually created security rules and reducing missed detections and false detections. It can also improve the efficiency of anomaly detection. In addition, this solution can also enable self-service and rapid construction of anomaly detection models to achieve oriented implementation, even if the relevant personnel have as little or even no professional knowledge of machine learning and deep learning, such as knowledge of model design and tuning. Task-specific anomaly detection.
  • the candidate attribute fields include m selected fields and n candidate fields, m is an integer not less than 0, and n is an integer greater than 0; based on the sample set and candidate attribute fields, determine the target from the candidate attribute fields Attribute fields include: based on the sample set, the m selected fields and the n candidate fields, determine the field scores corresponding to the n candidate fields, and the field scores represent adding the corresponding candidate fields to the m selected fields.
  • the degree to which the anomaly detection effect is improved after selecting fields based on the field scores corresponding to the n candidate fields, p candidate fields are determined from the n candidate fields, where p is a positive integer not greater than n;
  • the m selected fields and p candidate fields are determined as target attribute fields. That is, use the sample set and filter the fields according to the degree to which each candidate field improves the anomaly detection effect, thereby filtering out more valuable attribute fields.
  • the sample set includes a training subset and a validation subset; based on the sample set, the m selected fields and the n candidate fields, determining the field scores corresponding to the n candidate fields includes: converting the m The selected fields form a selected field set, and the n candidate fields form a candidate field set. Based on the training subset, the selected field set, and the candidate field set, determine the corresponding response of each candidate field in the candidate field set.
  • Mutual information which represents the correlation between the corresponding candidate field and all fields in the selected field set; select k candidate fields with the smallest mutual information from the candidate field set, k is not greater than n Positive integer; based on the training subset, verification subset, selected field set and the k candidate fields, determine the reconstruction loss corresponding to the k candidate fields.
  • the reconstruction loss is represented by the corresponding candidate fields and selected fields.
  • the effect of anomaly detection on the verification subset by the set based on the mutual information and reconstruction loss corresponding to the k candidate fields, select one candidate field from the k candidate fields, and determine the value of the selected candidate field Field score; move the selected candidate fields from the candidate field set to the selected field set, return based on the training subset, the selected field set and the candidate field set, and determine the score corresponding to each candidate field in the candidate field set.
  • the field scores corresponding to the n candidate fields are obtained. That is, the mutual information between the field to be selected and the selected field and the reconstruction loss corresponding to the field to be selected are used to filter the fields, thereby filtering out more valuable attribute fields.
  • the reconstruction losses corresponding to the k candidate fields including: for the kth candidate field among the k candidate fields.
  • One candidate field add the first candidate field to the selected field set to obtain a candidate field set, the first candidate field is any candidate field among the k candidate fields; based on the training subset and The candidate field set determines the second detection model corresponding to the first candidate field; based on the verification subset and the candidate field set, determines the reconstruction loss corresponding to the first candidate field through the second detection model corresponding to the first candidate field.
  • determining the second detection model corresponding to the first candidate field includes: determining the reference statistical characteristics of the training subset based on the candidate field set, and the reference statistical characteristics of the training subset include training The statistics of the data of all fields included in the candidate field set in the subset; train the initial detection model through the reference statistical features of the training subset to obtain the second detection model corresponding to the first candidate field.
  • determining the reconstruction loss corresponding to the first candidate field through the second detection model corresponding to the first candidate field including: determining the reference statistics of the verification subset based on the candidate field set.
  • the reference statistical features of the verification subset include statistics of data of all fields included in the candidate field set in the verification subset; input the reference statistical features of the verification subset into the second detection model corresponding to the first candidate field to obtain verification
  • the reference reconstruction characteristics of the subset, the reference reconstruction characteristics of the verification subset include reconstruction statistics of data of all fields included in the candidate field set in the verification subset; based on the reference statistical characteristics and reference reconstruction characteristics of the verification subset, determine the first candidate The reconstruction loss corresponding to the field.
  • the configuration parameter also indicates the category of each attribute field in the candidate attribute fields. Different categories of attribute fields correspond to different types of statistics. This can improve the anomaly detection effect by calculating more valuable statistics.
  • the first hyperparameter includes learning rate, number of training epochs, and hidden layer dimensions. That is, the first hyperparameters that need to be tuned include learning rate, number of training rounds, and hidden layer dimensions, which are three parameters that have a relatively large impact on model performance. Under the premise of ensuring the performance of the first detection model after parameter optimization, , which can improve the execution efficiency of anomaly detection tasks.
  • the first detection model includes an input layer, a first hidden layer and a second hidden layer; the hidden layer dimensions include the dimensions of the first hidden layer and the second hidden layer, and the dimensions of the input layer are based on the fields included in the target attribute field.
  • the quantity is determined, and the dimensions of the first hidden layer and the second hidden layer are determined based on the dimensions of the input layer. That is to say, the dimensions of the hidden layer are not set arbitrarily, and the search space of the hidden layer dimension is small.
  • the first detection model includes an encoder, a decoder and a discriminator, and the parameters of the discriminator include an error threshold; based on the target attribute field, anomaly detection is performed on the test set through the parameter-tuned first detection model to Obtaining the anomaly detection results of the test set includes: determining the statistical characteristics of the test set based on the target attribute field.
  • the statistical characteristics of the test set include statistics of the data in the target attribute field in the test set; inputting the statistical characteristics of the test set into the encoder to obtain Encoding features of the test set; inputting the encoding features of the test set into the decoder to obtain reconstructed features of the test set; inputting statistical features and reconstructed features of the test set into the discriminator to determine the Anomaly detection results on the test set.
  • the error threshold is determined according to the mean of multiple reconstruction losses, which include the error between the statistical features and the reconstructed features of each sample to be tested in the test set, or the error between the statistical features and the reconstructed features of each training sample in the training subset. That is, the error threshold is determined according to the average error of the sample population, which can improve the accuracy of anomaly detection.
  • an anomaly detection device has the function of realizing the behavior of the anomaly detection method in the first aspect.
  • the anomaly detection device includes one or more modules, the one or more modules are used to implement the anomalies provided by the first aspect. Detection method.
  • an anomaly detection device which device includes:
  • the receiving module is used to receive the configuration parameters of the anomaly detection task.
  • the configuration parameters indicate the sample set, the test set and the candidate attribute fields.
  • the sample set includes log data used for parameter tuning in the cloud platform.
  • the test set Includes log data to be detected for anomalies in the cloud platform, and the candidate attribute fields are attribute fields corresponding to the log data of the cloud platform;
  • a determination module configured to determine a target attribute field from the candidate attribute field based on the sample set and the candidate attribute field, where the target attribute field is an attribute field used to perform the anomaly detection task;
  • a parameter tuning module configured to tune the first hyperparameter of the first detection model based on the sample set and the target attribute field
  • An anomaly detection module configured to perform anomaly detection on the test set based on the target attribute field through a parameter-tuned first detection model to obtain an anomaly detection result of the test set.
  • the candidate attribute fields include m selected fields and n candidate fields, m is an integer not less than 0, and n is an integer greater than 0;
  • the determination module includes:
  • the first determination sub-module is used to determine the field scores corresponding to the n candidate fields based on the sample set, the m selected fields and the n candidate fields, where the field scores represent the The extent to which the anomaly detection effect is improved after adding corresponding candidate fields to the m selected fields;
  • the second determination sub-module is used to determine p candidate fields from the n candidate fields based on the field scores corresponding to the n candidate fields, where the p is a positive integer not greater than n;
  • the third determination sub-module is used to determine the m selected fields and the p candidate fields as the target attribute fields.
  • the sample set includes a training subset and a validation subset
  • the first determination sub-module is specifically used for:
  • the m selected fields are formed into a selected field set, and the n candidate fields are formed into a candidate field set. Based on the training subset, the selected field set and the candidate field set, it is determined Mutual information corresponding to each candidate field in the candidate field set, the mutual information representing the correlation between the corresponding candidate field and all fields in the selected field set;
  • the reconstruction loss corresponding to the k candidate fields is determined, and the reconstruction loss is represented by the corresponding The effect of anomaly detection on the verification subset by the selected fields and the selected field set;
  • the step of selecting the mutual information corresponding to each field to be selected in the field set is until the field set to be selected is empty, and the field scores corresponding to the n candidate fields are obtained.
  • the first determination sub-module is specifically used to:
  • the first candidate field is added to the selected field set to obtain a candidate field set, and the first candidate field is the Any candidate field among k candidate fields;
  • the reconstruction loss corresponding to the first candidate field is determined through the second detection model corresponding to the first candidate field.
  • the first determination sub-module is specifically used to:
  • An initial detection model is trained using the reference statistical features of the training subset to obtain a second detection model corresponding to the first candidate field.
  • the first determination sub-module is specifically used to:
  • the reconstruction loss corresponding to the first candidate field is determined.
  • the configuration parameter also indicates the category of each attribute field in the candidate attribute field, and attribute fields of different categories have different types of statistics corresponding to them.
  • the first hyperparameters include a learning rate, a number of training rounds, and a hidden layer dimension.
  • the first detection model includes an input layer, a first hidden layer and a second hidden layer; the hidden layer dimensions include the dimensions of the first hidden layer and the second hidden layer, and the input layer The dimensions of are determined based on the number of fields included in the target attribute field, and the dimensions of the first hidden layer and the second hidden layer are determined based on the dimensions of the input layer.
  • the first detection model includes an encoder, a decoder and a discriminator, and the parameters of the discriminator include an error threshold;
  • the anomaly detection module includes:
  • the fourth determination sub-module is used to determine the statistical characteristics of the test set based on the target attribute field, where the statistical characteristics of the test set include statistics of the data of the target attribute field in the test set;
  • the first input submodule is used to input the statistical characteristics of the test set into the encoder to obtain the coding characteristics of the test set;
  • the second input submodule is used to input the coding features of the test set into the decoder to obtain the reconstructed features of the test set;
  • the third input submodule is used to input the statistical features and reconstructed features of the test set into the discriminator to determine the anomaly detection result of the test set according to the error threshold.
  • a computing device cluster includes at least one computing device.
  • the computing device includes a processor and a memory.
  • the memory of the at least one computing device is used to store the data required for executing the first aspect.
  • the program that is, the instruction
  • the processor is configured to execute a program stored in the memory.
  • the computing device may also include a communication bus for establishing a connection between the processor and memory.
  • a fourth aspect provides a computer-readable storage medium in which a computer program is stored, which when run on a computer causes the computer to execute the anomaly detection method described in the first aspect.
  • a fifth aspect provides a computer program product containing instructions that, when run on a computer, causes the computer to execute the anomaly detection method described in the first aspect.
  • Figure 1 is a schematic structural diagram of an anomaly detection device provided by an embodiment of the present application.
  • Figure 2 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • Figure 3 is a schematic structural diagram of a computing device cluster provided by an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of another computing device cluster provided by an embodiment of the present application.
  • Figure 5 is a system architecture diagram involved in an anomaly detection method provided by an embodiment of the present application.
  • Figure 6 is a flow chart of an anomaly detection method provided by an embodiment of the present application.
  • Figure 7 is a flow chart of a method for determining field scores provided by an embodiment of the present application.
  • Figure 8 is a flow chart of another anomaly detection method provided by an embodiment of the present application.
  • Figure 9 is a flow chart of yet another anomaly detection method provided by an embodiment of the present application.
  • An embodiment of the present application provides an anomaly detection device, as shown in Figure 1.
  • the anomaly detection device includes:
  • the receiving module is used to receive the configuration parameters of the anomaly detection task.
  • the configuration parameters indicate the sample set, test set and candidate attribute fields.
  • the sample set includes the log data used for parameter tuning in the cloud platform.
  • the test set includes the log data to be used for parameter tuning in the cloud platform.
  • the candidate attribute fields are the attribute fields corresponding to the log data of the cloud platform;
  • the determination module is used to determine the target attribute field from the candidate attribute field based on the sample set and the candidate attribute field, and the target attribute field is the attribute field used for anomaly detection tasks;
  • a parameter tuning module used to tune the first hyperparameter of the first detection model based on the sample set and the target attribute field
  • the anomaly detection module is used to perform anomaly detection on the test set through the parameter-tuned first detection model based on the target attribute field to obtain the anomaly detection result of the test set.
  • the anomaly detection module is used to perform anomaly detection on the test set through the parameter-tuned first detection model based on the target attribute field to obtain the anomaly detection result of the test set.
  • the candidate attribute fields include m selected fields and n to-be-selected fields, where m is an integer not less than 0 and n is an integer greater than 0;
  • Determined modules include:
  • the first determination sub-module is used to determine the field scores corresponding to the n candidate fields based on the sample set, the m selected fields and the n candidate fields.
  • the field score representation is added to the m selected fields. The extent to which the anomaly detection effect is improved after the corresponding candidate fields are selected;
  • the second determination sub-module is used to determine p candidate fields from the n candidate fields based on the field scores corresponding to the n candidate fields, where p is a positive integer not greater than n;
  • the third determination sub-module is used to determine the m selected fields and p candidate fields as target attribute fields.
  • the sample set includes a training subset and a validation subset
  • the first determined sub-module is specifically used for:
  • the m selected fields are formed into a selected field set, and the n candidate fields are formed into a candidate field set. Based on the training subset, the selected field set and the candidate field set, each candidate field set in the candidate field set is determined. Mutual information corresponding to the selected field, which represents the correlation between the corresponding candidate field and all fields in the selected field set;
  • the reconstruction losses corresponding to the k candidate fields are determined, and the reconstruction loss representation is verified through the corresponding candidate fields and selected field sets.
  • the first determination sub-module is specifically used to:
  • the first candidate field is added to the selected field set to obtain a candidate field set, and the first candidate field is any of the k candidate fields.
  • a field to be selected For the first candidate field among the k candidate fields, the first candidate field is added to the selected field set to obtain a candidate field set, and the first candidate field is any of the k candidate fields.
  • the reconstruction loss corresponding to the first candidate field is determined through the second detection model corresponding to the first candidate field.
  • the first determination sub-module is specifically used to:
  • the initial detection model is trained by using the reference statistical features of the training subset to obtain a second detection model corresponding to the first candidate field.
  • a specific implementation method please refer to the relevant introduction of the embodiment of FIG. 6 .
  • the first determination sub-module is specifically used to:
  • the reference statistical characteristics of the verification subset into the second detection model corresponding to the first candidate field to obtain the reference reconstruction characteristics of the verification subset.
  • the reference reconstruction characteristics of the verification subset include the characteristics of all fields included in the candidate field set in the verification subset. Reconstruction statistics of data;
  • the reconstruction loss corresponding to the first candidate field is determined.
  • the configuration parameter also indicates the category of each attribute field in the candidate attribute fields. Different categories of attribute fields correspond to different types of statistics. For specific implementation methods, please refer to the relevant introduction of the embodiment in Figure 6.
  • the first hyperparameter includes learning rate, number of training epochs, and hidden layer dimensions.
  • learning rate number of training epochs
  • hidden layer dimensions For specific implementation methods, please refer to the relevant introduction of the embodiment in Figure 6.
  • the first detection model includes an input layer, a first hidden layer and a second hidden layer; the hidden layer dimensions include the dimensions of the first hidden layer and the second hidden layer, and the dimensions of the input layer are based on the fields included in the target attribute field.
  • the quantity is determined, and the dimensions of the first hidden layer and the second hidden layer are determined based on the dimensions of the input layer.
  • the first detection model includes an encoder, a decoder and a discriminator, and the parameters of the discriminator include an error threshold;
  • Anomaly detection modules include:
  • a fourth determination submodule is used to determine the statistical characteristics of the test set based on the target attribute field, where the statistical characteristics of the test set include the statistics of the data of the target attribute field in the test set;
  • the first input submodule is used to input the statistical characteristics of the test set into the encoder to obtain the coding characteristics of the test set;
  • a second input submodule used for inputting the encoded features of the test set into the decoder to obtain the reconstructed features of the test set;
  • the third input submodule is used to input the statistical features and reconstructed features of the test set into the discriminator to determine the anomaly detection result of the test set according to the error threshold.
  • the third input submodule is used to input the statistical features and reconstructed features of the test set into the discriminator to determine the anomaly detection result of the test set according to the error threshold.
  • the receiving module, determination module, parameter tuning module and anomaly detection module can all be implemented by software, or can be implemented by hardware, or can be implemented by a combination of software and hardware.
  • the following takes the parameter tuning module as an example to introduce the implementation method of the parameter tuning module.
  • the implementation of the determination module, parameter tuning module and anomaly detection module can refer to the implementation of the parameter tuning module.
  • a module can include code that runs on a computing instance.
  • the computing instance may include at least one of a physical host (computing device), a virtual machine, and a container.
  • the above computing instance may be one or more.
  • a parameter tuning module can include code running on multiple hosts/VMs/containers.
  • multiple hosts/virtual machines/containers used to run the code can be distributed in the same region (region) or in different regions.
  • multiple hosts/virtual machines/containers used to run the code can be distributed in the same availability zone (AZ) or in different AZs.
  • Each AZ includes one data center or multiple AZs. geographically close data centers. Among them, usually a region can include multiple AZs.
  • VPC virtual private cloud
  • multiple hosts/virtual machines/containers used to run the code can be distributed in the same virtual private cloud (VPC) or in multiple VPCs.
  • VPC virtual private cloud
  • a VPC is set up in a region.
  • a communication gateway needs to be set up in each VPC to achieve interconnection between VPCs through the communication gateway.
  • a module may include at least one computing device, such as a server.
  • the parameter tuning module may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD may be a complex program logic device (complex programmable logical device (CPLD), field-programmable gate array (FPGA), general array logic (GAL) or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • Multiple computing devices included in the parameter tuning module can be distributed in the same region or in different regions. Multiple computing devices included in the parameter tuning module can be distributed in the same AZ or in different AZs. Similarly, multiple computing devices included in the parameter tuning module can be distributed in the same VPC or in multiple VPCs.
  • the plurality of computing devices may be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
  • the receiving module can be used to perform any steps in the anomaly detection method
  • the determining module can be used to perform any steps in the anomaly detection method
  • the parameter tuning module can be used to perform the anomaly detection method.
  • the steps responsible for implementation by the receiving module, determination module, parameter tuning module and anomaly detection module can be specified as needed.
  • Different anomaly detection methods can be implemented through the receiving module, determination module, parameter tuning module and anomaly detection module respectively. steps to realize all functions of the anomaly detection device.
  • this application there is no need for relevant personnel in the cloud service field to manually create security rules, and there is no need to manually conduct in-depth analysis and summary of various attack modes, thus avoiding the loopholes in manually created security rules and reducing missed detection and In the case of false detection, it can also improve the efficiency of anomaly detection.
  • this solution can also enable self-service and rapid construction of anomaly detection models to achieve oriented implementation, even if the relevant personnel have as little or even no professional knowledge of machine learning and deep learning, such as knowledge of model design and tuning. Task-specific anomaly detection.
  • the abnormality detection device provided in the above embodiment performs abnormality detection
  • only the division of the above functional modules is used as an example.
  • the above function allocation can be completed by different functional modules as needed. That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the anomaly detection device provided by the above embodiments and the anomaly detection method embodiments belong to the same concept. Please refer to the method embodiments for the specific implementation process, which will not be described again here.
  • computing device 100 includes: bus 102 , processor 104 , memory 106 , and communication interface 108 .
  • the processor 104, the memory 106 and the communication interface 108 communicate through the bus 102.
  • Computing device 100 may be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 100.
  • the bus 102 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 2, but it does not mean that there is only one bus or one type of bus.
  • Bus 104 may include a path that carries information between various components of computing device 100 (eg, memory 106, processor 104, communications interface 108).
  • the processor 104 may include a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP). any one or more of them.
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • Memory 106 may include volatile memory, such as random access memory (RAM).
  • RAM random access memory
  • the processor 104 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk drive (hard disk drive, HDD) or solid state drive (solid state drive). drive, SSD).
  • ROM read-only memory
  • HDD hard disk drive
  • SSD solid state drive
  • the memory 106 stores executable program code, and the processor 104 executes the executable program code to respectively implement the functions of the aforementioned receiving module, determining module, parameter tuning module and anomaly detection module, thereby implementing the anomaly detection method. That is, the memory 106 stores instructions for executing the anomaly detection method.
  • the communication interface 103 uses transceiver modules such as, but not limited to, network interface cards and transceivers to implement communication between the computing device 100 and other devices or communication networks.
  • An embodiment of the present application also provides a computing device cluster.
  • the computing device cluster includes at least one computing device.
  • the computing device may be a server, such as a central server, an edge server, or a local server in a local data center.
  • the computing device may also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.
  • the computing device cluster includes at least one computing device 100.
  • the memory 106 in one or more computing devices 100 in the computing device cluster may store the same instructions for executing the anomaly detection method.
  • the memory 106 of one or more computing devices 100 in the computing device cluster may also store partial instructions for executing the anomaly detection method.
  • a combination of one or more computing devices 100 may collectively execute instructions for performing the anomaly detection method.
  • the memories 106 in different computing devices 100 in the computing device cluster can store different instructions, respectively used to execute part of the functions of the anomaly detection device. That is, the instructions stored in the memory 106 in different computing devices 100 can implement the functions of one or more modules among the receiving module, the determining module, the parameter tuning module and the anomaly detection module.
  • one or more computing devices in a cluster of computing devices may be connected through a network.
  • the network may be a wide area network or a local area network, etc.
  • Figure 4 shows a possible implementation. As shown in Figure 4, two computing devices 100A and 100B are connected through a network. Specifically, the connection to the network is made through a communication interface in each computing device.
  • the memory 106 in the computing device 100A stores instructions for performing the functions of the receiving module and the determining module. At the same time, instructions for executing the functions of the parameter tuning module and the anomaly detection module are stored in the memory 106 of the computing device 100B.
  • connection method between the computing device clusters shown in Figure 4 can be: Considering that the anomaly detection method provided by this application requires a large amount of data storage and computing resources, it is considered to hand over the functions implemented by the parameter tuning module and the anomaly detection module to the computing Device 100B executes.
  • computing device 100A shown in FIG. 4 may also be performed by multiple computing devices 100 .
  • the functions of computing device 100B may also be performed by multiple computing devices 100 .
  • the embodiment of the present application also provides another computing device cluster.
  • the connection relationship between the computing devices in the computing device cluster can be similar to the connection method of the computing device cluster described in FIG. 3 and FIG. 4 .
  • the difference is that the same instructions for executing the anomaly detection method may be stored in the memory 106 of one or more computing devices 100 in the computing device cluster.
  • the memory 106 of one or more computing devices 100 in the computing device cluster may also store partial instructions for executing the anomaly detection method.
  • a combination of one or more computing devices 100 may collectively execute instructions for performing the anomaly detection method.
  • the embodiment of the present application also provides a computer program product including instructions.
  • the computer program product may be software or a program product including instructions that can be run on a computing device or stored in any available medium.
  • the at least one computing device executes the above-mentioned anomaly detection method.
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be any available medium that a computing device can store or a data storage device such as a data center that contains one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc.
  • the computer-readable storage medium includes instructions that instruct the computing device to perform the above-mentioned anomaly detection method.
  • FIG. 5 is a system architecture diagram involved in an anomaly detection method provided by an embodiment of the present application.
  • this system can be called an anomaly detection system, which includes a client and a detection device.
  • the detection device is used to execute the anomaly detection task according to the configuration parameters of the received anomaly detection task. That is, the detection device is used to perform the steps of the anomaly detection method provided by the embodiment of the present application.
  • the client is used to send the configuration parameters of the anomaly detection task to the detection device. For example, when a configuration operation is detected, the client determines the configuration parameters of the anomaly detection task and sends the configuration parameters to the detection device.
  • the detection device is the computing device shown in Figure 2, or includes multiple computing devices shown in Figures 3/4.
  • Figure 6 is a flow chart of an anomaly detection method provided by an embodiment of the present application. Taking this method applied to detection equipment as an example, please refer to Figure 6. The method includes the following steps.
  • Step 601 Receive the configuration parameters of the anomaly detection task.
  • the configuration parameters indicate the sample set, the test set and the candidate attribute fields.
  • the sample set includes the log data used for parameter tuning in the cloud platform, and the test set includes the anomalies to be performed in the cloud platform.
  • the candidate attribute fields are the attribute fields corresponding to the log data of the cloud platform.
  • some manually configured Professional parameters unrelated to model design and tuning, and submit the configuration parameters of the anomaly detection task to the detection device.
  • the detection device receives the configuration parameters, and based on the configuration parameters, automatically performs feature engineering, model construction and parameter tuning, and implements anomalies for the anomaly detection task through the parameter-tuned anomaly detection model. detection.
  • the configuration parameter indicates the candidate attribute field, sample set and test set.
  • the sample set includes log data used for parameter tuning in the cloud platform
  • the test set includes log data to be used for anomaly detection in the cloud platform.
  • the log data in the cloud platform includes data of multiple attribute fields
  • the candidate attribute fields include part of the multiple attribute fields. Of course, they may also include all of the multiple attribute fields. That is, the candidate attribute fields are attribute fields corresponding to the log data of the cloud platform.
  • the sample set is used for feature engineering, model building and parameter tuning. In the process of feature engineering, the detection equipment needs to filter out the target attribute fields from the candidate attribute fields based on the sample set.
  • the test set is the data to be detected indicated by the anomaly detection task. The subsequent detection equipment will perform anomaly detection on the test set through a parameter-tuned anomaly detection model based on the target attribute field.
  • the sample set includes a training subset and a validation subset.
  • This configuration parameter includes the start and end time of the training subset, the start and end time of the validation subset, and the start and end time of the test set.
  • the above anomaly detection model obtains the training subset from the log data of the cloud platform based on the start and end time of the training subset, obtains the verification subset from the log data of the cloud platform based on the start and end time of the verification subset, and obtains the verification subset based on the start and end time of the test set. , obtain the test set from the log data of the cloud platform.
  • the training subset and the verification subset are combined to build the anomaly detection model and tune parameters.
  • the configuration parameter also indicates the detection object.
  • the configuration parameter includes the identification of the detection object, and the detection object can be the target user, target service, or target host etc.
  • the anomaly detection task is used to detect anomalies in the target user's operations and access to the cloud platform.
  • the sample set and test set include log data related to the target user in the cloud platform.
  • the detection object is a target service
  • the anomaly detection task is used to detect anomalies on the target service in the cloud platform.
  • the sample set and test set include log data related to the target service in the cloud platform.
  • the anomaly detection task is used to detect anomalies in operations and access related to the target host in the cloud platform.
  • the sample set and test set include log data related to the target host in the cloud platform.
  • the training subset includes multiple training samples
  • the validation subset includes multiple validation samples
  • the test set includes one or more samples to be tested.
  • this configuration parameter also indicates the time granularity of anomaly detection.
  • the anomaly detection model determines multiple samples included in the training subset, validation subset, and test set respectively.
  • the time granularity of each sample is equal to the time granularity of the anomaly detection.
  • a sample included in the training subset is called a training sample
  • a sample included in the verification subset is called a verification sample
  • a sample included in the test set is called a test sample.
  • the detection object is user a
  • the time granularity of anomaly detection is 24 hours
  • the starting and ending time of the training subset is from January 1 to June 30, 2022, that is, the training subset includes the data collected by users in these 6 months.
  • the detection device determines the log data of every 24 hours included in the 6 months in the training subset as a training sample included in the training subset.
  • this configuration parameter also indicates the time granularity of a single sample, and the time granularity of anomaly detection is An integer multiple of the time granularity of a single sample.
  • Each of the above samples includes multiple single samples. Based on the time granularity of a single sample, the detection device determines multiple single samples in each sample included in the training subset, verification subset, and test set respectively.
  • the detection object is user a
  • the time granularity of anomaly detection is 24 hours
  • the time granularity of a single sample is 1 hour.
  • the detection device collects the log data of each hour of 24 hours included in each training sample. Determine it as a single sample, thus obtaining 24 single samples included in each training sample.
  • the detection equipment determines the log data of each hour in the training subset as a single sample according to the time granularity of a single sample, and divides the 24 single samples every 24 hours in the training subset into a training sample.
  • the candidate attribute fields indicated by this configuration parameter include selected fields and candidate fields.
  • the selected fields refer to the fields determined to be used in the subsequent anomaly detection process of the test set
  • the candidate fields refer to the fields to be used in the subsequent anomaly detection process of the test set.
  • the detection equipment will determine a part of the candidate fields from all the candidate fields included in the candidate attribute field, and determine this part of the candidate fields and all the selected fields as the target attribute fields.
  • the target attribute fields include All fields required in the subsequent process of anomaly detection on the test set, that is, the target attribute fields are the attribute fields used for anomaly detection tasks.
  • the anomaly detection model obtains the statistics of the candidate attribute fields in the log data through statistics, and performs feature engineering based on the statistics.
  • the candidate attribute fields include multiple attribute fields. Due to different The characteristics of attribute fields are different. Therefore, the specific methods of statistics for different attribute fields are different, so that more valuable statistics can be obtained.
  • the configuration parameter also indicates the category of each attribute field in the candidate attribute field, and the types of statistics corresponding to different categories of attribute fields. There are differences. The specific statistics corresponding to different categories of attribute fields will be introduced in detail in step 602 below.
  • the category of each attribute field is a first-type attribute, a second-type attribute, a third-type attribute, or a fourth-type attribute.
  • the first type of attributes is situational attributes. For example, fields that represent a specific scenario, such as service name or service type, belong to the first category of attributes.
  • the values of the attribute fields belonging to the second type of attributes are discrete values, and the types of values do not exceed the type threshold. That is, the values of the attribute fields belonging to the second type of attributes are discrete values and have fewer types.
  • all possible values of an attribute field with limited values are ordered. For example, the status field belongs to the second type of attribute, and the possible values of the status field include ‘0, 1, 2, 3’.
  • the values of the attribute fields belonging to the third type of attributes are discrete values, and the types of values exceed the type threshold, that is, the values of the attribute fields belonging to the third type of attributes are discrete values and have many types.
  • the remote operation address field belongs to the third category of attributes, and the value of the remote operation address field can be '192.168.2.1', '192.168.1.3', etc.
  • the values of attribute fields belonging to the fourth category of attributes are continuous values.
  • the total length field of a hypertext transfer protocol (HTTP) request belongs to the fourth category of attributes.
  • HTTP hypertext transfer protocol
  • a piece of log data includes but is not limited to data in the following attribute fields: time (timestamp), userId (user identification), remote_addr (remote operation address), service_name (service name, also called service type) or service type), api_id (the application identification of the operation, one service can provide one or more applications), body_bytes_sent (the HTTP content length of the operation), forward_flag (the flag of whether the backend forwards), status (status, indicating the operation successful), request_method (HTTP request method), request_length (total length of HTTP request), diff_time (response time), accessModel (access mode), deploy_type (indicates database type).
  • time is used to determine the log data included in each sample
  • userId is used to filter users
  • time and userId are not used as statistical objects
  • service_name is the first type of attribute
  • categories of the other attribute fields are as follows.
  • the selected fields in the candidate attribute fields include three attribute fields: remote_addr, api_id, and body_bytes_sent, and the selected fields in the candidate attribute fields include seven attribute fields: forward_flag, status, accessModel, deploy_type, request_method, diff_time, and request_length.
  • the statistics that need to be calculated can also be determined based on whether the anomaly detection task is time-series related.
  • this configuration parameter also indicates whether the anomaly detection task is timing dependent.
  • the statistics corresponding to some attribute fields also include user access profiles and/or access profile similarities (also called access profile similarity). This content will also be introduced in detail in step 602.
  • log data in the cloud platform may be stored on multiple devices and multiple paths.
  • log data related to cloud service a is stored in device a
  • log data related to cloud service b is stored in device b
  • the log data of enterprise 1 is stored in path 1
  • the log data of enterprise 2 is stored in path 2.
  • humans can also configure the storage location of log data related to anomaly detection tasks.
  • this configuration parameter also indicates the storage location of log data related to the anomaly detection task.
  • the anomaly detection model obtains the log data in the sample set and test set from the cloud platform according to the storage location.
  • Table 2 is an input information table for configuration parameters provided by the embodiment of the present application. According to Table 2, the above configuration parameters can be configured.
  • the detection device In order to ensure the successful execution of the anomaly detection task, after receiving the configuration parameters of the anomaly detection task, the detection device performs logical detection on the configuration parameters. If a logical anomaly of the configuration parameters is detected, the detection device feeds back prompt information to prompt for retry. to configure. For example, the time granularity of anomaly detection is smaller than the time granularity of a single sample, or the time granularity of anomaly detection is not equal to the time granularity of a single sample, or the time period corresponding to the start and end time of the test set is the time corresponding to the start and end time of the training subset. If the segments overlap, it indicates that the configuration parameter is logically abnormal. If it is detected that the logic of the configuration parameter is normal, the detection device continues to perform step 602.
  • Step 602 Based on the sample set and the candidate attribute fields, determine the target attribute field from the candidate attribute fields.
  • the target attribute field is the attribute field used for the anomaly detection task.
  • the detection device obtains the sample set indicated by the configuration parameter from the log data of the cloud platform.
  • the configuration data includes the start and end time of the training subset and the start and end time of the verification subset.
  • the sample set includes the training subset and the verification subset.
  • the detection device obtains all training included in the training subset according to the start and end time of the training subset. sample.
  • the detection equipment obtains all verification samples included in the verification subset according to the start and end times of the verification subset.
  • the configuration data includes the identification of the detection object, the start and end time of the training subset, the start and end time of the verification subset, the storage location of the log data, and the time granularity of a single sample.
  • the sample set includes the training subset and the verification subset.
  • the detection The device obtains all single samples included in the training subset based on the identification of the detection object, the start and end time of the training subset, the storage location of the log data, and the time granularity of the single sample. In the same way, the detection device obtains all single samples included in the verification subset according to the identification of the detection object, the start and end time of the verification subset, the storage location of the log data, and the time granularity of the single sample.
  • the detection device After obtaining the sample set, the detection device determines the target attribute field from the candidate attribute fields based on the sample set and the candidate attribute fields. That is, the detection device uses the sample set to filter out the target attribute fields from the candidate attribute fields through feature engineering.
  • candidate attribute fields include selected fields and candidate fields.
  • the embodiment of this application determines each candidate field by Filter fields by selecting a field's field score. This will be described in detail next.
  • the candidate attribute fields include m selected fields and n candidate fields, m is an integer not less than 0, and n is an integer greater than 0.
  • the detection device determines the field scores corresponding to the n candidate fields based on the sample set, the m selected fields and the n candidate fields.
  • the detection device determines p candidate fields from the n candidate fields based on the field scores corresponding to the n candidate fields, where p is a positive integer not greater than n.
  • the detection device determines the m selected fields and p candidate fields as target attribute fields.
  • the field score represents the degree to which the anomaly detection effect is improved after adding the corresponding candidate fields to the m selected fields. Put another way, the field score represents the effect of anomaly detection through the selected fields and the corresponding candidate fields.
  • the embodiment of this application determines the field score by combining the mutual information between fields and the reconstruction loss corresponding to the fields. This will be introduced next in conjunction with Figure 7.
  • Figure 7 is a flow chart of a method for determining field scores provided by an embodiment of the present application. The method includes steps 6021 to 6026.
  • Step 6021 The m selected fields are formed into a selected field set, and the n candidate fields are formed into a candidate field set.
  • Step 6022 Based on the training subset, the selected field set and the candidate field set, determine the mutual information corresponding to each candidate field in the candidate field set.
  • mutual information represents the correlation between the corresponding candidate field and all fields in the selected field set.
  • the detection device determines the mutual information corresponding to each candidate field in the candidate field set based on the training subset, the selected field set, and the candidate field set.
  • the implementation process is: determining that the selected fields in the training subset have been selected.
  • the statistics of the data of the selected fields determine a plurality of first mutual information.
  • the plurality of first mutual information include the mutual information between the second field to be selected and each selected field in the selected field set.
  • the second mutual information is
  • the selection field is any candidate field in the set of candidate fields; the maximum value among the plurality of first mutual information is determined as the mutual information corresponding to the second candidate field. Simply put, the mutual information between fields is calculated through the statistics corresponding to the fields.
  • the statistics of the data of the first selected field in the training subset include R first statistics
  • the statistics of the data of the second candidate field in the training subset include S second statistics.
  • the selected field is any selected field in the selected field set, and R and S are both integers greater than 0.
  • the detection device determines the implementation process of multiple first mutual information based on the statistics of the second candidate field in the training subset and the data of all selected fields included in the selected field set, including: based on the R first statistics and S second statistics are determined through multiple rounds of iterations; the mean value of the S second mutual information is determined to be the same as the second candidate field in the plurality of first mutual information and the first already selected field. Mutual information between selected fields.
  • the detection device determines the j-th second statistic among the S second statistics and the (R-j+1) first statistics included in the R first statistics.
  • Mutual information between statistics to obtain (R-j+1) reference mutual information corresponding to the (R-j+1) first statistic, j is an integer greater than 0 and not greater than R .
  • the detection equipment determines the maximum value among the (R-j+1) reference mutual information as the j-th second mutual information among the S second mutual information, and determines the (R-j+1)-th second mutual information.
  • the (R-j) first statistics in a statistic, excluding the first statistic corresponding to the maximum value, are determined as the (R-j) first statistics in the j+1 iteration process.
  • status is a selected field
  • the statistics corresponding to status in the training subset include 500 first statistics, that is, it contains 500 dimensions.
  • diff_time is a candidate field
  • the statistics corresponding to diff_time in the training subset include 300 second statistics, that is, it contains 300 dimensions.
  • the detection device calculates the mutual information between the first dimension of diff_time and each of the 500 dimensions of status to obtain 500 reference mutual information. Assume that the maximum value among these 500 reference mutual information is Corresponding to the 321st dimension of status, the detection device determines the maximum value as the mutual information between the first dimension of diff_time and status, that is, the first second mutual information is obtained, and then the detection device removes the 321st dimension of status.
  • the detection device calculates the mutual information between the 2nd dimension of diff_time and each of the 499 dimensions of status to obtain 499 reference mutual information. Assume that the maximum value among these 499 reference mutual information is Corresponding to the 432nd dimension of status, the detection device determines the maximum value as the mutual information between the 2nd dimension of diff_time and status, that is, the second second mutual information is obtained, and then the detection device removes the 432nd dimension of status.
  • the detection device obtains a total of 300 mutual information between the 300 dimensions of diff_time and status, that is, 300 second mutual information are obtained. The detection device uses the mean value of these 300 second mutual information as the mutual information between diff_time and status.
  • the detection device can calculate the mutual information between statistics according to formula (1).
  • I( Su ; Sc ) represents the mutual information between the statistic Su and the statistic Sc
  • su and sc represent the values of Su and Sc respectively.
  • P( ⁇ , ⁇ ) represents the joint probability distribution
  • P( ⁇ ) represents the probability density.
  • the discretization method may be a quartile-based method or other methods, which are not limited in the embodiments of the present application.
  • the detection device discretizes statistics with continuous values based on the quartile method.
  • detection of unrecognized quartiles is calculated based on all values of the statistic in all training samples, and the three quartiles of the calculated quartiles are The points are marked Q1, Q2 and Q3 in sequence.
  • Step 6023 Select k candidate fields with the smallest mutual information from the candidate field set.
  • the detection device selects k candidate fields from the candidate field set in ascending order of mutual information.
  • k is a positive integer not greater than n. More precisely, k is not greater than the total number of fields included in the candidate field set.
  • the detection device determines k according to a preset value. If there are multiple candidate fields with the same mutual information in the current candidate field set, the total number of candidate fields with the smallest mutual information exceeds With this preset value, the detection device sets the current k equal to the total number of candidate fields with the smallest current mutual information. If the total number of fields included in the current candidate field set is less than the preset value, the detection device sets the current k equal to the total number of fields included in the candidate field set, or sets the current k as the candidate field. The total number of fields included in the set is reduced by a specified value (such as 1) to ensure that k does not exceed the total number of fields included in the candidate field set.
  • a specified value such as 1
  • Step 6024 Based on the training subset, the verification subset, the selected field set and the k candidate fields, determine the reconstruction losses corresponding to the k candidate fields.
  • the reconstruction loss represents the effect of anomaly detection on the validation subset through the corresponding set of candidate fields and selected fields.
  • the detection device determines the reconstruction losses corresponding to the k candidate fields based on the training subset, the verification subset, the selected field set and the k candidate fields.
  • the implementation process is: for the k The first candidate field among the k candidate fields is added to the selected field set to obtain a candidate field set.
  • the first candidate field is any one of the k candidate fields. field; based on the training subset and the candidate field set, determine the second detection model corresponding to the first candidate field; based on the verification subset and the candidate field set, determine through the second detection model corresponding to the first candidate field The reconstruction loss corresponding to the first candidate field.
  • the detection device can obtain k candidate field sets that correspond to the k candidate fields one-to-one, and k second detection models that correspond to the k candidate fields one-to-one.
  • the detection device adds any of the k candidate fields to the selected field set, it trains through the training subset based on the selected field set to which one candidate field is added.
  • the detection model is then used to verify the improvement in the anomaly detection effect after adding any of the k candidate fields to the selected field set by verifying the subset and the second detection model.
  • the detection device determines the reference statistical characteristics of the training subset based on the candidate field set (that is, the candidate field set corresponding to the first candidate field among the k candidate field sets), and the reference statistical characteristics of the training subset include the training subset. Concentrate statistics on data from all fields included in this candidate field set.
  • the detection device trains the initial detection model through the reference statistical features of the training subset to obtain the second detection model corresponding to the first candidate field.
  • the detection device determines reference statistical characteristics of the verification subset based on the candidate field set, and the reference statistical characteristics of the verification subset include statistics of data of all fields included in the candidate field set in the verification subset.
  • the detection device inputs the reference statistical characteristics of the verification subset into the second detection model corresponding to the first candidate field to obtain the reference reconstruction characteristics of the verification subset.
  • the reference reconstruction characteristics of the verification subset include the candidate fields included in the verification subset. Reconstructed statistics for data for all fields.
  • the detection device determines the reconstruction loss corresponding to the first candidate field based on the reference statistical features and reference reconstruction features of the verification subset.
  • the reconstruction loss corresponding to the first candidate field also represents the reconstruction effect on the reference statistical characteristics of the verification subset after adding the first candidate field to the selected field set.
  • the detection device determines the reconstruction loss of the first candidate field according to formula (2).
  • formula (2) represents the reconstruction loss corresponding to the first candidate field
  • n represents the total number of validation samples included in the validation subset
  • m represents the total dimension of the statistics of the first field to be selected in each validation sample and the data of all selected fields included in the selected field set
  • d represents the dth dimension in the total dimension
  • i represents the total number of single samples included in the validation subset.
  • d-th dimension reference statistic i.e., the input statistic of the second detection model
  • d-th dimension reconstruction statistic i.e., the output statistic of the second detection model
  • the detection device normalizes the reference statistical features and reference reconstruction features of the verification subset, and determines the reconstruction corresponding to the first candidate field based on the normalized reference statistical features and reference reconstruction features of the verification subset. loss.
  • the detection device normalizes and The value range of is mapped to the interval [-1,1].
  • the statistics at the time granularity of the single sample such as one hour Granularity statistics
  • normalized based on all statistics at the time granularity of a single sample such as statistics at each hour granularity in 24 hours of the verification sample to which the single sample belongs.
  • the above-mentioned second detection model is an anomaly detection model.
  • the structure of the second detection model is roughly the same as that of the first detection model.
  • the differences include the dimensions of the input layer and the following differences in model parameters.
  • the dimensions of the input layer in the second detection model match the total number of fields included in the selected field set after adding the first candidate field, while the dimensions of the input layer in the first detection model match the target attributes. Fields match the number of fields included.
  • all hyperparameters of the second detection model may be preset, while part of the hyperparameters of the first detection model (such as the first hyperparameter below) are to be tuned.
  • Step 6025 Based on the mutual information and reconstruction loss corresponding to the k candidate fields, select one candidate field from the k candidate fields, and determine the field score of the selected candidate field.
  • the detection device After obtaining the mutual information and reconstruction loss corresponding to the k candidate fields, the detection device will be able to combine the mutual information and reconstruction loss to determine the field score of the candidate field.
  • the detection device selects a candidate field from the k candidate fields based on the mutual information and reconstruction loss corresponding to the k candidate fields, and determines the implementation process of the field score of the selected candidate field as follows: based on the mutual information and reconstruction loss corresponding to the k candidate fields, determine the comprehensive scores corresponding to the k candidate fields, determine the candidate field with the highest comprehensive score among the k candidate fields as the selected candidate field, and determine the comprehensive score of the selected candidate field as the field score of the selected candidate field. It should be understood that in this implementation process, the detection device obtains k comprehensive scores corresponding to the k candidate fields one by one.
  • the detection device can process the mutual information and reconstruction losses corresponding to the k candidate fields by weighted summation or other methods to obtain the comprehensive scores corresponding to the k candidate fields.
  • the detection device determines the mutual information scores corresponding to the k candidate fields based on the mutual information corresponding to the k candidate fields.
  • the higher the mutual information score of the candidate field the greater the improvement in the anomaly detection effect. That is, the size of the mutual information is negatively correlated with the anomaly detection effect, while the size of the mutual information score is positively correlated with the anomaly detection effect.
  • the detection device determines the comprehensive scores corresponding to the k candidate fields based on the mutual information scores and reconstruction losses corresponding to the k candidate fields.
  • the detection device determines the mutual information score of the candidate field according to formula (3), and determines the comprehensive score of the candidate field according to formula (4).
  • field represents a field to be selected
  • I represents the mutual information corresponding to the field to be selected
  • Score(I) represents the mutual information score of the field to be selected
  • Score(Loss) represents the mutual information score of the field to be selected.
  • Score(field) represents the field score of the field to be selected.
  • is an adjustable parameter, and the default value can be 1 or other values.
  • Score(I) 1-(2 ⁇ sigmoid(I)-1) (3)
  • the detection device normalizes the mutual information corresponding to the candidate field through the sigmoid function, that is, the value range of the mutual information is mapped to the interval (0,1).
  • the value range of the mutual information score obtained by formula (3) is also the interval (0,1). The larger the mutual information score corresponding to a candidate field, the higher the degree of improvement of the anomaly detection effect of the candidate field.
  • Step 6026 Move the selected candidate fields from the candidate field set to the selected field set, and return to step 6022 until the candidate field set is empty, and obtain the field scores corresponding to the n candidate fields.
  • the detection device moves the selected candidate fields from the candidate field set into the selected field set, if the candidate field set is not empty, it returns to step 6022; if the candidate field set is empty, the detection device obtains the field scores corresponding to the n candidate fields respectively.
  • the detection device essentially determines the field scores corresponding to the n candidate fields through multiple rounds of iterations. In each iteration process, the field score of a candidate field is obtained. After n rounds of iteration, the field scores corresponding to the n candidate fields are obtained. To put it another way, the detection device determines the target attribute field from the candidate attribute field based on the sample set and the candidate attribute field as follows:
  • the detection device first forms the m selected fields into a selected field set, and the n candidate fields into a candidate field set.
  • the detection device determines the mutual information corresponding to each candidate field in the candidate field set based on the statistics of the data of the selected field set in the training subset and all fields included in the candidate field set, i is an integer greater than 0 and not greater than n.
  • the detection device selects k candidate fields from the candidate field set in order of mutual information from small to large, and adds each candidate field in the k candidate fields to the selected field set respectively, so as to obtain A set of k candidate fields that correspond one-to-one to the k candidate fields.
  • the detection device obtains the statistics of the data of all fields included in the first candidate field set in the training subset, and determines all fields included in the first candidate field set in the verification subset.
  • the statistics of the data, the first candidate field set is any candidate field set among the k candidate field sets.
  • the detection device trains the initial detection model by using the statistics of the data of all fields included in the first candidate field set in the training subset to obtain the second detection model corresponding to the first candidate field.
  • the first candidate field refers to the k The candidate fields corresponding to the first candidate field set among the candidate fields.
  • the detection device inputs statistics of data of all fields included in the first candidate field set in the verification subset into the second detection model to obtain reconstructed statistics of data of all fields included in the first candidate field set in the verification subset.
  • the detection device determines the reconstruction loss corresponding to the first candidate field based on the statistics and reconstruction statistics of the data of all fields included in the first candidate field set in the verification subset.
  • the detection device determines the comprehensive score of each of the k candidate fields based on the mutual information and reconstruction loss corresponding to each of the k candidate fields.
  • the detection device selects the candidate field with the highest comprehensive score among the k candidate fields, determines the comprehensive score of the selected candidate field as the field score of the selected candidate field, and converts the selected candidate field from The set of fields to be selected is moved into the set of selected fields to obtain the set of selected fields and the set of fields to be selected in the next iteration process. After the last round of iteration is completed, the detection device obtains the field score of each of the n candidate fields. Then, the detection device determines p candidate fields from the n candidate fields according to the field score of each of the n candidate fields, and combines the m selected fields and the p candidate fields. The field is determined as the target attribute field.
  • each iteration process all currently candidate fields are traversed, the mutual information between each candidate field and all currently selected fields is calculated one by one, and the k candidates with the smallest mutual information are selected. field. Then, one of the k candidate fields is added on the basis of all currently selected fields to obtain a set of k candidate fields. Use the training subset and the verification subset to verify the reconstruction effects corresponding to the k candidate field sets, that is, to obtain the reconstruction losses corresponding to the k candidate fields. Based on the mutual information and reconstruction loss corresponding to the k candidate fields, the comprehensive score of each of the k candidate fields is determined.
  • the best candidate field in this round of iteration is selected from the k candidate fields, and the selected candidate field is moved from the set of all current candidate fields to the set of all currently selected fields. . Iterate the above process until the current set of all fields to be selected is empty. After that, the target attribute field is determined based on the comprehensive score of the candidate fields selected in each iteration process.
  • the detection device will form the m selected fields into a selected field set (which is empty) , after forming the n candidate fields into a candidate field set, in the first round of iteration process, based on the training subset, the selected field set and the candidate field set, determine the corresponding value of each candidate field in the candidate field set
  • the mutual information of The n-1 candidate fields other than the first reference candidate field are used as the hypothetical selected field set. Based on the training subset, the hypothetical selected field set and the first reference candidate field, the first reference is determined Mutual information corresponding to the fields to be selected.
  • the detection device After determining the second reference candidate field, the detection device determines the mutual information corresponding to the second reference candidate field in the same manner.
  • a total of mutual information corresponding to n reference candidate fields is obtained.
  • the mutual information corresponding to these n reference candidate fields is the n candidate fields in the candidate field set. corresponding mutual information.
  • the detection device selects k candidate fields with the smallest mutual information from the set of candidate fields. Based on the training subset, the verification subset, the hypothetical selected field set and the k candidate fields, the reconstruction losses corresponding to the k candidate fields are determined. The detection device selects one candidate field from the k candidate fields based on the mutual information and reconstruction loss corresponding to the k candidate fields, and determines the field score of the selected candidate field. The detection device moves the selected field to be selected from the field set to be selected into the set of selected fields. At this point, the first selected field in the selected field set is obtained.
  • the detection device determines the mutual information corresponding to each candidate field in the candidate field set based on the training subset, the selected field set, and the candidate field set, and so on, until the candidate field When the set is empty, the field scores corresponding to the n candidate fields are obtained.
  • the detection device After obtaining the field scores corresponding to the n candidate fields, you can select p candidate fields from the n candidate fields through manual decision-making, or you can select p candidate fields from the n candidate fields through automatic decision-making. Select p candidate fields, and then the detection device will The selected fields and the p candidate fields are determined as target attribute fields.
  • the detection device sends field score information, where the field score information includes the field score of each of the n candidate fields.
  • the detection device receives field decision information, which indicates p candidate fields.
  • the detection device determines the p candidate fields from the n candidate fields based on the field decision information. That is, the detection equipment feeds back the comprehensive score of each field to be selected to the relevant personnel, who then make the final decision based on the comprehensive score.
  • the field score information also includes the mutual information (and/or mutual information score) corresponding to each candidate field obtained in the process of determining the field score and the reconstruction loss. That is, the detection equipment feeds back the mutual information, reconstruction loss and comprehensive score corresponding to each candidate field to the relevant personnel, who then make the final decision based on the mutual information, reconstruction loss and comprehensive score.
  • the detection device determines p from the n candidate fields based on the preset number of fields and the field score of each of the n candidate fields. fields to be selected.
  • the preset number of fields represents the total number of attribute fields required for anomaly detection on the test set, and p+m is equal to the preset number of fields.
  • the preset number of fields represents the total number of candidate fields that need to be selected from the n candidate fields, and p is equal to the preset number of fields.
  • the configuration parameter also indicates the category of each attribute field in the candidate attribute field, and attribute fields of different categories have different types of statistics corresponding to them.
  • the category of each attribute field is a first-type attribute or a second-type attribute or a third-type attribute or a fourth-type attribute.
  • Table 3 is a relationship table between different categories of attribute fields and statistics provided by the embodiment of the present application.
  • the statistics corresponding to each type of attribute field include basic statistics, first type statistics and second type statistics.
  • the first type statistics corresponding to the attribute fields belonging to the first type of attributes include maximum value, mean value, numerical standard deviation, information entropy and the number of field value types.
  • the first type of statistics corresponding to attribute fields belonging to the second type of attributes include maximum value, mean, numerical standard deviation, proportional standard deviation, information entropy and the number of field value types.
  • the first type of statistics corresponding to attribute fields belonging to the third type of attributes include maximum value, mean value, numerical standard deviation, proportional standard deviation, information entropy and the number of field value types.
  • the first type of statistics corresponding to attribute fields belonging to the fourth type of attributes include mean and numerical standard deviation.
  • the statistics corresponding to the attribute fields belonging to the second type of attributes and the third type of attributes also include user access profiles and/or access profile similarities.
  • Table 3 ‘ ⁇ ’ is used to indicate the statistics that need to be calculated, ‘ ⁇ ’ is used to indicate the statistics that do not need to be calculated, and ‘*’ is used to indicate the statistics that need to be calculated when the anomaly detection task is time series related.
  • the basic statistics are determined by counting the number of logs included in the corresponding sample.
  • Table 4 is a table of meanings of basic statistics corresponding to different categories of attribute fields provided by the embodiment of the present application. Taking the time granularity of a single sample as 1 hour and the time granularity of anomaly detection as 24 hours as an example, the meanings of the basic statistics corresponding to different categories of attribute fields are shown in Table 4.
  • the basic statistics of the first type of attributes include two types, and these two basic statistics correspond to the two situations of no differentiated services and differentiated services respectively.
  • the basic statistics for the remaining three types of attributes include one.
  • the basic statistics corresponding to the first type of attributes include the first basic statistic and the second basic statistic.
  • the first basic statistic includes each type of cloud that exists in each single sample included in the corresponding sample.
  • the second basic statistics include the number of logs of each cloud service that exist in each single sample included in the corresponding sample.
  • the basic statistics corresponding to the second type of attributes, the third type of attributes and the fourth type of attributes include the third basic statistics, and the third basic statistics include the attribute fields of the corresponding categories that exist in each single sample included in the corresponding sample.
  • ECS elastic compute service
  • OBS object storage service
  • ECS elastic compute service
  • OBS object storage service
  • the detection device counts the number of logs generated by user 1 operating each application in ECS every hour of the day, and the number of logs generated by operating each application in OBS every hour.
  • the number of logs a total of 24*(2+3)-dimensional statistics are obtained, counting the number of logs generated by user 1 operating each application in ECS during this day, and the number of logs generated by operating each application in OBS during this day.
  • the number of logs a total of (2+3)-dimensional statistics are obtained.
  • the detection device obtains a total of (24+1)*5-dimensional basic statistics.
  • the detection device For the first type of attributes and no distinction between services, the detection device counts the number of logs generated by user 1 operating ECS every hour during the day, and the number of logs generated by operating OBS every hour, and a total of 24* is obtained A 2-dimensional statistic that counts the number of logs generated by user 1 operating ECS in this day and the number of logs generated by operating OBS in this day, resulting in a total of 1*2-dimensional statistics. Then, for the first type of attributes and no differentiation of services, the detection device obtains a total of (24+1)*2-dimensional basic statistics.
  • the detection device counts the number of logs under each cloud service for each value of the status field in the log data generated by user 1 in each hour of the day. A total of 24*3-dimensional statistics are obtained. The statistics are For each value of the status field in the log data generated within a day, the number of logs under each cloud service is obtained, and a total of 1*3-dimensional statistics are obtained. Then, for each field belonging to the second type of attribute, the detection device has a total of (24+1)*3-dimensional basic statistics, and '3' represents the number of field value types of the status field.
  • the detection device determines the first type of statistics based on the basic statistics, that is, the first type of statistics is determined based on the basic statistics.
  • Table 5 is a table of calculation methods of the first type of statistics corresponding to different types of attribute fields provided by the embodiment of the present application. Taking the time granularity of a single sample as 1 hour and the time granularity of anomaly detection as 24 hours as an example, based on Table 4, the calculation method of the first type of statistics corresponding to different categories of attribute fields is shown in Table 5.
  • Table 6 is a table of meanings of the first type of statistics corresponding to different types of attribute fields provided by the embodiment of the present application.
  • the maximum value in the first type of statistics represents the upper bound of each statistic included in the basic statistics of the corresponding sample
  • the mean represents the average status of each statistic included in the basic statistics of the corresponding sample
  • the numerical standard deviation represents The degree of dispersion of each statistic included in the basic statistics of the corresponding sample
  • the proportional standard deviation represents the imbalance of each statistic included in the basic statistics of the corresponding sample
  • the information entropy represents the basic statistics included in the corresponding sample
  • the degree of confusion of each statistic, and the number of field value types represents the number of possible values for each attribute field in the corresponding sample.
  • the detection device counts the number of logs generated by user 1 operating these five applications in each hour of the day, and obtains the basic statistics for each hour including For the number of 5 logs, calculate the maximum value, mean, numerical standard deviation and information entropy of the 5 log numbers for each hour to obtain some of the statistics of the first type of statistics for each hour.
  • the detection device counts the number of logs generated by the five applications operated by user 1 during the day, obtains the number of five logs included in the basic statistics of this day, and calculates the maximum number of these five logs on this day. value, mean, numerical standard deviation and information entropy to obtain some statistics of the first type of statistics for this day.
  • the detection device counts the number of logs generated by user 1 operating ECS and the number of logs generated by operating OBS in each hour of the day, and obtains basic statistics for each hour. Including the other 2 log numbers, calculate the maximum value, mean, numerical standard deviation and information entropy of these 2 log numbers for each hour to obtain another part of the statistics of the first type of statistics for each hour.
  • the detection device detects the number of logs generated by user 1 operating ECS and the number of logs generated by operating OBS during the day, and obtains the number of the other two logs included in the basic statistics for this day, and calculates the two logs for this day.
  • the maximum value, mean, numerical standard deviation and information entropy of the number of logs are used to obtain another part of the statistics of the first type of statistics on this day.
  • the detection device counts the number of logs under each cloud service for each value of the status field in the log data generated by user 1 in each hour of the day, and obtains 3 of the basic statistics for each hour. For the number of logs, calculate the maximum value, mean, numerical standard deviation, proportional standard deviation and information entropy of the three log numbers for each hour to obtain some of the statistics of the first type of statistics for each hour.
  • the detection device counts the number of logs under each cloud service for each value of the status field in the log data generated during the day, obtains the number of three logs in the basic statistics for this day, and calculates the three logs for this day.
  • the maximum value, mean, numerical standard deviation, proportional standard deviation and information entropy of the number of logs are used to obtain some of the statistics of the first type of statistics on this day.
  • the detection device determines the second type of statistics based on the first type of statistics, that is, the second type of statistics is determined based on the first type of statistics, and the second type of statistics represents the first type of the corresponding sample.
  • the second type of statistics reflects the relative standard distance of a single sample from the sample population mean, that is, it reflects the difference between the sample in the first type of statistics and the sample population in the historical observation period.
  • the detection device calculates the second type of statistic by calculating z-score according to formula (5).
  • formula (5) Represents the i-th statistic in the first type of statistics for each day or each hour, ⁇ and ⁇ respectively represent The mean and numerical standard deviation of z i represents the i-th statistic in the second type of statistics for each day or hour, that is, the second type of statistic corresponding to the i-th statistic in the first type of statistics.
  • the configuration parameters also indicate whether the anomaly detection task is time-series related.
  • the statistics corresponding to the attribute fields belonging to the second type of attributes and the third type of attributes also include user access profiles and/or access profile similarities. sex.
  • the user access profile represents one or more of changes in the number of logs of multiple single samples included in the corresponding sample, changes in information entropy corresponding to the second type of attributes, and changes in information entropy corresponding to the third type of attributes.
  • Access profile similarity characterizes the similarity between the user access profile of the corresponding sample and the user access profile of the reference sample.
  • the reference sample can be the first sample in the sample set, for example, the first training sample in the training subset.
  • the change in the number of logs includes the change in the number of logs of each cloud service over time, and/or the change in the total number of logs in all cloud services over time.
  • the change in information entropy corresponding to the second type attribute includes the change of information entropy over time under each cloud service for each value of each attribute field belonging to the second type attribute, and/or, for each second type attribute Information entropy changes over time under all cloud services.
  • the change of information entropy corresponding to the third type attribute includes the change of the information entropy of each third type attribute in each cloud service over time, and/or the change of the information entropy of each third type attribute in all cloud services over time. Time changes.
  • the detection device counts how the number of logs generated by user 1 operating a certain service within 24 hours of a day changes over time, and obtains a 24-dimensional user access profile.
  • the detection device calculates access profile similarity by calculating cosine similarity according to formula (6).
  • vector A and vector B represent two user access profiles respectively, and i represents the element subscript in the vector.
  • the detection device calculates the access profile similarity between the user access profile of each training sample in the training subset and the user access profile of the reference sample according to formula (6), and obtains multiple access profile similarities that are the same as the total number of training samples. . In the same way, the detection device calculates the access profile similarity between the user access profile of each verification sample in the verification subset and the user access profile of the reference sample according to formula (6), and the user access profile of each sample to be tested in the test set. Access profile similarity between user access profiles of reference samples.
  • the detection device determines the mean value and numerical standard deviation of multiple access profile similarities obtained based on the training subset, and based on the mean value and numerical standard deviation of the multiple access profile similarities, the access profile obtained based on the training subset is The similarity is normalized to obtain the normalized access profile similarity of the training subset, and the access profile similarity obtained based on the verification subset is normalized to obtain the normalized access profile similarity of the verification subset.
  • Access profile similarity normalize the access profile similarity obtained based on the test set to obtain the normalized access profile similarity of the test set. Subsequent detection equipment uses the normalized access profile similarity as a statistic for the corresponding sample.
  • the detection device normalizes access profile similarity according to equation (7).
  • sim(g) represents the access profile similarity to be normalized for a sample (test sample or verification sample or training sample), Represents the normalized visit profile similarity of the sample, ⁇ sim and ⁇ sim respectively represent the mean and numerical standard deviation of multiple visit profile similarities obtained based on the training subset.
  • Step 603 Based on the sample set and the target attribute field, tune the first hyperparameter of the first detection model.
  • the detection device After determining the target attribute field, the detection device determines the statistical characteristics of the sample set based on the target attribute field.
  • the statistical characteristics of the sample set include statistics of the data of the target attribute field in the sample set. That is, the detection device determines the statistics of the data in the target attribute field in the sample set as the statistical characteristics of the sample set.
  • the detection device optimizes the first hyperparameter of the first detection model based on the statistical characteristics of the sample set to obtain a parameter-tuned first detection model.
  • the statistical characteristics of the sample set include the statistical characteristics of the training subset and the verification subset.
  • the detection device determines the statistics of the data of the target attribute field included in the training subset as the statistical characteristics of the training subset, and determines the statistics of the data of the target attribute field in the verification subset as the verification subset. statistical characteristics.
  • the detection device optimizes the first hyperparameter of the first detection model based on the statistical characteristics of the training subset and the statistical characteristics of the verification subset.
  • the first hyperparameters include learning rate, number of training rounds and hidden layer dimensions. There are multiple possible values for the learning rate, number of training rounds, and hidden layer dimensions to be tuned.
  • the detection device sets the first detection model according to the combination of various values of the learning rate, number of training rounds, and hidden layer dimensions.
  • the first hyperparameter uses the statistical characteristics of the training subset to train the first detection model with different combinations of first hyperparameters, and uses the statistical characteristics of the verification subset to verify the anomaly detection effect of the trained first detection model, After obtaining the anomaly detection effects corresponding to the first hyperparameters of different combinations, the first hyperparameter combination with the best anomaly detection effect is determined as the tuned first hyperparameter, and the tuned first hyperparameter is determined is the first hyperparameter of the parameter-tuned first detection model.
  • the detection device traverses all possible first hyperparameter combinations in a network search manner to search for the optimal first hyperparameter combination in the first hyperparameter search space.
  • the search space of the first hyperparameter is the Cartesian product of the search spaces of the three hyperparameters: learning rate, number of training epochs, and hidden layer dimension.
  • the first detection model includes an input layer, a first hidden layer and a second hidden layer (which may also be called a bottleneck layer).
  • the hidden layer dimensions to be tuned include the dimensions of the first hidden layer and the second hidden layer.
  • the dimensions of the input layer are determined based on the number of fields included in the target attribute field, and the dimensions of the first hidden layer and the second hidden layer are determined based on the dimensions of the input layer. That is to say, the dimensions of the hidden layer are not set arbitrarily, and the search space of the hidden layer dimension is small.
  • the dimension of the input layer is also related to the dimension of the statistics of the fields included in the target attribute field. The higher the dimension of the statistics of the fields included in the target attribute field, that is, the greater the number of statistics, the higher the dimension of the input layer will be.
  • N1 satisfies Among them, ceil() means rounding up, that is, N is an integer power of 2 that is not less than N1.
  • N2 N/4
  • N3 is one of the integer powers of 2 located in the interval [8, N/8].
  • N2 N/2
  • N3 is one of the integer powers of 2 located in the interval [4, N/4]
  • the values in the second interval are smaller than those in the first interval value.
  • the search space of N3 is ⁇ 8, 16,32,64,128,256,512 ⁇
  • N is an integer power of 2 in the second interval [64,256], that is, N1 ⁇ (264,256]
  • N2 N/2
  • the search space of N3 is ⁇ 4,8 ,16,32,64 ⁇ .
  • the learning rate is equal to Where l is an integer in the interval [0, L+1], and L does not exceed the search point threshold.
  • the search point threshold is 12, 14, 16, etc.
  • the verification subset is used to test the degree of overfitting of the first detection model trained by the training subset.
  • all validation samples in the validation subset are regarded as positive samples.
  • the detection device may consider that the trained first detection model is close to a convergence state without overfitting. Among them, the preset ratio is 95%, 98% or 99%, etc.
  • the value range of the number of training rounds is [1,100], that is, the search space of the number of training rounds is ⁇ 1,2,3,...,100 ⁇ .
  • the detection equipment determines the number of rounds with the best performance on the verification subset among 100 rounds of training as the number of training rounds after tuning, that is, the optimal number of rounds is obtained.
  • the first detection model also includes second hyperparameters, and the second hyperparameters include optimizer parameters, activation function, loss function, parameter initialization method, batch size, and number of hidden layers.
  • the second hyperparameter is a hyperparameter that does not require tuning.
  • the second hyperparameter is a preset parameter.
  • the activation function of each hidden layer can be a nonlinear function such as Relu, sigmoid, etc.
  • the activation function of the output layer can be a linear function.
  • embodiments of this application provide an anomaly detection model based on an automatic codec. That is, the first detection model and the second detection model in this article are implemented based on the automatic codec.
  • the first detection model includes an encoder and a decoder connected in series.
  • the encoder includes an input layer, a first hidden layer and a second hidden layer connected in series.
  • the decoder includes a third hidden layer connected in series. and output layer.
  • the input layer of the encoder and the output of the decoder The dimensions of the layers are the same, and the dimensions of the first hidden layer and the third hidden layer are the same.
  • the second hidden layer can be considered as the output layer of the encoder or the input layer of the decoder. That is, the second hidden layer can be considered as the network layer shared by the encoder and the decoder.
  • the encoder and the decoder are relatively symmetrical. of. Among them, the input layer of the encoder is used for the statistical features of the input samples, and the output layer of the decoder is used for the reconstructed features of the output samples.
  • the first detection model also includes a discriminator, and the parameters of the discriminator include an error threshold.
  • the detection device inputs the statistical features of the sample set into the input layer of the encoder. After the statistical features of the sample set are processed by the input layer, the first hidden layer and the second hidden layer of the encoder, the coding features of the sample set are obtained. After the coding features of the sample set are processed by the third hidden layer and output layer of the decoder, the reconstructed features of the sample set are obtained.
  • the detection device inputs the statistical features and reconstruction features of the sample set into the discriminator to determine the reconstruction loss of the sample set according to the error threshold.
  • the detection device performs training and hyperparameter tuning on the first detection model based on the reconstruction loss of the sample set.
  • the error threshold is determined based on the mean value of multiple reconstruction losses.
  • the multiple reconstruction losses include the error between the statistical characteristics of each training sample in the training subset and the reconstruction characteristics.
  • the detection device determines the mean value of the plurality of reconstruction losses as the error threshold.
  • the error threshold is determined based on the mean and standard deviation of the multiple reconstruction losses.
  • the detection device determines the error threshold (threshold) according to formula (8).
  • mean(loss) and std(loss) respectively represent the mean and standard deviation of multiple reconstruction losses
  • is a preset parameter
  • can be 0.4, 0.5, 0.6 and other values.
  • threshold mean(loss)+ ⁇ *std(loss) (8)
  • the reconstruction loss between the statistical characteristics of the sample and the reconstruction characteristics can be root mean square error (RMSE), mean square error (MSE) or other forms of errors.
  • RMSE root mean square error
  • MSE mean square error
  • the embodiments of this application are This is not a limitation. That is, the loss function can be an RMSE function or an MSE function or other functions.
  • the first hyperparameters that need to be tuned in the embodiment of the present application include the learning rate, the number of training rounds and the hidden layer dimension, which have a relatively large impact on the model performance.
  • the second hyperparameters that have a relatively small impact on the model performance can be pre-set to speed up the construction of the model and parameter tuning, thereby improving the execution efficiency of the anomaly detection task while ensuring the performance of the first detection model after parameter optimization.
  • Step 604 Based on the target attribute field, perform anomaly detection on the test set through the parameter-tuned first detection model to obtain anomaly detection results of the test set.
  • the detection device After obtaining the parameter-tuned first detection model, the detection device determines the statistical characteristics of the test set based on the target attribute field, and the statistical characteristics of the test set include statistics of the data of the target attribute field in the test set. The detection device inputs the statistical characteristics of the test set into the parameter-tuned first detection model to obtain anomaly detection results of the test set.
  • the time granularity of a single sample is 1 hour and the time granularity of anomaly detection is 24 hours
  • the detection object is user a
  • each sample to be tested in the test set includes the log data of user a within 24 hours
  • the The anomaly detection results of the sample to be tested indicate whether user a has abnormal operations and/or abnormal access behaviors within the 24 hours.
  • the detection object is cloud service a
  • each sample to be tested in the test set includes the log data of cloud service a within 24 hours
  • the anomaly detection result of the sample to be tested indicates whether cloud service a has abnormalities within these 24 hours.
  • the first detection model includes an encoder, a decoder and a discriminator, and the parameters of the discriminator include an error threshold.
  • the detection device inputs the statistical features of the test set into the encoder to obtain the coding features of the test set; it inputs the coding features of the test set into the decoder to obtain the reconstructed features of the test set.
  • the detection device inputs the statistical features and reconstructed features of the test set into the discriminator to determine the anomaly detection results of the test set according to the error threshold.
  • the error threshold is determined based on the mean value of multiple reconstruction losses.
  • the multiple reconstruction losses include the error between the statistical characteristics and reconstruction characteristics of each sample to be tested in the test set, or, The error between the statistical features and reconstructed features of each training sample in the training subset is also included.
  • the statistical characteristics of the sample to be tested include the statistics of the data of the target attribute field in the sample to be tested, and the reconstruction characteristics of the sample to be tested refer to the reconstruction statistics of the data of the target attribute field in the sample to be tested.
  • FIG8 is a flow chart of another anomaly detection method provided by an embodiment of the present application.
  • the detection device determines the target attribute field from the candidate attribute field through automatic feature selection based on the original log data, that is, automatically performs feature engineering to screen the candidate attribute field, and the screened candidate field and the selected field in the candidate attribute field are used as the target attribute field.
  • the data after feature selection is obtained, and the data after feature selection includes the statistics of the data of the target attribute field in the sample.
  • the detection device automatically encodes and decodes the data after feature selection to obtain the reconstruction statistics of the data of the target attribute field in the sample, that is, outputs the reconstruction result.
  • the detection device performs error calculation on the data after feature selection and the reconstruction result (calculating RMSE as shown in FIG8) to obtain the reconstruction loss of the sample. Then, the detection device outputs the discrimination result of the sample after discriminating the reconstruction loss of the sample according to the error threshold, that is, outputs the anomaly detection result.
  • FIG 9 is a flow chart of yet another anomaly detection method provided by an embodiment of the present application.
  • cloud field personnel configure relevant parameters of the anomaly detection task on the client and submit the configuration file (including configuration parameters) to the anomaly detection system.
  • Anomaly detection system provides anomaly detection self-service tools. Cloud field personnel or devices normalize the selected raw log data according to the log specification format.
  • the self-service tool automatically calculates statistics on normalized log data based on configuration files, and automatically filters fields based on statistics and configured candidate attribute fields.
  • the self-service tool automatically performs anomaly detection on the samples to be tested after filtering fields to obtain anomaly detection results.
  • abnormality detection of log data in the cloud platform can be realized.
  • relevant personnel in the cloud service field to manually create security rules, nor does it require manual in-depth analysis and summary of various attack modes. This avoids the loopholes in manually created security rules, reduces missed detections and false detections, and also reduces It can improve the efficiency of anomaly detection.
  • this solution does not require relevant personnel to have professional knowledge of machine learning and deep learning, such as knowledge of model design and tuning, and can also enable self-service and rapid anomaly detection models to achieve anomaly detection for specific tasks.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer, or a data storage device such as a server or data center integrated with one or more available media.
  • the available media may be magnetic media (such as floppy disks, hard disks, tapes), optical media (such as digital versatile discs (DVD)) or semiconductor media (such as solid state disks (SSD)) wait.
  • the computer-readable storage media mentioned in the embodiments of this application may be non-volatile storage media, in other words, may be non-transitory storage media.
  • the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in the embodiments of this application and Signals are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with the relevant laws, regulations and standards of relevant countries and regions.
  • the log data involved in the embodiments of this application are all obtained with full authorization.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Pure & Applied Mathematics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computational Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Operations Research (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The present application relates to the technical field of cloud services. Disclosed are an anomaly detection method and a related apparatus. The solution does not require related personnel in the field of cloud services to manually create security rules, and does not require manual deep analysis and summarization of various attack modes, thereby avoiding vulnerabilities in manual creation of security rules, reducing missing and false detections, and improving the anomaly detection efficiency. In addition, even if the related personnel have less or even no professional knowledge about machine learning and deep learning, for example, knowledge related to model design and tuning, the solution can realize quick self-construction of an anomaly detection model to realize anomaly detection oriented to specific tasks.

Description

异常检测方法及相关装置Abnormality detection method and related device
本申请要求于2022年9月20日提交的申请号为202211145851.6、发明名称为“异常检测方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application with application number 202211145851.6 and the invention title "Anomaly Detection Method and Related Devices" submitted on September 20, 2022, the entire content of which is incorporated into this application by reference.
技术领域Technical field
本申请涉及云服务技术领域,特别涉及一种异常检测方法及相关装置。This application relates to the field of cloud service technology, and in particular to an anomaly detection method and related devices.
背景技术Background technique
随着云服务技术的发展,越来越多的用户通过应用程序接口(API)对云平台进行操作和访问,以使用云平台提供的云服务。在用户使用云服务的过程中会在云平台产生大量有价值的数据,例如用户的操作数据、个人资料、商业信息等。为了保证用户使用云服务的安全性,防止攻击者通过非法操作对云平台中各类资源进行盗窃或破坏,需要对云平台中的各种操作和访问进行全面监控,以发现云平台中的潜在威胁和异常情况。With the development of cloud service technology, more and more users operate and access cloud platforms through application programming interfaces (APIs) to use cloud services provided by cloud platforms. In the process of users using cloud services, a large amount of valuable data will be generated on the cloud platform, such as users' operational data, personal information, business information, etc. In order to ensure the security of users' use of cloud services and prevent attackers from stealing or destroying various resources in the cloud platform through illegal operations, it is necessary to comprehensively monitor various operations and access in the cloud platform to discover potential threats in the cloud platform. Threats and anomalies.
相关技术中,在云平台的服务器与连接用户端的互联网之间部署网络应用防火墙(web application firewall,WAF),通过WAF对来自于互联网的API请求进行异常检测和防护。例如,WAF根据人工创建的安全规则对API请求进行过滤和防护。这些安全规则包含一些常见的攻击模式。WAF在检测到某些API请求符合安全规则所包含的攻击模式的情况下,能够拒绝这些API请求,从而防御恶意人员通过API对云平台的攻击。In related technology, a web application firewall (WAF) is deployed between the server of the cloud platform and the Internet connected to the client, and WAF is used to detect and protect API requests from the Internet anomalies. For example, WAF filters and protects API requests based on manually created security rules. These security rules cover some common attack patterns. When WAF detects that certain API requests comply with the attack modes included in security rules, it can reject these API requests, thereby preventing malicious persons from attacking the cloud platform through APIs.
然而,基于WAF的异常检测方案需要人工手动创建安全规则,不仅费时费力,还需要人工对各种攻击模式进行深入的分析和总结,才可能达到较好的异常检测效果。另外,人工创建的安全规则难免会存在漏洞,导致上述方案存在漏检和误检的情况。However, WAF-based anomaly detection solutions require manual creation of security rules, which is not only time-consuming and labor-intensive, but also requires manual in-depth analysis and summary of various attack modes to achieve better anomaly detection results. In addition, manually created security rules will inevitably have loopholes, resulting in missed detections and false detections in the above solutions.
发明内容Contents of the invention
本申请提供了一种异常检测方法及相关装置,能够降低相关人员进行异常检测的门槛,同时减少漏检和误检的情况,提高异常检测效率。所述技术方案如下:This application provides an anomaly detection method and related devices, which can lower the threshold for abnormal detection by relevant personnel, reduce missed detections and false detections, and improve the efficiency of abnormal detection. The technical solutions are as follows:
第一方面,提供了一种异常检测方法,所述方法包括:In a first aspect, an anomaly detection method is provided, and the method includes:
接收异常检测任务的配置参数,该配置参数指示样本集、测试集和候选属性字段,样本集包括云平台中用于进行参数调优的日志数据,测试集包括云平台中待进行异常检测的日志数据,候选属性字段为云平台的日志数据对应的属性字段;基于样本集和候选属性字段,从候选属性字段中确定目标属性字段,目标属性字段为用于进行异常检测任务的属性字段;基于样本集和目标属性字段,对第一检测模型的第一超参数进行调优;基于目标属性字段,通过经参数调优的第一检测模型对测试集进行异常检测,以得到测试集的异常检测结果。Receive the configuration parameters of the anomaly detection task. The configuration parameters indicate the sample set, test set and candidate attribute fields. The sample set includes the log data used for parameter tuning in the cloud platform. The test set includes the logs to be detected in the cloud platform. data, the candidate attribute field is the attribute field corresponding to the log data of the cloud platform; based on the sample set and the candidate attribute field, the target attribute field is determined from the candidate attribute field, and the target attribute field is the attribute field used for anomaly detection tasks; based on the sample Set and target attribute fields, tune the first hyperparameter of the first detection model; based on the target attribute field, perform anomaly detection on the test set through the parameter-tuned first detection model to obtain the anomaly detection results of the test set .
本方案无需云服务领域的相关人员手工创建安全规则,也无需人工对各种攻击模式进行深入的分析和总结,从而避免了人工创建安全规则存在的漏洞,减少了漏检和误检的情况,同时还能够提高异常检测效率。另外,本方案在相关人员具备尽可能少的甚至无需具备机器学习和深度学习的专业知识的情况下,例如模型设计和调优的相关知识,也能够实现自助地、快速搭建异常检测模型实现面向具体任务的异常检测。This solution does not require relevant personnel in the cloud service field to manually create security rules, nor does it require manual in-depth analysis and summary of various attack modes, thus avoiding the vulnerabilities of manually created security rules and reducing missed detections and false detections. It can also improve the efficiency of anomaly detection. In addition, this solution can also enable self-service and rapid construction of anomaly detection models to achieve oriented implementation, even if the relevant personnel have as little or even no professional knowledge of machine learning and deep learning, such as knowledge of model design and tuning. Task-specific anomaly detection.
可选地,候选属性字段包括m个已选字段和n个待选字段,m为不小于0的整数,n为大于0的整数;基于样本集和候选属性字段,从候选属性字段中确定目标属性字段,包括:基于样本集、该m个已选字段和n个待选字段,确定该n个待选字段分别对应的字段得分,字段得分表征在该m个已选字段中添加对应的待选字段后对异常检测效果的提升程度;基于该n个待选字段分别对应的字段得分,从该n个待选字段中确定出p个待选字段,p为不大于n的正整数;将该m个已选字段和p个待选字段确定为目标属性字段。也即是,利用样本集,并通过各个待选字段对于异常检测效果的提升程度来筛选字段,从而筛选出更有价值的属性字段。 Optionally, the candidate attribute fields include m selected fields and n candidate fields, m is an integer not less than 0, and n is an integer greater than 0; based on the sample set and candidate attribute fields, determine the target from the candidate attribute fields Attribute fields include: based on the sample set, the m selected fields and the n candidate fields, determine the field scores corresponding to the n candidate fields, and the field scores represent adding the corresponding candidate fields to the m selected fields. The degree to which the anomaly detection effect is improved after selecting fields; based on the field scores corresponding to the n candidate fields, p candidate fields are determined from the n candidate fields, where p is a positive integer not greater than n; The m selected fields and p candidate fields are determined as target attribute fields. That is, use the sample set and filter the fields according to the degree to which each candidate field improves the anomaly detection effect, thereby filtering out more valuable attribute fields.
可选地,样本集包括训练子集和验证子集;基于样本集、该m个已选字段和n个待选字段,确定该n个待选字段分别对应的字段得分,包括:将该m个已选字段组成已选字段集合,将该n个待选字段组成待选字段集合,基于训练子集、已选字段集合和待选字段集合,确定待选字段集合中每个待选字段对应的互信息,互信息表征对应的待选字段与已选字段集合中所有字段之间的相关性;从待选字段集合中选择出互信息最小的k个待选字段,k为不大于n的正整数;基于训练子集、验证子集、已选字段集合和该k个待选字段,确定该k个待选字段分别对应的重建损失,重建损失表征通过对应的待选字段和已选字段集合对验证子集进行异常检测的效果;基于该k个待选字段分别对应的互信息和重建损失,从该k个待选字段中选择一个待选字段,并确定选择出的待选字段的字段得分;将选择出的待选字段从待选字段集合移入已选字段集合,返回基于训练子集、已选字段集合和待选字段集合,确定待选字段集合中每个待选字段对应的互信息的步骤,直至待选字段集合为空时,得到该n个待选字段分别对应的字段得分。也即是,利用待选字段与已选字段之间的互信息,以及待选字段对应的重建损失来筛选字段,从而筛选出更有价值的属性字段。Optionally, the sample set includes a training subset and a validation subset; based on the sample set, the m selected fields and the n candidate fields, determining the field scores corresponding to the n candidate fields includes: converting the m The selected fields form a selected field set, and the n candidate fields form a candidate field set. Based on the training subset, the selected field set, and the candidate field set, determine the corresponding response of each candidate field in the candidate field set. Mutual information, which represents the correlation between the corresponding candidate field and all fields in the selected field set; select k candidate fields with the smallest mutual information from the candidate field set, k is not greater than n Positive integer; based on the training subset, verification subset, selected field set and the k candidate fields, determine the reconstruction loss corresponding to the k candidate fields. The reconstruction loss is represented by the corresponding candidate fields and selected fields. The effect of anomaly detection on the verification subset by the set; based on the mutual information and reconstruction loss corresponding to the k candidate fields, select one candidate field from the k candidate fields, and determine the value of the selected candidate field Field score; move the selected candidate fields from the candidate field set to the selected field set, return based on the training subset, the selected field set and the candidate field set, and determine the score corresponding to each candidate field in the candidate field set. In the mutual information step, until the set of candidate fields is empty, the field scores corresponding to the n candidate fields are obtained. That is, the mutual information between the field to be selected and the selected field and the reconstruction loss corresponding to the field to be selected are used to filter the fields, thereby filtering out more valuable attribute fields.
可选地,基于训练子集、验证子集、已选字段集合和该k个待选字段,确定该k个待选字段分别对应的重建损失,包括:对于该k个待选字段中的第一待选字段,将第一待选字段添加至已选字段集合,以得到一个候选字段集合,第一待选字段为该k个待选字段中的任一待选字段;基于训练子集和候选字段集合,确定第一待选字段对应的第二检测模型;基于验证子集和候选字段集合,通过第一待选字段对应的第二检测模型,确定第一待选字段对应的重建损失。Optionally, based on the training subset, the verification subset, the selected field set and the k candidate fields, determine the reconstruction losses corresponding to the k candidate fields, including: for the kth candidate field among the k candidate fields. One candidate field, add the first candidate field to the selected field set to obtain a candidate field set, the first candidate field is any candidate field among the k candidate fields; based on the training subset and The candidate field set determines the second detection model corresponding to the first candidate field; based on the verification subset and the candidate field set, determines the reconstruction loss corresponding to the first candidate field through the second detection model corresponding to the first candidate field.
可选地,基于训练子集和候选字段集合,确定第一待选字段对应的第二检测模型,包括:基于候选字段集合确定训练子集的参考统计特征,训练子集的参考统计特征包括训练子集中候选字段集合包括的所有字段的数据的统计量;通过训练子集的参考统计特征训练初始检测模型,以得到第一待选字段对应的第二检测模型。Optionally, based on the training subset and the candidate field set, determining the second detection model corresponding to the first candidate field includes: determining the reference statistical characteristics of the training subset based on the candidate field set, and the reference statistical characteristics of the training subset include training The statistics of the data of all fields included in the candidate field set in the subset; train the initial detection model through the reference statistical features of the training subset to obtain the second detection model corresponding to the first candidate field.
可选地,基于验证子集和候选字段集合,通过第一待选字段对应的第二检测模型,确定第一待选字段对应的重建损失,包括:基于候选字段集合确定验证子集的参考统计特征,验证子集的参考统计特征包括验证子集中候选字段集合包括的所有字段的数据的统计量;将验证子集的参考统计特征输入第一待选字段对应的第二检测模型,以得到验证子集的参考重建特征,验证子集的参考重建特征包括验证子集中候选字段集合包括的所有字段的数据的重建统计量;基于验证子集的参考统计特征和参考重建特征,确定第一待选字段对应的重建损失。Optionally, based on the verification subset and the candidate field set, determining the reconstruction loss corresponding to the first candidate field through the second detection model corresponding to the first candidate field, including: determining the reference statistics of the verification subset based on the candidate field set. Features, the reference statistical features of the verification subset include statistics of data of all fields included in the candidate field set in the verification subset; input the reference statistical features of the verification subset into the second detection model corresponding to the first candidate field to obtain verification The reference reconstruction characteristics of the subset, the reference reconstruction characteristics of the verification subset include reconstruction statistics of data of all fields included in the candidate field set in the verification subset; based on the reference statistical characteristics and reference reconstruction characteristics of the verification subset, determine the first candidate The reconstruction loss corresponding to the field.
可选地,该配置参数还指示候选属性字段中每个属性字段的类别,不同类别的属性字段对应的统计量的种类存在不同。这样能够通过计算更有价值的统计量来提升异常检测效果。Optionally, the configuration parameter also indicates the category of each attribute field in the candidate attribute fields. Different categories of attribute fields correspond to different types of statistics. This can improve the anomaly detection effect by calculating more valuable statistics.
可选地,第一超参数包括学习率、训练轮数和隐藏层维度。即,需要调优的第一超参数包括学习率、训练轮数和隐藏层维度这三个对模型性能影响相对较大的参数,在保证经参数优化后的第一检测模型的性能的前提下,能够提高异常检测任务的执行效率。Optionally, the first hyperparameter includes learning rate, number of training epochs, and hidden layer dimensions. That is, the first hyperparameters that need to be tuned include learning rate, number of training rounds, and hidden layer dimensions, which are three parameters that have a relatively large impact on model performance. Under the premise of ensuring the performance of the first detection model after parameter optimization, , which can improve the execution efficiency of anomaly detection tasks.
可选地,第一检测模型包括输入层、第一隐藏层和第二隐藏层;隐藏层维度包括第一隐藏层和第二隐藏层的维度,输入层的维度基于目标属性字段包括的字段的数量确定,第一隐藏层和第二隐藏层的维度基于输入层的维度确定。也即是,隐藏层的维度并非随意设置,隐藏层维度的搜索空间较小。Optionally, the first detection model includes an input layer, a first hidden layer and a second hidden layer; the hidden layer dimensions include the dimensions of the first hidden layer and the second hidden layer, and the dimensions of the input layer are based on the fields included in the target attribute field. The quantity is determined, and the dimensions of the first hidden layer and the second hidden layer are determined based on the dimensions of the input layer. That is to say, the dimensions of the hidden layer are not set arbitrarily, and the search space of the hidden layer dimension is small.
可选地,第一检测模型包括编码器、解码器和判别器,该判别器的参数包括误差阈值;基于目标属性字段,通过经参数调优的第一检测模型对测试集进行异常检测,以得到测试集的异常检测结果,包括:基于目标属性字段确定测试集的统计特征,测试集的统计特征包括测试集中目标属性字段的数据的统计量;将测试集的统计特征输入编码器,以得到测试集的编码特征;将测试集的编码特征输入解码器,以得到测试集的重建特征;将所述测试集的统计特征和重建特征输入所述判别器,以按照所述误差阈值确定所述测试集的异常检测结果。Optionally, the first detection model includes an encoder, a decoder and a discriminator, and the parameters of the discriminator include an error threshold; based on the target attribute field, anomaly detection is performed on the test set through the parameter-tuned first detection model to Obtaining the anomaly detection results of the test set includes: determining the statistical characteristics of the test set based on the target attribute field. The statistical characteristics of the test set include statistics of the data in the target attribute field in the test set; inputting the statistical characteristics of the test set into the encoder to obtain Encoding features of the test set; inputting the encoding features of the test set into the decoder to obtain reconstructed features of the test set; inputting statistical features and reconstructed features of the test set into the discriminator to determine the Anomaly detection results on the test set.
其中,误差阈值根据多个重建损失的均值确定,该多个重建损失包括测试集中每个待测样本的统计特征与重建特征之间的误差,或者,还包括训练子集中每个训练样本的统计特征与重建特征之间的误差。也即是,误差阈值按照样本总体的平均误差来确定,能够提高异常检测的准确性。The error threshold is determined according to the mean of multiple reconstruction losses, which include the error between the statistical features and the reconstructed features of each sample to be tested in the test set, or the error between the statistical features and the reconstructed features of each training sample in the training subset. That is, the error threshold is determined according to the average error of the sample population, which can improve the accuracy of anomaly detection.
第二方面,提供了一种异常检测装置,所述异常检测装置具有实现上述第一方面中异常检测方法行为的功能。所述异常检测装置包括一个或多个模块,该一个或多个模块用于实现上述第一方面所提供的异常 检测方法。In a second aspect, an anomaly detection device is provided. The anomaly detection device has the function of realizing the behavior of the anomaly detection method in the first aspect. The anomaly detection device includes one or more modules, the one or more modules are used to implement the anomalies provided by the first aspect. Detection method.
也即是,提供了一种异常检测装置,所述装置包括:That is, an anomaly detection device is provided, which device includes:
接收模块,用于接收异常检测任务的配置参数,所述配置参数指示样本集、测试集和候选属性字段,所述样本集包括云平台中用于进行参数调优的日志数据,所述测试集包括所述云平台中待进行异常检测的日志数据,所述候选属性字段为所述云平台的日志数据对应的属性字段;The receiving module is used to receive the configuration parameters of the anomaly detection task. The configuration parameters indicate the sample set, the test set and the candidate attribute fields. The sample set includes log data used for parameter tuning in the cloud platform. The test set Includes log data to be detected for anomalies in the cloud platform, and the candidate attribute fields are attribute fields corresponding to the log data of the cloud platform;
确定模块,用于基于所述样本集和所述候选属性字段,从所述候选属性字段中确定目标属性字段,所述目标属性字段为用于进行所述异常检测任务的属性字段;A determination module, configured to determine a target attribute field from the candidate attribute field based on the sample set and the candidate attribute field, where the target attribute field is an attribute field used to perform the anomaly detection task;
参数调优模块,用于基于所述样本集和所述目标属性字段,对第一检测模型的第一超参数进行调优;A parameter tuning module, configured to tune the first hyperparameter of the first detection model based on the sample set and the target attribute field;
异常检测模块,用于基于所述目标属性字段,通过经参数调优的第一检测模型对所述测试集进行异常检测,以得到所述测试集的异常检测结果。An anomaly detection module, configured to perform anomaly detection on the test set based on the target attribute field through a parameter-tuned first detection model to obtain an anomaly detection result of the test set.
可选地,所述候选属性字段包括m个已选字段和n个待选字段,m为不小于0的整数,n为大于0的整数;Optionally, the candidate attribute fields include m selected fields and n candidate fields, m is an integer not less than 0, and n is an integer greater than 0;
所述确定模块包括:The determination module includes:
第一确定子模块,用于基于所述样本集、所述m个已选字段和所述n个待选字段,确定所述n个待选字段分别对应的字段得分,所述字段得分表征在所述m个已选字段中添加对应的待选字段后对异常检测效果的提升程度;The first determination sub-module is used to determine the field scores corresponding to the n candidate fields based on the sample set, the m selected fields and the n candidate fields, where the field scores represent the The extent to which the anomaly detection effect is improved after adding corresponding candidate fields to the m selected fields;
第二确定子模块,用于基于所述n个待选字段分别对应的字段得分,从所述n个待选字段中确定出p个待选字段,所述p为不大于n的正整数;The second determination sub-module is used to determine p candidate fields from the n candidate fields based on the field scores corresponding to the n candidate fields, where the p is a positive integer not greater than n;
第三确定子模块,用于将所述m个已选字段和所述p个待选字段确定为所述目标属性字段。The third determination sub-module is used to determine the m selected fields and the p candidate fields as the target attribute fields.
可选地,所述样本集包括训练子集和验证子集;Optionally, the sample set includes a training subset and a validation subset;
所述第一确定子模块具体用于:The first determination sub-module is specifically used for:
将所述m个已选字段组成已选字段集合,将所述n个待选字段组成待选字段集合,基于所述训练子集、所述已选字段集合和所述待选字段集合,确定所述待选字段集合中每个待选字段对应的互信息,所述互信息表征对应的待选字段与所述已选字段集合中所有字段之间的相关性;The m selected fields are formed into a selected field set, and the n candidate fields are formed into a candidate field set. Based on the training subset, the selected field set and the candidate field set, it is determined Mutual information corresponding to each candidate field in the candidate field set, the mutual information representing the correlation between the corresponding candidate field and all fields in the selected field set;
从所述待选字段集合中选择出互信息最小的k个待选字段,所述k为不大于n的正整数;Select k candidate fields with the smallest mutual information from the set of candidate fields, where k is a positive integer not greater than n;
基于所述训练子集、所述验证子集、所述已选字段集合和所述k个待选字段,确定所述k个待选字段分别对应的重建损失,所述重建损失表征通过对应的待选字段和所述已选字段集合对所述验证子集进行异常检测的效果;Based on the training subset, the verification subset, the selected field set and the k candidate fields, the reconstruction loss corresponding to the k candidate fields is determined, and the reconstruction loss is represented by the corresponding The effect of anomaly detection on the verification subset by the selected fields and the selected field set;
基于所述k个待选字段分别对应的互信息和重建损失,从所述k个待选字段中选择一个待选字段,并确定选择出的待选字段的字段得分;Selecting a candidate field from the k candidate fields based on the mutual information and the reconstruction loss respectively corresponding to the k candidate fields, and determining a field score of the selected candidate field;
将所述选择出的待选字段从所述待选字段集合移入所述已选字段集合,返回基于所述训练子集、所述已选字段集合和所述待选字段集合,确定所述待选字段集合中每个待选字段对应的互信息的步骤,直至所述待选字段集合为空时,得到所述n个待选字段分别对应的字段得分。Move the selected candidate fields from the candidate field set to the selected field set, and return to determine the candidate fields based on the training subset, the selected field set and the candidate field set. The step of selecting the mutual information corresponding to each field to be selected in the field set is until the field set to be selected is empty, and the field scores corresponding to the n candidate fields are obtained.
可选地,所述第一确定子模块具体用于:Optionally, the first determination sub-module is specifically used to:
对于所述k个待选字段中的第一待选字段,将所述第一待选字段添加至所述已选字段集合,以得到一个候选字段集合,所述第一待选字段为所述k个待选字段中的任一待选字段;For the first candidate field among the k candidate fields, the first candidate field is added to the selected field set to obtain a candidate field set, and the first candidate field is the Any candidate field among k candidate fields;
基于所述训练子集和所述候选字段集合,确定所述第一待选字段对应的第二检测模型;Based on the training subset and the candidate field set, determine a second detection model corresponding to the first candidate field;
基于所述验证子集和所述候选字段集合,通过所述第一待选字段对应的第二检测模型,确定所述第一待选字段对应的重建损失。Based on the verification subset and the candidate field set, the reconstruction loss corresponding to the first candidate field is determined through the second detection model corresponding to the first candidate field.
可选地,所述第一确定子模块具体用于:Optionally, the first determination sub-module is specifically used to:
基于所述候选字段集合确定所述训练子集的参考统计特征,所述训练子集的参考统计特征包括所述训练子集中所述候选字段集合包括的所有字段的数据的统计量;Determine reference statistical characteristics of the training subset based on the candidate field set, where the reference statistical characteristics of the training subset include statistics of data of all fields included in the candidate field set in the training subset;
通过所述训练子集的参考统计特征训练初始检测模型,以得到所述第一待选字段对应的第二检测模型。An initial detection model is trained using the reference statistical features of the training subset to obtain a second detection model corresponding to the first candidate field.
可选地,所述第一确定子模块具体用于:Optionally, the first determination sub-module is specifically used to:
基于所述候选字段集合确定所述验证子集的参考统计特征,所述验证子集的参考统计特征包括所述验证子集中所述候选字段集合包括的所有字段的数据的统计量;Determine reference statistical characteristics of the verification subset based on the candidate field set, where the reference statistical characteristics of the verification subset include statistics of data of all fields included in the candidate field set in the verification subset;
将所述验证子集的参考统计特征输入所述第一待选字段对应的第二检测模型,以得到所述验证子集的 参考重建特征,所述验证子集的参考重建特征包括所述验证子集中所述候选字段集合包括的所有字段的数据的重建统计量;Input the reference statistical features of the validation subset into the second detection model corresponding to the first candidate field to obtain the validation subset A reference reconstruction feature, wherein the reference reconstruction feature of the validation subset includes reconstruction statistics of data of all fields included in the candidate field set in the validation subset;
基于所述验证子集的参考统计特征和参考重建特征,确定所述第一待选字段对应的重建损失。Based on the reference statistical features and reference reconstruction features of the verification subset, the reconstruction loss corresponding to the first candidate field is determined.
可选地,所述配置参数还指示所述候选属性字段中每个属性字段的类别,不同类别的属性字段对应的统计量的种类存在不同。Optionally, the configuration parameter also indicates the category of each attribute field in the candidate attribute field, and attribute fields of different categories have different types of statistics corresponding to them.
可选地,所述第一超参数包括学习率、训练轮数和隐藏层维度。Optionally, the first hyperparameters include a learning rate, a number of training rounds, and a hidden layer dimension.
可选地,所述第一检测模型包括输入层、第一隐藏层和第二隐藏层;所述隐藏层维度包括所述第一隐藏层和所述第二隐藏层的维度,所述输入层的维度基于所述目标属性字段包括的字段的数量确定,所述第一隐藏层和所述第二隐藏层的维度基于所述输入层的维度确定。Optionally, the first detection model includes an input layer, a first hidden layer and a second hidden layer; the hidden layer dimensions include the dimensions of the first hidden layer and the second hidden layer, and the input layer The dimensions of are determined based on the number of fields included in the target attribute field, and the dimensions of the first hidden layer and the second hidden layer are determined based on the dimensions of the input layer.
可选地,所述第一检测模型包括编码器、解码器和判别器,所述判别器的参数包括误差阈值;Optionally, the first detection model includes an encoder, a decoder and a discriminator, and the parameters of the discriminator include an error threshold;
所述异常检测模块包括:The anomaly detection module includes:
第四确定子模块,用于基于所述目标属性字段确定所述测试集的统计特征,所述测试集的统计特征包括所述测试集中所述目标属性字段的数据的统计量;The fourth determination sub-module is used to determine the statistical characteristics of the test set based on the target attribute field, where the statistical characteristics of the test set include statistics of the data of the target attribute field in the test set;
第一输入子模块,用于将所述测试集的统计特征输入所述编码器,以得到所述测试集的编码特征;The first input submodule is used to input the statistical characteristics of the test set into the encoder to obtain the coding characteristics of the test set;
第二输入子模块,用于将所述测试集的编码特征输入所述解码器,以得到所述测试集的重建特征;The second input submodule is used to input the coding features of the test set into the decoder to obtain the reconstructed features of the test set;
第三输入子模块,用于将所述测试集的统计特征和重建特征输入所述判别器,以按照所述误差阈值确定所述测试集的异常检测结果。The third input submodule is used to input the statistical features and reconstructed features of the test set into the discriminator to determine the anomaly detection result of the test set according to the error threshold.
第三方面,提供了一种计算设备集群,所述计算设备集群包括至少一个计算设备,所述计算设备包括处理器和存储器,所述至少一个计算设备的存储器用于存储执行上述第一方面所提供的异常检测方法的程序(即指令),以及存储用于实现上述第一方面所提供的异常检测方法所涉及的数据。所述处理器被配置为用于执行所述存储器中存储的程序。所述计算设备还可以包括通信总线,该通信总线用于该处理器与存储器之间建立连接。In a third aspect, a computing device cluster is provided. The computing device cluster includes at least one computing device. The computing device includes a processor and a memory. The memory of the at least one computing device is used to store the data required for executing the first aspect. The program (that is, the instruction) of the anomaly detection method is provided, and the data involved in storing the anomaly detection method provided in the first aspect is stored. The processor is configured to execute a program stored in the memory. The computing device may also include a communication bus for establishing a connection between the processor and memory.
第四方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面所述的异常检测方法。A fourth aspect provides a computer-readable storage medium in which a computer program is stored, which when run on a computer causes the computer to execute the anomaly detection method described in the first aspect.
第五方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的异常检测方法。A fifth aspect provides a computer program product containing instructions that, when run on a computer, causes the computer to execute the anomaly detection method described in the first aspect.
上述第二方面、第三方面、第四方面和第五方面所获得的技术效果与第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。The technical effects obtained by the above-mentioned second aspect, third aspect, fourth aspect and fifth aspect are similar to those obtained by the corresponding technical means in the first aspect, and will not be described again here.
附图说明Description of the drawings
图1是本申请实施例提供的一种异常检测装置的结构示意图;Figure 1 is a schematic structural diagram of an anomaly detection device provided by an embodiment of the present application;
图2是本申请实施例提供的一种计算设备的结构示意图;Figure 2 is a schematic structural diagram of a computing device provided by an embodiment of the present application;
图3是本申请实施例提供的一种计算设备集群的结构示意图;Figure 3 is a schematic structural diagram of a computing device cluster provided by an embodiment of the present application;
图4是本申请实施例提供的另一种计算设备集群的结构示意图;Figure 4 is a schematic structural diagram of another computing device cluster provided by an embodiment of the present application;
图5是本申请实施例提供的一种异常检测方法所涉及的系统架构图;Figure 5 is a system architecture diagram involved in an anomaly detection method provided by an embodiment of the present application;
图6是本申请实施例提供的一种异常检测方法的流程图;Figure 6 is a flow chart of an anomaly detection method provided by an embodiment of the present application;
图7是本申请实施例提供的一种确定字段得分的方法流程图;Figure 7 is a flow chart of a method for determining field scores provided by an embodiment of the present application;
图8是本申请实施例提供的另一种异常检测方法的流程图;Figure 8 is a flow chart of another anomaly detection method provided by an embodiment of the present application;
图9是本申请实施例提供的又一种异常检测方法的流程图。Figure 9 is a flow chart of yet another anomaly detection method provided by an embodiment of the present application.
具体实施方式Detailed ways
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the objectives, technical solutions, and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
当前,云服务场景中有多种类型的异常检测任务,这些异常检测任务所涉及的各种异质资源不仅数量 众多,而且相互之间往往包含高度复杂的关系。对不同类型的云服务中这些资源的访问和操作,通常会在云平台上产生海量的日志数据。由于日志数据的规模极其庞大并且人工总结分析相关异常判定的规则极其困难,因此,由人工进行监测和分析相关数据,难以实现准确高效的异常检测。Currently, there are many types of anomaly detection tasks in cloud service scenarios. The various heterogeneous resources involved in these anomaly detection tasks are not only the quantity Many, and often contain highly complex relationships with each other. Access to and operations on these resources in different types of cloud services usually generate massive log data on the cloud platform. Since the scale of log data is extremely large and it is extremely difficult to manually summarize and analyze the rules for determining relevant anomalies, it is difficult to achieve accurate and efficient anomaly detection by manually monitoring and analyzing relevant data.
现有的基于机器学习的异常检测方法在进行上述任务时存在如下两方面问题:一方面,现有方法大多有针对性地对特定类型的异常检测任务进行专门的设计和调优,没有构建具有通用性的异常检测模型;另一方面,对异常检测模型进行设计和调优的过程,十分依赖机器学习和深度学习方面的专业知识,而云服务领域的专家对这些任务往往具备丰富的业务经验,但对如何设计和调优异常检测模型并不熟悉,这使得领域专家在即使了解一些机器学习的相关算法的情况下,仍难以搭建好对具体问题行之有效的异常检测模型。Existing anomaly detection methods based on machine learning have the following two problems when performing the above tasks: On the one hand, most of the existing methods are specifically designed and tuned for specific types of anomaly detection tasks, and there is no way to build a Universal anomaly detection model; on the other hand, the process of designing and tuning anomaly detection models relies heavily on professional knowledge in machine learning and deep learning, and experts in the field of cloud services often have rich business experience in these tasks , but are not familiar with how to design and tune anomaly detection models, which makes it difficult for domain experts to build an effective anomaly detection model for specific problems even if they understand some machine learning-related algorithms.
基于此,设计一套能够便于领域专家自助使用的、对云平台中相关异常检测任务具有通用性的、轻量化的、能够智能构建异常检测模型的原型系统是十分重要的,以便领域专家在具备尽可能少的机器学习和深度学习知识的情况下,能够基于该原型系统快速搭建异常检测模型以实现面向目标任务的异常检测。Based on this, it is very important to design a prototype system that can be easily used by domain experts for self-help, is versatile for related anomaly detection tasks in cloud platforms, is lightweight, and can intelligently build anomaly detection models, so that domain experts can With as little machine learning and deep learning knowledge as possible, an anomaly detection model can be quickly built based on this prototype system to achieve target task-oriented anomaly detection.
本申请实施例提供一种异常检测装置,如图1所示,该异常检测装置包括:An embodiment of the present application provides an anomaly detection device, as shown in Figure 1. The anomaly detection device includes:
接收模块,用于接收异常检测任务的配置参数,该配置参数指示样本集、测试集和候选属性字段,样本集包括云平台中用于进行参数调优的日志数据,测试集包括云平台中待进行异常检测的日志数据,候选属性字段为云平台的日志数据对应的属性字段;The receiving module is used to receive the configuration parameters of the anomaly detection task. The configuration parameters indicate the sample set, test set and candidate attribute fields. The sample set includes the log data used for parameter tuning in the cloud platform. The test set includes the log data to be used for parameter tuning in the cloud platform. For log data used for anomaly detection, the candidate attribute fields are the attribute fields corresponding to the log data of the cloud platform;
确定模块,用于基于样本集和候选属性字段,从候选属性字段中确定目标属性字段,目标属性字段为用于进行异常检测任务的属性字段;The determination module is used to determine the target attribute field from the candidate attribute field based on the sample set and the candidate attribute field, and the target attribute field is the attribute field used for anomaly detection tasks;
参数调优模块,用于基于样本集和目标属性字段,对第一检测模型的第一超参数进行调优;A parameter tuning module, used to tune the first hyperparameter of the first detection model based on the sample set and the target attribute field;
异常检测模块,用于基于目标属性字段,通过经参数调优的第一检测模型对测试集进行异常检测,以得到测试集的异常检测结果。具体实现方式请参照图6实施例的相关介绍。The anomaly detection module is used to perform anomaly detection on the test set through the parameter-tuned first detection model based on the target attribute field to obtain the anomaly detection result of the test set. For specific implementation methods, please refer to the relevant introduction of the embodiment in Figure 6.
可选地,候选属性字段包括m个已选字段和n个待选字段,m为不小于0的整数,n为大于0的整数;Optionally, the candidate attribute fields include m selected fields and n to-be-selected fields, where m is an integer not less than 0 and n is an integer greater than 0;
确定模块包括:Determined modules include:
第一确定子模块,用于基于样本集、该m个已选字段和n个待选字段,确定该n个待选字段分别对应的字段得分,字段得分表征在该m个已选字段中添加对应的待选字段后对异常检测效果的提升程度;The first determination sub-module is used to determine the field scores corresponding to the n candidate fields based on the sample set, the m selected fields and the n candidate fields. The field score representation is added to the m selected fields. The extent to which the anomaly detection effect is improved after the corresponding candidate fields are selected;
第二确定子模块,用于基于该n个待选字段分别对应的字段得分,从该n个待选字段中确定出p个待选字段,p为不大于n的正整数;The second determination sub-module is used to determine p candidate fields from the n candidate fields based on the field scores corresponding to the n candidate fields, where p is a positive integer not greater than n;
第三确定子模块,用于将该m个已选字段和p个待选字段确定为目标属性字段。具体实现方式请参照图6实施例的相关介绍。The third determination sub-module is used to determine the m selected fields and p candidate fields as target attribute fields. For specific implementation methods, please refer to the relevant introduction of the embodiment in Figure 6.
可选地,样本集包括训练子集和验证子集;Optionally, the sample set includes a training subset and a validation subset;
第一确定子模块具体用于:The first determined sub-module is specifically used for:
将该m个已选字段组成已选字段集合,将该n个待选字段组成待选字段集合,基于训练子集、已选字段集合和待选字段集合,确定待选字段集合中每个待选字段对应的互信息,互信息表征对应的待选字段与已选字段集合中所有字段之间的相关性;The m selected fields are formed into a selected field set, and the n candidate fields are formed into a candidate field set. Based on the training subset, the selected field set and the candidate field set, each candidate field set in the candidate field set is determined. Mutual information corresponding to the selected field, which represents the correlation between the corresponding candidate field and all fields in the selected field set;
从待选字段集合中选择出互信息最小的k个待选字段,k为不大于n的正整数;Select k candidate fields with the smallest mutual information from the set of candidate fields, where k is a positive integer not greater than n;
基于训练子集、验证子集、已选字段集合和该k个待选字段,确定该k个待选字段分别对应的重建损失,重建损失表征通过对应的待选字段和已选字段集合对验证子集进行异常检测的效果;Based on the training subset, verification subset, selected field set and the k candidate fields, the reconstruction losses corresponding to the k candidate fields are determined, and the reconstruction loss representation is verified through the corresponding candidate fields and selected field sets. The effect of anomaly detection on subsets;
基于该k个待选字段分别对应的互信息和重建损失,从该k个待选字段中选择一个待选字段,并确定选择出的待选字段的字段得分;Based on the mutual information and reconstruction loss corresponding to the k candidate fields, select one candidate field from the k candidate fields, and determine the field score of the selected candidate field;
将选择出的待选字段从待选字段集合移入已选字段集合,返回基于训练子集、已选字段集合和待选字段集合,确定待选字段集合中每个待选字段对应的互信息的步骤,直至待选字段集合为空时,得到该n个待选字段分别对应的字段得分。具体实现方式请参照图6实施例的相关介绍。Move the selected candidate fields from the candidate field set to the selected field set, return the mutual information corresponding to each candidate field in the candidate field set based on the training subset, the selected field set and the candidate field set. Steps, until the set of candidate fields is empty, obtain the field scores corresponding to the n candidate fields. For specific implementation methods, please refer to the relevant introduction of the embodiment in Figure 6.
可选地,第一确定子模块具体用于:Optionally, the first determination sub-module is specifically used to:
对于该k个待选字段中的第一待选字段,将第一待选字段添加至已选字段集合,以得到一个候选字段集合,第一待选字段为该k个待选字段中的任一待选字段;For the first candidate field among the k candidate fields, the first candidate field is added to the selected field set to obtain a candidate field set, and the first candidate field is any of the k candidate fields. A field to be selected;
基于训练子集和候选字段集合,确定第一待选字段对应的第二检测模型; Based on the training subset and the candidate field set, determine the second detection model corresponding to the first candidate field;
基于验证子集和候选字段集合,通过第一待选字段对应的第二检测模型,确定第一待选字段对应的重建损失。具体实现方式请参照图6实施例的相关介绍。Based on the verification subset and the candidate field set, the reconstruction loss corresponding to the first candidate field is determined through the second detection model corresponding to the first candidate field. For specific implementation methods, please refer to the relevant introduction of the embodiment in Figure 6.
可选地,第一确定子模块具体用于:Optionally, the first determination sub-module is specifically used to:
基于候选字段集合确定训练子集的参考统计特征,训练子集的参考统计特征包括训练子集中候选字段集合包括的所有字段的数据的统计量;Determine the reference statistical characteristics of the training subset based on the candidate field set, and the reference statistical characteristics of the training subset include statistics of data of all fields included in the candidate field set in the training subset;
通过训练子集的参考统计特征训练初始检测模型,以得到第一待选字段对应的第二检测模型。具体实现方式请参照图6实施例的相关介绍。The initial detection model is trained by using the reference statistical features of the training subset to obtain a second detection model corresponding to the first candidate field. For a specific implementation method, please refer to the relevant introduction of the embodiment of FIG. 6 .
可选地,第一确定子模块具体用于:Optionally, the first determination sub-module is specifically used to:
基于候选字段集合确定验证子集的参考统计特征,验证子集的参考统计特征包括验证子集中候选字段集合包括的所有字段的数据的统计量;Determine the reference statistical characteristics of the verification subset based on the candidate field set, and the reference statistical characteristics of the verification subset include statistics of data of all fields included in the candidate field set in the verification subset;
将验证子集的参考统计特征输入第一待选字段对应的第二检测模型,以得到验证子集的参考重建特征,验证子集的参考重建特征包括验证子集中候选字段集合包括的所有字段的数据的重建统计量;Input the reference statistical characteristics of the verification subset into the second detection model corresponding to the first candidate field to obtain the reference reconstruction characteristics of the verification subset. The reference reconstruction characteristics of the verification subset include the characteristics of all fields included in the candidate field set in the verification subset. Reconstruction statistics of data;
基于验证子集的参考统计特征和参考重建特征,确定第一待选字段对应的重建损失。具体实现方式请参照图6实施例的相关介绍。Based on the reference statistical features and reference reconstruction features of the verification subset, the reconstruction loss corresponding to the first candidate field is determined. For specific implementation methods, please refer to the relevant introduction of the embodiment in Figure 6.
可选地,该配置参数还指示候选属性字段中每个属性字段的类别,不同类别的属性字段对应的统计量的种类存在不同。具体实现方式请参照图6实施例的相关介绍。Optionally, the configuration parameter also indicates the category of each attribute field in the candidate attribute fields. Different categories of attribute fields correspond to different types of statistics. For specific implementation methods, please refer to the relevant introduction of the embodiment in Figure 6.
可选地,第一超参数包括学习率、训练轮数和隐藏层维度。具体实现方式请参照图6实施例的相关介绍。Optionally, the first hyperparameter includes learning rate, number of training epochs, and hidden layer dimensions. For specific implementation methods, please refer to the relevant introduction of the embodiment in Figure 6.
可选地,第一检测模型包括输入层、第一隐藏层和第二隐藏层;隐藏层维度包括第一隐藏层和第二隐藏层的维度,输入层的维度基于目标属性字段包括的字段的数量确定,第一隐藏层和第二隐藏层的维度基于输入层的维度确定。具体实现方式请参照图6实施例的相关介绍。Optionally, the first detection model includes an input layer, a first hidden layer and a second hidden layer; the hidden layer dimensions include the dimensions of the first hidden layer and the second hidden layer, and the dimensions of the input layer are based on the fields included in the target attribute field. The quantity is determined, and the dimensions of the first hidden layer and the second hidden layer are determined based on the dimensions of the input layer. For specific implementation methods, please refer to the relevant introduction of the embodiment in Figure 6.
可选地,第一检测模型包括编码器、解码器和判别器,该判别器的参数包括误差阈值;Optionally, the first detection model includes an encoder, a decoder and a discriminator, and the parameters of the discriminator include an error threshold;
异常检测模块包括:Anomaly detection modules include:
第四确定子模块,用于基于目标属性字段确定测试集的统计特征,测试集的统计特征包括测试集中目标属性字段的数据的统计量;A fourth determination submodule is used to determine the statistical characteristics of the test set based on the target attribute field, where the statistical characteristics of the test set include the statistics of the data of the target attribute field in the test set;
第一输入子模块,用于将测试集的统计特征输入该编码器,以得到测试集的编码特征;The first input submodule is used to input the statistical characteristics of the test set into the encoder to obtain the coding characteristics of the test set;
第二输入子模块,用于将测试集的编码特征输入该解码器,以得到测试集的重建特征;A second input submodule, used for inputting the encoded features of the test set into the decoder to obtain the reconstructed features of the test set;
第三输入子模块,用于将测试集的统计特征和重建特征输入该判别器,以按照该误差阈值确定测试集的异常检测结果。具体实现方式请参照图6实施例的相关介绍。The third input submodule is used to input the statistical features and reconstructed features of the test set into the discriminator to determine the anomaly detection result of the test set according to the error threshold. For specific implementation methods, please refer to the relevant introduction of the embodiment in Figure 6.
其中,接收模块、确定模块、参数调优模块和异常检测模块均可以通过软件实现,或者可以通过硬件实现,或者可以通过软硬件结合的方式实现。示例性的,接下来以参数调优模块为例,介绍参数调优模块的实现方式。类似的,确定模块、参数调优模块和异常检测模块的实现方式可以参考参数调优模块的实现方式。Among them, the receiving module, determination module, parameter tuning module and anomaly detection module can all be implemented by software, or can be implemented by hardware, or can be implemented by a combination of software and hardware. Illustratively, the following takes the parameter tuning module as an example to introduce the implementation method of the parameter tuning module. Similarly, the implementation of the determination module, parameter tuning module and anomaly detection module can refer to the implementation of the parameter tuning module.
模块作为软件功能单元的一种举例,参数调优模块可以包括运行在计算实例上的代码。其中,计算实例可以包括物理主机(计算设备)、虚拟机、容器中的至少一种。进一步地,上述计算实例可以是一台或者多台。例如,参数调优模块可以包括运行在多个主机/虚拟机/容器上的代码。需要说明的是,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的区域(region)中,也可以分布在不同的region中。进一步地,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的可用区(availability zone,AZ)中,也可以分布在不同的AZ中,每个AZ包括一个数据中心或多个地理位置相近的数据中心。其中,通常一个region可以包括多个AZ。As an example of a software functional unit, a module can include code that runs on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Furthermore, the above computing instance may be one or more. For example, a parameter tuning module can include code running on multiple hosts/VMs/containers. It should be noted that multiple hosts/virtual machines/containers used to run the code can be distributed in the same region (region) or in different regions. Furthermore, multiple hosts/virtual machines/containers used to run the code can be distributed in the same availability zone (AZ) or in different AZs. Each AZ includes one data center or multiple AZs. geographically close data centers. Among them, usually a region can include multiple AZs.
同样,用于运行该代码的多个主机/虚拟机/容器可以分布在同一个虚拟私有云(virtual private cloud,VPC)中,也可以分布在多个VPC中。其中,通常一个VPC设置在一个region内,同一region内两个VPC之间,以及不同region的VPC之间跨区通信需在每个VPC内设置通信网关,经通信网关实现VPC之间的互连。Similarly, multiple hosts/virtual machines/containers used to run the code can be distributed in the same virtual private cloud (VPC) or in multiple VPCs. Usually, a VPC is set up in a region. For cross-region communication between two VPCs in the same region and between VPCs in different regions, a communication gateway needs to be set up in each VPC to achieve interconnection between VPCs through the communication gateway.
模块作为硬件功能单元的一种举例,参数调优模块可以包括至少一个计算设备,如服务器等。或者,参数调优模块也可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现、或可编程逻辑器件(programmable logic device,PLD)实现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex  programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。As an example of a hardware functional unit, a module may include at least one computing device, such as a server. Alternatively, the parameter tuning module may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). Wherein, the above-mentioned PLD may be a complex program logic device (complex programmable logical device (CPLD), field-programmable gate array (FPGA), general array logic (GAL) or any combination thereof.
参数调优模块包括的多个计算设备可以分布在相同的region中,也可以分布在不同的region中。参数调优模块包括的多个计算设备可以分布在相同的AZ中,也可以分布在不同的AZ中。同样,参数调优模块包括的多个计算设备可以分布在同一个VPC中,也可以分布在多个VPC中。其中,所述多个计算设备可以是服务器、ASIC、PLD、CPLD、FPGA和GAL等计算设备的任意组合。Multiple computing devices included in the parameter tuning module can be distributed in the same region or in different regions. Multiple computing devices included in the parameter tuning module can be distributed in the same AZ or in different AZs. Similarly, multiple computing devices included in the parameter tuning module can be distributed in the same VPC or in multiple VPCs. The plurality of computing devices may be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
需要说明的是,在其他实施例中,接收模块可以用于执行异常检测方法中的任意步骤,确定模块可以用于执行异常检测方法中的任意步骤,参数调优模块可以用于执行异常检测方法中的任意步骤,接收模块、确定模块、参数调优模块和异常检测模块负责实现的步骤可根据需要指定,通过接收模块、确定模块、参数调优模块和异常检测模块分别实现异常检测方法中不同的步骤来实现异常检测装置的全部功能。It should be noted that in other embodiments, the receiving module can be used to perform any steps in the anomaly detection method, the determining module can be used to perform any steps in the anomaly detection method, and the parameter tuning module can be used to perform the anomaly detection method. For any step in the process, the steps responsible for implementation by the receiving module, determination module, parameter tuning module and anomaly detection module can be specified as needed. Different anomaly detection methods can be implemented through the receiving module, determination module, parameter tuning module and anomaly detection module respectively. steps to realize all functions of the anomaly detection device.
在本申请实施例中,无需云服务领域的相关人员手工创建安全规则,也无需人工对各种攻击模式进行深入的分析和总结,从而避免了人工创建安全规则存在的漏洞,减少了漏检和误检的情况,同时还能够提高异常检测效率。另外,本方案在相关人员具备尽可能少的甚至无需具备机器学习和深度学习的专业知识的情况下,例如模型设计和调优的相关知识,也能够实现自助地、快速搭建异常检测模型实现面向具体任务的异常检测。In the embodiment of this application, there is no need for relevant personnel in the cloud service field to manually create security rules, and there is no need to manually conduct in-depth analysis and summary of various attack modes, thus avoiding the loopholes in manually created security rules and reducing missed detection and In the case of false detection, it can also improve the efficiency of anomaly detection. In addition, this solution can also enable self-service and rapid construction of anomaly detection models to achieve oriented implementation, even if the relevant personnel have as little or even no professional knowledge of machine learning and deep learning, such as knowledge of model design and tuning. Task-specific anomaly detection.
需要说明的是:上述实施例提供的异常检测装置在进行异常检测时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的异常检测装置与异常检测方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that when the abnormality detection device provided in the above embodiment performs abnormality detection, only the division of the above functional modules is used as an example. In practical applications, the above function allocation can be completed by different functional modules as needed. That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the anomaly detection device provided by the above embodiments and the anomaly detection method embodiments belong to the same concept. Please refer to the method embodiments for the specific implementation process, which will not be described again here.
本申请实施例还提供一种计算设备100。如图2所示,计算设备100包括:总线102、处理器104、存储器106和通信接口108。处理器104、存储器106和通信接口108之间通过总线102通信。计算设备100可以是服务器或终端设备。应理解,本申请不限定计算设备100中的处理器、存储器的个数。An embodiment of the present application also provides a computing device 100. As shown in FIG. 2 , computing device 100 includes: bus 102 , processor 104 , memory 106 , and communication interface 108 . The processor 104, the memory 106 and the communication interface 108 communicate through the bus 102. Computing device 100 may be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 100.
总线102可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图2中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线104可包括在计算设备100各个部件(例如,存储器106、处理器104、通信接口108)之间传送信息的通路。The bus 102 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 2, but it does not mean that there is only one bus or one type of bus. Bus 104 may include a path that carries information between various components of computing device 100 (eg, memory 106, processor 104, communications interface 108).
处理器104可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。The processor 104 may include a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP). any one or more of them.
存储器106可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。处理器104还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard disk drive,HDD)或固态硬盘(solid state drive,SSD)。Memory 106 may include volatile memory, such as random access memory (RAM). The processor 104 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk drive (hard disk drive, HDD) or solid state drive (solid state drive). drive, SSD).
存储器106中存储有可执行的程序代码,处理器104执行该可执行的程序代码以分别实现前述接收模块、确定模块、参数调优模块和异常检测模块的功能,从而实现异常检测方法。也即,存储器106上存有用于执行异常检测方法的指令。The memory 106 stores executable program code, and the processor 104 executes the executable program code to respectively implement the functions of the aforementioned receiving module, determining module, parameter tuning module and anomaly detection module, thereby implementing the anomaly detection method. That is, the memory 106 stores instructions for executing the anomaly detection method.
通信接口103使用例如但不限于网络接口卡、收发器一类的收发模块,来实现计算设备100与其他设备或通信网络之间的通信。The communication interface 103 uses transceiver modules such as, but not limited to, network interface cards and transceivers to implement communication between the computing device 100 and other devices or communication networks.
本申请实施例还提供了一种计算设备集群。该计算设备集群包括至少一台计算设备。该计算设备可以是服务器,例如是中心服务器、边缘服务器,或者是本地数据中心中的本地服务器。在一些实施例中,计算设备也可以是台式机、笔记本电脑或者智能手机等终端设备。An embodiment of the present application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.
如图3所示,所述计算设备集群包括至少一个计算设备100。计算设备集群中的一个或多个计算设备100中的存储器106中可以存有相同的用于执行异常检测方法的指令。As shown in Fig. 3, the computing device cluster includes at least one computing device 100. The memory 106 in one or more computing devices 100 in the computing device cluster may store the same instructions for executing the anomaly detection method.
在一些可能的实现方式中,该计算设备集群中的一个或多个计算设备100的存储器106中也可以分别存有用于执行异常检测方法的部分指令。换言之,一个或多个计算设备100的组合可以共同执行用于执行异常检测方法的指令。 In some possible implementations, the memory 106 of one or more computing devices 100 in the computing device cluster may also store partial instructions for executing the anomaly detection method. In other words, a combination of one or more computing devices 100 may collectively execute instructions for performing the anomaly detection method.
需要说明的是,计算设备集群中的不同的计算设备100中的存储器106可以存储不同的指令,分别用于执行异常检测装置的部分功能。也即,不同的计算设备100中的存储器106存储的指令可以实现接收模块、确定模块、参数调优模块和异常检测模块中的一个或多个模块的功能。It should be noted that the memories 106 in different computing devices 100 in the computing device cluster can store different instructions, respectively used to execute part of the functions of the anomaly detection device. That is, the instructions stored in the memory 106 in different computing devices 100 can implement the functions of one or more modules among the receiving module, the determining module, the parameter tuning module and the anomaly detection module.
在一些可能的实现方式中,计算设备集群中的一个或多个计算设备可以通过网络连接。其中,所述网络可以是广域网或局域网等等。图4示出了一种可能的实现方式。如图4所示,两个计算设备100A和100B之间通过网络进行连接。具体地,通过各个计算设备中的通信接口与所述网络进行连接。在这一类可能的实现方式中,计算设备100A中的存储器106中存有执行接收模块和确定模块的功能的指令。同时,计算设备100B中的存储器106中存有执行参数调优模块和异常检测模块的功能的指令。In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein, the network may be a wide area network or a local area network, etc. Figure 4 shows a possible implementation. As shown in Figure 4, two computing devices 100A and 100B are connected through a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, the memory 106 in the computing device 100A stores instructions for performing the functions of the receiving module and the determining module. At the same time, instructions for executing the functions of the parameter tuning module and the anomaly detection module are stored in the memory 106 of the computing device 100B.
图4所示的计算设备集群之间的连接方式可以是考虑到本申请提供的异常检测方法需要大量地存储数据和计算资源,因此考虑将参数调优模块和异常检测模块实现的功能交由计算设备100B执行。The connection method between the computing device clusters shown in Figure 4 can be: Considering that the anomaly detection method provided by this application requires a large amount of data storage and computing resources, it is considered to hand over the functions implemented by the parameter tuning module and the anomaly detection module to the computing Device 100B executes.
应理解,图4中示出的计算设备100A的功能也可以由多个计算设备100完成。同样,计算设备100B的功能也可以由多个计算设备100完成。It should be understood that the functions of the computing device 100A shown in FIG. 4 may also be performed by multiple computing devices 100 . Likewise, the functions of computing device 100B may also be performed by multiple computing devices 100 .
本申请实施例还提供了另一种计算设备集群。该计算设备集群中各计算设备之间的连接关系可以类似的参考图3和图4所述计算设备集群的连接方式。不同的是,该计算设备集群中的一个或多个计算设备100中的存储器106中可以存有相同的用于执行异常检测方法的指令。The embodiment of the present application also provides another computing device cluster. The connection relationship between the computing devices in the computing device cluster can be similar to the connection method of the computing device cluster described in FIG. 3 and FIG. 4 . The difference is that the same instructions for executing the anomaly detection method may be stored in the memory 106 of one or more computing devices 100 in the computing device cluster.
在一些可能的实现方式中,该计算设备集群中的一个或多个计算设备100的存储器106中也可以分别存有用于执行异常检测方法的部分指令。换言之,一个或多个计算设备100的组合可以共同执行用于执行异常检测方法的指令。In some possible implementations, the memory 106 of one or more computing devices 100 in the computing device cluster may also store partial instructions for executing the anomaly detection method. In other words, a combination of one or more computing devices 100 may collectively execute instructions for performing the anomaly detection method.
本申请实施例还提供了一种包含指令的计算机程序产品。所述计算机程序产品可以是包含指令的,能够运行在计算设备上或被储存在任何可用介质中的软件或程序产品。当所述计算机程序产品在至少一个计算设备上运行时,使得至少一个计算设备执行上述异常检测方法。The embodiment of the present application also provides a computer program product including instructions. The computer program product may be software or a program product including instructions that can be run on a computing device or stored in any available medium. When the computer program product is run on at least one computing device, the at least one computing device executes the above-mentioned anomaly detection method.
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行上述异常检测方法。An embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium may be any available medium that a computing device can store or a data storage device such as a data center that contains one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc. The computer-readable storage medium includes instructions that instruct the computing device to perform the above-mentioned anomaly detection method.
图5是本申请实施例提供的一种异常检测方法所涉及的系统架构图。参见图5,该系统可称为异常检测系统,该系统包括客户端和检测设备。Figure 5 is a system architecture diagram involved in an anomaly detection method provided by an embodiment of the present application. Referring to Figure 5, this system can be called an anomaly detection system, which includes a client and a detection device.
检测设备用于按照接收到的异常检测任务的配置参数来执行该异常检测任务。即,检测设备用于执行本申请实施例提供的异常检测方法的步骤。The detection device is used to execute the anomaly detection task according to the configuration parameters of the received anomaly detection task. That is, the detection device is used to perform the steps of the anomaly detection method provided by the embodiment of the present application.
客户端用于向检测设备发送异常检测任务的配置参数。示例性地,客户端在检测到配置操作的情况下,确定异常检测任务的配置参数,向该检测设备发送该配置参数。The client is used to send the configuration parameters of the anomaly detection task to the detection device. For example, when a configuration operation is detected, the client determines the configuration parameters of the anomaly detection task and sends the configuration parameters to the detection device.
可选地,该检测设备为图2所示的计算设备,或者包括图3/图4所示的多个计算设备。Optionally, the detection device is the computing device shown in Figure 2, or includes multiple computing devices shown in Figures 3/4.
本申请实施例描述的系统架构以及业务场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着系统架构的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The system architecture and business scenarios described in the embodiments of this application are for the purpose of explaining the technical solutions of the embodiments of this application more clearly, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. Persons of ordinary skill in the art will know that as the system With the evolution of architecture and the emergence of new business scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.
图6是本申请实施例提供的一种异常检测方法的流程图。以该方法应用于检测设备为例,请参考图6,该方法包括如下步骤。Figure 6 is a flow chart of an anomaly detection method provided by an embodiment of the present application. Taking this method applied to detection equipment as an example, please refer to Figure 6. The method includes the following steps.
步骤601:接收异常检测任务的配置参数,该配置参数指示样本集、测试集和候选属性字段,样本集包括云平台中用于进行参数调优的日志数据,测试集包括云平台中待进行异常检测的日志数据,候选属性字段为云平台的日志数据对应的属性字段。Step 601: Receive the configuration parameters of the anomaly detection task. The configuration parameters indicate the sample set, the test set and the candidate attribute fields. The sample set includes the log data used for parameter tuning in the cloud platform, and the test set includes the anomalies to be performed in the cloud platform. For the detected log data, the candidate attribute fields are the attribute fields corresponding to the log data of the cloud platform.
为了在相关人员具备尽可能少的甚至无需具备机器学习和深度学习相关知识的情况下,实现通过异常检测模型来检测云平台中的异常操作和访问,在本申请实施例中,由人工配置一些无关模型设计和调优的专业参数,并提交异常检测任务的配置参数到检测设备。检测设备接收该配置参数,基于该配置参数,自动进行特征工程、模型搭建和参数调优,通过经参数调优的异常检测模型实现面向该异常检测任务的异常 检测。In order to realize the detection of abnormal operations and access in the cloud platform through the anomaly detection model when the relevant personnel have as little or even no knowledge related to machine learning and deep learning, in the embodiment of this application, some manually configured Professional parameters unrelated to model design and tuning, and submit the configuration parameters of the anomaly detection task to the detection device. The detection device receives the configuration parameters, and based on the configuration parameters, automatically performs feature engineering, model construction and parameter tuning, and implements anomalies for the anomaly detection task through the parameter-tuned anomaly detection model. detection.
其中,该配置参数指示候选属性字段、样本集和测试集,样本集包括云平台中用于进行参数调优的日志数据,测试集包括云平台中待进行异常检测的日志数据。应当理解的是,云平台中的日志数据包括多种属性字段的数据,候选属性字段包括该多种属性字段中的一部分属性字段,当然也可以包括该多种属性字段中的全部属性字段。即,候选属性字段为云平台的日志数据对应的属性字段。样本集用于进行特征工程、模型搭建和参数调优。在进行特征工程的过程中,检测设备需要基于样本集从候选属性字段中筛选出目标属性字段。测试集是异常检测任务所指示的待检测数据,后续检测设备将基于目标属性字段通过经参数调优的异常检测模型对测试集进行异常检测。Among them, the configuration parameter indicates the candidate attribute field, sample set and test set. The sample set includes log data used for parameter tuning in the cloud platform, and the test set includes log data to be used for anomaly detection in the cloud platform. It should be understood that the log data in the cloud platform includes data of multiple attribute fields, and the candidate attribute fields include part of the multiple attribute fields. Of course, they may also include all of the multiple attribute fields. That is, the candidate attribute fields are attribute fields corresponding to the log data of the cloud platform. The sample set is used for feature engineering, model building and parameter tuning. In the process of feature engineering, the detection equipment needs to filter out the target attribute fields from the candidate attribute fields based on the sample set. The test set is the data to be detected indicated by the anomaly detection task. The subsequent detection equipment will perform anomaly detection on the test set through a parameter-tuned anomaly detection model based on the target attribute field.
在本申请实施例中,样本集包括训练子集和验证子集。该配置参数包括训练子集的起止时间、验证子集的起止时间和测试集的起止时间。上述异常检测模型基于训练子集的起止时间,从云平台的日志数据中获取训练子集,基于验证子集的起止时间,从云平台的日志数据中获取验证子集,基于测试集的起止时间,从云平台的日志数据中获取测试集。其中,训练子集和验证子集结合起来用于对异常检测模型的搭建和参数调优。In the embodiment of this application, the sample set includes a training subset and a validation subset. This configuration parameter includes the start and end time of the training subset, the start and end time of the validation subset, and the start and end time of the test set. The above anomaly detection model obtains the training subset from the log data of the cloud platform based on the start and end time of the training subset, obtains the verification subset from the log data of the cloud platform based on the start and end time of the verification subset, and obtains the verification subset based on the start and end time of the test set. , obtain the test set from the log data of the cloud platform. Among them, the training subset and the verification subset are combined to build the anomaly detection model and tune parameters.
由前述可知,云服务场景中有多种类型的异常检测任务,基于此,该配置参数还指示检测对象,例如,该配置参数包括检测对象的标识,该检测对象可以是目标用户、目标服务或目标主机等。在该检测对象为目标用户的情况下,该异常检测任务用于对目标用户对云平台的操作和访问进行异常检测,样本集和测试集包括云平台中与该目标用户相关的日志数据。在该检测对象为目标服务的情况下,该异常检测任务用于对云平台中目标服务进行异常检测,样本集和测试集包括云平台中与该目标服务相关的日志数据。在该检测对此为目标主机的情况下,该异常检测任务用于对云平台中目标主机相关的操作和访问进行异常检测,样本集和测试集包括云平台中与该目标主机相关的日志数据。As can be seen from the foregoing, there are many types of anomaly detection tasks in cloud service scenarios. Based on this, the configuration parameter also indicates the detection object. For example, the configuration parameter includes the identification of the detection object, and the detection object can be the target user, target service, or target host etc. When the detection object is a target user, the anomaly detection task is used to detect anomalies in the target user's operations and access to the cloud platform. The sample set and test set include log data related to the target user in the cloud platform. When the detection object is a target service, the anomaly detection task is used to detect anomalies on the target service in the cloud platform. The sample set and test set include log data related to the target service in the cloud platform. When the detection is a target host, the anomaly detection task is used to detect anomalies in operations and access related to the target host in the cloud platform. The sample set and test set include log data related to the target host in the cloud platform. .
训练子集包括多个训练样本,验证子集包括多个验证样本,测试集包括一个或多个待测样本。为了确定训练子集、验证子集以及测试集所包括的每个样本具体包括哪个时间段的日志数据,该配置参数还指示异常检测的时间粒度。异常检测模型基于异常检测的时间粒度,确定训练子集、验证子集和测试集分别包括的多个样本,每个样本的时间粒度等于该异常检测的时间粒度。其中,将训练子集所包括的一个样本称为一个训练样本,将验证子集所包括的一个样本称为一个验证样本,将测试集所包括的一个样本称为待测样本。The training subset includes multiple training samples, the validation subset includes multiple validation samples, and the test set includes one or more samples to be tested. In order to determine which time period of log data each sample included in the training subset, validation subset, and test set specifically includes, this configuration parameter also indicates the time granularity of anomaly detection. Based on the time granularity of anomaly detection, the anomaly detection model determines multiple samples included in the training subset, validation subset, and test set respectively. The time granularity of each sample is equal to the time granularity of the anomaly detection. Among them, a sample included in the training subset is called a training sample, a sample included in the verification subset is called a verification sample, and a sample included in the test set is called a test sample.
示例性地,检测对象为用户a,异常检测的时间粒度为24小时,训练子集的起止时间为2022年1月1日至6月30日,即训练子集包括这6个月中与用户a相关的日志数据,检测设备将训练子集中这6个月所包括的每24小时的日志数据确定为训练子集所包括的一个训练样本。For example, the detection object is user a, the time granularity of anomaly detection is 24 hours, and the starting and ending time of the training subset is from January 1 to June 30, 2022, that is, the training subset includes the data collected by users in these 6 months. a related log data, the detection device determines the log data of every 24 hours included in the 6 months in the training subset as a training sample included in the training subset.
为了实现对上述每个样本的更细粒度的数据统计,以便通过更加精确的特征工程和模型调优来提升异常检测的性能,该配置参数还指示单条样本的时间粒度,异常检测的时间粒度为单条样本的时间粒度的整数倍。上述每个样本包括多个单条样本。检测设备基于单条样本的时间粒度,确定训练子集、验证子集和测试集分别包括的各个样本中的多个单条样本。In order to achieve more fine-grained data statistics for each of the above samples, so as to improve the performance of anomaly detection through more accurate feature engineering and model tuning, this configuration parameter also indicates the time granularity of a single sample, and the time granularity of anomaly detection is An integer multiple of the time granularity of a single sample. Each of the above samples includes multiple single samples. Based on the time granularity of a single sample, the detection device determines multiple single samples in each sample included in the training subset, verification subset, and test set respectively.
仍以上述示例为例,检测对象为用户a,异常检测的时间粒度为24小时,单条样本的时间粒度为1小时,检测设备将每个训练样本所包括的24小时中每个小时的日志数据确定为一个单条样本,从而得到每个训练样本所包括的24个单条样本。换种方式来讲,检测设备按照单条样本的时间粒度,将训练子集中每个小时的日志数据确定为一个单条样本,将训练子集中每24小时的24个单条样本划分为一个训练样本。Still taking the above example as an example, the detection object is user a, the time granularity of anomaly detection is 24 hours, and the time granularity of a single sample is 1 hour. The detection device collects the log data of each hour of 24 hours included in each training sample. Determine it as a single sample, thus obtaining 24 single samples included in each training sample. To put it another way, the detection equipment determines the log data of each hour in the training subset as a single sample according to the time granularity of a single sample, and divides the 24 single samples every 24 hours in the training subset into a training sample.
可选地,该配置参数所指示的候选属性字段包括已选字段和待选字段,已选字段是指后续对测试集进行异常检测的过程中确定用到的字段,待选字段是指在进行特征工程之前还不确定后续对测试集进行异常检测的过程中是否用到的字段。在进行特征工程的过程中,检测设备将从候选属性字段包括的所有待选字段中确定出一部分待选字段,将这部分待选字段和全部已选字段确定为目标属性字段,目标属性字段包括后续对测试集进行异常检测的过程中所需的全部字段,即目标属性字段为用于进行异常检测任务的属性字段。Optionally, the candidate attribute fields indicated by this configuration parameter include selected fields and candidate fields. The selected fields refer to the fields determined to be used in the subsequent anomaly detection process of the test set, and the candidate fields refer to the fields to be used in the subsequent anomaly detection process of the test set. Before feature engineering, it was not certain whether the field would be used in the subsequent anomaly detection process on the test set. In the process of feature engineering, the detection equipment will determine a part of the candidate fields from all the candidate fields included in the candidate attribute field, and determine this part of the candidate fields and all the selected fields as the target attribute fields. The target attribute fields include All fields required in the subsequent process of anomaly detection on the test set, that is, the target attribute fields are the attribute fields used for anomaly detection tasks.
在后续进行特征工程的过程中,异常检测模型通过统计的方式得到日志数据中候选属性字段的数据的统计量,基于统计量进行特征工程,候选属性字段所包括的属性字段有多个,由于不同属性字段的特性存在不同,因此,对于不同的属性字段进行统计的具体方式存在不同,这样能够得到更有价值的统计量。基于此,该配置参数还指示候选属性字段中每个属性字段的类别,不同类别的属性字段对应的统计量的种类 存在不同。不同类别的属性字段具体对应哪些统计量将在下文步骤602中进行详细介绍。In the subsequent process of feature engineering, the anomaly detection model obtains the statistics of the candidate attribute fields in the log data through statistics, and performs feature engineering based on the statistics. The candidate attribute fields include multiple attribute fields. Due to different The characteristics of attribute fields are different. Therefore, the specific methods of statistics for different attribute fields are different, so that more valuable statistics can be obtained. Based on this, the configuration parameter also indicates the category of each attribute field in the candidate attribute field, and the types of statistics corresponding to different categories of attribute fields. There are differences. The specific statistics corresponding to different categories of attribute fields will be introduced in detail in step 602 below.
在本申请实施例中,每个属性字段的类别为第一类属性或第二类属性或第三类属性或第四类属性。如表1所示,第一类属性为情景属性。例如,服务名称或服务种类等表征某具体场景的字段属于第一类属性。属于第二类属性的属性字段的取值是离散值,且取值的种类不超过种类阈值,即属于第二类属性的属性字段的取值是离散值且种类较少。通常来讲,取值有限的属性字段的所有可能取值是有顺序的。例如,状态字段属于第二类属性,状态字段的可能取值包括‘0、1、2、3’。属于第三类属性的属性字段的取值是离散值,且取值的种类超过种类阈值,即属于第三类属性的属性字段的取值是离散值且种类较多。例如,远程操作地址字段属于第三类属性,远程操作地址字段的取值可以是‘192.168.2.1’、‘192.168.1.3’等。属于第四类属性的属性字段的取值是连续值。例如,超文本传输协议(hyper text transfer protocol,HTTP)请求的总长度字段属于第四类属性。In this embodiment of the present application, the category of each attribute field is a first-type attribute, a second-type attribute, a third-type attribute, or a fourth-type attribute. As shown in Table 1, the first type of attributes is situational attributes. For example, fields that represent a specific scenario, such as service name or service type, belong to the first category of attributes. The values of the attribute fields belonging to the second type of attributes are discrete values, and the types of values do not exceed the type threshold. That is, the values of the attribute fields belonging to the second type of attributes are discrete values and have fewer types. Generally speaking, all possible values of an attribute field with limited values are ordered. For example, the status field belongs to the second type of attribute, and the possible values of the status field include ‘0, 1, 2, 3’. The values of the attribute fields belonging to the third type of attributes are discrete values, and the types of values exceed the type threshold, that is, the values of the attribute fields belonging to the third type of attributes are discrete values and have many types. For example, the remote operation address field belongs to the third category of attributes, and the value of the remote operation address field can be '192.168.2.1', '192.168.1.3', etc. The values of attribute fields belonging to the fourth category of attributes are continuous values. For example, the total length field of a hypertext transfer protocol (HTTP) request belongs to the fourth category of attributes.
表1
Table 1
作为一个示例,一条日志数据包含但不限于如下属性字段(field)的数据:time(时间戳)、userId(用户标识)、remote_addr(远程操作地址)、service_name(服务名称,也可称为服务种类或服务类型)、api_id(操作的应用标识,一种服务可提供一种或多种应用)、body_bytes_sent(操作的HTTP内容长度)、forward_flag(后端是否转发的标志)、status(状态,指示操作是否成功)、request_method(HTTP请求方法)、request_length(HTTP请求的总长度)、diff_time(响应时间)、accessModel(访问模式)、deploy_type(指示数据库类型)。其中,time用于确定每个样本所包括的日志数据,userId用于筛选用户,time和userId不作为统计对象,service_name为第一类属性,其余属性字段的类别如下。
As an example, a piece of log data includes but is not limited to data in the following attribute fields: time (timestamp), userId (user identification), remote_addr (remote operation address), service_name (service name, also called service type) or service type), api_id (the application identification of the operation, one service can provide one or more applications), body_bytes_sent (the HTTP content length of the operation), forward_flag (the flag of whether the backend forwards), status (status, indicating the operation successful), request_method (HTTP request method), request_length (total length of HTTP request), diff_time (response time), accessModel (access mode), deploy_type (indicates database type). Among them, time is used to determine the log data included in each sample, userId is used to filter users, time and userId are not used as statistical objects, service_name is the first type of attribute, and the categories of the other attribute fields are as follows.
在一些实施例中,候选属性字段中的已选字段包括remote_addr、api_id和body_bytes_sent这3个属性字段,候选属性字段中的待选字段包括forward_flag、status、accessModel、deploy_type、request_method、diff_time和request_length这7个属性字段。In some embodiments, the selected fields in the candidate attribute fields include three attribute fields: remote_addr, api_id, and body_bytes_sent, and the selected fields in the candidate attribute fields include seven attribute fields: forward_flag, status, accessModel, deploy_type, request_method, diff_time, and request_length.
除了按照属性字段的类别来确定所需计算的统计量之外,还可以根据异常检测任务是否时序相关,来确定所需计算的统计量。可选地,该配置参数还指示异常检测任务是否时序相关。在异常检测任务是时序相关的情况下,某些属性字段(如属于第二类属性和/或第三类属性的属性字段)对应的统计量还包括用户访问轮廓和/或访问轮廓相似性(也可称为访问轮廓相似度)。该内容也将在步骤602中详细介绍。In addition to determining the statistics that need to be calculated according to the category of the attribute field, the statistics that need to be calculated can also be determined based on whether the anomaly detection task is time-series related. Optionally, this configuration parameter also indicates whether the anomaly detection task is timing dependent. In the case where the anomaly detection task is time series related, the statistics corresponding to some attribute fields (such as attribute fields belonging to the second type of attributes and/or third type attributes) also include user access profiles and/or access profile similarities ( Also called access profile similarity). This content will also be introduced in detail in step 602.
在云服务场景中,云平台中的日志数据可能存储于多台设备、多条路径下,比如,云服务a相关的日志数据存储于设备a,云服务b相关的日志数据存储于设备b,企业1的日志数据存储于路径1,企业2的日志数据存储于路径2。基于此,人工还可以配置异常检测任务相关的日志数据的存储位置。可选地,该配置参数还指示异常检测任务相关的日志数据的存储位置。异常检测模型按照该存储位置,从云平台获取样本集和测试集中的日志数据。 In a cloud service scenario, log data in the cloud platform may be stored on multiple devices and multiple paths. For example, log data related to cloud service a is stored in device a, and log data related to cloud service b is stored in device b. The log data of enterprise 1 is stored in path 1, and the log data of enterprise 2 is stored in path 2. Based on this, humans can also configure the storage location of log data related to anomaly detection tasks. Optionally, this configuration parameter also indicates the storage location of log data related to the anomaly detection task. The anomaly detection model obtains the log data in the sample set and test set from the cloud platform according to the storage location.
以上所介绍的配置参数均是云领域专家根据丰富的业务经验即可配置的,即相关人员具备尽可能少甚至无需具备专业的机器学习和深度学习知识。表2是本申请实施例提供的一种配置参数的输入信息表,按照表2即可配置出上述配置参数。The configuration parameters introduced above can be configured by experts in the cloud field based on rich business experience, that is, the relevant personnel have as little or even no professional knowledge of machine learning and deep learning. Table 2 is an input information table for configuration parameters provided by the embodiment of the present application. According to Table 2, the above configuration parameters can be configured.
表2
Table 2
为了保证异常检测任务的成功执行,检测设备在接收到异常检测任务的配置参数之后,对该配置参数进行逻辑检测,如果检测到该配置参数的逻辑异常,则检测设备反馈提示信息,以提示重新进行配置。例如,异常检测的时间粒度小于单条样本的时间粒度,或者,异常检测的时间粒度不等于单条样本的时间粒度,或者,测试集的起止时间对应的时间段与训练子集的起止时间对应的时间段有重叠,均说明该配置参数的逻辑异常。如果检测到该配置参数的逻辑正常,则检测设备继续执行步骤602。In order to ensure the successful execution of the anomaly detection task, after receiving the configuration parameters of the anomaly detection task, the detection device performs logical detection on the configuration parameters. If a logical anomaly of the configuration parameters is detected, the detection device feeds back prompt information to prompt for retry. to configure. For example, the time granularity of anomaly detection is smaller than the time granularity of a single sample, or the time granularity of anomaly detection is not equal to the time granularity of a single sample, or the time period corresponding to the start and end time of the test set is the time corresponding to the start and end time of the training subset. If the segments overlap, it indicates that the configuration parameter is logically abnormal. If it is detected that the logic of the configuration parameter is normal, the detection device continues to perform step 602.
步骤602:基于样本集和候选属性字段,从候选属性字段中确定目标属性字段,目标属性字段为用于进行异常检测任务的属性字段。Step 602: Based on the sample set and the candidate attribute fields, determine the target attribute field from the candidate attribute fields. The target attribute field is the attribute field used for the anomaly detection task.
在本申请实施例中,检测设备从云平台的日志数据中获取该配置参数所指示的样本集。In this embodiment of the present application, the detection device obtains the sample set indicated by the configuration parameter from the log data of the cloud platform.
例如,该配置数据包括训练子集的起止时间和验证子集的起止时间,样本集包括训练子集和验证子集,检测设备按照训练子集的起止时间,获取训练子集所包括的所有训练样本。同理,检测设备按照验证子集的起止时间,获取验证子集所包括的所有验证样本。For example, the configuration data includes the start and end time of the training subset and the start and end time of the verification subset. The sample set includes the training subset and the verification subset. The detection device obtains all training included in the training subset according to the start and end time of the training subset. sample. In the same way, the detection equipment obtains all verification samples included in the verification subset according to the start and end times of the verification subset.
又如,该配置数据包括检测对象的标识、训练子集的起止时间、验证子集的起止时间、日志数据的存储位置和单条样本的时间粒度,样本集包括训练子集和验证子集,检测设备按照检测对象的标识、训练子集的起止时间、日志数据的存储位置和单条样本的时间粒度,获取训练子集包括的所有单条样本。同理,检测设备按照检测对象的标识、验证子集的起止时间、日志数据的存储位置和单条样本的时间粒度,获取验证子集包括的所有单条样本。具体实现方式也可参照步骤601中的相关描述。For another example, the configuration data includes the identification of the detection object, the start and end time of the training subset, the start and end time of the verification subset, the storage location of the log data, and the time granularity of a single sample. The sample set includes the training subset and the verification subset. The detection The device obtains all single samples included in the training subset based on the identification of the detection object, the start and end time of the training subset, the storage location of the log data, and the time granularity of the single sample. In the same way, the detection device obtains all single samples included in the verification subset according to the identification of the detection object, the start and end time of the verification subset, the storage location of the log data, and the time granularity of the single sample. For specific implementation methods, please refer to the relevant description in step 601.
在获取到样本集之后,检测设备基于样本集和候选属性字段,从候选属性字段中确定目标属性字段。即,检测设备利用样本集通过特征工程从候选属性字段中筛选出目标属性字段。After obtaining the sample set, the detection device determines the target attribute field from the candidate attribute fields based on the sample set and the candidate attribute fields. That is, the detection device uses the sample set to filter out the target attribute fields from the candidate attribute fields through feature engineering.
由前述可知,候选属性字段包括已选字段和待选字段,为了去除待选字段中冗余的字段,保证筛选出的目标属性字段对异常检测任务的价值,本申请实施例通过确定每个待选字段的字段得分来筛选字段。接下来将对此进行详细介绍。As can be seen from the foregoing, candidate attribute fields include selected fields and candidate fields. In order to remove redundant fields in the candidate fields and ensure the value of the filtered target attribute fields to the anomaly detection task, the embodiment of this application determines each candidate field by Filter fields by selecting a field's field score. This will be described in detail next.
在本申请实施例中,候选属性字段包括m个已选字段和n个待选字段,m为不小于0的整数,n为大于0的整数。检测设备基于样本集、该m个已选字段和n个待选字段,确定该n个待选字段分别对应的字段得分。检测设备基于该n个待选字段分别对应的字段得分,从该n个待选字段中确定出p个待选字段,p为不大于n的正整数。检测设备将该m个已选字段和p个待选字段确定为目标属性字段。其中,字段得分表征在该m个已选字段中添加对应的待选字段后对异常检测效果的提升程度。换种方式来讲,字段得分表征通过已选字段以及对应的待选字段进行异常检测的效果。In this embodiment of the present application, the candidate attribute fields include m selected fields and n candidate fields, m is an integer not less than 0, and n is an integer greater than 0. The detection device determines the field scores corresponding to the n candidate fields based on the sample set, the m selected fields and the n candidate fields. The detection device determines p candidate fields from the n candidate fields based on the field scores corresponding to the n candidate fields, where p is a positive integer not greater than n. The detection device determines the m selected fields and p candidate fields as target attribute fields. Among them, the field score represents the degree to which the anomaly detection effect is improved after adding the corresponding candidate fields to the m selected fields. Put another way, the field score represents the effect of anomaly detection through the selected fields and the corresponding candidate fields.
为了验证在该m个已选字段中添加对应的待选字段后对异常检测效果的提升程度,本申请实施例通过字段间的互信息与字段对应的重建损失相结合的方式来确定字段得分。接下来将结合图7来对此进行介绍。In order to verify the degree of improvement in the anomaly detection effect after adding the corresponding candidate fields to the m selected fields, the embodiment of this application determines the field score by combining the mutual information between fields and the reconstruction loss corresponding to the fields. This will be introduced next in conjunction with Figure 7.
图7是本申请实施例提供的一种确定字段得分的方法流程图。该方法包括步骤6021至步骤6026。Figure 7 is a flow chart of a method for determining field scores provided by an embodiment of the present application. The method includes steps 6021 to 6026.
步骤6021:将该m个已选字段组成已选字段集合,将该n个待选字段组成待选字段集合。 Step 6021: The m selected fields are formed into a selected field set, and the n candidate fields are formed into a candidate field set.
步骤6022:基于训练子集、已选字段集合和待选字段集合,确定该待选字段集合中每个待选字段对应的互信息。Step 6022: Based on the training subset, the selected field set and the candidate field set, determine the mutual information corresponding to each candidate field in the candidate field set.
其中,互信息表征对应的待选字段与已选字段集合中所有字段之间的相关性。Among them, mutual information represents the correlation between the corresponding candidate field and all fields in the selected field set.
在本申请实施例中,检测设备基于训练子集、已选字段集合和待选字段集合,确定该待选字段集合中每个待选字段对应的互信息的实现过程为:确定训练子集中已选字段集合以及待选字段集合包括的所有字段的数据的统计量;对于该待选字段集合中的第二待选字段,基于训练子集中第二待选字段以及该已选字段集合包括的所有已选字段的数据的统计量,确定多个第一互信息,该多个第一互信息包括第二待选字段与该已选字段集合中各个已选字段之间的互信息,第二待选字段为该待选字段集合中的任一待选字段;将该多个第一互信息中的最大值确定为第二待选字段对应的互信息。简单来讲,通过字段对应的统计量来计算字段之间的互信息。In the embodiment of this application, the detection device determines the mutual information corresponding to each candidate field in the candidate field set based on the training subset, the selected field set, and the candidate field set. The implementation process is: determining that the selected fields in the training subset have been selected. Statistics of the data of the selected field set and all fields included in the selected field set; for the second candidate field in the candidate field set, based on the second candidate field in the training subset and all fields included in the selected field set The statistics of the data of the selected fields determine a plurality of first mutual information. The plurality of first mutual information include the mutual information between the second field to be selected and each selected field in the selected field set. The second mutual information is The selection field is any candidate field in the set of candidate fields; the maximum value among the plurality of first mutual information is determined as the mutual information corresponding to the second candidate field. Simply put, the mutual information between fields is calculated through the statistics corresponding to the fields.
在一个实施例中,训练子集中第一已选字段的数据的统计量包括R个第一统计量,训练子集中第二待选字段的数据的统计量包括S个第二统计量,第一已选字段为该已选字段集合中的任一已选字段,R和S均为大于0的整数。检测设备基于训练子集中第二待选字段以及该已选字段集合包括的所有已选字段的数据的统计量,确定多个第一互信息的实现过程包括:基于该R个第一统计量和S个第二统计量,通过多轮迭代的方式确定S个第二互信息;将该S个第二互信息的均值确定为该多个第一互信息中第二待选字段与第一已选字段之间的互信息。In one embodiment, the statistics of the data of the first selected field in the training subset include R first statistics, and the statistics of the data of the second candidate field in the training subset include S second statistics. The selected field is any selected field in the selected field set, and R and S are both integers greater than 0. The detection device determines the implementation process of multiple first mutual information based on the statistics of the second candidate field in the training subset and the data of all selected fields included in the selected field set, including: based on the R first statistics and S second statistics are determined through multiple rounds of iterations; the mean value of the S second mutual information is determined to be the same as the second candidate field in the plurality of first mutual information and the first already selected field. Mutual information between selected fields.
其中,在第j轮迭代过程中,检测设备确定该S个第二统计量中的第j个第二统计量与该R个第一统计量所包括的(R-j+1)个第一统计量之间的互信息,以得到与该(R-j+1)个第一统计量一一对应的(R-j+1)个参考互信息,j为大于0且不大于R的整数。检测设备将该(R-j+1)个参考互信息中的最大值,确定为该S个第二互信息中的第j个第二互信息,将该(R-j+1)个第一统计量中除去与该最大值对应的第一统计量之外的(R-j)个第一统计量,确定为第j+1轮迭代过程中的(R-j)个第一统计量。Among them, in the j-th round of iteration process, the detection device determines the j-th second statistic among the S second statistics and the (R-j+1) first statistics included in the R first statistics. Mutual information between statistics to obtain (R-j+1) reference mutual information corresponding to the (R-j+1) first statistic, j is an integer greater than 0 and not greater than R . The detection equipment determines the maximum value among the (R-j+1) reference mutual information as the j-th second mutual information among the S second mutual information, and determines the (R-j+1)-th second mutual information. The (R-j) first statistics in a statistic, excluding the first statistic corresponding to the maximum value, are determined as the (R-j) first statistics in the j+1 iteration process.
示例性地,status为一个已选字段,训练子集中status对应的统计量包括500个第一统计量,即包含500维。diff_time为一个待选字段,训练子集中diff_time对应的统计量包括300个第二统计量,即包含300维。在第1轮迭代过程中,检测设备计算diff_time的第1维与status的500维中每一维之间的互信息,以得到500个参考互信息,假设这500个参考互信息中的最大值对应status的第321维,那么检测设备将该最大值确定为diff_time的第1维与status之间的互信息,即得到第1个第二互信息,然后检测设备去掉status的第321维。在第2轮迭代过程中,检测设备计算diff_time的第2维与status的499维中每一维之间的互信息,以得到499个参考互信息,假设这499个参考互信息中的最大值对应status的第432维,那么检测设备将该最大值确定为diff_time的第2维与status之间的互信息,即得到第2个第二互信息,然后检测设备去掉status的第432维。以此类推,计算diff_time其余的298维分别与status之间的互信息后,检测设备共得到diff_time的300维分别与status之间的互信息,即得到300个第二互信息。检测设备将这300个第二互信息的均值作为diff_time与status之间的互信息。For example, status is a selected field, and the statistics corresponding to status in the training subset include 500 first statistics, that is, it contains 500 dimensions. diff_time is a candidate field, and the statistics corresponding to diff_time in the training subset include 300 second statistics, that is, it contains 300 dimensions. In the first round of iteration process, the detection device calculates the mutual information between the first dimension of diff_time and each of the 500 dimensions of status to obtain 500 reference mutual information. Assume that the maximum value among these 500 reference mutual information is Corresponding to the 321st dimension of status, the detection device determines the maximum value as the mutual information between the first dimension of diff_time and status, that is, the first second mutual information is obtained, and then the detection device removes the 321st dimension of status. In the second iteration process, the detection device calculates the mutual information between the 2nd dimension of diff_time and each of the 499 dimensions of status to obtain 499 reference mutual information. Assume that the maximum value among these 499 reference mutual information is Corresponding to the 432nd dimension of status, the detection device determines the maximum value as the mutual information between the 2nd dimension of diff_time and status, that is, the second second mutual information is obtained, and then the detection device removes the 432nd dimension of status. By analogy, after calculating the mutual information between the remaining 298 dimensions of diff_time and status, the detection device obtains a total of 300 mutual information between the 300 dimensions of diff_time and status, that is, 300 second mutual information are obtained. The detection device uses the mean value of these 300 second mutual information as the mutual information between diff_time and status.
在本申请实施例中,检测设备可按照公式(1)来计算统计量之间的互信息。在公式(1)中,I(Su;Sc)表示统计量Su与统计量Sc之间的互信息,su和sc分别表示Su和Sc的取值。P(·,·)表示联合概率分布,P(·)表示概率密度。
In the embodiment of the present application, the detection device can calculate the mutual information between statistics according to formula (1). In formula (1), I( Su ; Sc ) represents the mutual information between the statistic Su and the statistic Sc , and su and sc represent the values of Su and Sc respectively. P(·,·) represents the joint probability distribution, and P(·) represents the probability density.
两个统计量之间的互信息越大,说明这两个统计量之间的关联越大,冗余信息越多,在已选字段的一个统计量的基础上添加待选字段的一个统计量对异常检测效果的贡献越小。两个统计量之间的互信息越小,说明这两个统计量之间的关联越小,冗余信息越少,在已选字段的一个统计量的基础上添加待选字段的一个统计量对异常检测效果的贡献越大。The greater the mutual information between the two statistics, the greater the correlation between the two statistics and the more redundant information. Add a statistic of the unselected field on the basis of a statistic of the selected field. The smaller the contribution to the anomaly detection effect. The smaller the mutual information between the two statistics, the smaller the correlation between the two statistics and the less redundant information. Add a statistic of the unselected field on the basis of a statistic of the selected field. The greater the contribution to the anomaly detection effect.
为了便于计算,在本申请实施例中通过将取值连续的统计量进行离散化,基于离散化的统计量进行互信息的计算。其中,离散化的方法可以是基于四分位数的方法,也可以是其他方法,本申请实施例对此不作限定。In order to facilitate calculation, in the embodiment of the present application, statistics with continuous values are discretized, and mutual information is calculated based on the discretized statistics. The discretization method may be a quartile-based method or other methods, which are not limited in the embodiments of the present application.
示例性地,检测设备基于四分位数的方法对取值连续的统计量进行离散化。在具体实现中,对于每种统计量,检测未识别根据全部训练样在该统计量的所有取值计算四分位数,计算出的四分位数的三个分位 点依次记为Q1、Q2和Q3。记IQR=Q3-Q1,以Q1-1.5IQR、Q1、Q2、Q3、Q3+1.5IQR为分位点确定六个区间,将全部训练样本的该统计量离散化到这六个区间中,后续对于验证子集和测试子集的统计量也按照这六个区间进行离散化。For example, the detection device discretizes statistics with continuous values based on the quartile method. In the specific implementation, for each statistic, detection of unrecognized quartiles is calculated based on all values of the statistic in all training samples, and the three quartiles of the calculated quartiles are The points are marked Q1, Q2 and Q3 in sequence. Note IQR=Q3-Q1, determine six intervals with Q1-1.5IQR, Q1, Q2, Q3, Q3+1.5IQR as quantile points, discretize the statistics of all training samples into these six intervals, and then The statistics of the validation subset and the test subset are also discretized according to these six intervals.
步骤6023:从该待选字段集合中选择出互信息最小的k个待选字段。Step 6023: Select k candidate fields with the smallest mutual information from the candidate field set.
也即是,检测设备按照互信息从小到大的顺序,从该待选字段集合中选择出k个待选字段。其中,k为不大于n的正整数,更精确地,k不大于该待选字段集合所包括的字段的总数量。That is, the detection device selects k candidate fields from the candidate field set in ascending order of mutual information. Among them, k is a positive integer not greater than n. More precisely, k is not greater than the total number of fields included in the candidate field set.
在本申请实施例中,检测设备按照预设值来确定k,如果当前的待选字段集合中存在对应的互信息相同的多个待选字段,使得互信息最小的待选字段的总数量超过该预设值,则检测设备将当前的k设置为等于当前互信息最小的待选字段的总数量。如果当前的待选字段集合包括的字段的总数量小于预设值,则检测设备将当前的k设置为等于待选字段集合包括的字段的总数量,或者,将当前的k设置为待选字段集合包括的字段的总数量减去指定数值(如1),以保证k不超过该待选字段集合所包括的字段的总数量即可。In this embodiment of the present application, the detection device determines k according to a preset value. If there are multiple candidate fields with the same mutual information in the current candidate field set, the total number of candidate fields with the smallest mutual information exceeds With this preset value, the detection device sets the current k equal to the total number of candidate fields with the smallest current mutual information. If the total number of fields included in the current candidate field set is less than the preset value, the detection device sets the current k equal to the total number of fields included in the candidate field set, or sets the current k as the candidate field. The total number of fields included in the set is reduced by a specified value (such as 1) to ensure that k does not exceed the total number of fields included in the candidate field set.
步骤6024:基于训练子集、验证子集、已选字段集合和该k个待选字段,确定该k个待选字段分别对应的重建损失。Step 6024: Based on the training subset, the verification subset, the selected field set and the k candidate fields, determine the reconstruction losses corresponding to the k candidate fields.
其中,重建损失表征通过对应的待选字段和已选字段集合对验证子集进行异常检测的效果。Among them, the reconstruction loss represents the effect of anomaly detection on the validation subset through the corresponding set of candidate fields and selected fields.
在本申请实施例中,检测设备基于训练子集、验证子集、已选字段集合和该k个待选字段,确定该k个待选字段分别对应的重建损失的实现过程为:对于该k个待选字段中的第一待选字段,将第一待选字段添加至已选字段集合,以得到一个候选字段集合,第一待选字段为该k个待选字段中的任一待选字段;基于该训练子集和该候选字段集合,确定第一待选字段对应的第二检测模型;基于验证子集和该候选字段集合,通过第一待选字段对应的第二检测模型,确定第一待选字段对应的重建损失。应当理解的是,在该实现过程中,检测设备能够得到与该k个待选字段一一对应的k个候选字段集合,以及与该k个候选字段一一对应的k个第二检测模型。In this embodiment of the present application, the detection device determines the reconstruction losses corresponding to the k candidate fields based on the training subset, the verification subset, the selected field set and the k candidate fields. The implementation process is: for the k The first candidate field among the k candidate fields is added to the selected field set to obtain a candidate field set. The first candidate field is any one of the k candidate fields. field; based on the training subset and the candidate field set, determine the second detection model corresponding to the first candidate field; based on the verification subset and the candidate field set, determine through the second detection model corresponding to the first candidate field The reconstruction loss corresponding to the first candidate field. It should be understood that during this implementation process, the detection device can obtain k candidate field sets that correspond to the k candidate fields one-to-one, and k second detection models that correspond to the k candidate fields one-to-one.
简单来讲,检测设备在已选字段集合中添加该k个待选字段中的任一待选字段后,基于添加了一个待选字段的已选字段集合,通过训练子集来训练得到第二检测模型,然后,通过验证子集和第二检测模型来验证在已选字段集合中添加该k个待选字段中的任一待选字段后异常检测效果的提升程度。To put it simply, after the detection device adds any of the k candidate fields to the selected field set, it trains through the training subset based on the selected field set to which one candidate field is added. The detection model is then used to verify the improvement in the anomaly detection effect after adding any of the k candidate fields to the selected field set by verifying the subset and the second detection model.
其中,检测设备基于该候选字段集合(即该k个候选字段集合中与第一待选字段对应的一个候选字段集合)确定训练子集的参考统计特征,训练子集的参考统计特征包括训练子集中该候选字段集合包括的所有字段的数据的统计量。检测设备通过训练子集的参考统计特征训练初始检测模型,以得到第一待选字段对应的第二检测模型。Wherein, the detection device determines the reference statistical characteristics of the training subset based on the candidate field set (that is, the candidate field set corresponding to the first candidate field among the k candidate field sets), and the reference statistical characteristics of the training subset include the training subset. Concentrate statistics on data from all fields included in this candidate field set. The detection device trains the initial detection model through the reference statistical features of the training subset to obtain the second detection model corresponding to the first candidate field.
检测设备基于该候选字段集合确定验证子集的参考统计特征,验证子集的参考统计特征包括验证子集中该候选字段集合包括的所有字段的数据的统计量。检测设备将验证子集的参考统计特征输入第一待选字段对应的第二检测模型,以得到验证子集的参考重建特征,验证子集的参考重建特征包括验证子集中该候选字段集合包括的所有字段的数据的重建统计量。检测设备基于验证子集的参考统计特征和参考重建特征,确定第一待选字段对应的重建损失。第一待选字段对应的重建损失也表征在已选字段集合中添加第一待选字段后对验证子集的参考统计特征的重建效果。The detection device determines reference statistical characteristics of the verification subset based on the candidate field set, and the reference statistical characteristics of the verification subset include statistics of data of all fields included in the candidate field set in the verification subset. The detection device inputs the reference statistical characteristics of the verification subset into the second detection model corresponding to the first candidate field to obtain the reference reconstruction characteristics of the verification subset. The reference reconstruction characteristics of the verification subset include the candidate fields included in the verification subset. Reconstructed statistics for data for all fields. The detection device determines the reconstruction loss corresponding to the first candidate field based on the reference statistical features and reference reconstruction features of the verification subset. The reconstruction loss corresponding to the first candidate field also represents the reconstruction effect on the reference statistical characteristics of the verification subset after adding the first candidate field to the selected field set.
作为一个示例,检测设备按照公式(2)来确定第一待选字段的重建损失。在公式(2)中,表示第一待选字段对应的重建损失,X和分别表示验证子集的参考统计特征和参考重建特征,n表示验证子集包括的验证样本的总数量,m表示每个验证样本中第一待选字段以及已选字段集合包括的所有已选字段的数据的统计量的总维数,d表示该总维数中的第d维,i表示验证子集包括的单条样本的总数量。表示验证子集中第i个单条样本的参考统计特征包括的第d维参考统计量(即第二检测模型的输入统计量),表示验证子集中第i个单条样本的参考重建特征包括的第d维重建统计量(即第二检测模型的输出统计量)。
As an example, the detection device determines the reconstruction loss of the first candidate field according to formula (2). In formula (2), represents the reconstruction loss corresponding to the first candidate field, X and They respectively represent the reference statistical features and reference reconstruction features of the validation subset, n represents the total number of validation samples included in the validation subset, m represents the total dimension of the statistics of the first field to be selected in each validation sample and the data of all selected fields included in the selected field set, d represents the dth dimension in the total dimension, and i represents the total number of single samples included in the validation subset. represents the d-th dimension reference statistic (i.e., the input statistic of the second detection model) included in the reference statistical features of the i-th single sample in the validation subset, Represents the d-th dimension reconstruction statistic (i.e., the output statistic of the second detection model) included in the reference reconstruction features of the i-th single sample in the validation subset.
可选地,检测设备将验证子集的参考统计特征和参考重建特征进行归一化,基于验证子集的经归一化的参考统计特征和参考重建特征,确定第一待选字段对应的重建损失。示例性地,检测设备通过归一化将的值域映射到区间[-1,1]。例如,对于某个单条样本在单条样本的时间粒度的统计量(如一小时 粒度的统计量),基于该单条样本所属的验证样本的所有在单条样本的时间粒度的统计量(如24小时中每个小时粒度的统计量)进行归一化。对于某个验证样本在异常检测的时间粒度的统计量(如24小时粒度的统计量),基于该验证样本在所有验证样本的所有异常检测的时间粒度的统计量(如7天中每24小时粒度的统计量)进行归一化。Optionally, the detection device normalizes the reference statistical features and reference reconstruction features of the verification subset, and determines the reconstruction corresponding to the first candidate field based on the normalized reference statistical features and reference reconstruction features of the verification subset. loss. For example, the detection device normalizes and The value range of is mapped to the interval [-1,1]. For example, for a single sample, the statistics at the time granularity of the single sample (such as one hour Granularity statistics), normalized based on all statistics at the time granularity of a single sample (such as statistics at each hour granularity in 24 hours) of the verification sample to which the single sample belongs. For the statistics of the time granularity of anomaly detection for a certain verification sample (such as statistics of 24-hour granularity), statistics based on the time granularity of all anomaly detection of all verification samples for this verification sample (such as every 24 hours in 7 days) Granularity statistics) are normalized.
上述第二检测模型是一个异常检测模型,为了保证字段筛选中所使用的第二检测模型的异常检测效果,与后续对测试集进行异常检测所使用的第一检测模型的异常检测效果的相对一致性,第二检测模型与第一检测模型的结构大致相同,不同的地方包括输入层的维度可能存在不同、模型参数可能存在以下差异。例如,第二检测模型中输入层的维度是与添加上第一待选字段后的已选字段集合包括的字段的总数量相匹配的,而第一检测模型中输入层的维度是与目标属性字段所包括的字段的数量相匹配的。又如,第二检测模型的全部超参数可以是预设的,而第一检测模型的部分超参数(如下文中的第一超参数)是待调优的。The above-mentioned second detection model is an anomaly detection model. In order to ensure that the anomaly detection effect of the second detection model used in field screening is relatively consistent with the anomaly detection effect of the first detection model used for subsequent anomaly detection on the test set. The structure of the second detection model is roughly the same as that of the first detection model. The differences include the dimensions of the input layer and the following differences in model parameters. For example, the dimensions of the input layer in the second detection model match the total number of fields included in the selected field set after adding the first candidate field, while the dimensions of the input layer in the first detection model match the target attributes. Fields match the number of fields included. For another example, all hyperparameters of the second detection model may be preset, while part of the hyperparameters of the first detection model (such as the first hyperparameter below) are to be tuned.
步骤6025:基于该k个待选字段分别对应的互信息和重建损失,从该k个待选字段中选择一个待选字段,并确定选择出的待选字段的字段得分。Step 6025: Based on the mutual information and reconstruction loss corresponding to the k candidate fields, select one candidate field from the k candidate fields, and determine the field score of the selected candidate field.
在得到该k个待选字段分别对应的互信息和重建损失之后,检测设备将能够将互信息和重建损失结合起来确定待选字段的字段得分。After obtaining the mutual information and reconstruction loss corresponding to the k candidate fields, the detection device will be able to combine the mutual information and reconstruction loss to determine the field score of the candidate field.
在本申请实施例中,检测设备基于该k个待选字段分别对应的互信息和重建损失,从该k个待选字段中选择一个待选字段,并确定选择出的待选字段的字段得分的实现过程为:基于该k个待选字段分别对应的互信息和重建损失,确定该k个待选字段分别对应的综合分数,将该k个待选字段中综合分数最高的待选字段确定为选择出的待选字段,将选择出的待选字段的综合分数确定为选择出的待选字段的字段得分。应当理解的是,在该实现过程中,检测设备得到与该k个待选字段一一对应的k个综合得分。In an embodiment of the present application, the detection device selects a candidate field from the k candidate fields based on the mutual information and reconstruction loss corresponding to the k candidate fields, and determines the implementation process of the field score of the selected candidate field as follows: based on the mutual information and reconstruction loss corresponding to the k candidate fields, determine the comprehensive scores corresponding to the k candidate fields, determine the candidate field with the highest comprehensive score among the k candidate fields as the selected candidate field, and determine the comprehensive score of the selected candidate field as the field score of the selected candidate field. It should be understood that in this implementation process, the detection device obtains k comprehensive scores corresponding to the k candidate fields one by one.
由于字段间的互信息越大,表示字段间的相关性越强,字段对应的重建损失越大(即在已选字段中添加该字段后得到的重建损失相对于不添加该字段所得到的重建损失的降幅越小),表示字段对于异常检测效果的提升越弱,因此,为了尽可能去除与已选字段之间相关性较强、且对于异常检测效果的提升较弱的待选字段,检测设备可以通过加权求和或者其他方式对该k个待选字段分别对应的互信息和重建损失进行处理,以得到该k个待选字段分别对应的综合分数。Since the greater the mutual information between fields, the stronger the correlation between the fields, the greater the reconstruction loss corresponding to the field (that is, the smaller the decrease in the reconstruction loss after adding the field to the selected field relative to the reconstruction loss obtained by not adding the field), the weaker the improvement of the field on the anomaly detection effect. Therefore, in order to remove the candidate fields that have a strong correlation with the selected fields and a weaker improvement on the anomaly detection effect as much as possible, the detection device can process the mutual information and reconstruction losses corresponding to the k candidate fields by weighted summation or other methods to obtain the comprehensive scores corresponding to the k candidate fields.
可选地,检测设备基于该k个待选字段分别对应的互信息,确定该k个待选字段分别对应的互信息分数。互信息分数越高的待选字段对应异常检测效果的提升越大。即,互信息的大小与异常检测效果呈负相关,而互信息分数的大小与异常检测效果呈正相关。检测设备基于该k个待选字段分别对应的互信息分数和重建损失,确定该k个待选字段分别对应的综合分数。Optionally, the detection device determines the mutual information scores corresponding to the k candidate fields based on the mutual information corresponding to the k candidate fields. The higher the mutual information score of the candidate field, the greater the improvement in the anomaly detection effect. That is, the size of the mutual information is negatively correlated with the anomaly detection effect, while the size of the mutual information score is positively correlated with the anomaly detection effect. The detection device determines the comprehensive scores corresponding to the k candidate fields based on the mutual information scores and reconstruction losses corresponding to the k candidate fields.
作为一个示例,检测设备按照公式(3)来确定待选字段的互信息分数,按照公式(4)来确定待选字段的综合分数。在公式(3)和(4)中,field表示一个待选字段,I表示该待选字段对应的互信息,Score(I)表示该待选字段的互信息分数,Score(Loss)表示该待选字段对应的重建损失,Score(field)表示该待选字段的字段得分。β为可调参数,默认取值可以为1或其他数值。
Score(I)=1-(2×sigmoid(I)-1)                (3)
As an example, the detection device determines the mutual information score of the candidate field according to formula (3), and determines the comprehensive score of the candidate field according to formula (4). In formulas (3) and (4), field represents a field to be selected, I represents the mutual information corresponding to the field to be selected, Score(I) represents the mutual information score of the field to be selected, and Score(Loss) represents the mutual information score of the field to be selected. The reconstruction loss corresponding to the selected field, Score(field) represents the field score of the field to be selected. β is an adjustable parameter, and the default value can be 1 or other values.
Score(I)=1-(2×sigmoid(I)-1) (3)
由公式(3)可以看出,检测设备将待选字段对应的互信息通过sigmoid函数进行了归一化,即,将互信息的值域映射到了区间(0,1),通过公式(3)所得到的互信息分数的值域也是区间(0,1),某个待选字段对应的互信息分数越大,表示该待选字段对于异常检测效果的提升程度越高。It can be seen from formula (3) that the detection device normalizes the mutual information corresponding to the candidate field through the sigmoid function, that is, the value range of the mutual information is mapped to the interval (0,1). The value range of the mutual information score obtained by formula (3) is also the interval (0,1). The larger the mutual information score corresponding to a candidate field, the higher the degree of improvement of the anomaly detection effect of the candidate field.
步骤6026:将选择出的待选字段从待选字段集合移入已选字段集合,返回步骤6022,直至待选字段集合为空时,得到该n个待选字段分别对应的字段得分。Step 6026: Move the selected candidate fields from the candidate field set to the selected field set, and return to step 6022 until the candidate field set is empty, and obtain the field scores corresponding to the n candidate fields.
即,检测设备将选择出的待选字段从待选字段集合移入已选字段集合后,如果待选字段集合不为空,则返回步骤6022,如果待选字段集合为空,则检测设备得到该n个待选字段分别对应的字段得分。That is, after the detection device moves the selected candidate fields from the candidate field set into the selected field set, if the candidate field set is not empty, it returns to step 6022; if the candidate field set is empty, the detection device obtains the field scores corresponding to the n candidate fields respectively.
基于上文对步骤6021至步骤6026的相关介绍可知,检测设备实质上是通过多轮迭代的方式来确定该n个待选字段分别对应的字段得分。在每一轮迭代过程中得到一个待选字段的字段得分,经过n轮迭代后得到该n个待选字段分别对应的字段得分。再换种方式来讲,检测设备基于样本集和候选属性字段,从候选属性字段中确定目标属性字段的实现方式如下: Based on the above description of steps 6021 to 6026, it can be seen that the detection device essentially determines the field scores corresponding to the n candidate fields through multiple rounds of iterations. In each iteration process, the field score of a candidate field is obtained. After n rounds of iteration, the field scores corresponding to the n candidate fields are obtained. To put it another way, the detection device determines the target attribute field from the candidate attribute field based on the sample set and the candidate attribute field as follows:
确定训练子集中该m个已选字段的数据的统计量以及该n个待选字段的数据的统计量;通过执行多轮迭代来确定目标属性字段。Determine the statistics of the data of the m selected fields and the statistics of the data of the n candidate fields in the training subset; determine the target attribute field by performing multiple rounds of iterations.
其中,检测设备先将该m个已选字段组成已选字段集合,将该n个待选字段组成待选字段集合。在第i轮迭代过程中,检测设备基于训练子集中已选字段集合以及待选字段集合所包括的所有字段的数据的统计量,确定待选字段集合中每个待选字段对应的互信息,i为大于0且不大于n的整数。检测设备按照互信息从小到大的顺序,从待选字段集合中选择出k个待选字段,将该k个待选字段中的每个待选字段分别添加到已选字段集合中,以得到与该k个待选字段一一对应的k个候选字段集合。对于该k个候选字段集合中的第一候选字段集合,检测设备获取训练子集中第一候选字段集合包括的所有字段的数据的统计量,以及确定验证子集中第一候选字段集合包括的所有字段的数据的统计量,第一候选字段集合为该k个候选字段集合中的任一候选字段集合。检测设备通过训练子集中第一候选字段集合包括的所有字段的数据的统计量,训练初始检测模型,以得到第一待选字段对应的第二检测模型,第一待选字段是指该k个待选字段中与第一候选字段集合对应的待选字段。检测设备将验证子集中第一候选字段集合包括的所有字段的数据的统计量输入第二检测模型,以得到验证子集中第一候选字段集合包括的所有字段的数据的重建统计量。检测设备基于验证子集中第一候选字段集合包括的所有字段的数据的统计量和重建统计量,确定第一待选字段对应的重建损失。检测设备基于该k个待选字段中每个待选字段对应的互信息和重建损失,确定该k个待选字段中每个待选字段的综合分数。检测设备将该k个待选字段中综合分数最高的待选字段选择出来,将选择出的待选字段的综合得分确定为选择出的待选字段的字段得分,将选择出的待选字段从待选字段集合移入已选字段集合中,以得到下一轮迭代过程中的已选字段集合和待选字段集合。在最后一轮迭代完成后,检测设备得到该n个待选字段中每个待选字段的字段得分。然后,检测设备按照该n个待选字段中每个待选字段的字段得分,从该n个待选字段中确定出p个待选字段,将该m个已选字段和该p个待选字段确定为目标属性字段。Among them, the detection device first forms the m selected fields into a selected field set, and the n candidate fields into a candidate field set. In the i-th iteration process, the detection device determines the mutual information corresponding to each candidate field in the candidate field set based on the statistics of the data of the selected field set in the training subset and all fields included in the candidate field set, i is an integer greater than 0 and not greater than n. The detection device selects k candidate fields from the candidate field set in order of mutual information from small to large, and adds each candidate field in the k candidate fields to the selected field set respectively, so as to obtain A set of k candidate fields that correspond one-to-one to the k candidate fields. For the first candidate field set among the k candidate field sets, the detection device obtains the statistics of the data of all fields included in the first candidate field set in the training subset, and determines all fields included in the first candidate field set in the verification subset. The statistics of the data, the first candidate field set is any candidate field set among the k candidate field sets. The detection device trains the initial detection model by using the statistics of the data of all fields included in the first candidate field set in the training subset to obtain the second detection model corresponding to the first candidate field. The first candidate field refers to the k The candidate fields corresponding to the first candidate field set among the candidate fields. The detection device inputs statistics of data of all fields included in the first candidate field set in the verification subset into the second detection model to obtain reconstructed statistics of data of all fields included in the first candidate field set in the verification subset. The detection device determines the reconstruction loss corresponding to the first candidate field based on the statistics and reconstruction statistics of the data of all fields included in the first candidate field set in the verification subset. The detection device determines the comprehensive score of each of the k candidate fields based on the mutual information and reconstruction loss corresponding to each of the k candidate fields. The detection device selects the candidate field with the highest comprehensive score among the k candidate fields, determines the comprehensive score of the selected candidate field as the field score of the selected candidate field, and converts the selected candidate field from The set of fields to be selected is moved into the set of selected fields to obtain the set of selected fields and the set of fields to be selected in the next iteration process. After the last round of iteration is completed, the detection device obtains the field score of each of the n candidate fields. Then, the detection device determines p candidate fields from the n candidate fields according to the field score of each of the n candidate fields, and combines the m selected fields and the p candidate fields. The field is determined as the target attribute field.
简单来讲,在每一轮迭代过程中,对当前所有待选字段进行遍历,逐一计算每个待选字段与当前所有已选字段之间的互信息,选出互信息最小的k个待选字段。然后,分别在当前所有已选字段的基础上添加这k个待选字段中的一个字段,以得到k个候选字段集合。使用训练子集和验证子集来分别验证该k个候选字段集合对应的重建效果,即分别得到这k个待选字段对应的重建损失。基于这k个待选字段分别对应的互信息和重建损失,确定这k个待选字段中各个待选字段的综合分数。按照综合分数从这k个待选字段中选出本轮迭代过程中最优的一个待选字段,将选出的待选字段从当前所有待选字段的集合移入当前所有已选字段的集合中。迭代上述过程直至当前所有待选字段的集合为空。之后,按照每轮迭代过程中所选出的待选字段的综合分数,确定出目标属性字段。To put it simply, in each iteration process, all currently candidate fields are traversed, the mutual information between each candidate field and all currently selected fields is calculated one by one, and the k candidates with the smallest mutual information are selected. field. Then, one of the k candidate fields is added on the basis of all currently selected fields to obtain a set of k candidate fields. Use the training subset and the verification subset to verify the reconstruction effects corresponding to the k candidate field sets, that is, to obtain the reconstruction losses corresponding to the k candidate fields. Based on the mutual information and reconstruction loss corresponding to the k candidate fields, the comprehensive score of each of the k candidate fields is determined. According to the comprehensive score, the best candidate field in this round of iteration is selected from the k candidate fields, and the selected candidate field is moved from the set of all current candidate fields to the set of all currently selected fields. . Iterate the above process until the current set of all fields to be selected is empty. After that, the target attribute field is determined based on the comprehensive score of the candidate fields selected in each iteration process.
如果上述m为0,即候选字段中已选字段的数量为0,表示云平台相关人员本次未指定已选字段,则检测设备将该m个已选字段组成已选字段集合(为空),将该n个待选字段组成待选字段集合之后,在第一轮迭代过程中,基于训练子集、已选字段集合和待选字段集合,确定待选字段集合中每个待选字段对应的互信息,实现过程为:将该n个待选字段中的每个待选字段依次作为一个参考待选字段,在确定第一个参考待选字段之后,将该n个待选字段中除第一个参考待选字段之外的n-1个待选字段作为假想的已选字段集合,基于训练子集、假想的已选字段集合与第一个参考待选字段,确定第一个参考待选字段对应的互信息。在确定第二个参考待选字段之后,检测设备以同样的方式确定第二个参考待选字段对应的互信息。以此类推,在第一轮迭代过程中共得到n个参考待选字段分别对应的互信息,这n个参考待选字段分别对应的互信息即为待选字段集合中的n个待选字段分别对应的互信息。If the above m is 0, that is, the number of selected fields in the candidate fields is 0, which means that the relevant personnel of the cloud platform have not specified the selected fields this time, then the detection device will form the m selected fields into a selected field set (which is empty) , after forming the n candidate fields into a candidate field set, in the first round of iteration process, based on the training subset, the selected field set and the candidate field set, determine the corresponding value of each candidate field in the candidate field set The mutual information of The n-1 candidate fields other than the first reference candidate field are used as the hypothetical selected field set. Based on the training subset, the hypothetical selected field set and the first reference candidate field, the first reference is determined Mutual information corresponding to the fields to be selected. After determining the second reference candidate field, the detection device determines the mutual information corresponding to the second reference candidate field in the same manner. By analogy, in the first round of iteration process, a total of mutual information corresponding to n reference candidate fields is obtained. The mutual information corresponding to these n reference candidate fields is the n candidate fields in the candidate field set. corresponding mutual information.
然后,检测设备从待选字段集合中选择出互信息最小的k个待选字段。基于训练子集、验证子集、假想的已选字段集合和该k个待选字段,确定该k个待选字段分别对应的重建损失。检测设备基于该k个待选字段分别对应的互信息和重建损失,从该k个待选字段中选择一个待选字段,并确定选择出的待选字段的字段得分。检测设备将选择出的待选字段从待选字段集合移入已选字段集合。至此得到已选字段集合中的第一个已选字段。在第二轮迭代过程中,检测设备基于训练子集、已选字段集合和待选字段集合,确定待选字段集合中每个待选字段对应的互信息,按照前述方式类推,直至待选字段集合为空时,得到n个待选字段分别对应的字段得分。Then, the detection device selects k candidate fields with the smallest mutual information from the set of candidate fields. Based on the training subset, the verification subset, the hypothetical selected field set and the k candidate fields, the reconstruction losses corresponding to the k candidate fields are determined. The detection device selects one candidate field from the k candidate fields based on the mutual information and reconstruction loss corresponding to the k candidate fields, and determines the field score of the selected candidate field. The detection device moves the selected field to be selected from the field set to be selected into the set of selected fields. At this point, the first selected field in the selected field set is obtained. In the second iteration process, the detection device determines the mutual information corresponding to each candidate field in the candidate field set based on the training subset, the selected field set, and the candidate field set, and so on, until the candidate field When the set is empty, the field scores corresponding to the n candidate fields are obtained.
在得到该n个待选字段分别对应的字段得分之后,可以通过人工决策的方式从该n个待选字段中选择p个待选字段,也可以通过自动决策的方式从该n个待选字段中选择p个待选字段,进而检测设备将该m 个已选字段和该p个待选字段确定为目标属性字段。After obtaining the field scores corresponding to the n candidate fields, you can select p candidate fields from the n candidate fields through manual decision-making, or you can select p candidate fields from the n candidate fields through automatic decision-making. Select p candidate fields, and then the detection device will The selected fields and the p candidate fields are determined as target attribute fields.
在通过人工决策的方式选择p个待选字段的一个实施例中,检测设备发送字段得分信息,该字段得分信息包括该n个待选字段中每个待选字段的字段得分。检测设备接收字段决策信息,该字段决策信息指示p个待选字段。检测设备基于该字段决策信息,从该n个待选字段中确定出该p个待选字段。也即是,检测设备将每个待选字段的综合得分情况反馈给相关人员,由相关人员根据综合得分情况来作最终决策。In one embodiment of selecting p candidate fields through manual decision-making, the detection device sends field score information, where the field score information includes the field score of each of the n candidate fields. The detection device receives field decision information, which indicates p candidate fields. The detection device determines the p candidate fields from the n candidate fields based on the field decision information. That is, the detection equipment feeds back the comprehensive score of each field to be selected to the relevant personnel, who then make the final decision based on the comprehensive score.
可选地,该字段得分信息还包括确定字段得分过程中所得到的每个待选字段对应的互信息(和/或互信息分数)以及重建损失。也即是,检测设备将每个待选字段对应的互信息、重建损失以及综合得分情况反馈给相关人员,由相关人员结合互信息、重建损失以及综合得分情况来作最终决策。Optionally, the field score information also includes the mutual information (and/or mutual information score) corresponding to each candidate field obtained in the process of determining the field score and the reconstruction loss. That is, the detection equipment feeds back the mutual information, reconstruction loss and comprehensive score corresponding to each candidate field to the relevant personnel, who then make the final decision based on the mutual information, reconstruction loss and comprehensive score.
在通过自动决策的方式来确定目标属性字段的一个实施例中,检测设备根据预设字段数量和该n个待选字段中每个字段的字段得分,从该n个待选字段中确定出p个待选字段。其中,预设字段数量表示对测试集进行异常检测所需的属性字段的总数量,p+m等于该预设字段数量。或者,预设字段数量表示所需从该n个待选字段中选择出来的待选字段的总数量,p等于预设字段数量。In one embodiment of determining the target attribute field through automatic decision-making, the detection device determines p from the n candidate fields based on the preset number of fields and the field score of each of the n candidate fields. fields to be selected. Among them, the preset number of fields represents the total number of attribute fields required for anomaly detection on the test set, and p+m is equal to the preset number of fields. Alternatively, the preset number of fields represents the total number of candidate fields that need to be selected from the n candidate fields, and p is equal to the preset number of fields.
接下来对本申请实施例中属性字段对应的统计量进行介绍。Next, the statistics corresponding to the attribute fields in the embodiment of this application are introduced.
由前述可知,该配置参数还指示候选属性字段中每个属性字段的类别,不同类别的属性字段对应的统计量的种类存在不同。例如,每个属性字段的类别为第一类属性或第二类属性或第三类属性或第四类属性。这四类属性字段对应的统计量的种类存在不同。这四类属性字段的特性可参照步骤601中表1的相关介绍。As can be seen from the foregoing, the configuration parameter also indicates the category of each attribute field in the candidate attribute field, and attribute fields of different categories have different types of statistics corresponding to them. For example, the category of each attribute field is a first-type attribute or a second-type attribute or a third-type attribute or a fourth-type attribute. There are different types of statistics corresponding to these four types of attribute fields. For the characteristics of these four types of attribute fields, please refer to the relevant introduction in Table 1 in step 601.
表3是本申请实施例提供的一种不同类别的属性字段与统计量之间的关系表。如表3所示,每一类属性字段对应的统计量均包括基础统计量、第一类统计量和第二类统计量。其中,属于第一类属性的属性字段对应的第一类统计量包括最大值、均值、数值标准差、信息熵和字段取值种类数。属于第二类属性的属性字段对应的第一类统计量包括最大值、均值、数值标准差、比例标准差、信息熵和字段取值种类数。属于第三类属性的属性字段对应的第一类统计量包括最大值、均值、数值标准差、比例标准差、信息熵和字段取值种类数。属于第四类属性的属性字段对应的第一类统计量包括均值和数值标准差。另外,在异常检测任务是时序相关的情况下,属于第二类属性和第三类属性的属性字段对应的统计量还包括用户访问轮廓和/或访问轮廓相似性。在表3中用‘√’表示需要计算的统计量,用‘×’表示无需计算的统计量,用‘*’表示异常检测任务为时序相关时所需计算的统计量。Table 3 is a relationship table between different categories of attribute fields and statistics provided by the embodiment of the present application. As shown in Table 3, the statistics corresponding to each type of attribute field include basic statistics, first type statistics and second type statistics. Among them, the first type statistics corresponding to the attribute fields belonging to the first type of attributes include maximum value, mean value, numerical standard deviation, information entropy and the number of field value types. The first type of statistics corresponding to attribute fields belonging to the second type of attributes include maximum value, mean, numerical standard deviation, proportional standard deviation, information entropy and the number of field value types. The first type of statistics corresponding to attribute fields belonging to the third type of attributes include maximum value, mean value, numerical standard deviation, proportional standard deviation, information entropy and the number of field value types. The first type of statistics corresponding to attribute fields belonging to the fourth type of attributes include mean and numerical standard deviation. In addition, when the anomaly detection task is time series related, the statistics corresponding to the attribute fields belonging to the second type of attributes and the third type of attributes also include user access profiles and/or access profile similarities. In Table 3, ‘√’ is used to indicate the statistics that need to be calculated, ‘×’ is used to indicate the statistics that do not need to be calculated, and ‘*’ is used to indicate the statistics that need to be calculated when the anomaly detection task is time series related.
表3
table 3
其中,基础统计量通过统计相应样本包括的日志数量来确定。表4是本申请实施例提供的不同类别的属性字段对应的基础统计量的含义表。以单条样本的时间粒度为1小时、异常检测的时间粒度为24小时为例,不同类别的属性字段对应的基础统计量的含义如表4所示。其中,第一类属性的基础统计量包括两种,这两种基础统计量分别对应不区分服务和区分服务这两种情况。其余三类属性的基础统计量包括一种。Among them, the basic statistics are determined by counting the number of logs included in the corresponding sample. Table 4 is a table of meanings of basic statistics corresponding to different categories of attribute fields provided by the embodiment of the present application. Taking the time granularity of a single sample as 1 hour and the time granularity of anomaly detection as 24 hours as an example, the meanings of the basic statistics corresponding to different categories of attribute fields are shown in Table 4. Among them, the basic statistics of the first type of attributes include two types, and these two basic statistics correspond to the two situations of no differentiated services and differentiated services respectively. The basic statistics for the remaining three types of attributes include one.
表4

Table 4

由表4可以看出,第一类属性对应的基础统计量包括第一基础统计量和第二基础统计量,第一基础统计量包括相应样本所包括的每个单条样本中存在的每种云服务的日志数量,以及相应样本所包括的多个单条样本整体存在的每种云服务的日志数量,第二基础统计量包括相应样本所包括的每个单条样本中存在的每种云服务中每种应用的日志数量,以及相应样本所包括的多个单条样本整体存在的每种云服务中每种应用的日志数量。第二类属性、第三类属性和第四类属性对应的基础统计量包括第三基础统计量,第三基础统计量包括相应样本所包括的每个单条样本中存在的相应类别的属性字段的每种取值在每种云服务中的日志数量,以及相应样本所包括的多个单条样本整体中存在的相应类别的属性字段的每种取值在每种云服务中的日志数量。It can be seen from Table 4 that the basic statistics corresponding to the first type of attributes include the first basic statistic and the second basic statistic. The first basic statistic includes each type of cloud that exists in each single sample included in the corresponding sample. The number of logs of the service, and the number of logs of each cloud service that exist in multiple single samples included in the corresponding sample. The second basic statistics include the number of logs of each cloud service that exist in each single sample included in the corresponding sample. The number of logs for each application, and the number of logs for each application in each cloud service that exist across multiple single samples included in the corresponding sample. The basic statistics corresponding to the second type of attributes, the third type of attributes and the fourth type of attributes include the third basic statistics, and the third basic statistics include the attribute fields of the corresponding categories that exist in each single sample included in the corresponding sample. The number of logs in each cloud service for each value, and the number of logs in each cloud service for each value of the attribute field of the corresponding category that exists in the entire set of multiple single samples included in the corresponding sample.
以检测对象为用户1为例,用户1在一天内操作两种服务,包括弹性计算服务(elastic compute service,ECS)和对象存储服务(object storage service,OBS)。其中,ECS提供2种应用,OBS提供3种应用。对于第一类属性、区分服务的情况,检测设备统计用户1在这一天中的每个小时操作ECS中每种应用所产生的日志的数量,以及每个小时操作OBS中每种应用所产生的日志的数量,共得到24*(2+3)维的统计量,统计用户1在这一天中操作ECS中每种应用所产生的日志的数量,以及这一天中操作OBS中每种应用所产生的日志的数量,共得到(2+3)维的统计量。那么,对于第一类属性、区分服务的情况,检测设备共得到(24+1)*5维的基础统计量。Taking the detection object as user 1 as an example, user 1 operates two services within a day, including elastic computing service (elastic compute service, ECS) and object storage service (object storage service, OBS). Among them, ECS provides 2 applications and OBS provides 3 applications. For the first type of attributes and differentiated services, the detection device counts the number of logs generated by user 1 operating each application in ECS every hour of the day, and the number of logs generated by operating each application in OBS every hour. The number of logs, a total of 24*(2+3)-dimensional statistics are obtained, counting the number of logs generated by user 1 operating each application in ECS during this day, and the number of logs generated by operating each application in OBS during this day The number of logs, a total of (2+3)-dimensional statistics are obtained. Then, for the first type of attribute and differentiated services, the detection device obtains a total of (24+1)*5-dimensional basic statistics.
对于第一类属性、不区分服务的情况,检测设备统计用户1在这一天中每个小时操作ECS所产生的日志的数量,以及每个小时操作OBS所产生的日志的数量,共得到24*2维的统计量,统计用户1在这一天中操作ECS所产生的日志的数量,以及这一天中操作OBS所产生的日志的数量,共得到1*2维的统计量。那么,对于第一类属性、不区分服务的情况,检测设备共得到(24+1)*2维的基础统计量。For the first type of attributes and no distinction between services, the detection device counts the number of logs generated by user 1 operating ECS every hour during the day, and the number of logs generated by operating OBS every hour, and a total of 24* is obtained A 2-dimensional statistic that counts the number of logs generated by user 1 operating ECS in this day and the number of logs generated by operating OBS in this day, resulting in a total of 1*2-dimensional statistics. Then, for the first type of attributes and no differentiation of services, the detection device obtains a total of (24+1)*2-dimensional basic statistics.
以用户1在使用云服务过程中产生的日志数据包括status字段的数据为例,status字段的取值范围为[0,1,2],status字段属于第二类属性。检测设备统计用户1在这一天中的每个小时内所产生的日志数据中status字段的每个取值在每种云服务下的日志的数量,共得到24*3维的统计量,统计这一天内所产生的日志数据中status字段的每个取值在每种云服务下的日志的数量,共得到1*3维的统计量。那么,对于每个属于第二类属性的字段,检测设备共(24+1)*3维的基础统计量,’3’表示status字段的字段取值种类数。Take the log data generated by User 1 when using the cloud service as an example, including the data in the status field. The value range of the status field is [0,1,2], and the status field belongs to the second type of attribute. The detection device counts the number of logs under each cloud service for each value of the status field in the log data generated by user 1 in each hour of the day. A total of 24*3-dimensional statistics are obtained. The statistics are For each value of the status field in the log data generated within a day, the number of logs under each cloud service is obtained, and a total of 1*3-dimensional statistics are obtained. Then, for each field belonging to the second type of attribute, the detection device has a total of (24+1)*3-dimensional basic statistics, and '3' represents the number of field value types of the status field.
在得到基础统计量之后,检测设备基于基础统计量确定第一类统计量,即,第一类统计量基于基础统计量确定。表5是本申请实施例提供的不同类别的属性字段对应的第一类统计量的计算方式表。以单条样本的时间粒度为1小时、异常检测的时间粒度为24小时为例,在表4的基础上,不同类别的属性字段对应的第一类统计量的计算方式如表5所示。After obtaining the basic statistics, the detection device determines the first type of statistics based on the basic statistics, that is, the first type of statistics is determined based on the basic statistics. Table 5 is a table of calculation methods of the first type of statistics corresponding to different types of attribute fields provided by the embodiment of the present application. Taking the time granularity of a single sample as 1 hour and the time granularity of anomaly detection as 24 hours as an example, based on Table 4, the calculation method of the first type of statistics corresponding to different categories of attribute fields is shown in Table 5.
表5

table 5

表6是本申请实施例提供的不同类别的属性字段对应的第一类统计量的含义表。在表4的基础上,不同类别的属性字段对应的第一类统计量的含义如表6所示。第一类统计量中的最大值表征相应样本的基础统计量所包括的每种统计量的上界,均值表征相应样本的基础统计量所包括的每种统计量的平均状况,数值标准差表征相应样本的基础统计量所包括的每种统计量的离散程度,比例标准差表征相应样本的基础统计量所包括的每种统计量的不平衡性,信息熵表征相应样本的基础统计量所包括的每种统计量的混乱程度,字段取值种类数表征相应样本中每个属性字段的可能取值的个数。 Table 6 is a table of meanings of the first type of statistics corresponding to different types of attribute fields provided by the embodiment of the present application. On the basis of Table 4, the meanings of the first type of statistics corresponding to different categories of attribute fields are shown in Table 6. The maximum value in the first type of statistics represents the upper bound of each statistic included in the basic statistics of the corresponding sample, the mean represents the average status of each statistic included in the basic statistics of the corresponding sample, and the numerical standard deviation represents The degree of dispersion of each statistic included in the basic statistics of the corresponding sample, the proportional standard deviation represents the imbalance of each statistic included in the basic statistics of the corresponding sample, and the information entropy represents the basic statistics included in the corresponding sample The degree of confusion of each statistic, and the number of field value types represents the number of possible values for each attribute field in the corresponding sample.
表6
Table 6
以检测对象为用户1为例,用户1在一天内操作两种服务,包括ECS和OBS,其中,ECS提供2种应用,OBS提供3种应用。对于第一类属性、不区分服务的情况下,检测设备对这一天的每个小时内用户1操作这5种应用分别所产生的日志的数量进行统计,得到每个小时的基础统计量包括的5个日志数量,计算每个小时的这5个日志数量的最大值、均值、数值标准差和信息熵,以得到每个小时的第一类统计量中的一部分统计量。同理,检测设备对这一天内用户1操作这5种应用分别产生的日志的数量进行统计,得到这一天的基础统计量包括的5个日志数量,计算这一天的这5个日志数量的最大值、均值、数值标准差和信息熵,以得到这一天的第一类统计量中的一部分统计量。对于第一类属性、区分服务的情况下,检测设备统计这一天的每个小时内用户1操作ECS所产生的日志的数量以及操作OBS所产生的日志的数量,得到每个小时的基础统计量包括的另外2个日志数量,计算每个小时的这2个日志数量的最大值、均值、数值标准差和信息熵,以得到每个小时的第一类统计量中的另一部分统计量。同理,检测设备对这一天内用户1操作ECS所产生的日志的数量以及操作OBS所产生的日志的数量,得到这一天的基础统计量包括的另外2个日志数量,计算这一天的这2个日志数量的最大值、均值、数值标准差和信息熵,以得到这一天的第一类统计量中的另一部分统计量。Taking the detection object as user 1 as an example, user 1 operates two services within a day, including ECS and OBS. Among them, ECS provides 2 applications and OBS provides 3 applications. For the first type of attributes and without distinguishing services, the detection device counts the number of logs generated by user 1 operating these five applications in each hour of the day, and obtains the basic statistics for each hour including For the number of 5 logs, calculate the maximum value, mean, numerical standard deviation and information entropy of the 5 log numbers for each hour to obtain some of the statistics of the first type of statistics for each hour. In the same way, the detection device counts the number of logs generated by the five applications operated by user 1 during the day, obtains the number of five logs included in the basic statistics of this day, and calculates the maximum number of these five logs on this day. value, mean, numerical standard deviation and information entropy to obtain some statistics of the first type of statistics for this day. For the first type of attributes and differentiated services, the detection device counts the number of logs generated by user 1 operating ECS and the number of logs generated by operating OBS in each hour of the day, and obtains basic statistics for each hour. Including the other 2 log numbers, calculate the maximum value, mean, numerical standard deviation and information entropy of these 2 log numbers for each hour to obtain another part of the statistics of the first type of statistics for each hour. In the same way, the detection device detects the number of logs generated by user 1 operating ECS and the number of logs generated by operating OBS during the day, and obtains the number of the other two logs included in the basic statistics for this day, and calculates the two logs for this day. The maximum value, mean, numerical standard deviation and information entropy of the number of logs are used to obtain another part of the statistics of the first type of statistics on this day.
以用户1在使用云服务过程中产生的日志数据包括status字段的数据为例,status字段的取值范围为[0,1,2],status字段属于第二类属性。检测设备统计用户1在这一天中的每个小时内所产生的日志数据中status字段的每个取值在每种云服务下的日志的数量,得到每个小时的基础统计量中的3个日志数量,计算每个小时的这3个日志数量的最大值、均值、数值标准差、比例标准差和信息熵,以得到每个小时的第一类统计量中的一部分统计量。检测设备统计这一天内所产生的日志数据中status字段的每个取值在每种云服务下的日志的数量,得到这一天的基础统计量中的3个日志数量,计算这一天的这3个日志数量的最大值、均值、数值标准差、比例标准差和信息熵,以得到这一天的第一类统计量中的一部分统计量。Take the log data generated by User 1 when using the cloud service as an example, including the data in the status field. The value range of the status field is [0,1,2], and the status field belongs to the second type of attribute. The detection device counts the number of logs under each cloud service for each value of the status field in the log data generated by user 1 in each hour of the day, and obtains 3 of the basic statistics for each hour. For the number of logs, calculate the maximum value, mean, numerical standard deviation, proportional standard deviation and information entropy of the three log numbers for each hour to obtain some of the statistics of the first type of statistics for each hour. The detection device counts the number of logs under each cloud service for each value of the status field in the log data generated during the day, obtains the number of three logs in the basic statistics for this day, and calculates the three logs for this day. The maximum value, mean, numerical standard deviation, proportional standard deviation and information entropy of the number of logs are used to obtain some of the statistics of the first type of statistics on this day.
在得到第一类统计量之后,检测设备基于第一类统计量确定第二类统计量,即,第二类统计量根据第一类统计量确定,第二类统计量表征相应样本的第一类统计量所包括的每种统计量与所有样本的这种统计量之间的差异。简单来讲,第二类统计量反映单个样本距离样本总体均值的相对标准距离,也即是,反映样本在第一类统计量上与历史观测期内样本总体间的差异。After obtaining the first type of statistics, the detection device determines the second type of statistics based on the first type of statistics, that is, the second type of statistics is determined based on the first type of statistics, and the second type of statistics represents the first type of the corresponding sample. The difference between each statistic included in the class statistic and that statistic for all samples. Simply put, the second type of statistics reflects the relative standard distance of a single sample from the sample population mean, that is, it reflects the difference between the sample in the first type of statistics and the sample population in the historical observation period.
作为一个示例,检测设备按照公式(5)通过计算z分数的方式来计算第二类统计量。在公式(5)中,表示每一天或每个小时的第一类统计量中的第i个统计量,μ和σ分别表示的均值和数值标准差, zi表示每一天或每个小时的第二类统计量中的第i个统计量,即第一类统计量中的第i个统计量对应的第二类统计量。ns表示第一类统计量所包括的统计量的总数量,例如,按照表5来看,ns=(1*N)*(5+num(2nd_attr)*6+num(3nd_attr)*6num(4nd_attr)*2)。
As an example, the detection device calculates the second type of statistic by calculating z-score according to formula (5). In formula (5), Represents the i-th statistic in the first type of statistics for each day or each hour, μ and σ respectively represent The mean and numerical standard deviation of z i represents the i-th statistic in the second type of statistics for each day or hour, that is, the second type of statistic corresponding to the i-th statistic in the first type of statistics. n s represents the total number of statistics included in the first type of statistics. For example, according to Table 5, n s = (1*N)*(5+num(2nd_attr)*6+num(3nd_attr)*6num (4nd_attr)*2).
由前述可知,配置参数还指示异常检测任务是否时序相关,在时序相关的情况下,属于第二类属性和第三类属性的属性字段对应的统计量还包括用户访问轮廓和/或访问轮廓相似性。其中,用户访问轮廓表征相应样本所包括的多个单条样本的日志数量变化情况、第二类属性对应的信息熵变化情况和第三类属性对应的信息熵变化情况中的一种或多种。访问轮廓相似性表征相应样本的用户访问轮廓与参考样本的用户访问轮廓之间的相似度。参考样本可以是样本集中的第一个样本,例如,训练子集中的第一个训练样本。As can be seen from the foregoing, the configuration parameters also indicate whether the anomaly detection task is time-series related. In the case of time-series relatedness, the statistics corresponding to the attribute fields belonging to the second type of attributes and the third type of attributes also include user access profiles and/or access profile similarities. sex. Among them, the user access profile represents one or more of changes in the number of logs of multiple single samples included in the corresponding sample, changes in information entropy corresponding to the second type of attributes, and changes in information entropy corresponding to the third type of attributes. Access profile similarity characterizes the similarity between the user access profile of the corresponding sample and the user access profile of the reference sample. The reference sample can be the first sample in the sample set, for example, the first training sample in the training subset.
其中,日志数量变化情况包括每种云服务的日志数量随时间变化的情况,和/或,所有云服务的总日志数量随时间变化的情况。第二类属性对应的信息熵变化情况包括属于第二类属性的每种属性字段的每种取值在每种云服务下的信息熵随时间的变化,和/或,每种第二类属性在所有云服务下的信息熵随时间的变化。第三类属性对应的信息熵变化情况包括每种第三类属性在每种云服务中的信息熵随时间的变化,和/或,每种第三类属性在所有云服务中的信息熵随时间的变化。The change in the number of logs includes the change in the number of logs of each cloud service over time, and/or the change in the total number of logs in all cloud services over time. The change in information entropy corresponding to the second type attribute includes the change of information entropy over time under each cloud service for each value of each attribute field belonging to the second type attribute, and/or, for each second type attribute Information entropy changes over time under all cloud services. The change of information entropy corresponding to the third type attribute includes the change of the information entropy of each third type attribute in each cloud service over time, and/or the change of the information entropy of each third type attribute in all cloud services over time. Time changes.
示例性地,检测设备统计用户1在一天的24个小时内操作某种服务所产生的日志的数量随时间变化的情况,得到一个24维的用户访问轮廓。For example, the detection device counts how the number of logs generated by user 1 operating a certain service within 24 hours of a day changes over time, and obtains a 24-dimensional user access profile.
检测设备按照公式(6)通过计算余弦相似度的方式来计算访问轮廓相似性。在公式(6)中,向量A和向量B分别表示两个用户访问轮廓,i表示向量中的元素下标。
The detection device calculates access profile similarity by calculating cosine similarity according to formula (6). In formula (6), vector A and vector B represent two user access profiles respectively, and i represents the element subscript in the vector.
检测设备按照公式(6)计算训练子集中每个训练样本的用户访问轮廓与参考样本的用户访问轮廓之间的访问轮廓相似性,共得到与训练样本的总数量相同的多个访问轮廓相似性。同理,检测设备按照公式(6)计算验证子集中每个验证样本的用户访问轮廓与参考样本的用户访问轮廓之间的访问轮廓相似性,以及测试集中每个待测样本的用户访问轮廓与参考样本的用户访问轮廓之间的访问轮廓相似性。The detection device calculates the access profile similarity between the user access profile of each training sample in the training subset and the user access profile of the reference sample according to formula (6), and obtains multiple access profile similarities that are the same as the total number of training samples. . In the same way, the detection device calculates the access profile similarity between the user access profile of each verification sample in the verification subset and the user access profile of the reference sample according to formula (6), and the user access profile of each sample to be tested in the test set. Access profile similarity between user access profiles of reference samples.
可选地,检测设备确定基于训练子集得到的多个访问轮廓相似性的均值和数值标准差,基于该多个访问轮廓相似性的均值和数值标准差,对基于训练子集得到的访问轮廓相似性进行归一化,以得到训练子集的归一化后的访问轮廓相似性,对基于验证子集得到的访问轮廓相似性进行归一化,以得到验证子集的归一化后的访问轮廓相似性,对基于测试集得到的访问轮廓相似性进行归一化,以得到测试集的归一化后的访问轮廓相似性。后续检测设备将归一化后的访问轮廓相似性作为相应样本的一个统计量。Optionally, the detection device determines the mean value and numerical standard deviation of multiple access profile similarities obtained based on the training subset, and based on the mean value and numerical standard deviation of the multiple access profile similarities, the access profile obtained based on the training subset is The similarity is normalized to obtain the normalized access profile similarity of the training subset, and the access profile similarity obtained based on the verification subset is normalized to obtain the normalized access profile similarity of the verification subset. Access profile similarity: normalize the access profile similarity obtained based on the test set to obtain the normalized access profile similarity of the test set. Subsequent detection equipment uses the normalized access profile similarity as a statistic for the corresponding sample.
作为一个示例,检测设备按照公式(7)对访问轮廓相似性进行归一化。在公式(7)中,sim(g)表示一个样本(待测样本或验证样本或训练样本)的待进行归一化的访问轮廓相似性,表示该样本的归一化后的访问轮廓相似性,μsim和σsim分别表示基于训练子集得到的多个访问轮廓相似性的均值和数值标准差。
As an example, the detection device normalizes access profile similarity according to equation (7). In formula (7), sim(g) represents the access profile similarity to be normalized for a sample (test sample or verification sample or training sample), Represents the normalized visit profile similarity of the sample, μ sim and σ sim respectively represent the mean and numerical standard deviation of multiple visit profile similarities obtained based on the training subset.
上文对本申请实施例中属性字段对应的统计量进行了介绍。应当理解的是,以上内容并不用于限制本申请实施例,不同实施例中所需计算的统计量以及每种统计量的计算方式可以相同,也可以不同。The statistics corresponding to the attribute fields in the embodiment of this application are introduced above. It should be understood that the above content is not intended to limit the embodiments of the present application. The statistics required to be calculated and the calculation method of each statistic in different embodiments may be the same or different.
步骤603:基于样本集和目标属性字段,对第一检测模型的第一超参数进行调优。Step 603: Based on the sample set and the target attribute field, tune the first hyperparameter of the first detection model.
检测设备在确定目标属性字段之后,基于目标属性字段确定样本集的统计特征,该样本集的统计特征包括该样本集中目标属性字段的数据的统计量。即,检测设备将样本集中目标属性字段的数据的统计量,确定为样本集的统计特征。检测设备基于样本集的统计特征,对第一检测模型的第一超参数进行调优,以得到经参数调优的第一检测模型。After determining the target attribute field, the detection device determines the statistical characteristics of the sample set based on the target attribute field. The statistical characteristics of the sample set include statistics of the data of the target attribute field in the sample set. That is, the detection device determines the statistics of the data in the target attribute field in the sample set as the statistical characteristics of the sample set. The detection device optimizes the first hyperparameter of the first detection model based on the statistical characteristics of the sample set to obtain a parameter-tuned first detection model.
在样本集包括训练子集和验证子集的实施例中,样本集的统计特征包括训练子集的统计特征和验证子 集的统计特征,检测设备将训练子集所包括的目标属性字段的数据的统计量,确定为训练子集的统计特征,将验证子集中目标属性字段的数据的统计量,确定为验证子集的统计特征。检测设备基于训练子集的统计特征和验证子集的统计特征,对第一检测模型的第一超参数进行调优。In the embodiment where the sample set includes a training subset and a verification subset, the statistical characteristics of the sample set include the statistical characteristics of the training subset and the verification subset. The detection device determines the statistics of the data of the target attribute field included in the training subset as the statistical characteristics of the training subset, and determines the statistics of the data of the target attribute field in the verification subset as the verification subset. statistical characteristics. The detection device optimizes the first hyperparameter of the first detection model based on the statistical characteristics of the training subset and the statistical characteristics of the verification subset.
其中,第一超参数包括学习率、训练轮数和隐藏层维度。待调优的学习率、训练轮数和隐藏层维度的可能取值均有多个,检测设备按照学习率、训练轮数和隐藏层维度的各种取值的组合来设置第一检测模型的第一超参数,利用训练子集的统计特征来训练设置了不同组合的第一超参数的第一检测模型,利用验证子集的统计特征来验证经训练的第一检测模型的异常检测效果,在得到不同组合的第一超参数分别对应的异常检测效果之后,将异常检测效果最优的第一超参数组合确定为调优后的第一超参数,将调优后的第一超参数确定为经参数调优的第一检测模型的第一超参数。简单来讲,检测设备按照网络搜索的方式,遍历所有可能的第一超参数组合,以在第一超参数的搜索空间中搜索出最优的第一超参数组合。第一超参数的搜索空间为学习率、训练轮数和隐藏层维度这三个超参数的搜索空间的笛卡尔积。Among them, the first hyperparameters include learning rate, number of training rounds and hidden layer dimensions. There are multiple possible values for the learning rate, number of training rounds, and hidden layer dimensions to be tuned. The detection device sets the first detection model according to the combination of various values of the learning rate, number of training rounds, and hidden layer dimensions. The first hyperparameter uses the statistical characteristics of the training subset to train the first detection model with different combinations of first hyperparameters, and uses the statistical characteristics of the verification subset to verify the anomaly detection effect of the trained first detection model, After obtaining the anomaly detection effects corresponding to the first hyperparameters of different combinations, the first hyperparameter combination with the best anomaly detection effect is determined as the tuned first hyperparameter, and the tuned first hyperparameter is determined is the first hyperparameter of the parameter-tuned first detection model. Simply put, the detection device traverses all possible first hyperparameter combinations in a network search manner to search for the optimal first hyperparameter combination in the first hyperparameter search space. The search space of the first hyperparameter is the Cartesian product of the search spaces of the three hyperparameters: learning rate, number of training epochs, and hidden layer dimension.
在本申请实施例中,第一检测模型包括输入层、第一隐藏层和第二隐藏层(也可称为瓶颈层)。待调优的隐藏层维度包括第一隐藏层和第二隐藏层的维度。输入层的维度基于目标属性字段包括的字段的数量确定,第一隐藏层和第二隐藏层的维度基于输入层的维度确定。也即是,隐藏层的维度并非随意设置,隐藏层维度的搜索空间较小。In this embodiment of the present application, the first detection model includes an input layer, a first hidden layer and a second hidden layer (which may also be called a bottleneck layer). The hidden layer dimensions to be tuned include the dimensions of the first hidden layer and the second hidden layer. The dimensions of the input layer are determined based on the number of fields included in the target attribute field, and the dimensions of the first hidden layer and the second hidden layer are determined based on the dimensions of the input layer. That is to say, the dimensions of the hidden layer are not set arbitrarily, and the search space of the hidden layer dimension is small.
其中,目标属性字段包括的字段的数量越多,输入层的维度相对越高。输入层的维度还与目标属性字段包括的字段的统计量的维度相关,目标属性字段包括的字段的统计量的维度越高,即统计量的数量越多,输入层的维度相对越高。Among them, the more fields the target attribute field includes, the higher the dimension of the input layer will be. The dimension of the input layer is also related to the dimension of the statistics of the fields included in the target attribute field. The higher the dimension of the statistics of the fields included in the target attribute field, that is, the greater the number of statistics, the higher the dimension of the input layer will be.
以输入层的维度为N1,第一隐藏层的维度为N2,第二隐藏层的维度为N3为例,在一些实施例中,N1满足其中,ceil()表示向上取整,即N为不小于N1的一个2的整数次幂。在N位于第一区间的情况下,N2=N/4,N3为位于区间[8,N/8]内的2的整数次幂中的一个。在N位于第二区间的情况下,N2=N/2,N3为位于区间[4,N/4]内的2的整数次幂中的一个,第二区间内的数值均小于第一区间内的数值。Taking the dimension of the input layer as N1, the dimension of the first hidden layer as N2, and the dimension of the second hidden layer as N3 as an example, in some embodiments, N1 satisfies Among them, ceil() means rounding up, that is, N is an integer power of 2 that is not less than N1. When N is located in the first interval, N2=N/4, and N3 is one of the integer powers of 2 located in the interval [8, N/8]. When N is located in the second interval, N2=N/2, N3 is one of the integer powers of 2 located in the interval [4, N/4], and the values in the second interval are smaller than those in the first interval value.
示例性地,在N为第一区间[512,4096]内的一个2的整数次幂,即N1∈(256,4096]的情况下,N2=N/4,N3的搜索空间为{8,16,32,64,128,256,512}。在N为第二区间[64,256]内的一个2的整数次幂,即N1∈(264,256]的情况下,N2=N/2,N3的搜索空间为{4,8,16,32,64}。For example, when N is an integer power of 2 within the first interval [512,4096], that is, N1∈(256,4096], N2=N/4, the search space of N3 is {8, 16,32,64,128,256,512}. When N is an integer power of 2 in the second interval [64,256], that is, N1∈(264,256], N2=N/2, the search space of N3 is {4,8 ,16,32,64}.
可选地,学习率等于其中,l为区间[0,L+1]中的一个整数,L不超过搜索点阈值。换种方式来讲,学习率的搜索空间为其中,l=0,1,…,L+1,L不超过搜索点阈值,即学习率的搜索点共有L+2个,包括学习率优化区间内的L个搜索点以及学习率优化区间的边界上的两个搜索点。其中,搜索点阈值为12、14、16等数值。Optionally, the learning rate is equal to Where l is an integer in the interval [0, L+1], and L does not exceed the search point threshold. In other words, the search space of the learning rate is Where l = 0, 1, ..., L + 1, L does not exceed the search point threshold, that is, there are L + 2 search points for the learning rate, including L search points within the learning rate optimization interval and two search points on the boundary of the learning rate optimization interval. The search point threshold is 12, 14, 16, etc.
对于训练轮数来讲,通常训练轮数越多,模型越能够在训练子集上接近收敛状态,但是也容易使得训练后的模型过拟合。为了让模型收敛的同时尽量降低过拟合程度,本申请实施例中利用验证子集来测试经训练子集训练后的第一检测模型的过拟合程度。在验证子集中的日志数据没有异常的情况下,将验证子集中的所有验证样本视为正样本,当经训练的第一检测模型在验证子集上有较好的性能时,即,将验证子集中的超过预设比例的验证样本均检测为正样本时,检测设备可认为经训练的第一检测模型在没有过拟合的情况下接近收敛状态。其中,预设比例为95%、98%或99%等。Regarding the number of training rounds, usually the more the number of training rounds, the closer the model can be to the convergence state on the training subset, but it can also easily make the trained model overfit. In order to allow the model to converge while minimizing the degree of overfitting, in the embodiment of the present application, the verification subset is used to test the degree of overfitting of the first detection model trained by the training subset. When there are no abnormalities in the log data in the validation subset, all validation samples in the validation subset are regarded as positive samples. When the trained first detection model has better performance on the validation subset, that is, the validation When the verification samples exceeding the preset proportion in the subset are all detected as positive samples, the detection device may consider that the trained first detection model is close to a convergence state without overfitting. Among them, the preset ratio is 95%, 98% or 99%, etc.
在一些实施例中,训练轮数的取值区间为[1,100],即,训练轮数的搜索空间为{1,2,3,…,100}。检测设备将训练100轮中在验证子集上的性能最优的轮数确定为调优后的训练轮数,即得到最优轮数。In some embodiments, the value range of the number of training rounds is [1,100], that is, the search space of the number of training rounds is {1,2,3,...,100}. The detection equipment determines the number of rounds with the best performance on the verification subset among 100 rounds of training as the number of training rounds after tuning, that is, the optimal number of rounds is obtained.
在本申请实施例中,第一检测模型还包括第二超参数,第二超参数包括优化器参数、激活函数、损失函数、参数初始化方式、批处理大小和隐藏层层数。第二超参数是无需调优的超参数。例如,第二超参数是预设参数。其中,各个隐藏层的激活函数可以是Relu、sigmoid等非线性函数,输出层的激活函数可以是线性函数。In this embodiment of the present application, the first detection model also includes second hyperparameters, and the second hyperparameters include optimizer parameters, activation function, loss function, parameter initialization method, batch size, and number of hidden layers. The second hyperparameter is a hyperparameter that does not require tuning. For example, the second hyperparameter is a preset parameter. Among them, the activation function of each hidden layer can be a nonlinear function such as Relu, sigmoid, etc., and the activation function of the output layer can be a linear function.
为了实现轻量级的自动异常检测,本申请实施例提供了一种基于自动编解码器的异常检测模型。即,本文中的第一检测模型和第二检测模型基于自动编解码器实现。In order to implement lightweight automatic anomaly detection, embodiments of this application provide an anomaly detection model based on an automatic codec. That is, the first detection model and the second detection model in this article are implemented based on the automatic codec.
在本申请实施例中,第一检测模型包括串联连接的编码器和解码器,编码器包括串联连接的输入层、第一隐藏层和第二隐藏层,解码器包括串联连接的第三隐藏层和输出层。编码器的输入层与解码器的输出 层的维度相同,第一隐藏层与第三隐藏层的维度相同。第二隐藏层可认为是编码器的输出层,也可认为是解码器的输入层,即,可认为第二隐藏层是编码器与解码器共用的网络层,编码器与解码器是相对称的。其中,编码器的输入层用于输入样本的统计特征,解码器的输出层用于输出样本的重建特征。In the embodiment of the present application, the first detection model includes an encoder and a decoder connected in series. The encoder includes an input layer, a first hidden layer and a second hidden layer connected in series. The decoder includes a third hidden layer connected in series. and output layer. The input layer of the encoder and the output of the decoder The dimensions of the layers are the same, and the dimensions of the first hidden layer and the third hidden layer are the same. The second hidden layer can be considered as the output layer of the encoder or the input layer of the decoder. That is, the second hidden layer can be considered as the network layer shared by the encoder and the decoder. The encoder and the decoder are relatively symmetrical. of. Among them, the input layer of the encoder is used for the statistical features of the input samples, and the output layer of the decoder is used for the reconstructed features of the output samples.
另外,第一检测模型还包括判别器,判别器的参数包括误差阈值。检测设备将样本集的统计特征输入编码器的输入层,样本集的统计特征经编码器的输入层、第一隐藏层和第二隐藏层处理之后,得到样本集的编码特征。样本集的编码特征经解码器的第三隐藏层和输出层处理后,得到样本集的重建特征。检测设备将样本集的统计特征和重建特征输入判别器,以按照误差阈值来确定样本集的重建损失。检测设备基于样本集的重建损失来对第一检测模型进行训练和超参数调优。In addition, the first detection model also includes a discriminator, and the parameters of the discriminator include an error threshold. The detection device inputs the statistical features of the sample set into the input layer of the encoder. After the statistical features of the sample set are processed by the input layer, the first hidden layer and the second hidden layer of the encoder, the coding features of the sample set are obtained. After the coding features of the sample set are processed by the third hidden layer and output layer of the decoder, the reconstructed features of the sample set are obtained. The detection device inputs the statistical features and reconstruction features of the sample set into the discriminator to determine the reconstruction loss of the sample set according to the error threshold. The detection device performs training and hyperparameter tuning on the first detection model based on the reconstruction loss of the sample set.
其中,误差阈值根据多个重建损失的均值确定,在对第一检测模型进行参数调优的过程中,该多个重建损失包括训练子集中每个训练样本的统计特征与重建特征之间的误差。例如,检测设备将该多个重建损失的均值确定为误差阈值。或者,误差阈值根据该多个重建损失的均值和标准差确定。例如,检测设备按照公式(8)来确定误差阈值(threshold)。在公式(8)中,mean(loss)和std(loss)分别表示多个重建损失的均值和标准差,α为一个预设参数,α可以为0.4、0.5、0.6等数值。
threshold=mean(loss)+α*std(loss)          (8)
Wherein, the error threshold is determined based on the mean value of multiple reconstruction losses. In the process of parameter tuning of the first detection model, the multiple reconstruction losses include the error between the statistical characteristics of each training sample in the training subset and the reconstruction characteristics. . For example, the detection device determines the mean value of the plurality of reconstruction losses as the error threshold. Alternatively, the error threshold is determined based on the mean and standard deviation of the multiple reconstruction losses. For example, the detection device determines the error threshold (threshold) according to formula (8). In formula (8), mean(loss) and std(loss) respectively represent the mean and standard deviation of multiple reconstruction losses, α is a preset parameter, and α can be 0.4, 0.5, 0.6 and other values.
threshold=mean(loss)+α*std(loss) (8)
本文中样本的统计特征与重建特征之间的重建损失可以是均方根误差(root mean square error,RMSE)、均方误差(mean square error,MSE)或其他形式的误差,本申请实施例对此不作限定。即,损失函数可以是RMSE函数或MSE函数或其他函数。In this article, the reconstruction loss between the statistical characteristics of the sample and the reconstruction characteristics can be root mean square error (RMSE), mean square error (MSE) or other forms of errors. The embodiments of this application are This is not a limitation. That is, the loss function can be an RMSE function or an MSE function or other functions.
由上述可知,本申请实施例中需要调优的第一超参数包括学习率、训练轮数和隐藏层维度这三个对模型性能影响相对较大的参数,而对模型性能影响相对较小的第二超参数可以预先设置以加快模型的构建和参数调优,从而在保证经参数优化后的第一检测模型的性能的前提下,提高异常检测任务的执行效率。From the above, it can be seen that the first hyperparameters that need to be tuned in the embodiment of the present application include the learning rate, the number of training rounds and the hidden layer dimension, which have a relatively large impact on the model performance. The second hyperparameters that have a relatively small impact on the model performance can be pre-set to speed up the construction of the model and parameter tuning, thereby improving the execution efficiency of the anomaly detection task while ensuring the performance of the first detection model after parameter optimization.
步骤604:基于目标属性字段,通过经参数调优的第一检测模型对测试集进行异常检测,以得到测试集的异常检测结果。Step 604: Based on the target attribute field, perform anomaly detection on the test set through the parameter-tuned first detection model to obtain anomaly detection results of the test set.
在得到经参数调优的第一检测模型之后,检测设备基于目标属性字段确定测试集的统计特征,测试集的统计特征包括测试集中目标属性字段的数据的统计量。检测设备将测试集的统计特征输入经参数调优的第一检测模型,以得到该测试集的异常检测结果。After obtaining the parameter-tuned first detection model, the detection device determines the statistical characteristics of the test set based on the target attribute field, and the statistical characteristics of the test set include statistics of the data of the target attribute field in the test set. The detection device inputs the statistical characteristics of the test set into the parameter-tuned first detection model to obtain anomaly detection results of the test set.
以单条样本的时间粒度为1小时,异常检测的时间粒度为24小时为例,若检测对象为用户a,测试集中的每个待测样本包括用户a在24个小时内的日志数据,那么该待测样本的异常检测结果指示用户a在该24小时内是否存在异常操作和/或异常访问行为。若检测对象为云服务a,测试集中的每个待测样本包括云服务a在24小时内的日志数据,那么该待测样本的异常检测结果指示云服务a是否在这24小时内存在异常。For example, if the time granularity of a single sample is 1 hour and the time granularity of anomaly detection is 24 hours, if the detection object is user a, and each sample to be tested in the test set includes the log data of user a within 24 hours, then the The anomaly detection results of the sample to be tested indicate whether user a has abnormal operations and/or abnormal access behaviors within the 24 hours. If the detection object is cloud service a, and each sample to be tested in the test set includes the log data of cloud service a within 24 hours, then the anomaly detection result of the sample to be tested indicates whether cloud service a has abnormalities within these 24 hours.
在本申请实施例中,第一检测模型包括编码器、解码器和判别器,判别器的参数包括误差阈值。检测设备将测试集的统计特征输入编码器,以得到测试集的编码特征;将测试集的编码特征输入解码器,以得到测试集的重建特征。检测设备将测试集的统计特征和重建特征输入判别器,以按照误差阈值确定测试集的异常检测结果。In this embodiment of the present application, the first detection model includes an encoder, a decoder and a discriminator, and the parameters of the discriminator include an error threshold. The detection device inputs the statistical features of the test set into the encoder to obtain the coding features of the test set; it inputs the coding features of the test set into the decoder to obtain the reconstructed features of the test set. The detection device inputs the statistical features and reconstructed features of the test set into the discriminator to determine the anomaly detection results of the test set according to the error threshold.
其中,误差阈值根据多个重建损失的均值确定,在对测试集进行异常检测的过程中,该多个重建损失包括测试集中每个待测样本的统计特征与重建特征之间的误差,或者,还包括训练子集中每个训练样本的统计特征与重建特征之间的误差。待测样本的统计特征包括待测样本中目标属性字段的数据的统计量,待测样本的重建特征是指待测样本中目标属性字段的数据的重建统计量。The error threshold is determined based on the mean value of multiple reconstruction losses. In the process of anomaly detection on the test set, the multiple reconstruction losses include the error between the statistical characteristics and reconstruction characteristics of each sample to be tested in the test set, or, The error between the statistical features and reconstructed features of each training sample in the training subset is also included. The statistical characteristics of the sample to be tested include the statistics of the data of the target attribute field in the sample to be tested, and the reconstruction characteristics of the sample to be tested refer to the reconstruction statistics of the data of the target attribute field in the sample to be tested.
图8是本申请实施例提供的另一种异常检测方法的流程图。参见图8,检测设备基于原始日志数据通过自动特征选择从候选属性字段中确定目标属性字段,也即是,自动进行特征工程以筛选候选属性字段中的待选字段,筛选出的待选字段与候选属性字段中的已选字段作为目标属性字段,筛选字段后得到特征选择后的数据,特征选择后的数据包括样本中目标属性字段的数据的统计量。检测设备对特征选择后的数据进行自动编解码,以得到样本中目标属性字段的数据的重建统计量,即输出重建结果。检测设备对特征选择后的数据与重建结果进行误差计算(如图8所示计算RMSE),以得到样本的重建损失。然后,检测设备按照误差阈值对样本的重建损失进行判别后输出样本的判别结果,即输出异常检测结果。FIG8 is a flow chart of another anomaly detection method provided by an embodiment of the present application. Referring to FIG8, the detection device determines the target attribute field from the candidate attribute field through automatic feature selection based on the original log data, that is, automatically performs feature engineering to screen the candidate attribute field, and the screened candidate field and the selected field in the candidate attribute field are used as the target attribute field. After the field is screened, the data after feature selection is obtained, and the data after feature selection includes the statistics of the data of the target attribute field in the sample. The detection device automatically encodes and decodes the data after feature selection to obtain the reconstruction statistics of the data of the target attribute field in the sample, that is, outputs the reconstruction result. The detection device performs error calculation on the data after feature selection and the reconstruction result (calculating RMSE as shown in FIG8) to obtain the reconstruction loss of the sample. Then, the detection device outputs the discrimination result of the sample after discriminating the reconstruction loss of the sample according to the error threshold, that is, outputs the anomaly detection result.
图9是本申请实施例提供的又一种异常检测方法的流程图。参见图9,云领域人员在客户端上配置异常检测任务的相关参数,提交配置文件(包括配置参数)到异常检测系统。异常检测系统提供有异常检测 的自助化工具。云领域人员或设备按照日志规范格式对选择的原始日志数据进行规范化。自助化工具基于配置文件对规范化后的日志数据自动进行统计量计算,并基于统计量和配置的候选属性字段自动筛选字段。自助化工具自动对筛选字段后的待测样本进行异常检测,以得到异常检测结果。Figure 9 is a flow chart of yet another anomaly detection method provided by an embodiment of the present application. Referring to Figure 9, cloud field personnel configure relevant parameters of the anomaly detection task on the client and submit the configuration file (including configuration parameters) to the anomaly detection system. Anomaly detection system provides anomaly detection self-service tools. Cloud field personnel or devices normalize the selected raw log data according to the log specification format. The self-service tool automatically calculates statistics on normalized log data based on configuration files, and automatically filters fields based on statistics and configured candidate attribute fields. The self-service tool automatically performs anomaly detection on the samples to be tested after filtering fields to obtain anomaly detection results.
综上所述,在本申请实施例中,基于简单的人工配置,即可实现对云平台中的日志数据进行异常检测。无需云服务领域的相关人员手工创建安全规则,也无需人工对各种攻击模式进行深入的分析和总结,从而避免了人工创建安全规则存在的漏洞,减少了漏检和误检的情况,同时还能够提高异常检测效率。另外,本方案无需相关人员具备机器学习和深度学习的专业知识,例如模型设计和调优的相关知识,也能够实现自助地、快速搭建异常检测模型实现面向具体任务的异常检测。To sum up, in the embodiment of the present application, based on simple manual configuration, abnormality detection of log data in the cloud platform can be realized. There is no need for relevant personnel in the cloud service field to manually create security rules, nor does it require manual in-depth analysis and summary of various attack modes. This avoids the loopholes in manually created security rules, reduces missed detections and false detections, and also reduces It can improve the efficiency of anomaly detection. In addition, this solution does not require relevant personnel to have professional knowledge of machine learning and deep learning, such as knowledge of model design and tuning, and can also enable self-service and rapid anomaly detection models to achieve anomaly detection for specific tasks.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(digital subscriber line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质,或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(digital versatile disc,DVD))或半导体介质(例如:固态硬盘(solid state disk,SSD))等。值得注意的是,本申请实施例提到的计算机可读存储介质可以为非易失性存储介质,换句话说,可以是非瞬时性存储介质。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer, or a data storage device such as a server or data center integrated with one or more available media. The available media may be magnetic media (such as floppy disks, hard disks, tapes), optical media (such as digital versatile discs (DVD)) or semiconductor media (such as solid state disks (SSD)) wait. It is worth noting that the computer-readable storage media mentioned in the embodiments of this application may be non-volatile storage media, in other words, may be non-transitory storage media.
应当理解的是,本文提及的“至少一个”是指一个或多个,“多个”是指两个或两个以上。在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。It should be understood that "at least one" mentioned herein refers to one or more, and "plurality" refers to two or more. In the description of the embodiments of this application, unless otherwise stated, "/" means or, for example, A/B can mean A or B; "and/or" in this article is just a way to describe the association of related objects. Relationship means that three relationships can exist. For example, A and/or B can mean: A exists alone, A and B exist simultaneously, and B exists alone. In addition, in order to facilitate a clear description of the technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as “first” and “second” are used to distinguish identical or similar items with basically the same functions and effects. Those skilled in the art can understand that words such as "first" and "second" do not limit the number and execution order, and words such as "first" and "second" do not limit the number and execution order.
需要说明的是,本申请实施例所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请实施例中涉及到的日志数据都是在充分授权的情况下获取的。It should be noted that the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in the embodiments of this application and Signals are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with the relevant laws, regulations and standards of relevant countries and regions. For example, the log data involved in the embodiments of this application are all obtained with full authorization.
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。 The above-mentioned embodiments are provided for this application and are not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc. made within the principles of this application shall be included in the protection scope of this application.

Claims (23)

  1. 一种异常检测方法,其特征在于,所述方法包括:An anomaly detection method, characterized in that the method includes:
    接收异常检测任务的配置参数,所述配置参数指示样本集、测试集和候选属性字段,所述样本集包括云平台中用于进行参数调优的日志数据,所述测试集包括所述云平台中待进行异常检测的日志数据,所述候选属性字段为所述云平台的日志数据对应的属性字段;Receive the configuration parameters of the anomaly detection task. The configuration parameters indicate a sample set, a test set and a candidate attribute field. The sample set includes log data used for parameter tuning in the cloud platform. The test set includes the cloud platform. In the log data to be anomaly detected, the candidate attribute field is the attribute field corresponding to the log data of the cloud platform;
    基于所述样本集和所述候选属性字段,从所述候选属性字段中确定目标属性字段,所述目标属性字段为用于进行所述异常检测任务的属性字段;Based on the sample set and the candidate attribute field, determine a target attribute field from the candidate attribute field, where the target attribute field is an attribute field used to perform the anomaly detection task;
    基于所述样本集和所述目标属性字段,对第一检测模型的第一超参数进行调优;Based on the sample set and the target attribute field, tuning a first hyperparameter of a first detection model;
    基于所述目标属性字段,通过经参数调优的第一检测模型对所述测试集进行异常检测,以得到所述测试集的异常检测结果。Based on the target attribute field, anomaly detection is performed on the test set through a parameter-tuned first detection model to obtain anomaly detection results of the test set.
  2. 如权利要求1所述的方法,其特征在于,所述候选属性字段包括m个已选字段和n个待选字段,m为不小于0的整数,n为大于0的整数;The method of claim 1, wherein the candidate attribute fields include m selected fields and n candidate fields, m is an integer not less than 0, and n is an integer greater than 0;
    所述基于所述样本集和所述候选属性字段,从所述候选属性字段中确定目标属性字段,包括:Determining a target attribute field from the candidate attribute field based on the sample set and the candidate attribute field includes:
    基于所述样本集、所述m个已选字段和所述n个待选字段,确定所述n个待选字段分别对应的字段得分,所述字段得分表征在所述m个已选字段中添加对应的待选字段后对异常检测效果的提升程度;Based on the sample set, the m selected fields and the n candidate fields, the field scores corresponding to the n candidate fields are determined, and the field scores are represented in the m selected fields. How much the anomaly detection effect will be improved after adding the corresponding candidate fields;
    基于所述n个待选字段分别对应的字段得分,从所述n个待选字段中确定出p个待选字段,所述p为不大于n的正整数;Based on the field scores corresponding to the n candidate fields, determine p candidate fields from the n candidate fields, where p is a positive integer not greater than n;
    将所述m个已选字段和所述p个待选字段确定为所述目标属性字段。The m selected fields and the p candidate fields are determined as the target attribute fields.
  3. 如权利要求2所述的方法,其特征在于,所述样本集包括训练子集和验证子集;The method of claim 2, wherein the sample set includes a training subset and a validation subset;
    所述基于所述样本集、所述m个已选字段和所述n个待选字段,确定所述n个待选字段分别对应的字段得分,包括:Determining the field scores corresponding to the n candidate fields based on the sample set, the m selected fields and the n candidate fields includes:
    将所述m个已选字段组成已选字段集合,将所述n个待选字段组成待选字段集合,基于所述训练子集、所述已选字段集合和所述待选字段集合,确定所述待选字段集合中每个待选字段对应的互信息,所述互信息表征对应的待选字段与所述已选字段集合中所有字段之间的相关性;The m selected fields are grouped into a selected field set, the n to-be-selected fields are grouped into a to-be-selected field set, and mutual information corresponding to each to-be-selected field in the to-be-selected field set is determined based on the training subset, the selected field set and the to-be-selected field set, wherein the mutual information represents the correlation between the corresponding to-be-selected field and all fields in the selected field set;
    从所述待选字段集合中选择出互信息最小的k个待选字段,所述k为不大于n的正整数;Select k candidate fields with the smallest mutual information from the set of candidate fields, where k is a positive integer not greater than n;
    基于所述训练子集、所述验证子集、所述已选字段集合和所述k个待选字段,确定所述k个待选字段分别对应的重建损失,所述重建损失表征通过对应的待选字段和所述已选字段集合对所述验证子集进行异常检测的效果;Based on the training subset, the verification subset, the selected field set and the k candidate fields, the reconstruction loss corresponding to the k candidate fields is determined, and the reconstruction loss is represented by the corresponding The effect of anomaly detection on the verification subset by the selected fields and the selected field set;
    基于所述k个待选字段分别对应的互信息和重建损失,从所述k个待选字段中选择一个待选字段,并确定选择出的待选字段的字段得分;Based on the mutual information and reconstruction loss corresponding to the k candidate fields, select one candidate field from the k candidate fields, and determine the field score of the selected candidate field;
    将所述选择出的待选字段从所述待选字段集合移入所述已选字段集合,返回基于所述训练子集、所述已选字段集合和所述待选字段集合,确定所述待选字段集合中每个待选字段对应的互信息的步骤,直至所述待选字段集合为空时,得到所述n个待选字段分别对应的字段得分。Move the selected candidate fields from the candidate field set to the selected field set, and return to determine the candidate fields based on the training subset, the selected field set and the candidate field set. The step of selecting the mutual information corresponding to each field to be selected in the field set is until the field set to be selected is empty, and the field scores corresponding to the n candidate fields are obtained.
  4. 如权利要求3所述的方法,其特征在于,所述基于所述训练子集、所述验证子集、所述已选字段集合和所述k个待选字段,确定所述k个待选字段分别对应的重建损失,包括:The method of claim 3, wherein the k candidate fields are determined based on the training subset, the verification subset, the selected field set and the k candidate fields. The reconstruction losses corresponding to the fields include:
    对于所述k个待选字段中的第一待选字段,将所述第一待选字段添加至所述已选字段集合,以得到一个候选字段集合,所述第一待选字段为所述k个待选字段中的任一待选字段;For the first candidate field among the k candidate fields, the first candidate field is added to the selected field set to obtain a candidate field set, and the first candidate field is the Any candidate field among k candidate fields;
    基于所述训练子集和所述候选字段集合,确定所述第一待选字段对应的第二检测模型;Determine, based on the training subset and the candidate field set, a second detection model corresponding to the first candidate field;
    基于所述验证子集和所述候选字段集合,通过所述第一待选字段对应的第二检测模型,确定所述第一待选字段对应的重建损失。Based on the verification subset and the candidate field set, the reconstruction loss corresponding to the first candidate field is determined through the second detection model corresponding to the first candidate field.
  5. 如权利要求4所述的方法,其特征在于,所述基于所述训练子集和所述候选字段集合,确定所述第 一待选字段对应的第二检测模型,包括:The method of claim 4, wherein the first step is determined based on the training subset and the candidate field set. The second detection model corresponding to a candidate field includes:
    基于所述候选字段集合确定所述训练子集的参考统计特征,所述训练子集的参考统计特征包括所述训练子集中所述候选字段集合包括的所有字段的数据的统计量;Determine reference statistical characteristics of the training subset based on the candidate field set, where the reference statistical characteristics of the training subset include statistics of data of all fields included in the candidate field set in the training subset;
    通过所述训练子集的参考统计特征训练初始检测模型,以得到所述第一待选字段对应的第二检测模型。An initial detection model is trained using the reference statistical features of the training subset to obtain a second detection model corresponding to the first candidate field.
  6. 如权利要求4或5所述的方法,其特征在于,所述基于所述验证子集和所述候选字段集合,通过所述第一待选字段对应的第二检测模型,确定所述第一待选字段对应的重建损失,包括:The method according to claim 4 or 5, characterized in that, based on the verification subset and the candidate field set, the first candidate field is determined through a second detection model corresponding to the first candidate field. The reconstruction loss corresponding to the selected field includes:
    基于所述候选字段集合确定所述验证子集的参考统计特征,所述验证子集的参考统计特征包括所述验证子集中所述候选字段集合包括的所有字段的数据的统计量;Determine reference statistical characteristics of the verification subset based on the candidate field set, where the reference statistical characteristics of the verification subset include statistics of data of all fields included in the candidate field set in the verification subset;
    将所述验证子集的参考统计特征输入所述第一待选字段对应的第二检测模型,以得到所述验证子集的参考重建特征,所述验证子集的参考重建特征包括所述验证子集中所述候选字段集合包括的所有字段的数据的重建统计量;The reference statistical characteristics of the verification subset are input into the second detection model corresponding to the first candidate field to obtain the reference reconstruction characteristics of the verification subset, and the reference reconstruction characteristics of the verification subset include the verification Reconstruction statistics of data of all fields included in the candidate field set in the subset;
    基于所述验证子集的参考统计特征和参考重建特征,确定所述第一待选字段对应的重建损失。Based on the reference statistical features and reference reconstruction features of the verification subset, the reconstruction loss corresponding to the first candidate field is determined.
  7. 如权利要求5或6所述的方法,其特征在于,所述配置参数还指示所述候选属性字段中每个属性字段的类别,不同类别的属性字段对应的统计量的种类存在不同。The method according to claim 5 or 6, characterized in that the configuration parameters also indicate the category of each attribute field in the candidate attribute fields, and attribute fields of different categories have different types of statistics corresponding to them.
  8. 如权利要求1-7任一所述的方法,其特征在于,所述第一超参数包括学习率、训练轮数和隐藏层维度。The method according to any one of claims 1 to 7, characterized in that the first hyperparameter includes a learning rate, a number of training rounds, and a hidden layer dimension.
  9. 如权利要求8所述的方法,其特征在于,所述第一检测模型包括输入层、第一隐藏层和第二隐藏层;所述隐藏层维度包括所述第一隐藏层和所述第二隐藏层的维度,所述输入层的维度基于所述目标属性字段包括的字段的数量确定,所述第一隐藏层和所述第二隐藏层的维度基于所述输入层的维度确定。The method as claimed in claim 8 is characterized in that the first detection model includes an input layer, a first hidden layer and a second hidden layer; the hidden layer dimensions include the dimensions of the first hidden layer and the second hidden layer, the dimension of the input layer is determined based on the number of fields included in the target attribute field, and the dimensions of the first hidden layer and the second hidden layer are determined based on the dimension of the input layer.
  10. 如权利要求1-9任一所述的方法,其特征在于,所述第一检测模型包括编码器、解码器和判别器,所述判别器的参数包括误差阈值;The method according to any one of claims 1 to 9, wherein the first detection model includes an encoder, a decoder and a discriminator, and the parameters of the discriminator include an error threshold;
    所述基于所述目标属性字段,通过经参数调优的第一检测模型对所述测试集进行异常检测,以得到所述测试集的异常检测结果,包括:The step of performing anomaly detection on the test set based on the target attribute field through the parameter-tuned first detection model to obtain the anomaly detection results of the test set includes:
    基于所述目标属性字段确定所述测试集的统计特征,所述测试集的统计特征包括所述测试集中所述目标属性字段的数据的统计量;Determine statistical characteristics of the test set based on the target attribute field, and the statistical characteristics of the test set include statistics of the data of the target attribute field in the test set;
    将所述测试集的统计特征输入所述编码器,以得到所述测试集的编码特征;Input the statistical features of the test set into the encoder to obtain the coding features of the test set;
    将所述测试集的编码特征输入所述解码器,以得到所述测试集的重建特征;Input the encoding features of the test set into the decoder to obtain the reconstructed features of the test set;
    将所述测试集的统计特征和重建特征输入所述判别器,以按照所述误差阈值确定所述测试集的异常检测结果。The statistical features and reconstructed features of the test set are input into the discriminator to determine the anomaly detection result of the test set according to the error threshold.
  11. 一种异常检测装置,其特征在于,所述装置包括:An abnormality detection device, characterized in that the device comprises:
    接收模块,用于接收异常检测任务的配置参数,所述配置参数指示样本集、测试集和候选属性字段,所述样本集包括云平台中用于进行参数调优的日志数据,所述测试集包括所述云平台中待进行异常检测的日志数据,所述候选属性字段为所述云平台的日志数据对应的属性字段;The receiving module is used to receive the configuration parameters of the anomaly detection task. The configuration parameters indicate the sample set, the test set and the candidate attribute fields. The sample set includes log data used for parameter tuning in the cloud platform. The test set Including log data to be detected for anomalies in the cloud platform, the candidate attribute fields are attribute fields corresponding to the log data of the cloud platform;
    确定模块,用于基于所述样本集和所述候选属性字段,从所述候选属性字段中确定目标属性字段,所述目标属性字段为用于进行所述异常检测任务的属性字段;A determination module, configured to determine a target attribute field from the candidate attribute field based on the sample set and the candidate attribute field, where the target attribute field is an attribute field used to perform the anomaly detection task;
    参数调优模块,用于基于所述样本集和所述目标属性字段,对第一检测模型的第一超参数进行调优;A parameter tuning module, configured to tune the first hyperparameter of the first detection model based on the sample set and the target attribute field;
    异常检测模块,用于基于所述目标属性字段,通过经参数调优的第一检测模型对所述测试集进行异常检测,以得到所述测试集的异常检测结果。The anomaly detection module is used to perform anomaly detection on the test set based on the target attribute field through a first detection model with optimized parameters to obtain anomaly detection results of the test set.
  12. 如权利要求11所述的装置,其特征在于,所述候选属性字段包括m个已选字段和n个待选字段,m为不小于0的整数,n为大于0的整数; The device of claim 11, wherein the candidate attribute fields include m selected fields and n candidate fields, m is an integer not less than 0, and n is an integer greater than 0;
    所述确定模块包括:The determination module includes:
    第一确定子模块,用于基于所述样本集、所述m个已选字段和所述n个待选字段,确定所述n个待选字段分别对应的字段得分,所述字段得分表征在所述m个已选字段中添加对应的待选字段后对异常检测效果的提升程度;The first determination sub-module is used to determine the field scores corresponding to the n candidate fields based on the sample set, the m selected fields and the n candidate fields, where the field scores represent the The extent to which the anomaly detection effect is improved after adding corresponding candidate fields to the m selected fields;
    第二确定子模块,用于基于所述n个待选字段分别对应的字段得分,从所述n个待选字段中确定出p个待选字段,所述p为不大于n的正整数;The second determination sub-module is used to determine p candidate fields from the n candidate fields based on the field scores corresponding to the n candidate fields, where the p is a positive integer not greater than n;
    第三确定子模块,用于将所述m个已选字段和所述p个待选字段确定为所述目标属性字段。The third determination sub-module is used to determine the m selected fields and the p candidate fields as the target attribute fields.
  13. 如权利要求12所述的装置,其特征在于,所述样本集包括训练子集和验证子集;The device according to claim 12, wherein the sample set includes a training subset and a verification subset;
    所述第一确定子模块具体用于:The first determination sub-module is specifically used for:
    将所述m个已选字段组成已选字段集合,将所述n个待选字段组成待选字段集合,基于所述训练子集、所述已选字段集合和所述待选字段集合,确定所述待选字段集合中每个待选字段对应的互信息,所述互信息表征对应的待选字段与所述已选字段集合中所有字段之间的相关性;The m selected fields are formed into a selected field set, and the n candidate fields are formed into a candidate field set. Based on the training subset, the selected field set and the candidate field set, it is determined Mutual information corresponding to each candidate field in the candidate field set, the mutual information representing the correlation between the corresponding candidate field and all fields in the selected field set;
    从所述待选字段集合中选择出互信息最小的k个待选字段,所述k为不大于n的正整数;Select k candidate fields with the smallest mutual information from the candidate field set, where k is a positive integer not greater than n;
    基于所述训练子集、所述验证子集、所述已选字段集合和所述k个待选字段,确定所述k个待选字段分别对应的重建损失,所述重建损失表征通过对应的待选字段和所述已选字段集合对所述验证子集进行异常检测的效果;Based on the training subset, the verification subset, the selected field set and the k candidate fields, the reconstruction loss corresponding to the k candidate fields is determined, and the reconstruction loss is represented by the corresponding The effect of anomaly detection on the verification subset by the selected fields and the selected field set;
    基于所述k个待选字段分别对应的互信息和重建损失,从所述k个待选字段中选择一个待选字段,并确定选择出的待选字段的字段得分;Based on the mutual information and reconstruction loss corresponding to the k candidate fields, select one candidate field from the k candidate fields, and determine the field score of the selected candidate field;
    将所述选择出的待选字段从所述待选字段集合移入所述已选字段集合,返回基于所述训练子集、所述已选字段集合和所述待选字段集合,确定所述待选字段集合中每个待选字段对应的互信息的步骤,直至所述待选字段集合为空时,得到所述n个待选字段分别对应的字段得分。Move the selected candidate fields from the candidate field set to the selected field set, and return to determine the candidate fields based on the training subset, the selected field set and the candidate field set. The step of selecting the mutual information corresponding to each field to be selected in the field set is until the field set to be selected is empty, and the field scores corresponding to the n candidate fields are obtained.
  14. 如权利要求13所述的装置,其特征在于,所述第一确定子模块具体用于:The device according to claim 13, characterized in that the first determining sub-module is specifically used to:
    对于所述k个待选字段中的第一待选字段,将所述第一待选字段添加至所述已选字段集合,以得到一个候选字段集合,所述第一待选字段为所述k个待选字段中的任一待选字段;For the first candidate field among the k candidate fields, the first candidate field is added to the selected field set to obtain a candidate field set, and the first candidate field is the Any candidate field among k candidate fields;
    基于所述训练子集和所述候选字段集合,确定所述第一待选字段对应的第二检测模型;Based on the training subset and the candidate field set, determine a second detection model corresponding to the first candidate field;
    基于所述验证子集和所述候选字段集合,通过所述第一待选字段对应的第二检测模型,确定所述第一待选字段对应的重建损失。Based on the verification subset and the candidate field set, the reconstruction loss corresponding to the first candidate field is determined through the second detection model corresponding to the first candidate field.
  15. 如权利要求14所述的装置,其特征在于,所述第一确定子模块具体用于:The device according to claim 14, characterized in that the first determination sub-module is specifically used to:
    基于所述候选字段集合确定所述训练子集的参考统计特征,所述训练子集的参考统计特征包括所述训练子集中所述候选字段集合包括的所有字段的数据的统计量;Determine reference statistical characteristics of the training subset based on the candidate field set, where the reference statistical characteristics of the training subset include statistics of data of all fields included in the candidate field set in the training subset;
    通过所述训练子集的参考统计特征训练初始检测模型,以得到所述第一待选字段对应的第二检测模型。An initial detection model is trained using the reference statistical features of the training subset to obtain a second detection model corresponding to the first candidate field.
  16. 如权利要求14或15所述的装置,其特征在于,所述第一确定子模块具体用于:The device according to claim 14 or 15, characterized in that the first determining sub-module is specifically used to:
    基于所述候选字段集合确定所述验证子集的参考统计特征,所述验证子集的参考统计特征包括所述验证子集中所述候选字段集合包括的所有字段的数据的统计量;Determine reference statistical characteristics of the verification subset based on the candidate field set, where the reference statistical characteristics of the verification subset include statistics of data of all fields included in the candidate field set in the verification subset;
    将所述验证子集的参考统计特征输入所述第一待选字段对应的第二检测模型,以得到所述验证子集的参考重建特征,所述验证子集的参考重建特征包括所述验证子集中所述候选字段集合包括的所有字段的数据的重建统计量;The reference statistical features of the verification subset are input into the second detection model corresponding to the first candidate field to obtain the reference reconstruction features of the verification subset, and the reference reconstruction features of the verification subset include the verification Reconstruction statistics of data of all fields included in the candidate field set in the subset;
    基于所述验证子集的参考统计特征和参考重建特征,确定所述第一待选字段对应的重建损失。Based on the reference statistical features and the reference reconstruction features of the verification subset, a reconstruction loss corresponding to the first candidate field is determined.
  17. 如权利要求15或16所述的装置,其特征在于,所述配置参数还指示所述候选属性字段中每个属性字段的类别,不同类别的属性字段对应的统计量的种类存在不同。The device according to claim 15 or 16, wherein the configuration parameter further indicates the category of each attribute field in the candidate attribute fields, and attribute fields of different categories have different types of statistics corresponding to them.
  18. 如权利要求11-17任一所述的装置,其特征在于,所述第一超参数包括学习率、训练轮数和隐藏层 维度。The device according to any one of claims 11-17, wherein the first hyperparameter includes learning rate, number of training rounds and hidden layer dimensions.
  19. 如权利要求18所述的装置,其特征在于,所述第一检测模型包括输入层、第一隐藏层和第二隐藏层;所述隐藏层维度包括所述第一隐藏层和所述第二隐藏层的维度,所述输入层的维度基于所述目标属性字段包括的字段的数量确定,所述第一隐藏层和所述第二隐藏层的维度基于所述输入层的维度确定。The device of claim 18, wherein the first detection model includes an input layer, a first hidden layer and a second hidden layer; the hidden layer dimensions include the first hidden layer and the second hidden layer. Dimensions of the hidden layer, the dimensions of the input layer are determined based on the number of fields included in the target attribute field, and the dimensions of the first hidden layer and the second hidden layer are determined based on the dimensions of the input layer.
  20. 如权利要求11-19任一所述的装置,其特征在于,所述第一检测模型包括编码器、解码器和判别器,所述判别器的参数包括误差阈值;The device according to any one of claims 11 to 19, wherein the first detection model includes an encoder, a decoder and a discriminator, and the parameters of the discriminator include an error threshold;
    所述异常检测模块包括:The anomaly detection module includes:
    第四确定子模块,用于基于所述目标属性字段确定所述测试集的统计特征,所述测试集的统计特征包括所述测试集中所述目标属性字段的数据的统计量;A fourth determination submodule, configured to determine a statistical feature of the test set based on the target attribute field, wherein the statistical feature of the test set includes a statistical value of data of the target attribute field in the test set;
    第一输入子模块,用于将所述测试集的统计特征输入所述编码器,以得到所述测试集的编码特征;The first input submodule is used to input the statistical characteristics of the test set into the encoder to obtain the coding characteristics of the test set;
    第二输入子模块,用于将所述测试集的编码特征输入所述解码器,以得到所述测试集的重建特征;The second input submodule is used to input the coding features of the test set into the decoder to obtain the reconstructed features of the test set;
    第三输入子模块,用于将所述测试集的统计特征和重建特征输入所述判别器,以按照所述误差阈值确定所述测试集的异常检测结果。The third input submodule is used to input the statistical features and reconstructed features of the test set into the discriminator to determine the anomaly detection result of the test set according to the error threshold.
  21. 一种计算设备集群,其特征在于,包括至少一个计算设备,每个计算设备包括处理器和存储器;A computing device cluster, characterized by including at least one computing device, each computing device including a processor and a memory;
    所述至少一个计算设备的处理器用于执行所述至少一个计算设备的存储器中存储的指令,以使得所述计算设备集群执行如权利要求1-10任一所述的方法。The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device, so that the computing device cluster performs the method according to any one of claims 1-10.
  22. 一种计算机可读存储介质,其特征在于,所述存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-10任一所述的方法的步骤。A computer-readable storage medium, characterized in that a computer program is stored in the storage medium, and when the computer program is executed by a processor, the steps of the method described in any one of claims 1-10 are implemented.
  23. 一种计算机程序产品,其特征在于,所述计算机程序产品内存储有计算机指令,所述计算机指令被处理器执行时实现权利要求1-10任一所述的方法的步骤。 A computer program product, characterized in that computer instructions are stored in the computer program product, and when the computer instructions are executed by a processor, the steps of the method described in any one of claims 1-10 are implemented.
PCT/CN2023/103993 2022-09-20 2023-06-29 Anomaly detection method and related apparatus WO2024060767A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211145851.6 2022-09-20
CN202211145851.6A CN117792662A (en) 2022-09-20 2022-09-20 Abnormality detection method and related device

Publications (1)

Publication Number Publication Date
WO2024060767A1 true WO2024060767A1 (en) 2024-03-28

Family

ID=90387895

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/103993 WO2024060767A1 (en) 2022-09-20 2023-06-29 Anomaly detection method and related apparatus

Country Status (2)

Country Link
CN (1) CN117792662A (en)
WO (1) WO2024060767A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109491860A (en) * 2018-10-17 2019-03-19 深圳壹账通智能科技有限公司 Method for detecting abnormality, terminal device and the medium of application program
CN110197066A (en) * 2019-05-29 2019-09-03 轲飞(北京)环保科技有限公司 Virtual machine monitoring method and monitoring system under a kind of cloud computing environment
CN110414690A (en) * 2018-04-28 2019-11-05 第四范式(北京)技术有限公司 The method and device of prediction is executed using machine learning model
CN112235327A (en) * 2020-12-16 2021-01-15 中移(苏州)软件技术有限公司 Abnormal log detection method, device, equipment and computer readable storage medium
CN112395159A (en) * 2020-11-17 2021-02-23 华为技术有限公司 Log detection method, system, device and medium
US20220156134A1 (en) * 2020-11-16 2022-05-19 Servicenow, Inc. Automatically correlating phenomena detected in machine generated data to a tracked information technology change

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110414690A (en) * 2018-04-28 2019-11-05 第四范式(北京)技术有限公司 The method and device of prediction is executed using machine learning model
CN109491860A (en) * 2018-10-17 2019-03-19 深圳壹账通智能科技有限公司 Method for detecting abnormality, terminal device and the medium of application program
CN110197066A (en) * 2019-05-29 2019-09-03 轲飞(北京)环保科技有限公司 Virtual machine monitoring method and monitoring system under a kind of cloud computing environment
US20220156134A1 (en) * 2020-11-16 2022-05-19 Servicenow, Inc. Automatically correlating phenomena detected in machine generated data to a tracked information technology change
CN112395159A (en) * 2020-11-17 2021-02-23 华为技术有限公司 Log detection method, system, device and medium
CN112235327A (en) * 2020-12-16 2021-01-15 中移(苏州)软件技术有限公司 Abnormal log detection method, device, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN117792662A (en) 2024-03-29

Similar Documents

Publication Publication Date Title
US11336681B2 (en) Malware data clustering
US11689549B2 (en) Continuous learning for intrusion detection
US9785792B2 (en) Systems and methods for processing requests for genetic data based on client permission data
Elmasry et al. Empirical study on multiclass classification‐based network intrusion detection
US20220224723A1 (en) Ai-driven defensive cybersecurity strategy analysis and recommendation system
Ramaki et al. A systematic mapping study on intrusion alert analysis in intrusion detection systems
US9349103B2 (en) Application of machine learned Bayesian networks to detection of anomalies in complex systems
CN116506217B (en) Analysis method, system, storage medium and terminal for security risk of service data stream
US11546380B2 (en) System and method for creation and implementation of data processing workflows using a distributed computational graph
US20230244812A1 (en) Identifying Sensitive Data Risks in Cloud-Based Enterprise Deployments Based on Graph Analytics
RU2759087C1 (en) Method and system for static analysis of executable files based on predictive models
US11824894B2 (en) Defense of targeted database attacks through dynamic honeypot database response generation
CN110598959A (en) Asset risk assessment method and device, electronic equipment and storage medium
JP7274162B2 (en) ABNORMAL OPERATION DETECTION DEVICE, ABNORMAL OPERATION DETECTION METHOD, AND PROGRAM
WO2024060767A1 (en) Anomaly detection method and related apparatus
CN116599743A (en) 4A abnormal detour detection method and device, electronic equipment and storage medium
Tierney Knowledge discovery in cyber vulnerability databases
WO2024032032A1 (en) Cloud platform testing method and apparatus, service node, and cloud platform
CN117118689B (en) Unbalanced data Internet of things intrusion detection method and device
CN115766293B (en) Risk file detection method and device, electronic equipment and storage medium
US11755923B2 (en) Guided plan recognition
Yang et al. Security Assessment Model for Blockchain Software and Hardware Fusion Device Based on Decision Tree Algorithm
CN116468280A (en) Risk detection early warning method and device, computer equipment and storage medium
Bozyiğit et al. MACHINE LEARNING BASED SECURITY ANALYSIS: ALARM GENERATION AND THREAT FORECASTING
Satyanarayana A Conventional Approach Through Complex Event Processing in Intrusion Detection System

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23867062

Country of ref document: EP

Kind code of ref document: A1