WO2020078385A1 - 数据采集方法、装置、存储介质及系统 - Google Patents

数据采集方法、装置、存储介质及系统 Download PDF

Info

Publication number
WO2020078385A1
WO2020078385A1 PCT/CN2019/111481 CN2019111481W WO2020078385A1 WO 2020078385 A1 WO2020078385 A1 WO 2020078385A1 CN 2019111481 W CN2019111481 W CN 2019111481W WO 2020078385 A1 WO2020078385 A1 WO 2020078385A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
data
data set
nodes
processing node
Prior art date
Application number
PCT/CN2019/111481
Other languages
English (en)
French (fr)
Inventor
何永健
王辉
李冰杰
徐志威
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Publication of WO2020078385A1 publication Critical patent/WO2020078385A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available

Definitions

  • This application relates to the field of big data technology, and in particular to a data collection method, device, storage medium, and system.
  • a fixed preset rule is usually set in the distributed data collection system in advance, and the preset rule is used as a criterion for distinguishing normal data and abnormal data. Then, each time the distributed data collection system collects data, it distinguishes the collected data according to the preset rules, and determines the data that meets the preset rules in the collected data as normal data, and does not determine the collected data. Data that meets the preset rules is determined to be abnormal data.
  • the embodiments of the present application provide a data collection method, device, storage medium, and system, which can solve the problem of abnormal detection using fixed preset rules in the related art, resulting in the failure to accurately detect abnormal data and affecting the normal operation of subsequent work.
  • the technical solution is as follows:
  • a data collection method is provided, which is applied to a designated processing node of a distributed data collection system.
  • the distributed data collection system includes multiple collection nodes, multiple processing nodes, and multiple storage nodes. Methods include:
  • the anomaly detection model is obtained by training based on a data set obtained from at least one acquisition node of the plurality of acquisition nodes;
  • the abnormal data is stored to a first storage node of the plurality of storage nodes, and the first storage node is used to store the detected abnormal data.
  • the plurality of storage nodes include a second storage node corresponding to the designated processing node, the second storage node is used to store a data set acquired by the designated processing node, and the method further includes:
  • the specified time period refers to the time when the data collection starts The time and length of time are the specified time period;
  • the method further includes:
  • the multiple processing nodes include a master processing node and at least one slave processing node.
  • the method further includes:
  • the multiple processing nodes include a master processing node and at least one slave processing node.
  • the method further includes:
  • the main processing node is used to synthesize the anomaly detection models trained by the multiple processing nodes to obtain the synthesized anomaly detection A model, storing the synthesized abnormality detection model in a storage space shared by the multiple processing nodes for each processing node to perform abnormality detection.
  • anomaly detection is performed on the data set through an anomaly detection model to determine the anomaly data in the data set, including:
  • Anomaly detection is performed on the data set through the anomaly detection model to obtain an anomaly index for each piece of data in the data set, and data whose anomaly index is within a preset range is determined as anomaly data.
  • the plurality of processing nodes includes a master processing node and at least one slave processing node, and when the designated processing node is a master processing node, the slave processing node obtains from at least one collection node of the plurality of collection nodes Before the collected data set, the method further includes:
  • a data acquisition instruction is sent to at least one slave processing node, the data acquisition instruction carries a type identifier of the at least one collection node, the At least one slave processing node is used to obtain the collected data set from the corresponding collecting node according to the type identifier carried in the received data obtaining instruction.
  • the multiple processing nodes include a master processing node and at least one slave processing node, when the designated processing node is any slave processing node, the at least one collecting node of the multiple collecting nodes Get the collected data set in, including:
  • the main processing node Receiving a data acquisition instruction sent by the main processing node, where the data acquisition instruction carries a type identifier of at least one collection node, and the main processing node is used to send a data collection when monitoring that the at least one collection node collects a data set Describe the data acquisition instructions;
  • the training based on the first sample data set to obtain an initial anomaly detection model includes:
  • Each binary tree includes multiple layers of nodes, the first layer includes a root node, and each node is connected to two branch nodes of the next layer, each node includes a piece of data in the first sample data set, each The node value of the node is the key value of the data in each node on the specified attribute, and each node is used to divide the data whose key value on the specified attribute is less than the node value to the first of the next layer
  • the branch node divides the data whose key value on the specified attribute is not less than the node value into the second branch node of the next layer.
  • the establishing multiple binary trees according to the first sample data set includes:
  • the data whose key value on the specified attribute is less than the node value of the current node is divided into the first branch node of the layer below the current node, and the key value on the specified attribute Data that is not smaller than the node value of the current node is divided into the second branch node of the current node until the divided node includes only one piece of data or the same key value included in the specified attribute
  • you get a binary tree When you get data, you get a binary tree.
  • a data collection device for use in a designated processing node of a distributed data collection system.
  • the distributed data collection system includes multiple collection nodes, multiple processing nodes, and multiple storage nodes.
  • the device includes:
  • a first obtaining module configured to obtain the collected data set from at least one collecting node of the plurality of collecting nodes, the data set including at least one piece of data;
  • An anomaly detection module configured to perform anomaly detection on the data set through an anomaly detection model to determine anomaly data in the data set, the anomaly detection model is based on data acquired from at least one collection node of the plurality of collection nodes Set to get training;
  • the first storage module is configured to store the abnormal data to a first storage node among the plurality of storage nodes, and the first storage node is used to store the detected abnormal data.
  • the plurality of storage nodes include a second storage node corresponding to the designated processing node, the second storage node is used to store the data set acquired by the designated processing node, and the apparatus further includes:
  • a second obtaining module configured to obtain a data set collected within a specified time period from the second storage node, and use the data set as a first sample data set, the specified time period refers to start The time when the data is collected is the starting time, and the time length is the specified time period;
  • the training module is used for training according to the first sample data set to obtain an initial abnormality detection model.
  • the device further includes:
  • a third obtaining module configured to obtain, from the second storage node, a data set collected within a preset time period before the current time, and use the data set as a second sample data set;
  • the update module is configured to continue training according to the second sample data set to obtain an updated abnormality detection model.
  • the multiple processing nodes include a master processing node and at least one slave processing node.
  • the device further includes:
  • a fourth acquisition module configured to acquire the anomaly detection model trained by the master processing node, and receive the anomaly detection model trained by the at least one slave processing node;
  • the second storage module is configured to store the synthesized abnormality detection model in a storage space shared by the multiple processing nodes for each processing node to perform abnormality detection.
  • the multiple processing nodes include a master processing node and at least one slave processing node.
  • the device further includes:
  • the first sending module is used to obtain the anomaly detection model trained by the designated processing node and send it to the main processing node, and the main processing node is used to synthesize the anomaly detection models trained by the multiple processing nodes To obtain a synthesized abnormality detection model, and store the synthesized abnormality detection model in a storage space shared by the multiple processing nodes for each processing node to perform abnormality detection.
  • the abnormality detection module includes:
  • An anomaly detection submodule configured to perform anomaly detection on the data set through the anomaly detection model, obtain an anomaly index for each piece of data in the data set, and determine data whose anomaly index is within a preset range as anomalous data.
  • the multiple processing nodes include a master processing node and at least one slave processing node.
  • the device further includes:
  • a monitoring module used to monitor the multiple collection nodes
  • a second sending module configured to send a data acquisition instruction to at least one slave processing node when at least one of the plurality of acquisition nodes monitors and collects a data set, the data acquisition instruction carrying the at least one acquisition
  • the type identifier of the node, the at least one slave processing node is used to obtain the collected data set from the corresponding collection node according to the type identifier carried in the received data acquisition instruction.
  • the multiple processing nodes include a master processing node and at least one slave processing node.
  • the first acquisition module includes:
  • a receiving submodule configured to receive a data acquisition instruction sent by the main processing node, the data acquisition instruction carries a type identifier of at least one collection node, and the main processing node is used to collect when the at least one collection node is monitored Send the data acquisition instruction when the data set is reached;
  • the obtaining submodule is used to obtain the collected data set from the corresponding collecting node according to the type identifier carried in the data obtaining instruction.
  • the training module includes:
  • Each binary tree includes multiple layers of nodes, the first layer includes a root node, and each node is connected to two branch nodes of the next layer, each node includes a piece of data in the first sample data set, each The node value of the node is the key value of the data in each node on the specified attribute, and each node is used to divide the data whose key value on the specified attribute is less than the node value to the first of the next layer
  • the branch node divides the data whose key value on the specified attribute is not less than the node value into the second branch node of the next layer.
  • the establishment sub-module is also used to:
  • the data whose key value on the specified attribute is less than the node value of the current node is divided into the first branch node of the layer below the current node, and the key value on the specified attribute Data that is not smaller than the node value of the current node is divided into the second branch node of the current node until the divided node includes only one piece of data or the same key value included in the specified attribute
  • you get a binary tree When you get data, you get a binary tree.
  • a processing node for use in a distributed data collection system.
  • the distributed data collection system includes multiple collection nodes, multiple processing nodes, and multiple storage nodes, where the processing nodes are the distribution Any processing node in the distributed acquisition system;
  • the processing node includes a processor and a memory, and the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the data collection method described in the first aspect.
  • a computer-readable storage medium in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the data collection method described in the first aspect.
  • a distributed data collection system includes multiple collection nodes, multiple processing nodes, and multiple storage nodes;
  • the multiple collection nodes are used to collect a data set, and the data set includes at least one piece of data;
  • Any one of the plurality of processing nodes is used to obtain the collected data set from at least one collection node of the plurality of collection nodes;
  • the any processing node is also used to perform anomaly detection on the data set through an anomaly detection model to determine anomaly data in the data set.
  • the anomaly detection model is based on at least one acquisition node from The obtained data set is trained;
  • the any processing node is further used to store the abnormal data to the first storage node of the plurality of storage nodes;
  • the first storage node is used to store the detected abnormal data.
  • any processing node obtains the collected data set from at least one of the plurality of collecting nodes; and then performs anomaly detection on the data set through an anomaly detection model to determine abnormal data in the data set; Then, the abnormal data is stored to a first storage node among the plurality of storage nodes, and the first storage node is used to store the detected abnormal data.
  • the anomaly detection model trained based on the collected data set can reflect the rules for distinguishing normal data from abnormal data, and learn the criteria for distinguishing between normal data and abnormal data.
  • any processing node uses the anomaly detection model to perform anomaly detection on the acquired data set, determine the anomaly data in the data set and store it, so that the detection result is more in line with the real anomaly data, improving the accuracy of anomaly detection and ensuring The follow-up work was carried out normally.
  • FIG. 1 is a schematic structural diagram of a distributed data collection system provided by an embodiment of the present application
  • FIG. 2 is a flowchart of a data collection method provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of a method for training an anomaly detection model provided by an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of another distributed data collection system provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a data collection device provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • FIG. 1 is a schematic structural diagram of a distributed data collection system provided by an embodiment of the present application.
  • the distributed data collection system includes multiple collection nodes 101, multiple processing nodes 102, and multiple storage nodes 103. At least one collection node 101 is connected to one processing node 102, then each processing node 102 corresponds to at least one collection node 101, one processing node 102 is connected to one storage node 103, then the multiple processing nodes 102 are connected to the multiple storage nodes 103 one-to-one correspondence.
  • each collection node 101 has a data collection function and can collect data.
  • Each processing node 102 has an abnormality detection function, and can perform abnormality detection on the collected data.
  • Each storage node 103 has a data storage function and can store the collected data.
  • each collection node 101 is used to collect a data set from a data source.
  • the designated processing node is used to obtain the collected data set from at least one collection node 101 connected to it, and perform anomaly detection on the data set through an anomaly detection model to obtain abnormal data in the data set, and then store the detected abnormal data To the first storage node.
  • the plurality of storage nodes 103 include the first storage node, and the first storage node is used to store the detected abnormal data.
  • collection node, processing node, and storage node included in the distributed data collection system provided by the embodiments of the present application may be servers, or may be function modules in the server, that is, different nodes may be deployed In the same server, it can also be deployed on different servers.
  • FIG. 2 is a flowchart of a data collection method provided by an embodiment of the present application, which is applied to a designated processing node of the distributed data collection system shown in FIG. 1, and the designated processing node is any processing in the distributed data collection system. node.
  • the method includes the following steps:
  • Step 201 Acquire a collected data set from at least one collection node of the plurality of collection nodes, where the data set includes at least one piece of data.
  • Step 202 Perform anomaly detection on the data set through an anomaly detection model to determine anomaly data in the data set.
  • the anomaly detection model is obtained by training based on a data set obtained from at least one collection node of the plurality of collection nodes.
  • Step 203 Store the abnormal data to a first storage node among the plurality of storage nodes, where the first storage node is used to store the detected abnormal data.
  • any processing node obtains the collected data set from at least one of the plurality of collecting nodes; and then performs anomaly detection on the data set through an anomaly detection model to determine abnormal data in the data set; Then, the abnormal data is stored to a first storage node among the plurality of storage nodes, and the first storage node is used to store the detected abnormal data.
  • the anomaly detection model trained based on the collected data set can reflect the rules for distinguishing normal data from abnormal data, and learn the criteria for distinguishing between normal data and abnormal data.
  • any processing node uses the anomaly detection model to perform anomaly detection on the acquired data set, determine the anomaly data in the data set and store it, so that the detection result is more in line with the real anomaly data, improving the accuracy of anomaly detection
  • the follow-up work was carried out normally.
  • FIG. 3 is a flowchart of a method for training an anomaly detection model provided by an embodiment of the present application, which is applied to the distributed data collection system of the embodiment shown in FIG. 1.
  • the distributed data collection system includes multiple collection nodes and multiple A processing node and multiple storage nodes, the multiple processing nodes including a master processing node and at least one slave processing node.
  • the method includes the following steps:
  • Step 301 The master processing node monitors the plurality of collection nodes, and when it is monitored that at least one of the plurality of collection nodes collects a data set, sends a data acquisition instruction to at least one slave processing node, and the data acquisition instruction carries the Type identification of at least one collection node.
  • the type identifier of the collection node is used to indicate the type of the collection node, because in a distributed data collection system, data sharing can be realized between collection nodes of the same type. Then, when the main processing node monitors that a certain collection node collects a data set, it only needs to acquire the collected data set from any collection node of that type according to the type of the collection node. Therefore, the type identification of the at least one collection node may be carried in the data acquisition instruction.
  • the type identifier may be the node type name of the collection node, such as kafka, FTP (File Transfer Protocol, file transfer protocol), etc.
  • the main processing node can monitor each collection node in real time, and can also periodically monitor each collection node.
  • the data acquisition instruction may carry the type identification of the at least one collection node and the cache location of the collected data set, for example, the type identification of the collection node kafka and the cache location topic of the data set may be carried to facilitate processing from the processing node Can get data from the topic of kafka collection node.
  • the main processing node can monitor each collection node, and when monitoring at least one collection node of the plurality of collection nodes collects a data set, obtain the type identifier of the at least one collection node and the cache position of the data set, The type identifier of the at least one collection node and the cache location of the data set are carried in the data acquisition instruction and sent to at least one slave processing node.
  • the main processing node may carry all types of identifiers in the at least one collection node in the data acquisition instruction, or may carry part of the at least one collection node Type identification.
  • the type identifier carried in the data acquisition instruction is a partial type identifier
  • the partial type identifier is determined by the main processing node according to the number of type identifiers and the number of idle processing nodes.
  • the main processing node determines the number of type identifiers and the number of idle processing nodes, calculates the ratio between the number of type identifiers and the number of idle processing nodes, and when the ratio is less than 1, allocates to at least two idle processing nodes The same type identifier; when the ratio is not less than 1, each idle processing node is assigned at least one type identifier, and each space processing node is assigned a different type identifier.
  • the idle processing node may be a master processing node and at least one slave processing node, or at least one slave processing node, or a master processing node.
  • the master processing node may allocate the collection node indicated by the first type identifier to itself and one slave processing node; and assign the collection node indicated by the second type identifier to the other two slave processing nodes.
  • the master processing node can also assign the collection node indicated by the first type identifier to one slave processing node, and assign the second type identifier to the other two slave processing nodes, while the master processing node itself only monitors the collection node to collect data Distribution of the situation without participating in the process of acquiring the data set of the collection node.
  • the currently idle processing nodes are 2 slave processing nodes.
  • the master processing node may assign two type identifiers to one slave processing node, and assign the other two type identifiers to another slave processing node.
  • Step 302 According to the type identification of the at least one collection node, any processing node obtains the collected data set from the corresponding collection node and stores it in the second storage node.
  • the processing node is the main processing node and receives the data Any one of the slave processing nodes that fetches instructions.
  • the multiple storage nodes may further include multiple second storage nodes, each processing node corresponds to a second storage node, and the second storage node is used to store the data set acquired by the corresponding processing node.
  • the type identification of the at least one collection node may be obtained, and the collected data set may be obtained from the corresponding collection node according to the type identification, and Store the acquired data set to the second storage node corresponding to the main processing node.
  • each slave processing node when receiving a data acquisition instruction, it can determine the acquisition node whose data set is to be acquired according to the type identifier carried in the data acquisition instruction, and acquire the acquisition from the corresponding acquisition node And store the acquired data set to the second storage node corresponding to the slave processing node.
  • the number of received data acquisition instructions reaches a preset number, multiple data acquisition instructions may be processed in a unified manner.
  • one collection node only allows one processing node to obtain the collected data set at a time, and multiple processing nodes cannot simultaneously obtain the data set of the same collection node. Then, when a processing node is acquiring the data set collected by a collection node, other processing nodes cannot obtain the data set collected by the collection node, and can only obtain the data set collected by other collection nodes or no longer data set. In this way, each processing node obtains the data set collected by different collection nodes, which can avoid multiple processing nodes obtaining the same data set.
  • each slave processing node no matter whether the data acquisition instruction carries the type identifiers of all the collection nodes in the at least one collection node or the type identifiers of some of the at least one collection node The slave processing node only needs to obtain the data set from the corresponding collection node according to the type identifier carried in the data acquisition instruction.
  • the processing node obtains the collected data set from the corresponding collection node according to a certain type identifier carried in the data acquisition instruction, there may be situations where other processing nodes are obtaining the data set collected by the collection node.
  • the slave processing node obtains the collected data set from other collection nodes corresponding to the type identifier. For each collection node of all collection nodes corresponding to the type identifier, if there is a situation where other processing nodes are obtaining the collected data set, at this time, the slave processing node stops obtaining collection from the corresponding collection node according to the type identifier To the dataset. After that, the processing node obtains the collected data set from the corresponding collection node according to the next type identification of the type identification; if the data acquisition instruction carries only this type identification, the processing node stops the acquisition of the data set .
  • Step 303 Any processing node obtains the data set collected in the specified time period from the corresponding second storage node, and uses the data in the data set as the first sample data set.
  • the specified time period refers to start the collection
  • the time of the data is the starting time, and the length of time is the period of time specified.
  • the specified duration can be set to one day, two days, 12 hours, etc., or can be set to other durations.
  • the processing node may be a Spark Streaming component
  • the collection node may be Kafka
  • the data set collected by Kafka is cached in the corresponding topic topic
  • any Spark Streaming component identifies the storage location of Kafka and the data set according to the type of the collection node topic1, obtain the data set from Kafka's topic1, and then encapsulate the first sample data set into a DStream (data stream), and then traverse the RDD (Resilient Distributed Datasets, elastic distributed data sets) in the DStream to obtain Each piece of data undergoes subsequent model training.
  • RDD Resilient Distributed Datasets, elastic distributed data sets
  • the first sample data set acquired by a processing node may be as shown in Table 1 below.
  • Table 1 is only an exemplary first sample data set provided by an embodiment of the present application, and the first sample data set may also be other, which is not limited in this embodiment of the present application.
  • Step 304 Any processing node performs training according to the first sample data set to obtain an initial abnormality detection model.
  • any one of the processing nodes may be a master processing node, or may be any one of at least one slave processing node. Therefore, for each processing node of a master processing node and at least one slave processing node, a first sample data set will be obtained for training to obtain an initial anomaly detection model, then, in the end, multiple The initial anomaly detection model. And the first sample data set obtained by different processing nodes is data of different data sets, so that the obtained data is more comprehensive.
  • the anomaly detection model is directly trained based on the acquired data set, which is more in line with the discrimination standard of the data set itself, and the anomaly detection model does not need to set the discrimination standard in advance, and is suitable for anomaly detection of massive data sets with high accuracy.
  • the anomaly detection model can be implemented based on the Isolation-Forest (isolated forest) algorithm.
  • any processing node When any processing node obtains the first sample data set, it can first build multiple binary trees according to the first sample data set, and then synthesize the multiple binary trees to obtain an initial anomaly detection model.
  • each binary tree includes multiple layers of nodes, the first layer includes a root node, each node is connected to two branch nodes of the next layer, each node includes a piece of data in the first sample data set, each The node value of the node is the key value of the data in each node on the specified attribute, and each node is used to divide the data whose key value on the specified attribute is less than the node value into the first branch node of the next layer, Divide the data whose key value on the specified attribute is not less than the node value to the second branch node of the next layer.
  • any processing node when it builds multiple binary trees according to the first sample data set, it can randomly select any attribute from all data attributes of the first sample data set as the specified attribute; Select one key value among all key values as the node value of the root node, and add the data corresponding to the node value of the root node to the root node; starting from the root node, the key value on the specified attribute is less than the node value of the current node.
  • the data is divided into the first branch node under the current node, and the data whose key value on the specified attribute is not less than the node value of the current node is divided into the second branch node under the current node, up to the divided node
  • a binary tree is obtained.
  • the code to build a binary tree can be as follows:
  • Att is the specified attribute
  • Value is the node value
  • X is all the key values of the specified attribute randomly selected
  • e is the current height
  • l is the specified height
  • the specified height is set in advance.
  • the node value of a randomly selected node is higher than the specified height, or there is no data with the key value of the specified attribute less than the node value, or there is no key value of the specified attribute is not less than the node value.
  • the data indicates that the selected node value is inappropriate, and returns to re-select the node value of the node.
  • Step 305 Each slave processing node obtains the trained anomaly detection model and sends it to the master processing node.
  • Step 306 The master processing node obtains the currently trained anomaly detection model, and receives at least one anomaly detection model trained by the slave processing node.
  • Step 307 The main processing node synthesizes the abnormality detection models trained by the multiple processing nodes to obtain the synthesized abnormality detection model.
  • the master processing node synthesizes the currently detected anomaly detection model and at least one received anomaly detection model from the slave processing node, and finally obtains a synthesized anomaly detection model.
  • the main processing node synthesizes the abnormality detection models trained by multiple processing nodes according to different data sets, and can comprehensively consider all the data sets collected by the at least one collection node within a specified time period, so that the final result
  • the synthesized anomaly detection model can better reflect the regularity of the data set itself, ensuring the accuracy of the anomaly detection model.
  • Step 308 The main processing node stores the synthesized abnormality detection model in a storage space shared by the multiple processing nodes for each processing node to perform abnormality detection.
  • the storage space shared by the multiple processing nodes may be located at the main processing node, at other processing nodes, or at a storage server independently accessible to the multiple processing nodes.
  • the embodiment of the present application may update the abnormality detection model in the following manner to The updated abnormality detection model is more in line with the current rules for distinguishing normal data and abnormal data, thereby improving the accuracy of abnormality detection.
  • the above method further includes: any processing node obtains from the second storage node a data set collected within a preset duration before the current time, and uses the data set as The second sample data set; continue training according to the second sample data set to obtain a new anomaly detection model as an updated anomaly detection model.
  • each slave processing node obtains the currently trained anomaly detection model and sends it to the master processing node.
  • the master processing node obtains the currently trained anomaly detection model, receives at least one anomaly detection model trained by the slave processing node, and then synthesizes the anomaly detection models trained by multiple processing nodes to obtain a synthesized anomaly detection model.
  • the current abnormality detection model stored in the storage space shared by the multiple processing nodes is replaced with the synthesized abnormality detection model for each processing node to perform abnormality detection.
  • the anomaly detection model can be updated periodically. For example, a data set collected within a preset time period before the zero point is acquired every day at the processing node, right The anomaly detection model is updated, or another time is set to update the anomaly detection model, wherein the update period may be the same as the above-mentioned preset duration, or may be different.
  • the anomaly detection model obtained by training using the newly collected data set replaces the anomaly detection model currently stored in the storage space shared by the multiple processing nodes, and realizes the update of the anomaly detection model to ensure that when the data continues to change,
  • the abnormality detection model currently stored in the storage space shared by multiple processing nodes can be updated according to data changes, thereby improving the accuracy of abnormality detection.
  • Each processing node corresponds to a second storage node among the plurality of storage nodes, and when any processing node obtains the collected data set from the at least one collection node, the obtained data set may be stored to the The second storage node corresponding to the processing node to back up the unprocessed data set for subsequent update of the abnormality detection model, and may also be used for other processing.
  • the method provided in the embodiment of FIG. 3 is applied to the distributed data collection system of the embodiment shown in FIG. 1.
  • the designated processing node in the distributed data collection system may be the main processing node. In this case, the designated processing node In performing the operations performed by the master processing node in the above steps 301-308, through the interaction with the slave processing node, the method shown in the embodiment of FIG. 3 is executed.
  • the designated processing node in the distributed data collection system may also be any slave processing node. In this case, the designated processing node is used to perform the operations performed by the slave processing nodes in the above steps 301-308, by communicating with the master processing node To perform the method shown in the embodiment of FIG. 3 above.
  • the main processing node monitors the multiple collection nodes, so that any processing node of the multiple processing nodes obtains the collected data from at least one of the multiple collection nodes Set, directly using the acquired data set to train the anomaly detection model, which can reflect the rules of distinguishing normal data from abnormal data, and learn the standard for distinguishing between normal data and abnormal data.
  • the abnormality detection model currently stored in the storage space shared by multiple processing nodes can be updated according to the data changes, thereby improving the accuracy of abnormality detection.
  • FIG. 4 is a flowchart of a data collection method provided by an embodiment of the present application, which is applied to the distributed data collection system of the embodiment shown in FIG. 1.
  • the distributed data collection system includes multiple collection nodes and multiple processing nodes And multiple storage nodes, the multiple processing nodes including a master processing node and at least one slave processing node.
  • the method includes the following steps:
  • Step 401 The master processing node monitors the multiple collection nodes, and when at least one of the multiple collection nodes monitors to collect a data set, sends a data acquisition instruction to at least one slave processing node, the data acquisition instruction carries the Type identification of at least one collection node.
  • Step 402 Any processing node obtains the collected data set from the corresponding collection node according to the type identification of the at least one collection node.
  • the processing node is either a main processing node or a slave processing node that receives the data acquisition instruction.
  • One, the data set includes at least one piece of data.
  • Step 403 Any processing node obtains the abnormality detection model from the storage space shared by the multiple processing nodes, and performs abnormality detection on the data set through the abnormality detection model to determine abnormal data in the data set.
  • any processing node When any processing node obtains the data set, it may obtain an abnormality detection model from the storage space shared by the multiple processing nodes, the abnormality detection model is obtained and stored after training in the embodiment of FIG. 3 described above.
  • anomaly detection is performed on the data set through the anomaly detection model to obtain an anomaly index for each piece of data in the data set, and the data whose anomaly index is within a preset range is determined as anomaly data.
  • an anomaly detection model is used to perform anomaly detection on the data set.
  • the anomaly detection model includes multiple binary trees, and each binary tree passes through each node according to an attribute. The node value of divides the key value of the attribute. Therefore, when anomaly detection is performed on each piece of data in the data set, the key value of the specified attribute of the data is input into the anomaly detection model, and the key is passed through each binary tree. The value is divided into layers until the key is divided into the last branch node of the binary tree, and the path length of the key is recorded, that is, the height of the tree.
  • Calculate the anomaly index of the key value of the data according to the following formulas (1)-(3), calculate the anomaly index of all key values of the data in the same way, and according to the anomaly index of all key values of the data, Determine the abnormal index of the piece of data. If the abnormal index of the piece of data is within a preset range, determine the piece of data as abnormal data.
  • e represents the number of nodes that data x passes from the root node of the binary tree to the branch node of the last layer, and c (n) is the correction term;
  • H (k) ln (k) + ⁇
  • Euler's constant, the value is 0.5772156649;
  • S (x, n) is used to represent the abnormal index of the current data
  • S (x, n) ⁇ 1 means that the possibility of abnormality of data X is greater, and S (x, n) ⁇ 0 means that the possibility of abnormality of data X is smaller.
  • S (x, n) of most data in this data set is close to 0.5, it means that there is no obvious anomaly in the entire data set.
  • the average value of the abnormality indices of all the key values of the piece of data can be calculated, and the average value can be used as the abnormality index of the piece of data, and then the piece can be determined. Whether the anomaly index of the data is within the preset range, if it is, it is determined that the data is abnormal.
  • the preset range may be preset, such as (0.6, 1), and of course, may also be other, such as (0.8, 1).
  • any processing node adopts the above-mentioned method to perform anomaly detection on each piece of data in the acquired data set through an anomaly detection model, to quickly determine the anomaly index of all data in the data set, and then determine the data set in the data set Abnormal data, to achieve abnormal detection of the data set.
  • Step 404 Any processing node stores the abnormal data to the first storage node of the plurality of storage nodes, where the first storage node is used to store the detected abnormal data.
  • Each processing node corresponds to a first storage node among multiple storage nodes.
  • the abnormal data may be stored in its corresponding first storage node.
  • the first storage node may Store the detected abnormal data for the user to retrieve the abnormal data and analyze the cause of the abnormality.
  • the first storage node may be an Elasticsearch component
  • the user may retrieve abnormal data through the Elasticsearch component, and analyze the retrieval results.
  • the Elasticsearch component can set an index name for storage and retrieval of abnormal data when storing abnormal data.
  • the index name of abnormal data may be normal, but of course it may be other.
  • the user can search for abnormal data according to the index name, and the retrieval results of some of the abnormal data obtained can be as follows:
  • normal indicates the index name of abnormal data
  • type is the index type
  • _id is the unique identifier of the current piece of data
  • the content in _source is a piece of data data in the acquired data set and the abnormal index normalScore in the data.
  • each processing node may separately store the normal data and the abnormal data in the data set in the first storage node.
  • index names can be set for normal data and abnormal data, so as to be used for retrieving normal data and abnormal data.
  • the index name of abnormal data can be set to normal, and of course it can be set to other.
  • the method provided in the embodiment of FIG. 4 is applied to the distributed data collection system of the embodiment shown in FIG. 1, and the designated processing node in the distributed data collection system may be the main processing node.
  • the designated The processing node is used to perform the operations performed by the master processing node in the above steps 401-405, and through the interaction with the slave processing node, perform the method shown in the embodiment of FIG. 4 described above.
  • the designated processing node in the distributed data collection system may also be a slave processing node.
  • the designated processing node is used to perform the operations performed by the slave processing nodes in the above steps 401-405 through interaction with the master processing node To execute the method shown in the embodiment of FIG. 4 described above.
  • the distributed data collection system may include multiple kafka modules, multiple Spark Streaming modules, multiple Elasticsearch modules, and multiple Hbase modules.
  • the multiple Spark Streaming modules include a master Spark Streaming module and at least one slave Spark Streaming module.
  • the operations performed by the collection node in the foregoing embodiments of FIGS. 3 and 4 may be performed by the kafka module.
  • the operations performed by the processing node in the above embodiments of FIGS. 3 and 4 may be performed by the Spark Streaming module.
  • the Elasticsearch module can also provide analysis and search functions, thereby facilitating users to search and analyze abnormal data to evaluate the data
  • the abnormal reason is convenient for follow-up work.
  • the operations performed by the second storage node in the foregoing embodiments of FIGS. 3 and 4 may be performed by the Hbase module.
  • the distributed data collection system includes a kafka module, a Spark Streaming module, an Elasticsearch module, and an Hbase module as examples.
  • the Spark Streaming module monitors the kafka module.
  • the Spark Streaming module obtains the collected data set from the kafka module and stores the obtained data set to the Hbase module.
  • the Spark Streaming module uses the anomaly detection model trained in advance from the data set obtained from the kafka module to perform anomaly detection on the currently acquired data set to determine the anomaly data in the data set.
  • the Spark Streaming module then stores the abnormal data to the Elasticsearch module.
  • the main processing node monitors multiple collection nodes so that the multiple processing nodes can obtain the collected data set when at least one of the multiple collection nodes collects the data set,
  • Each processing node then obtains the anomaly detection model from the shared storage space, and uses the anomaly detection model to perform anomaly detection on the data set to determine the anomaly data in the data set; then the anomaly data is stored in the multiple storage nodes The first storage node for subsequent use.
  • the anomaly detection model can reflect the law of distinguishing normal data from abnormal data, and multiple processing nodes use the anomaly detection model stored in the shared storage space to be able to perform anomalies on massive data sets in parallel Detection, while ensuring the accuracy of anomaly detection, also improves the speed of anomaly detection.
  • the data collection method provided by the embodiment of the present application can update the abnormality detection model currently stored in the storage space shared by multiple processing nodes according to the change of the data when the subsequent data changes continuously, and pass the updated abnormality detection model Perform anomaly detection to make the anomaly detection results more accurate.
  • FIG. 6 is a schematic structural diagram of a data collection device provided by an embodiment of the present application. Referring to FIG. 6, the device is applied to a designated processing node of a distributed data collection system.
  • the distributed data collection system includes multiple collection nodes, multiple processing nodes, and multiple storage nodes.
  • the device includes a first acquisition module 601, Anomaly detection module 602 and first storage module 603.
  • the first obtaining module 601 is configured to obtain the collected data set from at least one collecting node of the plurality of collecting nodes, where the data set includes at least one piece of data;
  • Anomaly detection module 602 configured to perform anomaly detection on the data set through an anomaly detection model to determine anomaly data in the data set, the anomaly detection model is trained according to the data set obtained from at least one of the plurality of collection nodes get;
  • the first storage module 603 is configured to store the abnormal data to a first storage node among the plurality of storage nodes, and the first storage node is used to store the detected abnormal data.
  • the multiple storage nodes include a second storage node corresponding to the designated processing node.
  • the second storage node is used to store data acquired by the designated processing node.
  • the device further includes:
  • the second obtaining module is used to obtain the data set collected in the specified time period from the second storage node, and use the data set as the first sample data set.
  • the specified time period refers to the time when the data collection starts Is the starting time and the length of time is the specified time period;
  • the training module is used for training according to the first sample data set to obtain an initial anomaly detection model.
  • the device further includes:
  • a third obtaining module configured to obtain, from the second storage node, a data set collected within a preset time period before the current time, and use the data set as a second sample data set;
  • the update module is used to continue training according to the second sample data set to obtain an updated abnormality detection model.
  • the multiple processing nodes include a master processing node and at least one slave processing node.
  • the device further includes:
  • the fourth acquisition module is used to acquire the anomaly detection model trained by the master processing node and receive the anomaly detection model trained by the at least one slave processing node;
  • the synthesis module is used to synthesize the abnormality detection models trained by the multiple processing nodes to obtain the synthesized abnormality detection model
  • the second storage module is used to store the synthesized abnormality detection model in a storage space shared by the multiple processing nodes for each processing node to perform abnormality detection.
  • the multiple processing nodes include a master processing node and at least one slave processing node.
  • the device further includes:
  • the first sending module is used to obtain the anomaly detection model trained by the designated processing node and send it to the main processing node.
  • the main processing node is used to synthesize the anomaly detection models trained by the multiple processing nodes to obtain the synthesized anomaly
  • the detection model stores the synthesized abnormality detection model in a storage space shared by the multiple processing nodes for each processing node to perform abnormality detection.
  • the anomaly detection module 602 includes:
  • the anomaly detection submodule is used to perform anomaly detection on the data set through an anomaly detection model to obtain an anomaly index for each piece of data in the data set, and determine data with an anomaly index within a preset range as anomalous data.
  • the multiple processing nodes include a master processing node and at least one slave processing node.
  • the device further includes:
  • Monitoring module for monitoring the multiple collection nodes
  • the second sending module is configured to send a data acquisition instruction to at least one slave processing node when at least one of the plurality of collection nodes is monitored to collect a data set, the data acquisition instruction carrying the type of the at least one collection node Identifier, the at least one slave processing node is used to obtain the collected data set from the corresponding collection node according to the type identifier carried in the received data acquisition instruction.
  • the multiple processing nodes include a master processing node and at least one slave processing node.
  • the first acquisition module includes:
  • the receiving submodule is used to receive a data acquisition instruction sent by the main processing node, the data acquisition instruction carries a type identification of at least one collection node, and the main processing node is used to send the data collection when the at least one collection node is monitored to collect the data set Data acquisition instructions;
  • the obtaining submodule is used to obtain the collected data set from the corresponding collecting node according to the type identifier carried in the data obtaining instruction.
  • the training module includes:
  • the synthesis sub-module is used to synthesize the multiple binary trees to obtain the initial anomaly detection model
  • Each binary tree includes multiple layers of nodes.
  • the first layer includes a root node. Each node is connected to two branch nodes of the next layer. Each node includes a piece of data in the first sample data set.
  • the node value is the key value of the data in each node on the specified attribute, and each node is used to divide the data whose key value on the specified attribute is less than the node value to the first branch node of the next layer. The data whose key value on the specified attribute is not less than the node value is divided into the second branch node of the next layer.
  • the establishment sub-module is also used for:
  • the data whose key value on the specified attribute is less than the node value of the current node is divided into the first branch node under the current node, and the key value on the specified attribute is not less than the node value of the current node.
  • the data is divided into the second branch node under the current node, and until the divided node includes only one piece of data or multiple pieces of data with the same key value on the specified attribute, a binary tree is obtained.
  • any processing node obtains the collected data set from at least one of the plurality of collecting nodes; and then performs anomaly detection on the data set through an anomaly detection model to determine abnormal data in the data set; Then, the abnormal data is stored to a first storage node among the plurality of storage nodes, and the first storage node is used to store the detected abnormal data.
  • the anomaly detection model trained based on the collected data set can reflect the rules for distinguishing normal data from abnormal data, and learn the criteria for distinguishing between normal data and abnormal data.
  • any processing node uses the anomaly detection model to perform anomaly detection on the acquired data set, determine the anomaly data in the data set and store it, so that the detection result is more in line with the real anomaly data, improving the accuracy of anomaly detection and ensuring The follow-up work was carried out normally.
  • the data collection device provided in the above embodiment collects data
  • only the above-mentioned division of each functional module is used as an example for illustration.
  • the above-mentioned functions may be allocated by different functional modules according to needs, that is, The internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the data collection device and the data collection method embodiment provided in the above embodiments belong to the same concept. For the specific implementation process, see the method embodiments, and details are not described here.
  • the server 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 701 and a Or more than one memory 702, wherein at least one instruction is stored in the memory 702, and the at least one instruction is loaded and executed by the processor 701.
  • processors central processing units, CPU
  • the server 700 may also have components such as a wired or wireless network interface, a keyboard, and an input-output interface for input and output.
  • the server 700 may also include other components for implementing device functions, which will not be repeated here.
  • the server 700 is used to perform the operations performed by the control device or the node device in the above data acquisition method.
  • a computer-readable storage medium is also provided, for example, a memory including instructions that can be executed by the processor in the terminal or server to complete the data collection method in the above embodiments.
  • the computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, or the like.
  • the program may be stored in a computer-readable storage medium.
  • the mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种数据采集方法、装置、存储介质及系统,属于大数据技术领域。该方法应用于分布式数据采集系统的指定处理节点中,该方法包括:从多个采集节点的至少一个采集节点中获取采集到的数据集,该数据集包括至少一条数据(201);通过异常检测模型对该数据集进行异常检测,确定该数据集中的异常数据,该异常检测模型根据从该多个采集节点的至少一个采集节点中获取的数据集进行训练得到(202);将该异常数据存储至该多个存储节点中的第一存储节点,该第一存储节点用于存储检测出的异常数据(203)。上述方法根据采集的数据集进行训练得到的异常检测模型,能够反映出区分正常数据和异常数据的规律,学习到正常数据与异常数据之间的区分标准。通过异常检测模型对数据集进行异常检测,能够使得检测结果更加符合真实的异常数据,提高异常检测的准确率。

Description

数据采集方法、装置、存储介质及系统
本申请要求于2018年10月18日提交的申请号为201811215823.0、发明名称为“数据采集方法、装置、存储介质及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及大数据技术领域,特别涉及一种数据采集方法、装置、存储介质及系统。
背景技术
在大数据技术领域中,由于网络的各种原因如网络崩溃、恶意攻击等,可能会产生不符合要求的异常数据,影响数据的后续使用。因此,在数据采集过程中,需要对采集到的数据进行检测,确定其中的异常数据,实现海量数据的有效性检测。
相关技术中,通常是预先在分布式数据采集系统中设置固定的预设规则,将预设规则作为正常数据和异常数据的区分标准。那么,分布式数据采集系统每次采集到数据时,按照预设规则对采集到的数据进行区分,将采集到的数据中满足预设规则的数据确定为正常数据,将采集到的数据中不满足预设规则的数据确定为异常数据。
在采集海量数据的场景下,随着时间的推移,数据会发生变化,正常数据与异常数据之间的区分标准可能也会发生变化,仍按照上述固定的预设规则进行检测,可能导致无法准确检测出异常数据,从而影响后续工作的正常进行。
发明内容
本申请实施例提供了一种数据采集方法、装置、存储介质及系统,可以解决相关技术中采用固定的预设规则进行异常检测,导致无法准确检测出异常数据,影响后续工作正常运行的问题。所述技术方案如下:
第一方面,提供了一种数据采集方法,应用于分布式数据采集系统的指定 处理节点中,所述分布式数据采集系统包括多个采集节点、多个处理节点和多个存储节点,所述方法包括:
从所述多个采集节点的至少一个采集节点中获取采集到的数据集,所述数据集包括至少一条数据;
通过异常检测模型对所述数据集进行异常检测,确定所述数据集中的异常数据,所述异常检测模型根据从所述多个采集节点的至少一个采集节点中获取的数据集进行训练得到;
将所述异常数据存储至所述多个存储节点中的第一存储节点,所述第一存储节点用于存储检测出的异常数据。
可选地,所述多个存储节点包括与所述指定处理节点对应的第二存储节点,所述第二存储节点用于存储所述指定处理节点获取的数据集,所述方法还包括:
从所述第二存储节点中,获取在指定时间段内采集到的数据集,将所述数据集作为第一样本数据集,所述指定时间段是指以开始采集数据的时刻为起始时刻、时间长度为指定时长的时间段;
根据所述第一样本数据集进行训练,得到初始的异常检测模型。
可选地,所述方法还包括:
从所述第二存储节点中,获取在当前时刻之前的预设时长内采集到的数据集,将所述数据集作为第二样本数据集;
根据所述第二样本数据集继续进行训练,得到更新后的异常检测模型。
可选地,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为主处理节点时,所述方法还包括:
获取所述主处理节点已训练的异常检测模型,并接收所述至少一个从处理节点已训练的异常检测模型;
将所述多个处理节点已训练的异常检测模型进行合成,得到合成后的异常检测模型;
将所述合成后的异常检测模型存储至所述多个处理节点共享的存储空间中,供每个处理节点进行异常检测。
可选地,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为任一从处理节点时,所述方法还包括:
获取所述指定处理节点已训练的异常检测模型,发送给所述主处理节点,所述主处理节点用于将所述多个处理节点已训练的异常检测模型进行合成,得 到合成后的异常检测模型,将所述合成后的异常检测模型存储至所述多个处理节点共享的存储空间中,供每个处理节点进行异常检测。
可选地,通过异常检测模型对所述数据集进行异常检测,确定所述数据集中的异常数据,包括:
通过所述异常检测模型对所述数据集进行异常检测,得到所述数据集中每条数据的异常指数,将异常指数处于预设范围内的数据确定为异常数据。
可选地,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为主处理节点时,所述从所述多个采集节点的至少一个采集节点中获取采集到的数据集之前,所述方法还包括:
监控所述多个采集节点;
当监控到所述多个采集节点中的至少一个采集节点采集到数据集时,向至少一个从处理节点发送数据获取指令,所述数据获取指令携带所述至少一个采集节点的类型标识,所述至少一个从处理节点用于根据接收到的数据获取指令中携带的类型标识,从对应的采集节点中获取采集到的数据集。
可选地,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为任一从处理节点时,所述从所述多个采集节点的至少一个采集节点中获取采集到的数据集,包括:
接收所述主处理节点发送的数据获取指令,所述数据获取指令携带至少一个采集节点的类型标识,且所述主处理节点用于当监控到所述至少一个采集节点采集到数据集时发送所述数据获取指令;
根据所述数据获取指令中携带的类型标识,从对应的采集节点中获取采集到的数据集。
可选地,所述根据所述第一样本数据集进行训练,得到初始的异常检测模型,包括:
根据所述第一样本数据集,建立多个二叉树,将所述多个二叉树进行合成,得到所述初始的异常检测模型;
每个二叉树包括多层节点,第一层中包括一个根节点,每个节点与下一层的两个分支节点连接,每个节点中包括所述第一样本数据集中的一条数据,每个节点的节点值为每个节点中的数据在指定属性上的键值,且每个节点用于将在所述指定属性上的键值小于所述节点值的数据划分至下一层的第一分支节点,将在所述指定属性上的键值不小于所述节点值的数据划分至下一层的第二 分支节点。
可选地,所述根据所述第一样本数据集,建立多个二叉树,包括:
随机从所述第一样本数据集的所有数据属性中选择任一属性,作为指定属性;
随机从所述指定属性的所有键值中选择一个键值作为根节点的节点值,将所述根节点的节点值对应的数据添加至根节点中;
从所述根节点开始,将在所述指定属性上的键值小于当前节点的节点值的数据划分至所述当前节点下一层的第一分支节点,将在所述指定属性上的键值不小于所述当前节点的节点值的数据划分至所述当前节点下一层的第二分支节点,直至划分至的节点中仅包括一条数据或包括在所述指定属性上的键值相同的多条数据时,得到一个二叉树。
第二方面,提供了一种数据采集装置,应用于分布式数据采集系统的指定处理节点中,所述分布式数据采集系统包括多个采集节点、多个处理节点和多个存储节点,所述装置包括:
第一获取模块,用于从所述多个采集节点的至少一个采集节点中获取采集到的数据集,所述数据集包括至少一条数据;
异常检测模块,用于通过异常检测模型对所述数据集进行异常检测,确定所述数据集中的异常数据,所述异常检测模型根据从所述多个采集节点的至少一个采集节点中获取的数据集进行训练得到;
第一存储模块,用于将所述异常数据存储至所述多个存储节点中的第一存储节点,所述第一存储节点用于存储检测出的异常数据。
可选地,所述多个存储节点包括与所述指定处理节点对应的第二存储节点,所述第二存储节点用于存储所述指定处理节点获取的数据集,所述装置还包括:
第二获取模块,用于从所述第二存储节点中,获取在指定时间段内采集到的数据集,将所述数据集作为第一样本数据集,所述指定时间段是指以开始采集数据的时刻为起始时刻、时间长度为指定时长的时间段;
训练模块,用于根据所述第一样本数据集进行训练,得到初始的异常检测模型。
可选地,所述装置还包括:
第三获取模块,用于从所述第二存储节点中,获取在当前时刻之前的预设时长内采集到的数据集,将所述数据集作为第二样本数据集;
更新模块,用于根据所述第二样本数据集继续进行训练,得到更新后的异常检测模型。
可选地,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为主处理节点时,所述装置还包括:
第四获取模块,用于获取所述主处理节点已训练的异常检测模型,并接收所述至少一个从处理节点已训练的异常检测模型;
合成模块,用于将所述多个处理节点已训练的异常检测模型进行合成,得到合成后的异常检测模型;
第二存储模块,用于将所述合成后的异常检测模型存储至所述多个处理节点共享的存储空间中,供每个处理节点进行异常检测。
可选地,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为任一从处理节点时,所述装置还包括:
第一发送模块,用于获取所述指定处理节点已训练的异常检测模型,发送给所述主处理节点,所述主处理节点用于将所述多个处理节点已训练的异常检测模型进行合成,得到合成后的异常检测模型,将所述合成后的异常检测模型存储至所述多个处理节点共享的存储空间中,供每个处理节点进行异常检测。
可选地,所述异常检测模块包括:
异常检测子模块,用于通过所述异常检测模型对所述数据集进行异常检测,得到所述数据集中每条数据的异常指数,将异常指数处于预设范围内的数据确定为异常数据。
可选地,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为主处理节点时,所述装置还包括:
监控模块,用于监控所述多个采集节点;
第二发送模块,用于当监控到所述多个采集节点中的至少一个采集节点采集到数据集时,向至少一个从处理节点发送数据获取指令,所述数据获取指令携带所述至少一个采集节点的类型标识,所述至少一个从处理节点用于根据接收到的数据获取指令中携带的类型标识,从对应的采集节点中获取采集到的数据集。
可选地,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为任一从处理节点时,所述第一获取模块包括:
接收子模块,用于接收所述主处理节点发送的数据获取指令,所述数据获 取指令携带至少一个采集节点的类型标识,且所述主处理节点用于当监控到所述至少一个采集节点采集到数据集时发送所述数据获取指令;
获取子模块,用于根据所述数据获取指令中携带的类型标识,从对应的采集节点中获取采集到的数据集。
可选地,所述训练模块包括:
建立子模块,用于根据所述第一样本数据集,建立多个二叉树;
合成子模块,用于将所述多个二叉树进行合成,得到所述初始的异常检测模型;
每个二叉树包括多层节点,第一层中包括一个根节点,每个节点与下一层的两个分支节点连接,每个节点中包括所述第一样本数据集中的一条数据,每个节点的节点值为每个节点中的数据在指定属性上的键值,且每个节点用于将在所述指定属性上的键值小于所述节点值的数据划分至下一层的第一分支节点,将在所述指定属性上的键值不小于所述节点值的数据划分至下一层的第二分支节点。
可选地,所述建立子模块还用于:
随机从所述第一样本数据集的所有数据属性中选择任一属性,作为指定属性;
随机从所述指定属性的所有键值中选择一个键值作为根节点的节点值,将所述根节点的节点值对应的数据添加至根节点中;
从所述根节点开始,将在所述指定属性上的键值小于当前节点的节点值的数据划分至所述当前节点下一层的第一分支节点,将在所述指定属性上的键值不小于所述当前节点的节点值的数据划分至所述当前节点下一层的第二分支节点,直至划分至的节点中仅包括一条数据或包括在所述指定属性上的键值相同的多条数据时,得到一个二叉树。
第三方面,提供一种处理节点,应用于分布式数据采集系统中,所述分布式数据采集系统包括多个采集节点、多个处理节点和多个存储节点,所述处理节点为所述分布式采集系统中的任一处理节点;
所述处理节点包括处理器和存储器,所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行以实现上述第一方面所述的数据采集方法。
第四方面,提供一种计算机可读存储介质,所述存储介质中存储有至少一条指令,所述至少一条指令由处理器加载并执行以实现上述第一方面所述的数据采集方法。
第五方面,提供一种分布式数据采集系统,所述分布式数据采集系统包括多个采集节点、多个处理节点和多个存储节点;
所述多个采集节点用于采集数据集,所述数据集包括至少一条数据;
所述多个处理节点中的任一处理节点用于从所述多个采集节点的至少一个采集节点中获取采集到的数据集;
所述任一处理节点还用于通过异常检测模型对所述数据集进行异常检测,确定所述数据集中的异常数据,所述异常检测模型根据从所述多个采集节点的至少一个采集节点中获取的数据集进行训练得到;
所述任一处理节点还用于将所述异常数据存储至所述多个存储节点中的第一存储节点;
所述第一存储节点用于存储检测出的异常数据。
本申请实施例中,任一处理节点从该多个采集节点的至少一个采集节点中获取采集到的数据集;然后通过异常检测模型对该数据集进行异常检测,确定该数据集中的异常数据;之后将该异常数据存储至该多个存储节点中的第一存储节点,该第一存储节点用于存储检测出的异常数据。这样,根据采集的数据集进行训练得到的异常检测模型,能够反映出区分正常数据和异常数据的规律,学习到正常数据与异常数据之间的区分标准。那么,任一处理节点使用该异常检测模型对获取到的数据集进行异常检测,确定该数据集中的异常数据并存储,使得检测结果更加符合真实的异常数据,提高了异常检测的准确率,保证了后续工作的正常进行。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下, 还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种分布式数据采集系统的结构示意图;
图2是本申请实施例提供的一种数据采集方法流程图;
图3是本申请实施例提供的一种训练异常检测模型的方法流程图;
图4是本申请实施例提供的另一种数据采集方法流程图;
图5是本申请实施例提供的另一种分布式数据采集系统的结构示意图;
图6是本申请实施例提供的一种数据采集装置结构示意图;
图7是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
为了便于理解,在对本申请实施例进行详细地解释说明之前,先对本申请实施例涉及的系统架构进行介绍。
图1是本申请实施例提供的一种分布式数据采集系统的结构示意图,参见图1,该分布式数据采集系统中包括多个采集节点101、多个处理节点102和多个存储节点103,至少一个采集节点101与一个处理节点102连接,则每个处理节点102与至少一个采集节点101对应,一个处理节点102与一个存储节点103连接,则该多个处理节点102与该多个存储节点103一一对应。
其中,每个采集节点101具备数据采集功能,可以采集数据。每个处理节点102具备异常检测功能,可以对采集到的数据进行异常检测。每个存储节点103具备数据存储功能,可以存储采集到的数据。
以多个处理节点102中的任一处理节点为指定处理节点为例,每个采集节点101用于从数据源采集数据集。指定处理节点用于从与其连接的至少一个采集节点101中获取采集到的数据集,并通过异常检测模型对数据集进行异常检测,得到该数据集中的异常数据,之后将检测出的异常数据存储至第一存储节点。
其中,该多个存储节点103中包括该第一存储节点,该第一存储节点用于存储检测出的异常数据。
需要说明的是,本申请实施例提供的分布式数据采集系统中包括的采集节点、处理节点和存储节点可以为服务器,或者也可以为服务器中的功能模块, 也即是,则不同节点可以部署在同一服务器中,也可以部署在不同服务器上。
图2是本申请实施例提供的一种数据采集方法流程图,应用于图1所示的分布式数据采集系统的指定处理节点中,该指定处理节点为分布式数据采集系统中的任一处理节点。参见图2,该方法包括如下步骤:
步骤201:从该多个采集节点的至少一个采集节点中获取采集到的数据集,该数据集包括至少一条数据。
步骤202:通过异常检测模型对该数据集进行异常检测,确定该数据集中的异常数据,该异常检测模型根据从该多个采集节点的至少一个采集节点中获取的数据集进行训练得到。
步骤203:将该异常数据存储至该多个存储节点中的第一存储节点,该第一存储节点用于存储检测出的异常数据。
本申请实施例中,任一处理节点从该多个采集节点的至少一个采集节点中获取采集到的数据集;然后通过异常检测模型对该数据集进行异常检测,确定该数据集中的异常数据;之后将该异常数据存储至该多个存储节点中的第一存储节点,该第一存储节点用于存储检测出的异常数据。这样,根据采集的数据集进行训练得到的异常检测模型,能够反映出区分正常数据和异常数据的规律,学习到正常数据与异常数据之间的区分标准。那么,任一处理节点使用该异常检测模型对获取到的数据集进行异常检测,确定该数据集中的异常数据并存储,使得检测结果更加符合真实的异常数据,提高了异常检测的准确率,保证了后续工作的正常进行。
图3是本申请实施例提供的一种训练异常检测模型的方法流程图,应用于图1所示实施例的分布式数据采集系统中,该分布式数据采集系统包括多个采集节点、多个处理节点和多个存储节点,该多个处理节点包括一个主处理节点和至少一个从处理节点。参见图3,该方法包括如下步骤:
步骤301:主处理节点监控该多个采集节点,当监控到该多个采集节点中的至少一个采集节点采集到数据集时,向至少一个从处理节点发送数据获取指令,该数据获取指令携带该至少一个采集节点的类型标识。
其中,采集节点的类型标识用于表示该采集节点的类型,由于在分布式数据采集系统中,类型相同的采集节点之间能够实现数据共享。那么,当主处理 节点监控到某一采集节点采集到数据集时,只需要根据采集节点的类型,就可以从该种类型的任一个采集节点中获取采集到的数据集。因此,在该数据获取指令中携带该至少一个采集节点的类型标识即可。该类型标识可以为采集节点的节点类型名称等,如kafka、FTP(File Transfer Protocol,文件传输协议)等。
其中,主处理节点可以实时监控每个采集节点,也可以周期性的监控每个采集节点。
可选地,该数据获取指令中可以携带该至少一个采集节点的类型标识和采集到的数据集的缓存位置,例如可以携带采集节点的类型标识kafka和数据集的缓存位置topic,以便从处理节点能从kafka采集节点的topic中获取数据。
相应地,主处理节点可以监控每个采集节点,在监控到该多个采集节点中的至少一个采集节点采集到数据集时,获取该至少一个采集节点的类型标识和该数据集的缓存位置,并将该至少一个采集节点的类型标识和该数据集的缓存位置携带在数据获取指令中,发送给至少一个从处理节点。
当监控到的至少一个采集节点包括多种类型的采集节点时,主处理节点可以在该数据获取指令中携带该至少一个采集节点中的所有类型标识,也可以携带该至少一个采集节点中的部分类型标识。当该数据获取指令中携带的类型标识为部分类型标识时,该部分类型标识由主处理节点根据类型标识的数量和空闲的处理节点的数量确定。
可选地,主处理节点确定类型标识的数量和空闲处理节点的数量,计算类型标识的数量和空闲处理节点的数量之间的比值,当该比值小于1时,为至少两个空闲处理节点分配同一个类型标识;当该比值不小于1时,为每个空闲处理节点分配至少一个类型标识,且每个空间处理节点所分配的类型标识不同。
其中,空闲处理节点可以为主处理节点和至少一个从处理节点,或者为至少一个从处理节点,或者为一个主处理节点。
例如,若类型标识为2个,当前空闲的处理节点为4个,包括1个主处理节点和3个从处理节点。那么,主处理节点可以将第一个类型标识指示的采集节点分配给自身和一个从处理节点;将第二个类型标识所指示的采集节点分配给其他两个从处理节点。当然,主处理节点也可以将第一个类型标识指示的采集节点分配给一个从处理节点,将第二个类型标识分配给其他两个从处理节点,而主处理节点本身仅监控采集节点采集数据的情况并进行分配,而不参与获取采集节点的数据集的过程。
例如,若类型标识为4个,当前空闲的处理节点为2个从处理节点。主处理节点可以为将两个类型标识分配给一个从处理节点,将另外两个类型标识分配给另一个从处理节点。
步骤302:任一处理节点根据该至少一个采集节点的类型标识,从对应的采集节点中获取采集到的数据集并存储至第二存储节点,该任一处理节点为主处理节点和接收到数据获取指令的从处理节点中的任一个。
其中,分布式采集系统中,多个存储节点还可以包括多个第二存储节点,每个处理节点对应一个第二存储节点,第二存储节点用于存储对应处理节点获取的数据集。
对于主处理节点而言,可以在监控到该至少一个采集节点采集到数据集时,获取该至少一个采集节点的类型标识,根据该类型标识从对应的采集节点中获取采集到的数据集,并将获取到的数据集存储至主处理节点对应的第二存储节点。
对于每个从处理节点而言,可以在接收到一个数据获取指令时,就根据该数据获取指令中携带的类型标识,确定要获取其数据集的采集节点,从对应的采集节点中获取采集到的数据集,并将获取到的数据集存储至从处理节点对应的第二存储节点。当然,也可以在接收到的数据获取指令的个数达到预设数量时,对多个数据获取指令进行统一处理。或者,也可以在接收到第一个数据获取指令时,开始计时,当到达一定时间间隔时,根据该时间间隔内接收到的数据获取指令中携带的类型标识,从对应的采集节点中获取采集到的数据集,并重新开始计时。
对于采集节点而言,一个采集节点每次只允许一个处理节点获取采集的数据集,多个处理节点不能同时获取同一个采集节点的数据集。那么当一个处理节点在获取某一采集节点采集到的数据集时,其他处理节点则无法获取该采集节点采集到的数据集,只能再去获取其他采集节点采集到的数据集或者不再获取数据集。通过这种方式,使得每个处理节点获取不同采集节点采集到的数据集,这样可以避免多个处理节点获取相同的数据集。
需要说明的是,对于每个从处理节点而言,无论该数据获取指令中携带的是该至少一个采集节点中所有采集节点的类型标识,还是携带该至少一个采集节点中部分采集节点的类型标识,从处理节点只需按照该数据获取指令中携带的类型标识从对应的采集节点中获取数据集。
而从处理节点根据该数据获取指令中携带的某一个类型标识,从对应的采集节点获取采集到的数据集时,可能存在其他处理节点正在获取该采集节点采集到的数据集的情况,此时,该从处理节点从与该类型标识对应的其他采集节点获取采集到的数据集。对于该类型标识对应的所有采集节点中的每个采集节点,若均存在其他处理节点正在获取采集到的数据集的情况,此时,从处理节点停止根据该类型标识从对应的采集节点获取采集到的数据集。之后,从处理节点再根据该类型标识的下一个类型标识从对应的采集节点获取采集到的数据集;若该数据获取指令中仅携带了这一个类型标识,则从处理节点停止数据集的获取。
步骤303:任一处理节点从对应的第二存储节点中获取在指定时间段内采集到的数据集,将该数据集中的数据作为第一样本数据集,该指定时间段是指以开始采集数据的时刻为起始时刻、时间长度为指定时长的时间段。
其中,该指定时长可以设置为一天、两天、12小时等,也可以设置为其他时长。例如,该处理节点可以为Spark Streaming组件,该采集节点可以为Kafka,Kafka采集到的数据集缓存在对应的topic主题中,任一Spark Streaming组件根据采集节点的类型标识Kafka和数据集的存储位置topic1,从Kafka的topic1中获取数据集,然后将该第一样本数据集封装到DStream(数据流)中,之后通过遍历DStream中的RDD(Resilient Distributed Datasets,弹性分布式数据集),来获取每条数据进行后续的模型训练。
例如,某一处理节点获取到的第一样本数据集可以如下表1所示。
表1
  Number_count Id_count p_count Map_agent_count Name_len_count prefix_count price
0 0 0 0 0 0 0 0
1 0.744680851 0.744681 0.065217 0.0315 0.571681286 0.21965486 56.04167
2 0.021270596 0.021277 0.021739 0.0315 0 0 79.89256
3 0.021270596 0.021277 0.021739 0 0.028571429 0 4.165
4 0.021270596 0.021277 0 0 0.028571429 0 67.4862
5 0.042625844 0.06383 0.021739 0 0 0.0216325 67.1586
6 0.021270596 0.021277 0.043478 0 0.084625962 0 56.41654
7 0.021270596 0.042553 0.021739 0.0315 0.028571429 0.04929858 66.469
8 0.064825843 0.021277 0 0 0.028571429 0.0216325 157.655
需要说明的是,上述表1仅仅是本申请实施例提供的一种示例性的第一样本数据集,该第一样本数据集也可以为其他,对此本申请实施例不予限定。
步骤304:任一处理节点根据第一样本数据集进行训练,得到初始的异常检测模型。
其中,该任一处理节点可以为主处理节点,也可以为至少一个从处理节点中的任意一个从处理节点。因此,对于一个主处理节点和至少一个从处理节点中的每个处理节点而言,都会获取到一个第一样本数据集进行训练,得到一个初始的异常检测模型,那么,最后会得到多个初始的异常检测模型。并且不同处理节点获取的第一样本数据集为不同数据集的数据,使得获取的数据更全面。
该异常检测模型是直接根据获取的数据集进行训练得到,更符合数据集本身的区分标准,且该异常检测模型不需要预先设置区分标准,适用于海量数据集的异常检测,准确率高。其中,该异常检测模型可以基于Isolation-Forest(孤立森林)算法实现。
任一处理节点在获取到第一样本数据集时,可以先根据第一样本数据集,建立多个二叉树,然后将该多个二叉树进行合成,得到初始的异常检测模型。
其中,每个二叉树包括多层节点,第一层中包括一个根节点,每个节点与下一层的两个分支节点连接,每个节点中包括第一样本数据集中的一条数据,每个节点的节点值为每个节点中的数据在指定属性上的键值,且每个节点用于将在指定属性上的键值小于该节点值的数据划分至下一层的第一分支节点,将在指定属性上的键值不小于该节点值的数据划分至下一层的第二分支节点。
可选地,任一处理节点根据第一样本数据集,建立多个二叉树时,可以随机从第一样本数据集的所有数据属性中选择任一属性,作为指定属性;随机从指定属性的所有键值中选择一个键值作为根节点的节点值,将根节点的节点值对应的数据添加至根节点中;从根节点开始,将在指定属性上的键值小于当前节点的节点值的数据划分至当前节点下一层的第一分支节点,将在指定属性上的键值不小于当前节点的节点值的数据划分至当前节点下一层的第二分支节点,直至划分至的节点中仅包括一条数据或包括在指定属性上的键值相同的多条数据时,得到一个二叉树。
建立一个二叉树的代码可以如下所示:
Figure PCTCN2019111481-appb-000001
Figure PCTCN2019111481-appb-000002
其中,Att为指定属性,Value为节点值,X为随机选取的指定属性的所有键值,e为当前高度,l为指定高度,该指定高度为预先设置。
需要说明的是,若随机选取的某个节点的节点值高于指定高度,或者不存在指定属性上的键值小于该节点值的数据,或者不存在指定属性上的键值不小于该节点值的数据,则表示选取的该节点值不合适,返回重新选取该节点的节点值。
步骤305:每个从处理节点获取已训练的异常检测模型,发送给主处理节点。
步骤306:主处理节点获取当前已训练的异常检测模型,并接收至少一个从处理节点已训练的异常检测模型。
步骤307:主处理节点将多个处理节点已训练的异常检测模型进行合成,得到合成后的异常检测模型。
主处理节点将自身当前已训练的异常检测模型和接收到的至少一个从处理节点已训练的异常检测模型进行合成处理,最终得到一个合成后的异常检测模型。
也即是说,主处理节点将多个处理节点根据不同数据集训练得到的异常检测模型进行合成,能够综合考虑该至少一个采集节点在指定时间段内采集到的所有数据集,使得最终得到的合成后的异常检测模型更能体现出数据集本身的规律,保证了异常检测模型的准确率。
步骤308:主处理节点将合成后的异常检测模型存储至该多个处理节点共享的存储空间中,供每个处理节点进行异常检测。
其中,该多个处理节点共享的存储空间,可以位于主处理节点,也可以位于其他处理节点,或者位于一个独立可供该多个处理节点共同访问的存储服务器上。
考虑到在采集海量数据的场景下,随着时间的推移,正常数据与异常数据之间的区分标准可能会发生变化,因此,本申请实施例可以通过下述方式对异常检测模型进行更新,以使更新后的异常检测模型更加符合当前区分正常数据和异常数据的规律,从而能够提高异常检测的准确率。
在按照上述步骤301-308建立异常检测模型之后,上述方法还包括:任一处理节点从第二存储节点中,获取在当前时刻之前的预设时长内采集到的数据集,将该数据集作为第二样本数据集;根据该第二样本数据集继续进行训练,得到新的异常检测模型作为更新后的异常检测模型。
之后按照上述步骤305-308的方法由每个从处理节点获取当前已训练的异常检测模型,发送给主处理节点。主处理节点获取当前已训练的异常检测模型,并接收至少一个从处理节点已训练的异常检测模型,然后将多个处理节点已训练的异常检测模型进行合成,得到合成后的异常检测模型。之后,将当前该多个处理节点共享的存储空间中存储的异常检测模型替换为该合成后的异常检测模型,以供每个处理节点进行异常检测。
需要说明的是,为了保证异常检测模型的准确性,可以周期性的对异常检测模型进行更新,例如,可以在处理节点中设置在每天零点获取零点之前预设时长内采集到的数据集,对异常检测模型进行更新,或者设置其他时间定时对异常检测模型进行更新,其中,该更新周期可以为与上述预设时长相同,也可以不同。这样,使用新采集的数据集进行训练得到的异常检测模型,替换该多个处理节点共享的存储空间中当前存储的异常检测模型,实现对异常检测模型的更新,从而保证在数据不断变化时,能够根据数据的变化情况,更新多个处理节点共享的存储空间中当前存储的异常检测模型,从而提高异常检测的准确度。
其中,每个处理节点对应该多个存储节点中的一个第二存储节点,当任一处理节点在从该至少一个采集节点获取采集到的数据集时,可以将获取到的数据集存储至该处理节点对应的第二存储节点,以将未处理的数据集进行备份,用于后续对异常检测模型进行更新,也可以用于进行其他处理。
另外,图3实施例提供的方法应用于图1所示实施例的分布式数据采集系统中,该分布式数据采集系统中的指定处理节点可以为主处理节点,此时,该指定处理节点用于执行上述步骤301-308中主处理节点执行的操作,通过与从处理节点之间的交互,执行上述图3实施例所示的方法。该分布式数据采集系统 中的指定处理节点也可以为任一从处理节点,此时,该指定处理节点用于执行上述步骤301-308中从处理节点执行的操作,通过与主处理节点之间的交互,执行上述图3实施例所示的方法。
综上所述,本申请实施例中,通过主处理节点对多个采集节点进行监控,以使多个处理节点的任一处理节点从多个采集节点中至少一个采集节点中获取采集到的数据集,直接使用获取到的数据集进行训练得到的异常检测模型,能够反映出区分正常数据和异常数据的规律,学习到正常数据与异常数据之间的区分标准。并且,还可以在后续数据不断变化时,能够根据数据的变化情况,更新多个处理节点共享的存储空间中当前存储的异常检测模型,从而提高异常检测的准确度。
图4是本申请实施例提供的一种数据采集方法的流程图,应用于图1所示实施例的分布式数据采集系统中,该分布式数据采集系统包括多个采集节点、多个处理节点和多个存储节点,该多个处理节点包括一个主处理节点和至少一个从处理节点。参见图4,该方法包括如下步骤:
步骤401:主处理节点监控该多个采集节点,当监控到该多个采集节点中的至少一个采集节点采集到数据集时,向至少一个从处理节点发送数据获取指令,该数据获取指令携带该至少一个采集节点的类型标识。
步骤402:任一处理节点根据该至少一个采集节点的类型标识,从对应的采集节点中获取采集到的数据集,该处理节点为主处理节点和接收到数据获取指令的从处理节点中的任一个,该数据集包括至少一条数据。
步骤403:任一处理节点从该多个处理节点共享的存储空间中获取异常检测模型,并通过异常检测模型对该数据集进行异常检测,确定该数据集中的异常数据。
任一处理节点在获取到数据集时,可以从该多个处理节点共享的存储空间中获取异常检测模型,该异常检测模型为上述图3实施例训练后得到并存储的。
任一处理节点在获取到异常检测模型后,通过异常检测模型对该数据集进行异常检测,得到该数据集中每条数据的异常指数,将异常指数处于预设范围内的数据确定为异常数据。
需要说明的是,任一处理节点在获取到该数据集,使用异常检测模型对该数据集进行异常检测,该异常检测模型中包括多个二叉树,而每个二叉树按照 一个属性,通过每个节点的节点值对该属性的键值进行划分,因此,在对该数据集中的每条数据进行异常检测时,将数据的指定属性的键值输入到异常检测模型中,通过每个二叉树对该键值进行层层划分,直至将该键值划分至二叉树的最后一层分支节点,记录这个键值的路径长度,即树高度。根据如下公式(1)-(3)计算出数据该键值的异常指数,按照同样的方式计算出该条数据的所有键值的异常指数,根据该条数据中的所有键值的异常指数,确定该条数据的异常指数,若该条数据的异常指数处于预设范围内,则确定该条数据为异常数据。
h(x)=e+c(n)    公式(1)
其中,e表示数据x从二叉树的根节点到最后一层分支节点过程中经过的节点数目,c(n)为修正项;
c(n)=2H(n-1)-(2(n-1)/n);  公式(2)
其中,H(k)=ln(k)+ξ,ξ为欧拉常数,取值为0.5772156649;
Figure PCTCN2019111481-appb-000003
其中,S(x,n)用于表示当前数据的异常指数;
Figure PCTCN2019111481-appb-000004
S(x,n)→1表示数据X异常的可能性越大,S(x,n)→0表示数据X异常的可能性越小。当该数据集中大多数数据的异常指数S(x,n)都接近于0.5,说明整个数据集没有明显的异常情况。
一种可能实现的方式,上述确定某条数据的异常指数时,可以通过计算该条数据的所有键值的异常指数的平均值,将该平均值作为该条数据的异常指数,然后确定该条数据的异常指数是否位于预设范围,若是,则确定该条数据异常。
其中,该预设范围可以预先设置,如(0.6,1),当然也可以为其他,如(0.8,1)。
在本申请实施例中,任一处理节点采用上述方式,通过异常检测模型对获取的数据集中的每一条数据进行异常检测,快速确定该数据集中的所有数据的异常指数,进而确定该数据集中的异常数据,实现对数据集的异常检测。
步骤404:任一处理节点将该异常数据存储至该多个存储节点中的第一存储节点,该第一存储节点用于存储检测出的异常数据。
每个处理节点对应该多个存储节点中的一个第一存储节点,当任一处理节点得到异常数据时,可以将该异常数据存储至自身对应的第一存储节点中,该第一存储节点可以将检测出的异常数据存储,以供用户后续检索异常数据,以 及分析异常原因。
例如,该第一存储节点可以为Elasticsearch组件,那么用户可以通过Elasticsearch组件对异常数据进行检索,并对检索结果进行分析。Elasticsearch组件可以在存储异常数据时,为异常数据设置用于存储和检索的索引名。例如,异常数据的索引名可以为anormal,当然也可以为其他。用户可以根据索引名对异常数据进行检索,得到的部分异常数据的检索结果可以如下所示:
{
“_index”:“anormal”,
“_type”:“item”,
“_id”:“AWUTmL_XiommX4iKIZq7”,
“_score”:1,
“_score”:{“_data”:[0.744680851,0.744680851,0.0246956,0,1,0.71658592,115],“anormalscore”:“0.746989258992569”}
}
{
“_index”:“anormal”,
“_type”:“item”,
“_id”:“AWUTfrijefhihji235_ht”,
“_score”:1,
“_score”:{“_data”:[1,1,0,0.29852985985,0,0.418926478,0.0246956,1,0],“anormalscore”:“0.7879261458”}
}
其中,anormal表示异常数据的索引名,type为索引类型,_id为当前条数据的唯一标识符,_source中的内容为获取到的数据集中一条数据data和该数据中的异常指数anormalScore。
一种可能的实现方式中,每个处理节点可以将数据集中的正常数据与异常数据分开存储在第一存储节点。
相应地,可以为正常数据和异常数据设置不同的索引名,以用于对正常数据和异常数据进行检索。如设置异常数据的索引名为anormal,正常数据的索引名可以为normal,当然也可以设置为其他。
需要说明的是,图4实施例提供的方法应用于图1所示实施例的分布式数 据采集系统中,该分布式数据采集系统中的指定处理节点可以为主处理节点,此时,该指定处理节点用于执行上述步骤401-405中主处理节点执行的操作,通过与从处理节点之间的交互,执行上述图4实施例所示的方法。该分布式数据采集系统中的指定处理节点也可以为从处理节点,此时,该指定处理节点用于执行上述步骤401-405中从处理节点执行的操作,通过与主处理节点之间的交互,执行上述图4实施例所示的方法。
一种可能实现的方式中,本申请实施例的分布式数据采集系统可以包括多个kafka模块、多个Spark Streaming模块、多个Elasticsearch模块和多个Hbase模块,该多个Spark Streaming模块包括一个主Spark Streaming模块和至少一个从Spark Streaming模块。上述图3和图4实施例中采集节点所执行的操作可以由kafka模块执行。上述图3和图4实施例中处理节点所执行的操作可以由Spark Streaming模块来执行。上述图3和图4实施例中第一存储节点所执行的操作可以由Elasticsearch模块来执行,且该Elasticsearch模块还可以提供分析以及搜索功能,从而便于用户对异常数据进行搜索和分析,以评估数据异常原因,便于进行后续的工作。上述图3和图4实施例中第二存储节点所执行的操作可以由Hbase模块来执行。
接下来,如图5所示,以该分布式数据采集系统包括一个kafka模块、一个Spark Streaming模块、一个Elasticsearch模块和一个Hbase模块为例进行说明。
Spark Streaming模块监控kafka模块,当Spark Streaming模块监控到kafka模块采集到数据集时,Spark Streaming模块从kafka模块中获取采集到的数据集,并将获取到的数据集存储至Hbase模块。然后Spark Streaming模块使用预先根据从kafka模块中获取的数据集进行训练得到的异常检测模型,对当前获取到的数据集进行异常检测,确定该数据集中的异常数据。之后Spark Streaming模块将该异常数据存储至Elasticsearch模块。
综上所述,本申请实施例中,主处理节点监控多个采集节点,使得该多个处理节点能够在该多个采集节点的至少一个采集节点采集到数据集时,获取采集的数据集,然后每个处理节点从共享的存储空间中获取异常检测模型,并使用异常检测模型对该数据集进行异常检测,确定该数据集中的异常数据;之后将该异常数据存储至该多个存储节点中的第一存储节点,以供后续使用。在本申请的分布式数据采集系统中,异常检测模型能够反映出区分正常数据和异常数据的规律,多个处理节点使用共享的存储空间中存储的异常检测模型,能够 并行对海量数据集进行异常检测,保证了异常检测的准确度的同时,还提高了异常检测的速度。并且,本申请实施例提供的数据采集方法能够在后续数据不断变化时,能够根据数据的变化情况,更新多个处理节点共享的存储空间中当前存储的异常检测模型,通过更新后的异常检测模型进行异常检测,使得异常检测结果更加准确。
图6是本申请实施例提供的一种数据采集装置的结构示意图。参见图6,该装置应用于分布式数据采集系统的指定处理节点中,该分布式数据采集系统包括多个采集节点、多个处理节点和多个存储节点,该装置包括第一获取模块601、异常检测模块602和第一存储模块603。
第一获取模块601,用于从该多个采集节点的至少一个采集节点中获取采集到的数据集,该数据集包括至少一条数据;
异常检测模块602,用于通过异常检测模型对该数据集进行异常检测,确定该数据集中的异常数据,该异常检测模型根据从该多个采集节点的至少一个采集节点中获取的数据集进行训练得到;
第一存储模块603,用于将该异常数据存储至该多个存储节点中的第一存储节点,第一存储节点用于存储检测出的异常数据。
可选地,该多个存储节点包括与指定处理节点对应的第二存储节点,第二存储节点用于存储指定处理节点获取的数据,该装置还包括:
第二获取模块,用于从第二存储节点中,获取在指定时间段内采集到的数据集,将该数据集作为第一样本数据集,该指定时间段是指以开始采集数据的时刻为起始时刻、时间长度为指定时长的时间段;
训练模块,用于根据第一样本数据集进行训练,得到初始的异常检测模型。
可选地,该装置还包括:
第三获取模块,用于从第二存储节点中,获取在当前时刻之前的预设时长内采集到的数据集,将该数据集作为第二样本数据集;
更新模块,用于根据第二样本数据集继续进行训练,得到更新后的异常检测模型。
可选地,该多个处理节点包括一个主处理节点和至少一个从处理节点,当该指定处理节点为主处理节点时,该装置还包括:
第四获取模块,用于获取主处理节点已训练的异常检测模型,并接收该至 少一个从处理节点已训练的异常检测模型;
合成模块,用于将该多个处理节点已训练的异常检测模型进行合成,得到合成后的异常检测模型;
第二存储模块,用于将合成后的异常检测模型存储至该多个处理节点共享的存储空间中,供每个处理节点进行异常检测。
可选地,该多个处理节点包括一个主处理节点和至少一个从处理节点,当该指定处理节点为任一从处理节点时,该装置还包括:
第一发送模块,用于获取该指定处理节点已训练的异常检测模型,发送给主处理节点,主处理节点用于将该多个处理节点已训练的异常检测模型进行合成,得到合成后的异常检测模型,将合成后的异常检测模型存储至该多个处理节点共享的存储空间中,供每个处理节点进行异常检测。
可选地,异常检测模块602包括:
异常检测子模块,用于通过异常检测模型对该数据集进行异常检测,得到该数据集中每条数据的异常指数,将异常指数处于预设范围内的数据确定为异常数据。
可选地,该多个处理节点包括一个主处理节点和至少一个从处理节点,当指定处理节点为主处理节点时,该装置还包括:
监控模块,用于监控该多个采集节点;
第二发送模块,用于当监控到该多个采集节点中的至少一个采集节点采集到数据集时,向至少一个从处理节点发送数据获取指令,该数据获取指令携带该至少一个采集节点的类型标识,该至少一个从处理节点用于根据接收到的数据获取指令中携带的类型标识,从对应的采集节点中获取采集到的数据集。
可选地,该多个处理节点包括一个主处理节点和至少一个从处理节点,当指定处理节点为任一从处理节点时,第一获取模块包括:
接收子模块,用于接收主处理节点发送的数据获取指令,该数据获取指令携带至少一个采集节点的类型标识,且主处理节点用于当监控到该至少一个采集节点采集到数据集时发送该数据获取指令;
获取子模块,用于根据该数据获取指令中携带的类型标识,从对应的采集节点中获取采集到的数据集。
可选地,训练模块包括:
建立子模块,用于根据第一样本数据集,建立多个二叉树;
合成子模块,用于将该多个二叉树进行合成,得到初始的异常检测模型;
每个二叉树包括多层节点,第一层中包括一个根节点,每个节点与下一层的两个分支节点连接,每个节点中包括第一样本数据集中的一条数据,每个节点的节点值为每个节点中的数据在指定属性上的键值,且每个节点用于将在指定属性上的键值小于该节点值的数据划分至下一层的第一分支节点,将在指定属性上的键值不小于该节点值的数据划分至下一层的第二分支节点。
可选地,建立子模块还用于:
随机从第一样本数据集的所有数据属性中选择任一属性,作为指定属性;
随机从指定属性的所有键值中选择一个键值作为根节点的节点值,将根节点的节点值对应的数据添加至根节点中;
从根节点开始,将在指定属性上的键值小于当前节点的节点值的数据划分至当前节点下一层的第一分支节点,将在指定属性上的键值不小于当前节点的节点值的数据划分至当前节点下一层的第二分支节点,直至划分至的节点中仅包括一条数据或包括在该指定属性上的键值相同的多条数据时,得到一个二叉树。
本申请实施例中,任一处理节点从该多个采集节点的至少一个采集节点中获取采集到的数据集;然后通过异常检测模型对该数据集进行异常检测,确定该数据集中的异常数据;之后将该异常数据存储至该多个存储节点中的第一存储节点,该第一存储节点用于存储检测出的异常数据。这样,根据采集的数据集进行训练得到的异常检测模型,能够反映出区分正常数据和异常数据的规律,学习到正常数据与异常数据之间的区分标准。那么,任一处理节点使用该异常检测模型对获取到的数据集进行异常检测,确定该数据集中的异常数据并存储,使得检测结果更加符合真实的异常数据,提高了异常检测的准确率,保证了后续工作的正常进行。
需要说明的是:上述实施例提供的数据采集装置在采集数据时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数据采集装置与数据采集方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图7是本申请实施例提供的一种服务器的结构示意图,该服务器700可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)701和一个或一个以上的存储器702,其中,所述存储器702中存储有至少一条指令,所述至少一条指令由该处理器701加载并执行。当然,该服务器700还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器700还可以包括其他用于实现设备功能的部件,在此不做赘述。
该服务器700用于执行上述数据获取方法中控制设备或节点设备所执行的操作。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括指令的存储器,上述指令可由上述终端或服务器中的处理器执行以完成上述实施例中的数据采集方法。例如,所述计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (23)

  1. 一种数据采集方法,其特征在于,应用于分布式数据采集系统的指定处理节点中,所述分布式数据采集系统包括多个采集节点、多个处理节点和多个存储节点,所述方法包括:
    从所述多个采集节点的至少一个采集节点中获取采集到的数据集,所述数据集包括至少一条数据;
    通过异常检测模型对所述数据集进行异常检测,确定所述数据集中的异常数据,所述异常检测模型根据从所述多个采集节点的至少一个采集节点中获取的数据集进行训练得到;
    将所述异常数据存储至所述多个存储节点中的第一存储节点,所述第一存储节点用于存储检测出的异常数据。
  2. 如权利要求1所述的方法,其特征在于,所述多个存储节点包括与所述指定处理节点对应的第二存储节点,所述第二存储节点用于存储所述指定处理节点获取的数据集,所述方法还包括:
    从所述第二存储节点中,获取在指定时间段内采集到的数据集,将所述数据集作为第一样本数据集,所述指定时间段是指以开始采集数据的时刻为起始时刻、时间长度为指定时长的时间段;
    根据所述第一样本数据集进行训练,得到初始的异常检测模型。
  3. 如权利要求2所述的方法,其特征在于,所述方法还包括:
    从所述第二存储节点中,获取在当前时刻之前的预设时长内采集到的数据集,将所述数据集作为第二样本数据集;
    根据所述第二样本数据集继续进行训练,得到更新后的异常检测模型。
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为主处理节点时,所述方法还包括:
    获取所述主处理节点已训练的异常检测模型,并接收所述至少一个从处理节点已训练的异常检测模型;
    将所述多个处理节点已训练的异常检测模型进行合成,得到合成后的异常检测模型;
    将所述合成后的异常检测模型存储至所述多个处理节点共享的存储空间中,供每个处理节点进行异常检测。
  5. 如权利要求1-3任一项所述的方法,其特征在于,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为任一从处理节点时,所述方法还包括:
    获取所述指定处理节点已训练的异常检测模型,发送给所述主处理节点,所述主处理节点用于将所述多个处理节点已训练的异常检测模型进行合成,得到合成后的异常检测模型,将所述合成后的异常检测模型存储至所述多个处理节点共享的存储空间中,供每个处理节点进行异常检测。
  6. 如权利要求1所述的方法,其特征在于,通过异常检测模型对所述数据集进行异常检测,确定所述数据集中的异常数据,包括:
    通过所述异常检测模型对所述数据集进行异常检测,得到所述数据集中每条数据的异常指数,将异常指数处于预设范围内的数据确定为异常数据。
  7. 如权利要求1所述的方法,其特征在于,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为主处理节点时,所述从所述多个采集节点的至少一个采集节点中获取采集到的数据集之前,所述方法还包括:
    监控所述多个采集节点;
    当监控到所述多个采集节点中的至少一个采集节点采集到数据集时,向至少一个从处理节点发送数据获取指令,所述数据获取指令携带所述至少一个采集节点的类型标识,所述至少一个从处理节点用于根据接收到的数据获取指令中携带的类型标识,从对应的采集节点中获取采集到的数据集。
  8. 如权利要求1所述的方法,其特征在于,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为任一从处理节点时, 所述从所述多个采集节点的至少一个采集节点中获取采集到的数据集,包括:
    接收所述主处理节点发送的数据获取指令,所述数据获取指令携带至少一个采集节点的类型标识,且所述主处理节点用于当监控到所述至少一个采集节点采集到数据集时发送所述数据获取指令;
    根据所述数据获取指令中携带的类型标识,从对应的采集节点中获取采集到的数据集。
  9. 如权利要求2所述的方法,其特征在于,所述根据所述第一样本数据集进行训练,得到初始的异常检测模型,包括:
    根据所述第一样本数据集,建立多个二叉树,将所述多个二叉树进行合成,得到所述初始的异常检测模型;
    每个二叉树包括多层节点,第一层中包括一个根节点,每个节点与下一层的两个分支节点连接,每个节点中包括所述第一样本数据集中的一条数据,每个节点的节点值为每个节点中的数据在指定属性上的键值,且每个节点用于将在所述指定属性上的键值小于所述节点值的数据划分至下一层的第一分支节点,将在所述指定属性上的键值不小于所述节点值的数据划分至下一层的第二分支节点。
  10. 如权利要求9所述的方法,其特征在于,所述根据所述第一样本数据集,建立多个二叉树,包括:
    随机从所述第一样本数据集的所有数据属性中选择任一属性,作为指定属性;
    随机从所述指定属性的所有键值中选择一个键值作为所述根节点的节点值,将所述节点值对应的数据添加至所述根节点中;
    从所述根节点开始,将在所述指定属性上的键值小于当前节点的节点值的数据划分至所述当前节点下一层的第一分支节点,将在所述指定属性上的键值不小于所述当前节点的节点值的数据划分至所述当前节点下一层的第二分支节点,直至划分至的节点中仅包括一条数据或包括在所述指定属性上的键值相同的多条数据时,得到一个二叉树。
  11. 一种数据采集装置,其特征在于,应用于分布式数据采集系统的指定处理节点中,所述分布式数据采集系统包括多个采集节点、多个处理节点和多个存储节点,所述装置包括:
    第一获取模块,用于从所述多个采集节点的至少一个采集节点中获取采集到的数据集,所述数据集包括至少一条数据;
    异常检测模块,用于通过异常检测模型对所述数据集进行异常检测,确定所述数据集中的异常数据,所述异常检测模型根据从所述多个采集节点的至少一个采集节点中获取的数据集进行训练得到;
    第一存储模块,用于将所述异常数据存储至所述多个存储节点中的第一存储节点,所述第一存储节点用于存储检测出的异常数据。
  12. 如权利要求11所述的装置,其特征在于,所述多个存储节点包括与所述指定处理节点对应的第二存储节点,所述第二存储节点用于存储所述指定处理节点获取的数据集,所述装置还包括:
    第二获取模块,用于从所述第二存储节点中,获取在指定时间段内采集到的数据集,将所述数据集作为第一样本数据集,所述指定时间段是指以开始采集数据的时刻为起始时刻、时间长度为指定时长的时间段;
    训练模块,用于根据所述第一样本数据集进行训练,得到初始的异常检测模型。
  13. 如权利要求11所述的装置,其特征在于,所述装置还包括:
    第三获取模块,用于从所述第二存储节点中,获取在当前时刻之前的预设时长内采集到的数据集,将所述数据集作为第二样本数据集;
    更新模块,用于根据所述第二样本数据集继续进行训练,得到更新后的异常检测模型。
  14. 如权利要求11-13任一项所述的装置,其特征在于,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为主处理节点时,所述装置还包括:
    第四获取模块,用于获取所述主处理节点已训练的异常检测模型,并接收 所述至少一个从处理节点已训练的异常检测模型;
    合成模块,用于将所述多个处理节点已训练的异常检测模型进行合成,得到合成后的异常检测模型;
    第二存储模块,用于将所述合成后的异常检测模型存储至所述多个处理节点共享的存储空间中,供每个处理节点进行异常检测。
  15. 如权利要求11-13任一项所述的装置,其特征在于,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为任一从处理节点时,所述装置还包括:
    第一发送模块,用于获取所述指定处理节点已训练的异常检测模型,发送给所述主处理节点,所述主处理节点用于将所述多个处理节点已训练的异常检测模型进行合成,得到合成后的异常检测模型,将所述合成后的异常检测模型存储至所述多个处理节点共享的存储空间中,供每个处理节点进行异常检测。
  16. 如权利要求11所述的装置,其特征在于,所述异常检测模块包括:
    异常检测子模块,用于通过所述异常检测模型对所述数据集进行异常检测,得到所述数据集中每条数据的异常指数,将异常指数处于预设范围内的数据确定为异常数据。
  17. 如权利要求11所述的装置,其特征在于,所述多个处理节点包括一个主处理节点和至少一个从处理节点,当所述指定处理节点为主处理节点时,所述装置还包括:
    监控模块,用于监控所述多个采集节点;
    第二发送模块,用于当监控到所述多个采集节点中的至少一个采集节点采集到数据集时,向至少一个从处理节点发送数据获取指令,所述数据获取指令携带所述至少一个采集节点的类型标识,所述至少一个从处理节点用于根据接收到的数据获取指令中携带的类型标识,从对应的采集节点中获取采集到的数据集。
  18. 如权利要求11所述的装置,其特征在于,所述多个处理节点包括一个 主处理节点和至少一个从处理节点,当所述指定处理节点为任一从处理节点时,所述第一获取模块包括:
    接收子模块,用于接收所述主处理节点发送的数据获取指令,所述数据获取指令携带至少一个采集节点的类型标识,且所述主处理节点用于当监控到所述至少一个采集节点采集到数据集时发送所述数据获取指令;
    获取子模块,用于根据所述数据获取指令中携带的类型标识,从对应的采集节点中获取采集到的数据集。
  19. 如权利要求12所述的装置,其特征在于,所述训练模块包括:
    建立子模块,用于根据所述第一样本数据集,建立多个二叉树;
    合成子模块,用于将所述多个二叉树进行合成,得到所述初始的异常检测模型;
    每个二叉树包括多层节点,第一层中包括一个根节点,每个节点与下一层的两个分支节点连接,每个节点中包括所述第一样本数据集中的一条数据,每个节点的节点值为每个节点中的数据在指定属性上的键值,且每个节点用于将在所述指定属性上的键值小于所述节点值的数据划分至下一层的第一分支节点,将在所述指定属性上的键值不小于所述节点值的数据划分至下一层的第二分支节点。
  20. 如权利要求19所述的装置,其特征在于,所述建立子模块还用于:
    随机从所述第一样本数据集的所有数据属性中选择任一属性,作为指定属性;
    随机从所述指定属性的所有键值中选择一个键值作为所述根节点的节点值,将所述节点值对应的数据添加至根节点中;
    从所述根节点开始,将在所述指定属性上的键值小于当前节点的节点值的数据划分至所述当前节点下一层的第一分支节点,将在所述指定属性上的键值不小于所述当前节点的节点值的数据划分至所述当前节点下一层的第二分支节点,直至划分至的节点中仅包括一条数据或包括在所述指定属性上的键值相同的多条数据时,得到一个二叉树。
  21. 一种处理节点,其特征在于,应用于分布式数据采集系统中,所述分布式数据采集系统包括多个采集节点、多个处理节点和多个存储节点,所述处理节点为所述分布式采集系统中的任一处理节点;
    所述处理节点包括处理器和存储器,所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行以实现如权利要求1至10中任一个权利要求所述的数据采集方法。
  22. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令,所述至少一条指令由处理器加载并执行以实现如权利要求1至10中任一个权利要求所述的数据采集方法。
  23. 一种分布式数据采集系统,其特征在于,所述分布式数据采集系统包括多个采集节点、多个处理节点和多个存储节点;
    所述多个采集节点用于采集数据集,所述数据集包括至少一条数据;
    所述多个处理节点中的任一处理节点用于从所述多个采集节点的至少一个采集节点中获取采集到的数据集;
    所述任一处理节点还用于通过异常检测模型对所述数据集进行异常检测,确定所述数据集中的异常数据,所述异常检测模型根据从所述多个采集节点的至少一个采集节点中获取的数据集进行训练得到;
    所述任一处理节点还用于将所述异常数据存储至所述多个存储节点中的第一存储节点;
    所述第一存储节点用于存储检测出的异常数据。
PCT/CN2019/111481 2018-10-18 2019-10-16 数据采集方法、装置、存储介质及系统 WO2020078385A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811215823.0A CN111078488B (zh) 2018-10-18 2018-10-18 数据采集方法、装置、存储介质及系统
CN201811215823.0 2018-10-18

Publications (1)

Publication Number Publication Date
WO2020078385A1 true WO2020078385A1 (zh) 2020-04-23

Family

ID=70283367

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/111481 WO2020078385A1 (zh) 2018-10-18 2019-10-16 数据采集方法、装置、存储介质及系统

Country Status (2)

Country Link
CN (1) CN111078488B (zh)
WO (1) WO2020078385A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666276A (zh) * 2020-06-11 2020-09-15 上海积成能源科技有限公司 一种电力负荷预测中应用孤立森林算法剔除异常数据处理的方法
CN111708846A (zh) * 2020-05-14 2020-09-25 北京嗨学网教育科技股份有限公司 一种多终端的数据管理方法及装置
CN112710918A (zh) * 2021-01-04 2021-04-27 安徽容知日新科技股份有限公司 基于边缘计算的无线数据采集方法及系统
CN112732536A (zh) * 2020-12-30 2021-04-30 平安科技(深圳)有限公司 数据监控告警方法、装置、计算机设备及存储介质
CN112815994A (zh) * 2021-01-04 2021-05-18 安徽容知日新科技股份有限公司 基于边缘计算的有线数据采集方法及系统
CN114860510A (zh) * 2022-07-08 2022-08-05 飞狐信息技术(天津)有限公司 微服务系统的数据监控方法和系统
CN115597653A (zh) * 2022-12-14 2023-01-13 南通新瑾逸软件科技有限公司(Cn) 一种半导体质量检测设备的智能识别方法及系统
CN117118913A (zh) * 2023-10-20 2023-11-24 山东沪金精工科技股份有限公司 一种基于工业物联网的加工设备数据采集系统

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111708672B (zh) * 2020-06-15 2021-04-16 北京优特捷信息技术有限公司 数据传输方法、装置、设备及存储介质
CN111784966A (zh) * 2020-06-15 2020-10-16 武汉烽火众智数字技术有限责任公司 一种基于机器学习的人员管控的方法及系统
CN114070899B (zh) * 2020-07-27 2023-05-12 深信服科技股份有限公司 一种报文检测方法、设备及可读存储介质
CN112711757B (zh) * 2020-12-23 2022-09-16 光大兴陇信托有限责任公司 一种基于大数据平台的数据安全集中管控方法及系统
CN113515450A (zh) * 2021-05-20 2021-10-19 广东工业大学 一种环境异常检测方法和系统
CN116581891B (zh) * 2023-07-14 2023-09-19 中能聚创(杭州)能源科技有限公司 一种电力数据采集方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102176698A (zh) * 2010-12-20 2011-09-07 北京邮电大学 一种基于迁移学习的用户异常行为检测方法
CN104063747A (zh) * 2014-06-26 2014-09-24 上海交通大学 一种分布式系统中的性能异常预测方法及系统
CN108040074A (zh) * 2018-01-26 2018-05-15 华南理工大学 一种基于大数据的实时网络异常行为检测系统及方法
CN108075906A (zh) * 2016-11-08 2018-05-25 上海有云信息技术有限公司 一种用于云计算数据中心的管理方法及系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008040018A2 (en) * 2006-09-28 2008-04-03 Fisher-Rosemount Systems, Inc. Abnormal situation prevention in a heat exchanger
US9542293B2 (en) * 2014-01-14 2017-01-10 Netapp, Inc. Method and system for collecting and pre-processing quality of service data in a storage system
CN107066365B (zh) * 2017-02-20 2021-01-01 创新先进技术有限公司 一种系统异常的监测方法及装置
CN108229528A (zh) * 2017-08-16 2018-06-29 北京市商汤科技开发有限公司 聚类模型训练方法和装置、电子设备、计算机存储介质
CN107608810A (zh) * 2017-08-24 2018-01-19 北京寄云鼎城科技有限公司 一种基于迭代的异常检测方法和检测装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102176698A (zh) * 2010-12-20 2011-09-07 北京邮电大学 一种基于迁移学习的用户异常行为检测方法
CN104063747A (zh) * 2014-06-26 2014-09-24 上海交通大学 一种分布式系统中的性能异常预测方法及系统
CN108075906A (zh) * 2016-11-08 2018-05-25 上海有云信息技术有限公司 一种用于云计算数据中心的管理方法及系统
CN108040074A (zh) * 2018-01-26 2018-05-15 华南理工大学 一种基于大数据的实时网络异常行为检测系统及方法

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111708846A (zh) * 2020-05-14 2020-09-25 北京嗨学网教育科技股份有限公司 一种多终端的数据管理方法及装置
CN111666276A (zh) * 2020-06-11 2020-09-15 上海积成能源科技有限公司 一种电力负荷预测中应用孤立森林算法剔除异常数据处理的方法
CN112732536A (zh) * 2020-12-30 2021-04-30 平安科技(深圳)有限公司 数据监控告警方法、装置、计算机设备及存储介质
CN112710918B (zh) * 2021-01-04 2022-10-11 安徽容知日新科技股份有限公司 基于边缘计算的无线数据采集方法及系统
CN112815994A (zh) * 2021-01-04 2021-05-18 安徽容知日新科技股份有限公司 基于边缘计算的有线数据采集方法及系统
CN112710918A (zh) * 2021-01-04 2021-04-27 安徽容知日新科技股份有限公司 基于边缘计算的无线数据采集方法及系统
CN112815994B (zh) * 2021-01-04 2023-08-15 安徽容知日新科技股份有限公司 基于边缘计算的有线数据采集方法及系统
CN114860510A (zh) * 2022-07-08 2022-08-05 飞狐信息技术(天津)有限公司 微服务系统的数据监控方法和系统
CN114860510B (zh) * 2022-07-08 2022-12-02 飞狐信息技术(天津)有限公司 微服务系统的数据监控方法和系统
CN115597653A (zh) * 2022-12-14 2023-01-13 南通新瑾逸软件科技有限公司(Cn) 一种半导体质量检测设备的智能识别方法及系统
CN115597653B (zh) * 2022-12-14 2023-11-03 中顺世纪(深圳)电子有限责任公司 一种半导体质量检测设备的智能识别方法及系统
CN117118913A (zh) * 2023-10-20 2023-11-24 山东沪金精工科技股份有限公司 一种基于工业物联网的加工设备数据采集系统
CN117118913B (zh) * 2023-10-20 2024-01-05 山东沪金精工科技股份有限公司 一种基于工业物联网的加工设备数据采集系统

Also Published As

Publication number Publication date
CN111078488A (zh) 2020-04-28
CN111078488B (zh) 2021-11-09

Similar Documents

Publication Publication Date Title
WO2020078385A1 (zh) 数据采集方法、装置、存储介质及系统
US10164847B2 (en) Data transfer monitor system, data transfer monitor method and base system
US10120889B2 (en) Prospective search of objects using k-d forest
US20200372007A1 (en) Trace and span sampling and analysis for instrumented software
WO2022007434A1 (zh) 可视化方法及相关设备
JP2002523814A (ja) 通常表現を使用するトランザクションの認識および予測
EP2235651A2 (en) Distributed indexing of file content
US11799798B1 (en) Generating infrastructure templates for facilitating the transmission of user data into a data intake and query system
CN110324327B (zh) 基于特定企业域名数据的用户及服务器ip地址标定装置及方法
CN106055630A (zh) 日志存储的方法及装置
CN112636942B (zh) 业务主机节点的监测方法及装置
US11620303B1 (en) Security essentials and information technology essentials for a data intake and query system
CN112084224A (zh) 一种数据管理方法、系统、设备及介质
CN111488594A (zh) 一种基于云服务器的权限检查方法、装置、存储介质及终端
CN111338888B (zh) 一种数据统计方法、装置、电子设备及存储介质
CN108154024B (zh) 一种数据检索方法、装置及电子设备
WO2015165230A1 (zh) 一种社交消息的监测方法及装置
JP7292368B2 (ja) デバイスからの属性および位置シグネチャを使用してデバイスを識別する方法、その方法のための一意に生成された識別子のサーバ、およびその方法のための命令シーケンスを記憶する非一時的コンピュータ可読記憶媒体
CN111343416B (zh) 一种分布式图像分析方法、系统及存储介质
US11843622B1 (en) Providing machine learning models for classifying domain names for malware detection
CN108696418B (zh) 一种社交网络中隐私保护方法及装置
CN110633411A (zh) 一种筛选房源的方法、装置、电子设备及存储介质
US20170180511A1 (en) Method, system and apparatus for dynamic detection and propagation of data clusters
CN113783862A (zh) 一种边云协同过程中进行数据校验的方法及装置
WO2015154641A1 (zh) 一种业务并发性预测方法与预测系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19873339

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19873339

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19873339

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.12.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19873339

Country of ref document: EP

Kind code of ref document: A1