CN113887674B - Abnormal behavior detection method and system based on big data - Google Patents

Abnormal behavior detection method and system based on big data Download PDF

Info

Publication number
CN113887674B
CN113887674B CN202111474046.3A CN202111474046A CN113887674B CN 113887674 B CN113887674 B CN 113887674B CN 202111474046 A CN202111474046 A CN 202111474046A CN 113887674 B CN113887674 B CN 113887674B
Authority
CN
China
Prior art keywords
binary
forest
data
hyperplane
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111474046.3A
Other languages
Chinese (zh)
Other versions
CN113887674A (en
Inventor
邵俊
张孜勉
万友平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Suoxinda Data Technology Co ltd
Original Assignee
Shenzhen Suoxinda Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Suoxinda Data Technology Co ltd filed Critical Shenzhen Suoxinda Data Technology Co ltd
Priority to CN202111474046.3A priority Critical patent/CN113887674B/en
Publication of CN113887674A publication Critical patent/CN113887674A/en
Application granted granted Critical
Publication of CN113887674B publication Critical patent/CN113887674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/245Classification techniques relating to the decision surface
    • G06F18/2451Classification techniques relating to the decision surface linear, e.g. hyperplane
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a system for detecting abnormal behaviors based on big data, wherein the method comprises the following steps: acquiring mass data to be trained, and importing the mass data to be trained into a first memory; preprocessing massive data to be trained in the first memory to obtain a first data set; generating a binary forest based on the first data set, the binary forest being a set consisting of a number of binary trees, the binary trees being obtained based on a subset of the first data set; and detecting abnormal behaviors based on the binary forest. The invention can improve the generation efficiency of the effective hyperplane, and eliminate the hyperplane with lower precision by grading the hyperplane so as to reduce the memory occupation and improve the operation speed, stability and interpretability.

Description

Abnormal behavior detection method and system based on big data
Technical Field
The invention belongs to the field of abnormal detection methods, and particularly relates to a method and a system for detecting abnormal behaviors based on big data.
Background
In various fields such as production and manufacturing, medical treatment or finance and the like, the problem that automatic abnormity detection needs to be carried out on mass data is encountered. Finding outliers from a large number of data samples and data dimensions helps us to quickly identify samples where anomalies may exist. Since these outlier samples tend to have few labels, the current mainstream and well-behaved model remains an unsupervised model, such as the ifoest model, which is widely used in the industry because it is efficient and does not depend on specific data distributions. For example, the existing chinese patent with patent number ZL202010025249.3 discloses an SMT solder joint defect detection method based on iForest model verification. The method comprises the steps of carrying out local binary pattern value and edge detection on an image sample to obtain a binary pattern texture feature vector, obtaining an accurate training sample according to a constructed and verified isolated forest model, screening abnormal samples, constructing an accurate BP neural network model according to the training sample, and further obtaining a defect detection result of a welding spot. Although the existing patent can adopt an image processing technology and a rapid and accurate partitioning technology of an isolated forest model to screen sample data, the accuracy of the sample data is improved, the quality evaluation of a welding spot picture is completed through a constructed BP neural network model, and the formation of the welding spot is accurately controlled.
However, the following problems still remain: firstly, the decision tree can be divided by using only single feature data each time, and the division usually causes that abnormal points are difficult to isolate under the limit of limited tree depth, so that the result has larger deviation; secondly, the algorithm needs to construct a large number of random trees, and occupies a large amount of memory resources.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method for detecting abnormal behaviors based on big data, which comprises the following steps:
acquiring mass data to be trained, and importing the mass data to be trained into a first memory;
preprocessing massive data to be trained in the first memory to obtain a first data set;
generating a binary forest based on the first data set, the binary forest being a set consisting of a number of binary trees, the binary trees being obtained based on a subset of the first data set;
and detecting abnormal behaviors based on the binary forest.
Further, the first data set includes N training samples and m feature data.
Further, the obtaining of the binary tree based on the subset of the first data set specifically includes:
randomly selecting N samples from the N training samples, wherein N is greater than or equal to N;
allocating a first weight probability to the m feature data;
randomly generating a hyperplane based on the n samples based on the first weight probability;
and dividing the n samples based on the hyperplane to form a binary tree.
Further, assigning a first weight probability to the m feature data specifically includes:
calculating the dimension entropy of each feature data, and recording featuresd i The dimension entropy of i is more than or equal to 1 and less than or equal to m is ent (d i );
B randomly selected samples p are processed according to the characteristicsd i Equally cutting the value into bin groups, and counting the number of samples of each group jb j
The dimension entropy corresponding to the feature data is as follows:
Figure 362438DEST_PATH_IMAGE001
and the dimension entropy is a first weight probability.
Further, the randomly generating a hyperplane based on the n samples based on the first weight probability specifically includes:
constructing a three-dimensional hyperplane, wherein the hyperplane equation for dividing is ax + by + cz + u = 0;
n samples satisfying ax + by + cz + u <0 are divided into left sub-trees;
the sub-tree which satisfies ax + by + cz + u >0 is divided into a right sub-tree;
each time a hyperplane division is constructed, the feature
Figure 341896DEST_PATH_IMAGE002
The probability of being selected as feature data in the hyperplane is
Figure 704656DEST_PATH_IMAGE003
Then randomly assigning coefficients to each feature data in the selected feature data to obtain a hyperplane, wherein the depth of the binary tree does not exceed a value
Figure 178494DEST_PATH_IMAGE004
Further, the hyperplane includes any number of dimensions, and each hyperplane can divide a node of a decision tree into left and right subtrees.
Further, the hyperplane division specifically includes:
and for the training sample p, dividing according to the method for dividing the left sub-tree and the right sub-tree by the hyperplane, and if the node where the node is located only has 1 sample point per se, the division is considered to be completed.
Further, the generating a binary forest based on the first data set specifically includes:
randomly generating a plurality of binary trees based on the N training samples to form a first binary forest;
based on any one binary tree r of the first binary forest, dividing a first data set of known abnormal points by using the binary tree r, recording a set of the abnormal points as { q1, q 2.,. qs }, recording a sample point of which a node in a division result only comprises one as an abnormal point, and recording the sample point if the abnormal point qk is successfully identified by the treer k =1, otherwiser k =0, the score of the binary tree r is:
Figure 320762DEST_PATH_IMAGE005
where s represents the known number of outliers, qk represents the kth outlier, k ∈ [1, s)],SrRepresenting the capacity of the binary tree r for identifying abnormal points, wherein the more the abnormal points are identified, the larger the value is;
and sequencing the scores of all binary trees in the first binary forest from high to low, only obtaining the top n binary trees with the scores, and removing the rest trees from the binary forest to obtain a second binary forest.
Further, the detecting abnormal behavior based on the binary forest specifically includes:
acquiring the depth L (p) of a leaf node where a sample p is located in each binary tree of the second binary forest;
and (3) scoring the abnormal degree of the sample point p, wherein the scoring formula is as follows:
Figure 454809DEST_PATH_IMAGE006
wherein E (l (p)) represents an average value of depths of leaf nodes where the sample p is located in each binary tree of the second binary forest, c (n) =2 x (ln (n-1) +0.5772156649) -2 x (n-1)/n, and n represents the n-th tree after sorting in the second binary forest;
and according to the score obtained by the formula, if the score is less than 0.5, the normal point is considered.
The invention also provides a system for detecting abnormal behaviors based on big data, which comprises:
the data acquisition module is used for acquiring mass data to be trained;
the first memory is used for storing the mass data to be trained;
the preprocessing module is used for preprocessing massive data to be trained in the first memory to obtain a first data set;
a model generation module configured to generate a binary forest based on the first data set, the binary forest being a set composed of a number of binary trees, the binary trees being obtained based on a subset of the first data set;
and the abnormal detection module is used for detecting abnormal behaviors based on the binary forest.
Compared with the prior art, the method has the advantages that the dimensional entropy is introduced to preselect the characteristics by the method of splitting the hyperplane, the generation efficiency of the effective hyperplane is improved, the hyperplane with low precision is eliminated by grading the hyperplane, the memory occupation is reduced, and the operation speed, the stability and the interpretability are improved.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
FIG. 1 is a flow diagram illustrating a method for big data based abnormal behavior detection, according to an embodiment of the present invention;
FIG. 2 is a flow diagram illustrating a binary tree according to an embodiment of the invention;
FIG. 3 is a schematic diagram illustrating hyperplane division according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating a system for big data based abnormal behavior detection, according to an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two.
It should be understood that although the terms first, second, third, etc. may be used to describe … … in embodiments of the present invention, these … … should not be limited to these terms. These terms are used only to distinguish … …. For example, the first … … can also be referred to as the second … … and similarly the second … … can also be referred to as the first … … without departing from the scope of embodiments of the present invention.
It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in the article or device in which the element is included.
Alternative embodiments of the present invention are described in detail below with reference to the accompanying drawings.
The first embodiment,
As shown in fig. 1, the invention discloses a method for detecting abnormal behavior based on big data, comprising the following steps:
acquiring mass data to be trained, and importing the mass data to be trained into a first memory;
preprocessing massive data to be trained in the first memory to obtain a first data set;
generating a binary forest based on the first data set, the binary forest being a set consisting of a number of binary trees, the binary trees being obtained based on a subset of the first data set;
and detecting abnormal behaviors based on the binary forest.
Example II,
The embodiment provides a method for detecting abnormal behaviors based on big data, which comprises the following steps:
acquiring mass data to be trained, and importing the mass data to be trained into a first memory;
preprocessing massive data to be trained in a first memory to obtain a first data set, wherein the first data set comprises N training samples and m characteristic data; preferably, the preprocessing comprises the digitization of the type data and the elimination of missing samples;
generating a binary forest based on the first data set, wherein the binary forest is a set formed by a plurality of binary trees, and the binary trees are acquired based on the subset of the first data set;
and detecting abnormal behaviors based on the binary forest.
As shown in fig. 2, the present embodiment will be described in detail by the following contents, so as to facilitate understanding of the formation process of the binary forest in the present embodiment. The obtaining of the binary tree based on the subset of the first data set in this embodiment may specifically include:
randomly selecting N samples from the N training samples, wherein N is greater than or equal to N;
allocating a first weight probability to the m characteristic data;
randomly generating a hyperplane based on the n samples based on the first weight probability;
and dividing the n samples based on the hyperplane to form a binary tree.
In this embodiment, when generating the hyperplane, assigning a first weight probability to the m feature data may specifically include:
calculating the dimension entropy of each feature data, and recording featuresd i The dimension entropy of i is more than or equal to 1 and less than or equal to m is ent (d i );
B randomly selected samples p are processed according to the characteristicsd i Equally cutting the value into bin groups, and counting the number of samples of each group jb j
The dimension entropy corresponding to the feature data is as follows:
Figure 49738DEST_PATH_IMAGE007
and the dimension entropy is a first weight probability.
Example III,
On the basis of the second embodiment, the present embodiment may further include the following:
referring to fig. 3, in the present embodiment, after the first weight probability is obtained, the hyperplane is generated by the first weight probability. The generated hyperplane comprises any number of dimensions, and each hyperplane can divide the nodes of one decision tree into a left subtree and a right subtree. When the hyperplane is used for decision tree division, each hyperplane divides the nodes of the decision tree downwards once. The node may include a plurality of sample points, or may have only one sample point or no sample points. The generated hyperplane is illustrated in this embodiment using three dimensions as an example. In an application scenario, based on the first weight probability, randomly generating a hyperplane based on n samples may specifically include:
constructing a three-dimensional hyperplane, wherein an equation for dividing the hyperplane is ax + by + cz + u = 0;
n samples satisfying ax + by + cz + u <0 are divided into left sub-trees;
the sub-tree which satisfies ax + by + cz + u >0 is divided into a right sub-tree;
each time a hyperplane division is constructed, the feature
Figure 327267DEST_PATH_IMAGE008
The probability of being selected as feature data in the hyperplane is
Figure 573310DEST_PATH_IMAGE003
Then randomly giving coefficient to each feature data in the selected feature data to obtain hyperplane, wherein the depth of the binary tree does not exceed the value
Figure 894570DEST_PATH_IMAGE004
. Wherein the depth represents the number of levels of the binary tree; for the convenience of understanding, the present embodiment further exemplifies that, as shown in fig. 3, the node containing the samples a, b, c, d, e and the node containing the samples p, q, r are in the first layer and have a depth of 1, and the node containing the samples a, b, c, the node containing the samples d, e, the node containing the sample p and the node containing the samples q, r are in the second layer and have a depth of 2.
The embodiment of the present invention further describes the generation process of the hyperplane in order to facilitate understanding of the generation process of the hyperplane. In practical application scene, random operator is used for constructing random number between 0 and 1
Figure 461948DEST_PATH_IMAGE009
If random number
Figure 244965DEST_PATH_IMAGE010
Then characteristic of
Figure 956568DEST_PATH_IMAGE011
Is selected, otherwise is not selected;
suppose that a selected features are
Figure 71898DEST_PATH_IMAGE012
…、dma
Construction of a +1 random numbers between 0 and 1 using random operators
Figure 392152DEST_PATH_IMAGE013
Figure 729592DEST_PATH_IMAGE014
The generated hyperplane is
Figure 221666DEST_PATH_IMAGE015
Exemplary selected characteristics are
Figure 884728DEST_PATH_IMAGE016
I.e. to generate a three-dimensional hyperplane;
construction of random numbers between 0 and 1 using random operators
Figure 223437DEST_PATH_IMAGE017
The generated hyperplane is
Figure 613836DEST_PATH_IMAGE018
In this embodiment, when dividing the image by the hyperplane, the method may specifically include:
and for the training sample p, dividing according to a method for dividing the left sub-tree and the right sub-tree by the hyperplane, and if the node where the training sample p is located only has 1 sample point per se, the division is considered to be completed.
The embodiment divides the samples in the first data set to obtain a binary forest formed by a binary tree. In an application scenario, the generating a binary forest based on the first data set in this embodiment may specifically include:
randomly generating a plurality of binary trees based on the N training samples to form a first binary forest;
based on any binary tree r of a first two-fork forest, dividing a first data set of known abnormal points by using the binary tree r, recording a set of the known abnormal points as { q1, q 2.,. qs }, recording sample points of which the nodes only comprise one in a division result as the abnormal points, and recording the sample points if the abnormal points qk are successfully identified by the treer k =1, otherwiser k =0, the score of the binary tree r is:
Figure 439709DEST_PATH_IMAGE019
where s represents the known number of outliers, qk represents the kth outlier, k ∈ [1, s)],SrRepresenting the capacity of the binary tree r for identifying abnormal points, wherein the more the abnormal points are identified, the larger the value is;
and sequencing the scores of all binary trees in the first binary forest from high to low, only obtaining the top n binary trees with the scores, and removing the rest trees from the binary forest to obtain a second binary forest.
When the binary tree r is used for dividing, if the dividing depth reaches a preset value (a)
Figure 273673DEST_PATH_IMAGE004
) Then the partitioning is stopped, with the sample points where the node contains only one being recorded as outliers. And if the meaning indicated by successful identification is that the sample point marked with the abnormal point in the division result belongs to a known abnormal point set, the identification is successful, and the corresponding abnormal point qk in the abnormal point set is output.
Example four,
On the basis of the foregoing embodiment, the present embodiment may further include the following:
after the binary forest is obtained, the abnormal behavior detection of the mass data to be trained can be completed through the obtained binary forest. In an application scenario, the abnormal behavior detection performed based on the binary forest in this embodiment may specifically include:
acquiring the depth L (p) of a leaf node where a sample p is located in each binary tree of the second binary forest;
and (3) scoring the abnormal degree of the sample point p, wherein the scoring formula is as follows:
Figure 4737DEST_PATH_IMAGE020
wherein E (l (p)) represents an average value of depths of leaf nodes where the sample p is located in each binary tree of the second binary forest, c (n) =2 x (ln (n-1) +0.5772156649) -2 x (n-1)/n, and n represents the n-th tree after sorting in the second binary forest;
and according to the score obtained by the formula, if the score is less than 0.5, the normal point is considered.
When the mass data to be trained is constructed to the binary forest, the mass data to be trained includes the user and the behavior information corresponding to the user, wherein the user is a single individual, the behavior information may be a specific behavior and action information of the user or a specific operation information of the user, and whether the user has abnormal behavior or abnormal operation can be judged through the constructed binary forest, so that the abnormal user can be judged. When the embodiment detects abnormal behaviors of mass data to be detected, the behavior information corresponding to the user may be any behavior action information in a behavior sequence of "go into bank-insert card-input password-withdraw money-go out of bank" in which the user withdraws money from an ATM, such as withdraw money, or may be action information outside the behavior sequence, such as act of pulling down a hat brim, avoiding a camera, and the like. In addition, the behavior information of the user can be operation information performed when the user purchases on the shopping website, and the behavior information of the user can be acquired through background records of the shopping website. Or the operation information of the user when the user withdraws money from the ATM, and the behavior information of the user can be acquired through background recording of the ATM and a camera arranged on the ATM. The activity information of the user in the public place can also be obtained through a camera installed in the public place.
According to the embodiment of the invention, the dimension entropy is introduced by a method for splitting the hyperplane to pre-select the characteristics, so that the generation efficiency of the effective hyperplane is improved, and the hyperplane with lower precision is removed by grading the hyperplane so as to reduce the memory occupation and improve the operation speed, stability and interpretability.
Example V,
As shown in fig. 4, an embodiment of the present invention further provides a system for detecting abnormal behavior based on big data, which may include:
the data acquisition module is used for acquiring mass data to be trained;
the device comprises a first memory, a second memory and a third memory, wherein the first memory is used for storing massive data to be trained;
the preprocessing module is used for preprocessing massive data to be trained in the first memory to obtain a first data set;
a model generation module for generating a binary forest based on the first data set, the binary forest being a set composed of a plurality of binary trees, the binary trees being obtained based on the subset of the first data set;
and the anomaly detection module is used for detecting the anomaly behaviors based on the binary forest.
Example six,
The disclosed embodiments provide a non-volatile computer storage medium having stored thereon computer-executable instructions that may perform the method steps as described in the embodiments above.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local Area Network (AN) or a Wide Area Network (WAN), or the connection may be made to AN external computer (for example, through the internet using AN internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.
The foregoing describes preferred embodiments of the present invention, and is intended to provide a clear and concise description of the spirit and scope of the invention, and not to limit the same, but to include all modifications, substitutions, and alterations falling within the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A big data-based abnormal behavior detection method is characterized by comprising the following steps:
acquiring mass data to be trained, and importing the mass data to be trained into a first memory;
preprocessing mass data to be trained in the first memory to obtain a first data set, wherein the first data set comprises N training samples and m characteristic data;
generating a binary forest based on the first data set, the binary forest being a set consisting of a number of binary trees, the binary trees being obtained based on a subset of the first data set;
detecting abnormal behaviors based on the binary forest;
generating a binary forest based on the first data set, specifically comprising:
randomly generating a plurality of binary trees based on the N training samples to form a first binary forest;
based on any one binary tree r of the first binary forest, dividing a first data set of known abnormal points by using the binary tree r, recording a set of the abnormal points as { q1, q 2.,. qs }, recording a sample point of which a node in a division result only comprises one as an abnormal point, and recording the sample point if the abnormal point qk is successfully identified by the treer k =1, otherwiser k =0, the score of the binary tree r is:
Figure DEST_PATH_IMAGE001
where s represents the known number of outliers, qk represents the kth outlier, k ∈ [1, s)],SrRepresenting the capacity of the binary tree r for identifying abnormal points, wherein the more the abnormal points are identified, the larger the value is;
and sequencing the scores of all binary trees in the first binary forest from high to low, only obtaining the top n binary trees with the scores, and removing the rest trees from the binary forest to obtain a second binary forest.
2. The method of claim 1, wherein the binary tree is obtained based on a subset of the first data set, and specifically comprises:
randomly selecting N samples from the N training samples, wherein N is greater than or equal to N;
allocating a first weight probability to the m feature data;
randomly generating a hyperplane based on the n samples based on the first weight probability;
and dividing the n samples based on the hyperplane to form a binary tree.
3. The method of claim 2, wherein assigning a first weight probability to the m feature data comprises:
calculating the dimension entropy of each feature data, and recording featuresd i The dimension entropy of i is more than or equal to 1 and less than or equal to m is ent (d i );
B randomly selected samples p are processed according to the characteristicsd i Equally cutting the value into bin groups, and counting the number of samples of each group jb j
The dimension entropy corresponding to the feature data is as follows:
Figure 88162DEST_PATH_IMAGE002
and the dimension entropy is a first weight probability.
4. The method of claim 3, wherein randomly generating the hyperplane based on the n samples based on the first weighted probability comprises:
constructing a three-dimensional hyperplane, wherein the hyperplane equation for dividing is ax + by + cz + u = 0;
n samples satisfying ax + by + cz + u <0 are divided into left sub-trees;
the sub-tree which satisfies ax + by + cz + u >0 is divided into a right sub-tree;
each time a hyperplane division is constructed, the feature
Figure DEST_PATH_IMAGE003
The probability of being selected as feature data in the hyperplane is
Figure 615089DEST_PATH_IMAGE004
Then at this pointRandomly giving coefficients to each feature data in the selected feature data to obtain a hyperplane, wherein the depth of the binary tree does not exceed a value
Figure DEST_PATH_IMAGE005
5. The method of claim 4, wherein the hyperplane comprises any number of dimensions, and each hyperplane can divide nodes of a decision tree into left and right subtrees.
6. The method of claim 4, wherein the hyperplane partitioning specifically comprises:
and for the training sample p, dividing according to the method for dividing the left sub-tree and the right sub-tree by the hyperplane, and if the node where the node is located only has 1 sample point per se, the division is considered to be completed.
7. The method according to claim 1, wherein the detecting abnormal behavior based on the binary forest specifically comprises:
acquiring the depth L (p) of a leaf node where a sample p is located in each binary tree of the second binary forest;
and (3) scoring the abnormal degree of the sample point p, wherein the scoring formula is as follows:
Figure 611471DEST_PATH_IMAGE006
wherein E (l (p)) represents an average value of depths of leaf nodes where the sample p is located in each binary tree of the second binary forest, c (n) =2 x (ln (n-1) +0.5772156649) -2 x (n-1)/n, and n represents the n-th tree after sorting in the second binary forest;
and according to the score obtained by the formula, if the score is less than 0.5, the normal point is considered.
8. A system for implementing the big data based abnormal behavior detection method according to any one of claims 1 to 7, comprising:
the data acquisition module is used for acquiring mass data to be trained;
the first memory is used for storing the mass data to be trained;
the preprocessing module is used for preprocessing massive data to be trained in the first memory to obtain a first data set;
a model generation module configured to generate a binary forest based on the first data set, the binary forest being a set composed of a number of binary trees, the binary trees being obtained based on a subset of the first data set;
and the abnormal detection module is used for detecting abnormal behaviors based on the binary forest.
CN202111474046.3A 2021-12-06 2021-12-06 Abnormal behavior detection method and system based on big data Active CN113887674B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111474046.3A CN113887674B (en) 2021-12-06 2021-12-06 Abnormal behavior detection method and system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111474046.3A CN113887674B (en) 2021-12-06 2021-12-06 Abnormal behavior detection method and system based on big data

Publications (2)

Publication Number Publication Date
CN113887674A CN113887674A (en) 2022-01-04
CN113887674B true CN113887674B (en) 2022-03-22

Family

ID=79015616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111474046.3A Active CN113887674B (en) 2021-12-06 2021-12-06 Abnormal behavior detection method and system based on big data

Country Status (1)

Country Link
CN (1) CN113887674B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580580B (en) * 2022-05-07 2022-08-16 深圳索信达数据技术有限公司 Intelligent operation and maintenance abnormity detection method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3076384A1 (en) * 2017-12-28 2019-07-05 Worldline DETECTION OF ANOMALIES BY A COMBINING APPROACH SUPERVISORY AND NON-SUPERVISE LEARNING
CN110570244A (en) * 2019-09-04 2019-12-13 深圳创新奇智科技有限公司 hot-selling commodity construction method and system based on abnormal user identification
CN111081016A (en) * 2019-12-18 2020-04-28 北京航空航天大学 Urban traffic abnormity identification method based on complex network theory

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292350A (en) * 2017-08-04 2017-10-24 电子科技大学 The method for detecting abnormality of large-scale data
CN111833172A (en) * 2020-05-25 2020-10-27 百维金科(上海)信息科技有限公司 Consumption credit fraud detection method and system based on isolated forest

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3076384A1 (en) * 2017-12-28 2019-07-05 Worldline DETECTION OF ANOMALIES BY A COMBINING APPROACH SUPERVISORY AND NON-SUPERVISE LEARNING
CN110570244A (en) * 2019-09-04 2019-12-13 深圳创新奇智科技有限公司 hot-selling commodity construction method and system based on abnormal user identification
CN111081016A (en) * 2019-12-18 2020-04-28 北京航空航天大学 Urban traffic abnormity identification method based on complex network theory

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Entropy Isolation Forest Based on Dimension Entropy for Anomaly Detection;Liefa Liao et al.;《Computational Intelligence and Intelligent Systems》;20190208;365-376 *
基于多次抽样和维度熵的异常点检测算法研究;罗斌;《中国优秀硕士学位论文全文数据库信息科技辑》;20200115;I138-906 *

Also Published As

Publication number Publication date
CN113887674A (en) 2022-01-04

Similar Documents

Publication Publication Date Title
CN109936582B (en) Method and device for constructing malicious traffic detection model based on PU learning
US10275683B2 (en) Clustering-based person re-identification
US20020067857A1 (en) System and method for classification of images and videos
US8612371B1 (en) Computing device and method using associative pattern memory using recognition codes for input patterns
CN111914665B (en) Face shielding detection method, device, equipment and storage medium
CN106295502A (en) A kind of method for detecting human face and device
CN111143838B (en) Database user abnormal behavior detection method
CN110414367B (en) Time sequence behavior detection method based on GAN and SSN
CN110929848A (en) Training and tracking method based on multi-challenge perception learning model
CN110414321B (en) Method and system for automatically identifying shaking video
CN113887674B (en) Abnormal behavior detection method and system based on big data
CN116386081A (en) Pedestrian detection method and system based on multi-mode images
CN115439718A (en) Industrial detection method, system and storage medium combining supervised learning and feature matching technology
CN109286622B (en) Network intrusion detection method based on learning rule set
JP6988995B2 (en) Image generator, image generator and image generator
CN113434857A (en) User behavior safety analysis method and system applying deep learning
CN111027601B (en) Plane detection method and device based on laser sensor
CN115187884A (en) High-altitude parabolic identification method and device, electronic equipment and storage medium
CN115604032B (en) Method and system for detecting complex multi-step attack of power system
KR101953479B1 (en) Group search optimization data clustering method and system using the relative ratio of distance
CN115880499A (en) Occluded target detection model training method, device, medium and equipment
CN114299328A (en) Environment self-adaptive sensing small sample endangered animal detection method and system
CN113642017A (en) Encrypted flow identification method based on self-adaptive feature classification, memory and processor
CN106897301A (en) A kind of evaluating method of search quality, device and electronic equipment
CN110570025A (en) prediction method, device and equipment for real reading rate of WeChat seal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant