CN114285601B - Multi-dense-block detection and extraction method for big data - Google Patents
Multi-dense-block detection and extraction method for big data Download PDFInfo
- Publication number
- CN114285601B CN114285601B CN202111406838.7A CN202111406838A CN114285601B CN 114285601 B CN114285601 B CN 114285601B CN 202111406838 A CN202111406838 A CN 202111406838A CN 114285601 B CN114285601 B CN 114285601B
- Authority
- CN
- China
- Prior art keywords
- data
- dense
- blocks
- block
- dimension
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Complex Calculations (AREA)
Abstract
The invention discloses a method for detecting and extracting multiple dense blocks of big data, and aims to solve the technical problems of low detection accuracy and low recall rate of a dense block detection method in the prior art. It includes: acquiring K-dimensional tensor data D, the number m of dense blocks to be extracted and the size range of the dense blocks; performing suspicious degree measurement on the K-dimensional tensor data D by using a density tracking coefficient based on a piecewise function, and generating a snapnotes list according to the suspicious degree and the size range of the dense block; m dense blocks are extracted from the K-dimensional tensor data D according to the snapshotts list. The invention can effectively improve the accuracy and recall rate of the dense blocks while ensuring the detection efficiency.
Description
Technical Field
The invention relates to a method for detecting and extracting multiple dense blocks of big data, and belongs to the technical field of abnormal data detection.
Background
With the advent of the big data age, detection of abnormal data is more and more important, and abnormal data containing network attacks usually have "consistency", such as: a group of IP addresses send requests to a plurality of fixed ports of a group of target IP at the same time interval; fraudulent data from purchasing zombie powder to increase influence would reveal a high degree of consistency among users of interest to a particular group of users. By establishing a tensor model, the consistency of the abnormal data can enable the tensor data to have dense blocks, so that the functions of detecting the abnormal data such as network attack detection, social network zombie powder detection and the like can be realized by detecting and extracting the dense blocks in the tensor data.
At present, the dense block detection and extraction method for the tensor model mainly comprises the following steps:
1. the dense block detection method based on tensor decomposition specifically includes methods such as hovvd and CP decomposition, which, although detecting dense blocks, do not have high ductility under density index and do not provide reasonable boundaries.
2. The method provides a Suspiousness measurement index for measuring the suspicious degree of a dense block, and provides a crossSpot method for detecting the dense block in a tensor, but the method can only detect a single dense block, and if the tensor contains a plurality of dense blocks with the same scale, a plurality of dense blocks can be combined into a large dense block, so that the accuracy of the detection result is low.
3. The technology provides a dense block detection framework, which is suitable for different measurement indexes, can realize multi-dense block detection, and still cannot solve the problem that the detection accuracy rate is reduced due to the fact that dense blocks of the same scale are combined.
4. The method is characterized in that a binary tree based suspicious block detection method is used for placing a dense block detected each time on a left leaf, the dense block is regarded as a common tensor to be further decomposed, meanwhile, a mathematical measurement method is designed to determine whether the dense block of the leaf is decomposed or not, if the mathematical measurement rule is met, the decomposition is continued, if the mathematical measurement rule is not met, the decomposition is stopped, and the like is carried out until all leaf nodes are decomposed. The method solves the problem of merging the same-specification dense blocks, but has two problems: firstly, decomposition is needed to be carried out at each time, and then whether the termination condition is met is judged, so that the efficiency is very low; secondly, since a decomposition method is adopted, the first k dense blocks found cannot be guaranteed to be the k dense blocks with the maximum suspicious degree.
Disclosure of Invention
Aiming at the problem of low detection accuracy of the existing dense block detection method, the invention provides a method for detecting and extracting multiple dense blocks of big data.
In order to solve the technical problems, the invention adopts the following technical means:
the invention provides a method for detecting and extracting multiple dense blocks of big data, which comprises the following steps:
acquiring K-dimensional tensor data D, the number m of dense blocks to be extracted and the size range of the dense blocks;
measuring the suspicious degree of the K-dimensional tensor data D by using a density tracking coefficient based on a piecewise function, and generating a snapnotes list according to the suspicious degree and the size range of the dense block;
m dense blocks are extracted from the K-dimensional tensor data D according to the snapshotts list.
Further, the method for measuring the suspicious degree of the K-dimensional tensor data D by using the density tracking coefficient based on the piecewise function comprises the following steps:
taking K-dimensional tensor data D as input data;
deleting the column with the least count under each dimension of the input data to obtain the residual block b of each dimension i Wherein i represents dimension, i =1,2, \8230;, K;
computing the residual blocks b for each dimension based on a piecewise function i The density tracking coefficient of (a);
computing each dimension's residual block b from density tracking coefficients i A DTS value for the degree of suspicion of the input tensor data D.
Further, the density tracking factor is expressed as follows:
wherein the content of the first and second substances,residual block b representing the ith dimension i The density tracking coefficient of (a) is,representing the remaining block b i The total count of (a) is counted,representing the remaining block b i The product of the sizes of (a).
Further, the calculation formula of the suspicious degree DTS value is as follows:
wherein the content of the first and second substances,residual block b representing the ith dimension i The value of the DTS of (a),represents the Suspiobiousness metric index, N i Denotes a product of sizes in the ith dimension of the original K-dimensional tensor data D, and C denotes an overall count of the original K-dimensional tensor data D.
Further, the method for generating the snapnotes list according to the suspicious degree and the dense block size range comprises the following steps:
comparing the remaining blocks b of each dimension i The residual block b with the highest DTS value in each dimension is obtained max ;
Judging the remaining block b max Whether the dense block size range is satisfied, if so, the remaining blocks b max As data snapshot B;
storing the data snapshot B and the DTS value into a snapnotes list;
will remain block b max And as new input data, carrying out the suspicious degree measurement and the data snapshot extraction again until the input data is empty, and obtaining a final snapshotts list.
Further, the method for extracting m dense blocks from the K-dimensional tensor data D according to the snapnotes list comprises the following steps:
finding out a data snapshot Bmax with the maximum DTS value from a snapshotts list as a dense block;
deleting the data snapshot Bmax from the K-dimensional tensor data D to obtain updated tensor data D;
and generating a new snapshotts list according to the updated tensor data D, and extracting new dense blocks until m dense blocks are extracted.
The following advantages can be obtained by adopting the technical means:
the invention provides a method for detecting and extracting multiple dense blocks of big data, aiming at high-dimensional tensor data, the data blocks in the high-dimensional tensor data are tracked in real time through a designed density tracking coefficient and a DTS value to obtain the suspicious degree of the data blocks, all suspicious data snapshots are uniformly stored in a snapnotes list in the dense block detection process, dense blocks are selected one by one according to the list, the dense blocks cannot be spliced or split, the detection efficiency is ensured, and the accuracy of the dense blocks can be effectively improved. The method fully considers the density problem of the data block, and the density tracking coefficient can enable the residual block with low density to obtain lower DTS and the residual block with high density to obtain higher DTS value, so that the residual block with low density is removed in the subsequent extraction process, and the accuracy of the detection of the dense block is further improved.
Drawings
FIG. 1 is a flow chart of a big data multiple dense block detection and extraction method of the present invention;
FIG. 2 is a flow chart of single dense block detection and extraction in the present invention.
Detailed Description
The technical scheme of the invention is further explained by combining the accompanying drawings as follows:
the invention provides a method for detecting and extracting multiple dense blocks of big data, which specifically comprises the following steps as shown in figures 1 and 2:
step A, obtaining K-dimension tensor data D, wherein the specific operation comprises 4 steps: 1. integrating data, namely integrating the big data to be detected to a specified data center through an ETL (extract transform load) technology; 2. desensitizing data, namely desensitizing sensitive information (such as identification numbers) in a production environment by a desensitizing technology; 3. data cleaning, namely cleaning the desensitized data to ensure the accuracy and consistency of the data; 4. and (3) data preprocessing, namely performing data modeling on the relational data obtained in the production environment and converting the relational data into K-dimensional tensor data D.
Manually setting the number m of the dense blocks to be extracted and the size range [ min, max ] of the dense blocks, wherein min is the lower limit of the size of the dense blocks, and max is the upper limit of the size of the dense blocks.
Step B, performing suspicious degree measurement on the K-dimensional tensor data D by using a density tracking coefficient based on a piecewise function, and generating a snapnotes list according to the suspicious degree and the dense block size range, wherein the specific operation is as follows:
and step B01, taking the K-dimensional tensor data D as input data, and establishing an empty list snapnotes for storing a snapshot generated in each step of the tensor data D in the optimizing process.
B02, calculating the count of each column under each dimension of the input data, deleting the column with the least count, and obtaining the residual block B of each dimension i I denotes dimension, i =1,2, \ 8230;, K.
Assuming that the input data is a two-dimensional tensor matrix of 100 by 100, the dimension K =2, the first dimension being a row and the second dimension being a column; adding 100 elements of each row in the matrix to obtain the count of each row, comparing the counts of 100 rows, selecting the row with the least count, and removing the row with the least count from the 100 rows to obtain a 99-by-100 two-dimensional tensor matrix, which is the residual block b of the first dimension 1 (ii) a The same method is used to remove the least counted column from 100 columns to obtain a 100 by 99 two-dimensional tensor matrix, which is the remaining block b in the second dimension 2 。
Step B03, in order to avoid the problem that the accuracy is reduced due to the fact that dense blocks with the same scale are combined, the invention designs a density tracking coefficient based on a piecewise function, and the expression of the density tracking coefficient is as follows:
wherein the content of the first and second substances,residual block b representing the ith dimension i The density tracking coefficient of (a) is,representing the remaining block b i Is the total count of the remaining blocks b i The sum of all the elements in (A) and (B),representing the remaining block b i The product of the sizes of (a).
Assuming a residual block size of 50 x 50, thenIs the sum of the 2500 elements in the remaining block,K=2。
sequentially calculating the residual block b of each dimension by using formula (3) i The density tracking coefficient of (1).
Step B04, calculating the residual block B of each dimension according to the density tracking coefficient i For the suspicious degree DTS value of the input tensor data D, the calculation formula is as follows:
wherein the content of the first and second substances,representing the residual block b of the ith dimension i The value of the DTS of (a),represents the Suspiobiousness metric index, N i Denotes a product of sizes in the i-th dimension of the original K-dimensional tensor data D, and C denotes a total count of the original K-dimensional tensor data D, i.e., a sum of all elements in the K-dimensional tensor data D.
Suspiciousness is a metric method proposed by previous scholars, which assumes that the original input tensor conforms to a Poisson distribution and evaluates the likelihood of the probability of a dense block appearing under the Poisson distribution.
Step B05, comparing the residual blocks B of each dimension i Obtaining the residual block b with the highest DTS value in each dimension max Judging the remaining block b max Whether or not the dense block size range, i.e. min, is satisfied<size(b max )<max, if satisfied, will remain block b max And storing the data snapshot B and the DTS value into a snapnotes list, otherwise, not storing the data snapshot B.
Step B06, the rest blocks B in the step B05 max And (4) as new input data, repeating the steps B02-B05 until the input data is empty, stopping circulation, and obtaining a snapshotts list containing a plurality of data snapshots and corresponding DTS values.
Assume the remaining blocks b of the first dimension 1 (99 times 100 two-dimensional tensor matrix) remaining blocks b with DTS values greater than the second dimension 2 (100 times 99 two-dimensional tensor matrix) then select b 1 As b max (ii) a When the above steps are repeated, b 1 (99 times 100 two-dimensional tensor matrix) is the new input data, from b 1 The least counted row of the 99 rows of (a) is removed to obtain a 98 by 100 two-dimensional tensor matrix from b 1 The column with the least count is removed from the 100 columns to obtain a 99-by-99 two-dimensional tensor matrix, and the DTS values of the 98-by-100 two-dimensional tensor matrix and the 99-by-99 two-dimensional tensor matrix are compared; and so on.
And C, extracting m dense blocks from the K-dimensional tensor data D according to the snapshotts list.
And step C01, acquiring the snapnotes list generated in the step B06, and finding out the data snapshot Bmax with the maximum DTS value from the snapnotes list to serve as a dense block.
And C02, deleting the data snapshot Bmax from the K-dimensional tensor data D to obtain the updated tensor data D.
And C03, repeating the steps B-C02 by using the updated tensor data D, continuously updating the snapshots list, extracting the dense blocks, updating the tensor data D until the m dense blocks are extracted, and finally outputting the m dense blocks.
Compared with 4 prior arts in the background art, the present invention has the following advantages:
1. the dense block detection method based on tensor decomposition is an earlier technology, and has poor extensibility on high-dimensional data, but the method can process the high-dimensional data and has high extensibility on the high-dimensional data.
2. The cross spot detection method based on Suspiousness and the M-Zoom dense fast detection framework both have the problem that dense blocks with the same size are merged, and a large dense block after splicing generates a plurality of non-dense areas, so that the accuracy of the detected dense block is very low. If there is a matrix of 100 times 100 and two dense blocks inside, one is a small block of 20 times 20 with row marks and column marks from 1 to 20, and the other is a small block with row marks and column marks from 21 to 40, if dense block detection is performed by using the CrossSpot detection method based on Suspiousness or an M-Zoom dense block detection framework, a large dense block of 40 times 40 (a block with row marks and column marks from 1 to 40) with the two small dense blocks as diagonal matrix is detected, and the upper right corner and the lower left corner of the large dense block are non-dense, resulting in an accuracy of only about 50% for the large dense block, and if there are 3 such small blocks, an accuracy of only about 33% is possible, and so on.
In the suspicious degree calculation process, the density problem is fully considered, the density tracking coefficient can enable the residual blocks with low density to obtain lower DTS and the residual blocks with high density to obtain higher DTS values, so that the residual blocks with low density are removed in the subsequent extraction process, and in the above example, the method can find two dense blocks with 20 times 20 (instead of 40 times 40 directly).
3. Although the binary tree-based suspicious block detection method can solve the problem of merging dense blocks of the same size, the efficiency is low, and because a large dense block is split firstly for judgment, the method is likely to split a large dense block with higher suspicious degree into 2 small dense blocks with insufficient suspicious degree, so that the finally extracted dense block is not the most suspicious. The method has the advantages that the problem does not exist, the DTS value of the residual block is calculated in real time through the density tracking coefficient, the dense block is detected in real time, the splitting operation does not exist, the efficiency is higher, and the true most suspicious block can be accurately found.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (1)
1. A big data multi-density block detection and extraction method is characterized by comprising the following steps:
acquiring K-dimensional tensor data D, the number m of dense blocks to be extracted and the size range of the dense blocks;
measuring the suspicious degree of the K-dimensional tensor data D by using a density tracking coefficient based on a piecewise function, and generating a snapnotes list according to the suspicious degree and the size range of the dense block;
extracting m dense blocks from the K-dimensional tensor data D according to the snapshotts list;
the method for measuring the suspicious degree of the K-dimensional tensor data D by using the density tracking coefficient based on the piecewise function comprises the following steps:
taking K-dimensional tensor data D as input data;
adding all elements of each attribute corresponding column under each dimension of input data to obtain the count of each attribute corresponding column under each dimension;
deleting the column with the least count under each dimension of the input data to obtain the residual blocks of each dimensionb i Wherein, in the step (A),ithe dimensions are represented by a number of dimensions,;
computing residual blocks for each dimension based on piecewise functionb i A density tracking coefficient of;
computing residual blocks for each dimension from density tracking coefficientsb i A suspicious degree DTS value for the input tensor data D;
the density tracking factor is expressed as follows:
wherein the content of the first and second substances,is shown asiResidual block of one dimensionb i The density tracking coefficient of (a) is,representing the remaining blocksb i The total count of (a) is counted,representing the rest blockb i The product of the sizes of (a);
the calculation formula of the DTS value of the suspicious degree is as follows:
wherein, the first and the second end of the pipe are connected with each other,is shown asiResidual block of one dimensionb i The value of the DTS of (a),represents the Suspiciousness metric,the second of the original K-dimensional tensor data DiDimension (Wei)The product of the dimensions in degrees (c),a total count representing the original K-dimensional tensor data D;
the method for generating the snapnotes list according to the suspicious degree and the dense block size range comprises the following steps:
comparing the remaining blocks of each dimensionb i The DTS value of (1) is obtained, and the residual block with the highest DTS value in each dimension is obtained;
Judging residual blockWhether the dense block size range is satisfied, if so, the remaining blocksAs data snapshot B;
storing the data snapshot B and the DTS value thereof into a snapnotes list;
will remain the blockAs new input data, carrying out suspicious degree measurement and data snapshot extraction again until the input data is empty, and obtaining a final snapshotts list;
the method for extracting m dense blocks from K-dimensional tensor data D according to the snapnotes list comprises the following steps:
finding out a data snapshot Bmax with the maximum DTS value from a snapshotts list as a dense block;
deleting the data snapshot Bmax from the K-dimensional tensor data D to obtain updated tensor data D;
and generating a new snapshotts list according to the updated tensor data D, and extracting new dense blocks until m dense blocks are extracted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111406838.7A CN114285601B (en) | 2021-11-24 | 2021-11-24 | Multi-dense-block detection and extraction method for big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111406838.7A CN114285601B (en) | 2021-11-24 | 2021-11-24 | Multi-dense-block detection and extraction method for big data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114285601A CN114285601A (en) | 2022-04-05 |
CN114285601B true CN114285601B (en) | 2023-02-14 |
Family
ID=80870105
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111406838.7A Active CN114285601B (en) | 2021-11-24 | 2021-11-24 | Multi-dense-block detection and extraction method for big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114285601B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108366048A (en) * | 2018-01-10 | 2018-08-03 | 南京邮电大学 | A kind of network inbreak detection method based on unsupervised learning |
CN109064189A (en) * | 2018-07-13 | 2018-12-21 | 北京亚鸿世纪科技发展有限公司 | Brush list detecting and alarm device based on the detection of intensive block |
CN109753797A (en) * | 2018-12-10 | 2019-05-14 | 中国科学院计算技术研究所 | For the intensive subgraph detection method and system of streaming figure |
CN111523012A (en) * | 2019-02-01 | 2020-08-11 | 慧安金科(北京)科技有限公司 | Method, apparatus, and computer-readable storage medium for detecting abnormal data |
CN113420608A (en) * | 2021-05-31 | 2021-09-21 | 高新兴科技集团股份有限公司 | Human body abnormal behavior identification method based on dense space-time graph convolutional network |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2786284A4 (en) * | 2011-11-28 | 2015-08-05 | Hewlett Packard Development Co | Clustering event data by multiple time dimensions |
CN110322356B (en) * | 2019-04-22 | 2020-08-07 | 山东大学 | Medical insurance abnormity detection method and system based on HIN mining dynamic multi-mode |
US20210150305A1 (en) * | 2019-11-19 | 2021-05-20 | Ciena Corporation | Forecasting time-series data in a network environment |
-
2021
- 2021-11-24 CN CN202111406838.7A patent/CN114285601B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108366048A (en) * | 2018-01-10 | 2018-08-03 | 南京邮电大学 | A kind of network inbreak detection method based on unsupervised learning |
CN109064189A (en) * | 2018-07-13 | 2018-12-21 | 北京亚鸿世纪科技发展有限公司 | Brush list detecting and alarm device based on the detection of intensive block |
CN109753797A (en) * | 2018-12-10 | 2019-05-14 | 中国科学院计算技术研究所 | For the intensive subgraph detection method and system of streaming figure |
CN111523012A (en) * | 2019-02-01 | 2020-08-11 | 慧安金科(北京)科技有限公司 | Method, apparatus, and computer-readable storage medium for detecting abnormal data |
CN113420608A (en) * | 2021-05-31 | 2021-09-21 | 高新兴科技集团股份有限公司 | Human body abnormal behavior identification method based on dense space-time graph convolutional network |
Non-Patent Citations (4)
Title |
---|
D-Cube: Dense-Block Detection in Terabyte-Scale Tensors;Kijung Shin et al.;《ACM 》;20170228;全文 * |
M-Zoom: Fast Dense-Block Detection;Kijung Shin et al.;《Springer International Publishing AG 2016》;20161231;全文 * |
一种基于深度学习的异常行为识别方法;杨锐等;《五邑大学学报(自然科学版)》;20180515(第02期);全文 * |
张量数据中的多密集块检测方法;范卫俊等;《计算机应用研究》;20190228;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114285601A (en) | 2022-04-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108521434B (en) | A kind of network security intrusion detecting system based on block chain technology | |
CN107786388B (en) | Anomaly detection system based on large-scale network flow data | |
CN109344845B (en) | Feature matching method based on triple deep neural network structure | |
CN108282460B (en) | Evidence chain generation method and device for network security event | |
CN108335290B (en) | Image area copying and tampering detection method based on LIOP feature and block matching | |
CN113364787B (en) | Botnet flow detection method based on parallel neural network | |
WO2019200739A1 (en) | Data fraud identification method, apparatus, computer device, and storage medium | |
CN103455597B (en) | Distributed information towards magnanimity web graph picture hides detection method | |
CN113037567B (en) | Simulation method of network attack behavior simulation system for power grid enterprise | |
CN111125750B (en) | Database watermark embedding and detecting method and system based on double-layer ellipse model | |
WO2023082641A1 (en) | Electronic archive generation method and apparatus, and terminal device and storage medium | |
CN115037543A (en) | Abnormal network flow detection method based on bidirectional time convolution neural network | |
CN114827380B (en) | Network security detection method based on artificial intelligence | |
CN116366313A (en) | Small sample abnormal flow detection method and system | |
CN110428438B (en) | Single-tree modeling method and device and storage medium | |
CN112035621A (en) | Enterprise name similarity detection method based on statistics | |
CN115952067A (en) | Database operation abnormal behavior detection method and readable storage medium | |
CN114285601B (en) | Multi-dense-block detection and extraction method for big data | |
CN111709021B (en) | Attack event identification method based on mass alarms and electronic device | |
CN114218610B (en) | Multi-dense block detection and extraction method based on Possion distribution | |
CN109739840A (en) | Data processing empty value method, apparatus and terminal device | |
CN112333155B (en) | Abnormal flow detection method and system, electronic equipment and storage medium | |
CN104484869A (en) | Image matching method and system for ordinal measure features | |
CN111680286B (en) | Refinement method of Internet of things equipment fingerprint library | |
CN113132291B (en) | Heterogeneous terminal feature generation and identification method based on network traffic at edge side |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |