CN114285601B

CN114285601B - Multi-dense-block detection and extraction method for big data

Info

Publication number: CN114285601B
Application number: CN202111406838.7A
Authority: CN
Inventors: 王俊松; 边荟凇; 洪海兵; 金易琛
Original assignee: Nanjing College of Information Technology
Current assignee: Nanjing College of Information Technology
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2023-02-14
Anticipated expiration: 2041-11-24
Also published as: CN114285601A

Abstract

The invention discloses a method for detecting and extracting multiple dense blocks of big data, and aims to solve the technical problems of low detection accuracy and low recall rate of a dense block detection method in the prior art. It includes: acquiring K-dimensional tensor data D, the number m of dense blocks to be extracted and the size range of the dense blocks; performing suspicious degree measurement on the K-dimensional tensor data D by using a density tracking coefficient based on a piecewise function, and generating a snapnotes list according to the suspicious degree and the size range of the dense block; m dense blocks are extracted from the K-dimensional tensor data D according to the snapshotts list. The invention can effectively improve the accuracy and recall rate of the dense blocks while ensuring the detection efficiency.

Description

Multi-density block detection and extraction method for big data

Technical Field

The invention relates to a method for detecting and extracting multiple dense blocks of big data, and belongs to the technical field of abnormal data detection.

Background

With the advent of the big data age, detection of abnormal data is more and more important, and abnormal data containing network attacks usually have "consistency", such as: a group of IP addresses send requests to a plurality of fixed ports of a group of target IP at the same time interval; fraudulent data from purchasing zombie powder to increase influence would reveal a high degree of consistency among users of interest to a particular group of users. By establishing a tensor model, the consistency of the abnormal data can enable the tensor data to have dense blocks, so that the functions of detecting the abnormal data such as network attack detection, social network zombie powder detection and the like can be realized by detecting and extracting the dense blocks in the tensor data.

At present, the dense block detection and extraction method for the tensor model mainly comprises the following steps:

1. the dense block detection method based on tensor decomposition specifically includes methods such as hovvd and CP decomposition, which, although detecting dense blocks, do not have high ductility under density index and do not provide reasonable boundaries.

2. The method provides a Suspiousness measurement index for measuring the suspicious degree of a dense block, and provides a crossSpot method for detecting the dense block in a tensor, but the method can only detect a single dense block, and if the tensor contains a plurality of dense blocks with the same scale, a plurality of dense blocks can be combined into a large dense block, so that the accuracy of the detection result is low.

3. The technology provides a dense block detection framework, which is suitable for different measurement indexes, can realize multi-dense block detection, and still cannot solve the problem that the detection accuracy rate is reduced due to the fact that dense blocks of the same scale are combined.

4. The method is characterized in that a binary tree based suspicious block detection method is used for placing a dense block detected each time on a left leaf, the dense block is regarded as a common tensor to be further decomposed, meanwhile, a mathematical measurement method is designed to determine whether the dense block of the leaf is decomposed or not, if the mathematical measurement rule is met, the decomposition is continued, if the mathematical measurement rule is not met, the decomposition is stopped, and the like is carried out until all leaf nodes are decomposed. The method solves the problem of merging the same-specification dense blocks, but has two problems: firstly, decomposition is needed to be carried out at each time, and then whether the termination condition is met is judged, so that the efficiency is very low; secondly, since a decomposition method is adopted, the first k dense blocks found cannot be guaranteed to be the k dense blocks with the maximum suspicious degree.

Disclosure of Invention

Aiming at the problem of low detection accuracy of the existing dense block detection method, the invention provides a method for detecting and extracting multiple dense blocks of big data.

In order to solve the technical problems, the invention adopts the following technical means:

the invention provides a method for detecting and extracting multiple dense blocks of big data, which comprises the following steps:

acquiring K-dimensional tensor data D, the number m of dense blocks to be extracted and the size range of the dense blocks;

measuring the suspicious degree of the K-dimensional tensor data D by using a density tracking coefficient based on a piecewise function, and generating a snapnotes list according to the suspicious degree and the size range of the dense block;

m dense blocks are extracted from the K-dimensional tensor data D according to the snapshotts list.

Further, the method for measuring the suspicious degree of the K-dimensional tensor data D by using the density tracking coefficient based on the piecewise function comprises the following steps:

taking K-dimensional tensor data D as input data;

deleting the column with the least count under each dimension of the input data to obtain the residual block b of each dimension _i Wherein i represents dimension, i =1,2, \8230;, K;

computing the residual blocks b for each dimension based on a piecewise function _i The density tracking coefficient of (a);

computing each dimension's residual block b from density tracking coefficients _i A DTS value for the degree of suspicion of the input tensor data D.

Further, the density tracking factor is expressed as follows:

wherein the content of the first and second substances,

residual block b representing the ith dimension _i The density tracking coefficient of (a) is,

representing the remaining block b _i The total count of (a) is counted,

representing the remaining block b _i The product of the sizes of (a).

Further, the calculation formula of the suspicious degree DTS value is as follows:

wherein the content of the first and second substances,

residual block b representing the ith dimension _i The value of the DTS of (a),

represents the Suspiobiousness metric index, N _i Denotes a product of sizes in the ith dimension of the original K-dimensional tensor data D, and C denotes an overall count of the original K-dimensional tensor data D.

Further, the method for generating the snapnotes list according to the suspicious degree and the dense block size range comprises the following steps:

comparing the remaining blocks b of each dimension _i The residual block b with the highest DTS value in each dimension is obtained _max ；

Judging the remaining block b _max Whether the dense block size range is satisfied, if so, the remaining blocks b _max As data snapshot B;

storing the data snapshot B and the DTS value into a snapnotes list;

will remain block b _max And as new input data, carrying out the suspicious degree measurement and the data snapshot extraction again until the input data is empty, and obtaining a final snapshotts list.

Further, the method for extracting m dense blocks from the K-dimensional tensor data D according to the snapnotes list comprises the following steps:

finding out a data snapshot Bmax with the maximum DTS value from a snapshotts list as a dense block;

deleting the data snapshot Bmax from the K-dimensional tensor data D to obtain updated tensor data D;

and generating a new snapshotts list according to the updated tensor data D, and extracting new dense blocks until m dense blocks are extracted.

The following advantages can be obtained by adopting the technical means:

the invention provides a method for detecting and extracting multiple dense blocks of big data, aiming at high-dimensional tensor data, the data blocks in the high-dimensional tensor data are tracked in real time through a designed density tracking coefficient and a DTS value to obtain the suspicious degree of the data blocks, all suspicious data snapshots are uniformly stored in a snapnotes list in the dense block detection process, dense blocks are selected one by one according to the list, the dense blocks cannot be spliced or split, the detection efficiency is ensured, and the accuracy of the dense blocks can be effectively improved. The method fully considers the density problem of the data block, and the density tracking coefficient can enable the residual block with low density to obtain lower DTS and the residual block with high density to obtain higher DTS value, so that the residual block with low density is removed in the subsequent extraction process, and the accuracy of the detection of the dense block is further improved.

Drawings

FIG. 1 is a flow chart of a big data multiple dense block detection and extraction method of the present invention;

FIG. 2 is a flow chart of single dense block detection and extraction in the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the accompanying drawings as follows:

the invention provides a method for detecting and extracting multiple dense blocks of big data, which specifically comprises the following steps as shown in figures 1 and 2:

step A, obtaining K-dimension tensor data D, wherein the specific operation comprises 4 steps: 1. integrating data, namely integrating the big data to be detected to a specified data center through an ETL (extract transform load) technology; 2. desensitizing data, namely desensitizing sensitive information (such as identification numbers) in a production environment by a desensitizing technology; 3. data cleaning, namely cleaning the desensitized data to ensure the accuracy and consistency of the data; 4. and (3) data preprocessing, namely performing data modeling on the relational data obtained in the production environment and converting the relational data into K-dimensional tensor data D.

Manually setting the number m of the dense blocks to be extracted and the size range [ min, max ] of the dense blocks, wherein min is the lower limit of the size of the dense blocks, and max is the upper limit of the size of the dense blocks.

Step B, performing suspicious degree measurement on the K-dimensional tensor data D by using a density tracking coefficient based on a piecewise function, and generating a snapnotes list according to the suspicious degree and the dense block size range, wherein the specific operation is as follows:

and step B01, taking the K-dimensional tensor data D as input data, and establishing an empty list snapnotes for storing a snapshot generated in each step of the tensor data D in the optimizing process.

B02, calculating the count of each column under each dimension of the input data, deleting the column with the least count, and obtaining the residual block B of each dimension _i I denotes dimension, i =1,2, \ 8230;, K.

Assuming that the input data is a two-dimensional tensor matrix of 100 by 100, the dimension K =2, the first dimension being a row and the second dimension being a column; adding 100 elements of each row in the matrix to obtain the count of each row, comparing the counts of 100 rows, selecting the row with the least count, and removing the row with the least count from the 100 rows to obtain a 99-by-100 two-dimensional tensor matrix, which is the residual block b of the first dimension ₁ (ii) a The same method is used to remove the least counted column from 100 columns to obtain a 100 by 99 two-dimensional tensor matrix, which is the remaining block b in the second dimension ₂ 。

Step B03, in order to avoid the problem that the accuracy is reduced due to the fact that dense blocks with the same scale are combined, the invention designs a density tracking coefficient based on a piecewise function, and the expression of the density tracking coefficient is as follows:

wherein the content of the first and second substances,

representing the remaining block b _i Is the total count of the remaining blocks b _i The sum of all the elements in (A) and (B),

representing the remaining block b _i The product of the sizes of (a).

Assuming a residual block size of 50 x 50, then

Is the sum of the 2500 elements in the remaining block,

K＝2。

sequentially calculating the residual block b of each dimension by using formula (3) _i The density tracking coefficient of (1).

Step B04, calculating the residual block B of each dimension according to the density tracking coefficient _i For the suspicious degree DTS value of the input tensor data D, the calculation formula is as follows:

wherein the content of the first and second substances,

representing the residual block b of the ith dimension _i The value of the DTS of (a),

represents the Suspiobiousness metric index, N _i Denotes a product of sizes in the i-th dimension of the original K-dimensional tensor data D, and C denotes a total count of the original K-dimensional tensor data D, i.e., a sum of all elements in the K-dimensional tensor data D.

Suspiciousness is a metric method proposed by previous scholars, which assumes that the original input tensor conforms to a Poisson distribution and evaluates the likelihood of the probability of a dense block appearing under the Poisson distribution.

Step B05, comparing the residual blocks B of each dimension _i Obtaining the residual block b with the highest DTS value in each dimension _max Judging the remaining block b _max Whether or not the dense block size range, i.e. min, is satisfied<size(b _max )<max, if satisfied, will remain block b _max And storing the data snapshot B and the DTS value into a snapnotes list, otherwise, not storing the data snapshot B.

Step B06, the rest blocks B in the step B05 _max And (4) as new input data, repeating the steps B02-B05 until the input data is empty, stopping circulation, and obtaining a snapshotts list containing a plurality of data snapshots and corresponding DTS values.

Assume the remaining blocks b of the first dimension ₁ (99 times 100 two-dimensional tensor matrix) remaining blocks b with DTS values greater than the second dimension ₂ (100 times 99 two-dimensional tensor matrix) then select b ₁ As b _max (ii) a When the above steps are repeated, b ₁ (99 times 100 two-dimensional tensor matrix) is the new input data, from b ₁ The least counted row of the 99 rows of (a) is removed to obtain a 98 by 100 two-dimensional tensor matrix from b ₁ The column with the least count is removed from the 100 columns to obtain a 99-by-99 two-dimensional tensor matrix, and the DTS values of the 98-by-100 two-dimensional tensor matrix and the 99-by-99 two-dimensional tensor matrix are compared; and so on.

And C, extracting m dense blocks from the K-dimensional tensor data D according to the snapshotts list.

And step C01, acquiring the snapnotes list generated in the step B06, and finding out the data snapshot Bmax with the maximum DTS value from the snapnotes list to serve as a dense block.

And C02, deleting the data snapshot Bmax from the K-dimensional tensor data D to obtain the updated tensor data D.

And C03, repeating the steps B-C02 by using the updated tensor data D, continuously updating the snapshots list, extracting the dense blocks, updating the tensor data D until the m dense blocks are extracted, and finally outputting the m dense blocks.

Compared with 4 prior arts in the background art, the present invention has the following advantages:

1. the dense block detection method based on tensor decomposition is an earlier technology, and has poor extensibility on high-dimensional data, but the method can process the high-dimensional data and has high extensibility on the high-dimensional data.

2. The cross spot detection method based on Suspiousness and the M-Zoom dense fast detection framework both have the problem that dense blocks with the same size are merged, and a large dense block after splicing generates a plurality of non-dense areas, so that the accuracy of the detected dense block is very low. If there is a matrix of 100 times 100 and two dense blocks inside, one is a small block of 20 times 20 with row marks and column marks from 1 to 20, and the other is a small block with row marks and column marks from 21 to 40, if dense block detection is performed by using the CrossSpot detection method based on Suspiousness or an M-Zoom dense block detection framework, a large dense block of 40 times 40 (a block with row marks and column marks from 1 to 40) with the two small dense blocks as diagonal matrix is detected, and the upper right corner and the lower left corner of the large dense block are non-dense, resulting in an accuracy of only about 50% for the large dense block, and if there are 3 such small blocks, an accuracy of only about 33% is possible, and so on.

In the suspicious degree calculation process, the density problem is fully considered, the density tracking coefficient can enable the residual blocks with low density to obtain lower DTS and the residual blocks with high density to obtain higher DTS values, so that the residual blocks with low density are removed in the subsequent extraction process, and in the above example, the method can find two dense blocks with 20 times 20 (instead of 40 times 40 directly).

3. Although the binary tree-based suspicious block detection method can solve the problem of merging dense blocks of the same size, the efficiency is low, and because a large dense block is split firstly for judgment, the method is likely to split a large dense block with higher suspicious degree into 2 small dense blocks with insufficient suspicious degree, so that the finally extracted dense block is not the most suspicious. The method has the advantages that the problem does not exist, the DTS value of the residual block is calculated in real time through the density tracking coefficient, the dense block is detected in real time, the splitting operation does not exist, the efficiency is higher, and the true most suspicious block can be accurately found.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A big data multi-density block detection and extraction method is characterized by comprising the following steps:

extracting m dense blocks from the K-dimensional tensor data D according to the snapshotts list;

the method for measuring the suspicious degree of the K-dimensional tensor data D by using the density tracking coefficient based on the piecewise function comprises the following steps:

taking K-dimensional tensor data D as input data;

adding all elements of each attribute corresponding column under each dimension of input data to obtain the count of each attribute corresponding column under each dimension;

deleting the column with the least count under each dimension of the input data to obtain the residual blocks of each dimensionb _i Wherein, in the step (A),ithe dimensions are represented by a number of dimensions,

；

computing residual blocks for each dimension based on piecewise functionb _i A density tracking coefficient of;

computing residual blocks for each dimension from density tracking coefficientsb _i A suspicious degree DTS value for the input tensor data D;

the density tracking factor is expressed as follows:

wherein the content of the first and second substances,

is shown asiResidual block of one dimensionb _i The density tracking coefficient of (a) is,

representing the remaining blocksb _i The total count of (a) is counted,

representing the rest blockb _i The product of the sizes of (a);

the calculation formula of the DTS value of the suspicious degree is as follows:

wherein, the first and the second end of the pipe are connected with each other,

is shown asiResidual block of one dimensionb _i The value of the DTS of (a),

represents the Suspiciousness metric,

the second of the original K-dimensional tensor data DiDimension (Wei)The product of the dimensions in degrees (c),

a total count representing the original K-dimensional tensor data D;

the method for generating the snapnotes list according to the suspicious degree and the dense block size range comprises the following steps:

comparing the remaining blocks of each dimensionb _i The DTS value of (1) is obtained, and the residual block with the highest DTS value in each dimension is obtained

；

Judging residual block

Whether the dense block size range is satisfied, if so, the remaining blocks

As data snapshot B;

storing the data snapshot B and the DTS value thereof into a snapnotes list;

will remain the block

As new input data, carrying out suspicious degree measurement and data snapshot extraction again until the input data is empty, and obtaining a final snapshotts list;

the method for extracting m dense blocks from K-dimensional tensor data D according to the snapnotes list comprises the following steps: