CN114218610A

CN114218610A - Multi-dense block detection and extraction method based on Possion distribution

Info

Publication number: CN114218610A
Application number: CN202111406839.1A
Authority: CN
Inventors: 王俊松; 边荟凇; 虞振峰; 陈诚
Original assignee: Nanjing College of Information Technology
Current assignee: Nanjing College of Information Technology
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2022-03-22
Anticipated expiration: 2041-11-24
Also published as: CN114218610B

Abstract

The invention discloses a multi-dense block detection and extraction method based on Possion distribution, which comprises the following steps: carrying out suspicious degree measurement on the multi-dimensional tensor data by using a dense block suspicious degree measurement method to obtain a snapnotes list comprising a plurality of suspicious data snapshots; extracting a single dense block from the multi-dimensional tensor data according to a snapshotts list; removing the extracted single dense block from the multi-dimensional tensor data to obtain updated multi-dimensional tensor data; generating a new snapshotts list according to the updated multidimensional tensor data, and extracting new dense blocks until m dense blocks are extracted; the method for measuring the suspicious degree of the dense block is obtained by deducing Possion distribution including counting and density factors. The invention can effectively improve the accuracy and recall rate of the dense blocks while ensuring the detection efficiency.

Description

Multi-dense block detection and extraction method based on Possion distribution

Technical Field

The invention relates to a multi-dense block detection and extraction method based on Possion distribution, and belongs to the technical field of abnormal data detection.

Background

With the advent of the big data age, data anomaly detection becomes more and more important. Abnormal data not only brings data fraud and affects normal data analysis work, but also can cause loss, leakage and the like of normal data, so that the rapid and accurate detection and extraction of abnormal data from mass data is a research focus in the field of data detection.

The method comprises the steps that dense abnormal data exist in abnormal data, the dense abnormal data often have consistency, dense block structures are shown in tensor data, and a plurality of dense block detection and extraction methods appear on the market aiming at the abnormal data, but the existing detection method has some defects, for example, a CrossSpot detection method based on Suspiousness can only detect a single dense block, if the tensor contains a plurality of dense blocks with the same scale, a plurality of dense blocks are combined into a large dense block, and the detection result accuracy and recall rate are not high; the suspicious block detection method based on the binary tree firstly decomposes the dense block every time, and then judges whether the termination condition is met, so that the detection efficiency is very low, excessive decomposition is easy to occur, and the detection accuracy is insufficient.

Disclosure of Invention

Aiming at the problems of low detection accuracy and low recall rate of the existing dense block detection method, the invention provides a multi-dense block detection and extraction method based on Possion distribution, which not only considers the counting of dense blocks but also considers the density of the dense blocks on the basis of the Possion distribution, provides a novel dense block suspicious degree measurement method, and can effectively improve the efficiency, the accuracy and the recall rate of the dense block detection.

In order to solve the technical problems, the invention adopts the following technical means:

the invention provides a multi-dense block detection and extraction method based on Possion distribution, which comprises the following steps:

acquiring multi-dimensional tensor data, the number m of dense blocks to be extracted and the size range of the dense blocks;

carrying out suspicious degree measurement on the multi-dimensional tensor data by using a dense block suspicious degree measurement method to obtain a snapnotes list comprising a plurality of suspicious data snapshots;

extracting a single dense block from the multi-dimensional tensor data according to a snapshotts list;

removing the extracted single dense block from the multi-dimensional tensor data to obtain updated multi-dimensional tensor data;

generating a new snapshotts list according to the updated multidimensional tensor data, and extracting new dense blocks until m dense blocks are extracted;

the method for measuring the suspicious degree of the dense block is obtained by deducing Possion distribution including counting and density factors.

Further, the method for measuring the suspicious degree of the multidimensional tensor data by using the dense block suspicious degree measuring method by setting the dimension of the multidimensional tensor data D as K comprises the following steps:

taking K-dimensional tensor data D as input data;

deleting the column with the least count under each dimension of the input data to obtain the residual block b of each dimension_iWherein i represents dimension, i ═ 1,2, …, K;

calculating residual blocks b of each dimension by using dense block suspicious degree measurement method_iAnd obtaining K DGCS values according to the suspicious degree DGCS values.

Further, the expression of the mission distribution including the two factors of the count and the density is as follows:

where P (Y ═ qc) denotes the probability of occurrence of dense blocks in the original multidimensional tensor data under the poisson distribution, Q denotes the density of the original multidimensional tensor data, Q denotes the density of the dense blocks, and c denotes the total count of the dense blocks.

Further, the expression of the DGCS value of the suspicious degree is as follows:

wherein the content of the first and second substances,

residual block b representing the ith dimension_iThe DGCS value of the degree of suspicion of,

representing the remaining block b_iThe product of the sizes of (a) and (b),

representing the remaining block b_iN denotes a product of sizes of the multi-dimensional tensor data D, C denotes a total count of the multi-dimensional tensor data D,

representing the presence of residual blocks b in the multi-dimensional tensor data D_iThe probability of (a) of (b) being,

representing the remaining block b_iQ denotes the density of the multi-dimensional tensor data D;

according to the calculation formula of the density, the expression of the DGCS value of the suspicious degree satisfies the following equation:

Q＝C/N (3)

further, the method for obtaining the snapnotes list containing the plurality of suspicious data snapshots comprises the following steps:

comparing K DGCS values, and recording the residual block with the highest DGCS value as b_max；

Judging the remaining block b_maxWhether the dense block size range is satisfied, if so, the remaining blocks b_maxAs data snapshot B;

the data snapshot B is associated with the DGCS value and then stored in a snapshotts list;

will remain block b_maxAnd as new input data, measuring the suspicious degree again, extracting a new data snapshot and storing the new data snapshot into a snapshotts list until the input data is empty, and obtaining the snapshotts list containing a plurality of suspicious data snapshots.

Further, the data to be detected is converted into multi-dimensional tensor data through data integration, data desensitization, data cleaning and data modeling.

The following advantages can be obtained by adopting the technical means:

the invention provides a multisection block detection and extraction method based on Possion distribution, which is characterized in that data to be detected is converted into multi-dimensional tensor data for subsequent detection, in the dense block detection process, a suspicious degree DGCS value of a data block in the multi-dimensional tensor data is calculated by a designed dense block suspicious degree measurement method to obtain data snapshots meeting requirements, all suspicious data snapshots are uniformly stored in a snapshots list, dense blocks are selected one by one according to the list, and finally a plurality of most suspicious dense blocks are obtained. The method of the invention uses the table to store the data snapshot, avoids splicing or decomposing the dense blocks, ensures the independence and the integrity of the dense blocks in each detection process, and in addition, the method for measuring the suspicious degree of the dense blocks can simultaneously consider the technology and the density factor of the data blocks, ensures that the dense blocks with high density obtain higher DGCS value, is easier to detect, solves the problems of the existing dense block detection method, and improves the efficiency, the accuracy and the recall rate of the dense block detection.

Drawings

FIG. 1 is a flow chart of a multi-dense block detection and extraction method based on Possion distribution according to the present invention;

FIG. 2 is a flow chart of single dense block detection and extraction in the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the accompanying drawings as follows:

the invention provides a multi-dense block detection and extraction method based on Possion distribution, which specifically comprises the following steps as shown in figures 1 and 2:

step A, acquiring multi-dimensional tensor data, the number m of dense blocks to be extracted and the size range of the dense blocks.

The method for acquiring the multi-dimensional tensor data comprises the following steps: 1. integrating data, namely integrating the data to be detected into a specified data center through an ETL technology; 2. desensitizing data, namely desensitizing sensitive information (such as an identification number) in a production environment in the data to be detected through a desensitizing technology; 3. data cleaning, namely cleaning the desensitized data to ensure the accuracy and consistency of the data; 4. and data preprocessing, namely performing data modeling on the cleaned data, and converting the data to be detected into multi-dimensional tensor data.

The dimension of the multidimensional tensor data D in the embodiment of the invention is K, the number m of dense blocks to be extracted and the size range [ min, max ] of the dense blocks are manually set, min is the lower limit of the size of the dense blocks, and max is the upper limit of the size of the dense blocks.

And B, carrying out suspicious degree measurement on the multi-dimensional tensor data by using a dense block suspicious degree measurement method to obtain a snapnotes list containing a plurality of suspicious data snapshots.

In the method, the dense block suspicious degree measuring method is obtained by deducing the Session distribution including counting and density factors.

In the existing dense block detection method, there is a measure of suspicious degree of dense blocks by using the counted session distribution, that is:

where P (Y ═ c) denotes the probability of occurrence of a dense block in the original multidimensional tensor data based on the counting factor under the mission distribution, and c denotes the total count of the dense block, i.e., the sum of all elements in the dense block.

The distribution of the Possions related to counting does not take the density factor of the dense blocks into consideration, the problem that the syndiotactic modules are mixed into a whole block can occur, and the Possions distribution containing the counting and density factors in the method can solve the problem.

The expression of the mission distribution including both count and density factors is as follows:

where P (Y ═ qc) denotes the probability of occurrence of a dense block in the original multidimensional tensor data based on the two factors of count and density under the poisson distribution, Q denotes the density of the original multidimensional tensor data, and Q denotes the density of the dense block.

The invention provides a novel dense block suspicious degree measurement method, which can derive a suspicious degree DGCS value of a dense block according to a formula (7), wherein the suspicious degree DGCS value has the following expression:

wherein the content of the first and second substances,

representing the remaining block b_iThe product of the sizes of (a) and (b),

representing the remaining block b_iI is 1,2, …, K.

Q＝C/N (9)

assume that the original multi-dimensional tensor data is a 100 x 100 matrix with a residual block b_iIs a 50 x 50 matrix, N100 x 100 10000, then C is the sum of 10000 elements in the original multi-dimensional tensor data,

then

Is the sum of 2500 elements in the remaining block, K ═ 2.

In the embodiment of the present invention, the specific operation of step B is as follows:

and step B01, taking the K-dimensional tensor data D as input data, and establishing an empty list snapnotes for storing snapshots generated in each step of the tensor data D in the optimizing process.

Step B02, deleting the column with the least count in each dimension of the input data to obtain the residual block B in each dimension_i。

The input data is assumed to be a two-dimensional tensor matrix of 100 multiplied by 100, the dimension K of the matrix is 2, the first dimension is a row, and the second dimension is a column; adding 100 elements of each row in the matrix to obtain the count of each row, comparing the counts of 100 rows, selecting the row with the least count, and removing the row with the least count from the 100 rows to obtain a 99-by-100 two-dimensional tensor matrix, which is the residual block b of the first dimension₁(ii) a By removing the least counted column from the 100 columns in the same way, a 100 by 99 two-dimensional tensor can be obtainedMatrix, which is the remaining block b of the second dimension₂。

Step B03, calculating the residual block B of each dimension by using formula (11)_iAnd obtaining K DGCS values according to the suspicious degree DGCS values.

Step B04, comparing K DGCS values, and marking the residual block with the highest DGCS value as B_max。

Step B05, judging the remaining block B_maxWhether or not the dense block size range, i.e. min, is satisfied<size(b_max)<max, if satisfied, will remain block b_maxAs data snapshot B, otherwise, it is not processed.

And step B06, storing the data snapshot B into a snapshotts list after the data snapshot B is associated with the DGCS value.

Step B07, the remaining blocks B in step B04_maxAnd as new input data, measuring the suspicious degree again, extracting a new data snapshot and storing the new data snapshot into a snapshotts list, and stopping circulation until the input data is empty to obtain the snapshotts list containing a plurality of suspicious data snapshots.

Assume the remaining blocks b of the first dimension₁The DGCS value of (99 times 100 two-dimensional tensor matrix) is greater than the remaining block b of the second dimension₂(100 times 99 two-dimensional tensor matrix) is selected as the DGCS value₁As b_max(ii) a When the above steps are repeated, b₁(99 times 100 two-dimensional tensor matrix) as the new input data, from b₁The least counted row of the 99 rows of (a) is removed to obtain a 98 by 100 two-dimensional tensor matrix from b₁The column with the least count is removed from the 100 columns to obtain a 99-by-99 two-dimensional tensor matrix, and the DGCS values of the 98-by-100 two-dimensional tensor matrix and the 99-by-99 two-dimensional tensor matrix are compared; and so on.

And C, extracting a single dense block from the multi-dimensional tensor data according to the snapnotes list, wherein the concrete operations are as follows: and finding out the data snapshot Bmax with the maximum DGCS value from the snapshotts list as a dense block.

And D, removing the extracted single dense block from the multi-dimensional tensor data to obtain the updated multi-dimensional tensor data.

And E, repeating the steps B to D according to the updated multi-dimensional tensor data to generate a new snapshotts list, and extracting new dense blocks until m dense blocks are extracted.

Compared with the prior art, the invention has the following advantages:

1. in the process of detecting the dense blocks, the dense blocks are not spliced or decomposed, so that the independence and the integrity of each dense block are ensured, the recall rate of the dense blocks can be effectively improved, and the method is particularly suitable for detecting a plurality of dense blocks with the same or similar specifications. Assuming that there is a matrix of 100 by 100, in which there are two dense blocks, one is a small block of 20 by 20 with row and column labels both from 1 to 20, and the other is a small block of 21 to 40 with row and column labels, if the dense block detection is performed by using the CrossSpot detection method based on Suspiciousness, it is possible to detect a large dense block (a block of 1 to 40 with row and column labels both from 1 to 40) with 40 by 40 and with two small dense blocks as diagonal matrix, and the upper right corner and lower left corner of the large dense block are non-dense, so the recall rate of the large dense block is only about 50%, whereas the dense block detection performed by using the method of the present invention takes the density problem into full account, and the higher density (small dense block) has the larger DGCS value, and the lower density (large dense block) has the lower DGCS value, so that the method of the present invention can accurately detect two dense blocks of 20 by 20 (not directly multiplying 40), the recall rate of the dense blocks is close to 100 percent and is far greater than that of the prior art.

2. The method has higher detection efficiency and higher detection accuracy. Assuming that a matrix of 100 by 100 has 1 dense block of 30 by 30 and 1 dense block of 15 by 15, and the number of dense block detections is 2, if the dense block detection is performed by using a suspicious block detection method based on binary tree, the method may decompose the dense block of 30 by 30 into 1 small dense block of 20 by 20 and a small dense block of 10 by 10, and after the decomposition, the degree of suspicious of the small dense block of 10 by 10 decreases, and finally retrieve a dense block of 15 by 15 and a small dense block of 20 by 20, which results in insufficient accuracy of dense block detection, and the finally extracted dense block is not the most suspicious. The method does not have the problem, does not split the dense block, directly calculates the DGCS value of the rest blocks, detects the dense block in real time, has higher efficiency, and can accurately find the really most suspicious block.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A multi-dense block detection and extraction method based on Possion distribution is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the dimension of the multidimensional tensor data D is K, and the method for measuring the suspicious degree of the multidimensional tensor data by using the dense block suspicious degree measuring method comprises:

taking K-dimensional tensor data D as input data;

each dimension of the input dataDeleting the column with the least count under the degree to obtain the residual block b of each dimension_iWherein i represents dimension, i ═ 1,2, …, K;

3. The method of claim 1, wherein the expression of the Possion distribution including the count and density factors is as follows:

4. The Possion distribution-based multi-dense-block detection and extraction method as claimed in claim 2 or 3, wherein the DGCS value of the suspicious degree is expressed as follows:

wherein the content of the first and second substances,

representing the remaining block b_iThe product of the sizes of (a) and (b),

representing the remaining blocksb_iN denotes a product of sizes of the multi-dimensional tensor data D, C denotes a total count of the multi-dimensional tensor data D,

Q＝C/N

5. the method of claim 2, wherein the step of obtaining a snapnotes list comprising suspicious data snapshots comprises:

6. The Possion distribution-based multi-density block detection and extraction method as claimed in claim 1, wherein the data to be detected is converted into multi-dimensional tensor data by data integration, data desensitization, data cleaning and data modeling.