CN106228173A

CN106228173A - A kind of forensic data reduction method based on spatial statistics

Info

Publication number: CN106228173A
Application number: CN201510305873.8A
Authority: CN
Inventors: 彭涛; 姜明华; 杨贤; 宋坤芳; 胡鸣; 魏雄; 梁晶
Original assignee: Wuhan Textile University
Current assignee: Wuhan Textile University
Priority date: 2015-06-02
Filing date: 2015-06-02
Publication date: 2016-12-14

Abstract

A kind of forensic data reduction method based on spatial statistics, the method includes static evidence Data Reduction and two parts of Dynamic Forensics Data Reduction, first according to the needs of evidence obtaining, data for having collected carry out characterization description, the data described through described characterization can be mapped in hyperspace, can embody the substitutive characteristics of data simultaneously；It follows that the point being mapped as in space by forensic data, the point after mapping is divided into multiple data set, described data set carries out yojan process respectively；Then, the forensic data Algorithm for Reduction using local Haussdorff carries out yojan to described data set；Finally, yojan effect is evaluated.On the premise of the method can be worth not losing original forensic data collection, it is substantially reduced forensic data scale, thus improves the efficiency of digital evidence obtaining.

Description

A kind of forensic data reduction method based on spatial statistics

Technical field

Patent of the present invention relates to big Data processing Data Reduction field, particularly to a kind of forensic data based on spatial statistics about Letter method.

Background technology

Need to process the data of magnanimity when computer crime is collected evidence, to network transmission, store and process brings greatly Challenge.But from the forensic data of magnanimity, how to remove invalid, redundancy and that similarity is big data, obtain a phase To less data set, on the basis of not losing original forensic data value, reduce the scale of evidence obtaining, thus improve The efficiency of digital evidence obtaining is difficult and challenge.On the one hand need removing the same of invalid, redundancy and big data of similarity Time, the most do not destroy the value of initial data, it is ensured that the effectiveness of data after yojan.On the other hand need to consider such as What carries out quick yojan to evidence obtaining large data sets, does not affect evidence obtaining efficiency.

Therefore, a kind of forensic data reduction method of research, thus from large data sets, obtained forensic data valency fast and efficiently The data of value have become urgent need and have solved the technical problem that.

Summary of the invention

For overcoming the problems referred to above, the invention provides a kind of big Data Reduction method of evidence obtaining based on spatial distribution, the method can On the premise of not losing original forensic data collection value, it is substantially reduced forensic data scale, thus improves numeral and take The efficiency of card.

For achieving the above object, the technical solution used in the present invention is:

A kind of forensic data reduction method based on spatial statistics, the method includes static evidence Data Reduction and Dynamic Forensics data Two parts of yojan, it is characterised in that:

1) according to the needs of evidence obtaining, characterization description is carried out for the data collected, through the data that described characterization describes Can be mapped in hyperspace, the substitutive characteristics of data can be embodied simultaneously；

2) point being mapped as in space by forensic data, the point after mapping is divided into multiple data set, to described data set respectively Carry out yojan process；

3) the forensic data Algorithm for Reduction using local Haussdorff carries out yojan to described data set；

4) yojan effect is evaluated.

Preferably, described evaluation uses formulaComplete, wherein V_OFor the value of raw data set, V_RFor number after yojan Value according to collection.

Forensic data reduction method based on spatial statistics of the present invention compared with prior art has the advantage that

Accompanying drawing explanation

Fig. 1 is the forensics process of yojan；

Fig. 2 is that the characterization of forensic data describes；

Fig. 3 is the geometric meaning of Haussdorff distance；

Fig. 4 is that Haussdorff distance portrays different pieces of information collection similarity；

Fig. 5 is the comparison diagram of Reduced Data Set and raw data set；

Fig. 6 is the QQplot figure of data set after raw data set and yojan；

Fig. 7 is the schematic block diagram that forensic data gathers；

Fig. 8 is the schematic block diagram of the characterization of text type data；

Fig. 9 is the seamless cutting schematic diagram of data set.

Detailed description of the invention

With embodiment, the invention will be further described below, but the practical range of the present invention is not limited to this.

Static evidence Data Reduction and two portions of Dynamic Forensics Data Reduction should be included by forensic data reduction method based on spatial statistics Point, specifically comprise the following steps that

1) the spatial character extraction of different forensic data:

For the data collected, according to the needs of evidence obtaining, it is carried out characterization description so that it is both may map to multidimensional In space, the substitutive characteristics of data can also be embodied simultaneously.Such as the data of Doctype, utilize evidence obtaining key word to carry out characterization and retouch State；Attacking class data and then can attack frequency according to agreement, the persistent period etc. attacks data to every and carries out characterization description, Fig. 2 shows that the characterization of forensic data describes.

Definition 1: given evidence obtaining target data D, for the characteristic set F and the set V of respective value of specific purpose evidence obtaining, then one Bar forensic data can be expressed as: D_i=(F, V), data set D={D_k| k=1....n}.

Based on definition 1, according to evidence obtaining type, determine the feature of forensic data:

(1) document data, determines key word, the frequency that in search document, corresponding key word occurs, obtains text data set D_T=(W, V), W are keyword set, and V is frequency set.

(2) attack class data, teach and state greatly from North Carolina according to the Sal Stolfo from Columbia University Wenke professor Lee the learned tagsort to attack class data, attacks data for each and can be decomposed into 41 Individual feature.

(3) account class data, the principal character of account data includes Account Type, account number, accesses the time, access frequency, Access locations, utilizes these features that the account class data obtained are carried out characterization description.

2) the seamless cutting of large data sets:

In general, the data volume of forensic data collection is the hugest, directly carries out yojan on raw data set and can cause algorithm Complexity higher.The point being first mapped as in space by forensic data due to us, therefore, research finds, these are distributed in sky Between data have certain locality characteristics, i.e. data a subrange class and should in the range of do not deposit between other data At statistical relationship.Therefore, the present invention utilizes this feature, and raw data set is divided into the small data set with certain degree of overlapping, Small data set is carried out yojan process, thus realizes the yojan of whole original big data.Concrete step is as follows:

Step 1: obtain the data set S after characterization；

Step 2: size M of partition data is set；

Step 3: from raw data set S, in choose reference point o (generally central point)；

Step 4: the point that one range points o of selection is nearest from S is as initial point x_i；

Step 5: utilize K-NN searching algorithm with x_iCentered by point search go out Sub Data Set S_i；

Step 6: if the number n ＞ K, S=S-S of the data in data set D_i, jump to step 4, otherwise algorithm exits. A data set S={S being made up of partitioned data set will be obtained after above-mentioned steps_k| k=1....N}.

3) forensic data reduction method based on local Haussdorff

Fig. 3 shows the geometric meaning of Haussdorff distance, it can be seen that Haussdorff distance can preferably portray two Spatial simlanty between individual data set.The definition that can be obtained Haussdorff distance by Fig. 3 is formula (1)

d_{H} (X, Y) = \max {\sup_{x &Element; X} \inf_{y &Element; Y} d (x, y), \sup_{y &Element; Y} \inf_{x &Element; X} d (x, y)} - - - (1)

Fig. 4 shows and utilizes Haussdorff distance to describe similarity between two groups of different pieces of information collection.It can be seen that two data Haussdorff distance between collection is the least, and its similarity is the biggest.

Utilize local Haussdorff distance thought, first to segmentation after small data set carry out yojan, algorithm in two stages, (1), The acquisition of yojan threshold values μ；(2) algorithm based on parameter μ realizes；

The acquisition of yojan threshold values μ:

Step 1: determine yojan rate δ, given Initial Hurdle μ_iAdjust parameter lambda with threshold values, randomly choose a data set A；

Step 2: take from data set A and do not have selected data x；

Step 3: remove example x, obtain data set B；

Step 4: calculate Hausdorff distance Hd of data set S and data set B；

Step 5: if Hd is less than given threshold value μ_i, then this example can remove, and otherwise retains this example, obtains new number According to collection S；

Step 6: if data set S has not traveled through, then jump to step 2, otherwise arrive step 7；

Step 7: calculating yojan rate | S | μ=μ-λ, | B | and represent the number of example in data set B, | S | represents data set S The number of middle example.If δ ' is ＞ δ, μ=μ-λ, jumps to step 2.

Algorithm based on parameter μ realizes；

Step 1: utilize the seamless cutting method of large data sets that original large data sets is carried out cutting and obtain data set S={S_k| k=1....N}；

Step 2: get parms μ；

Step 3: small data set S in collection set of fetching data_i, therefrom select an example x；

Step 4: from data set S_iIn remove example x, obtain data set S_i′；

Step 5: calculate data set S_iWith data set S_i' Hausdorff distance Hd；

Step: 6: if Hd is less than given threshold value μ, then this example can remove, and otherwise retains this example, and S_i=S_i′；

Step 7: if data set S_iDo not travel through, then jumped to step 4, otherwise arrive step 3；

Step 8: finally give data set S ' after yojan.

Fig. 5 shows when Sub Data Set size K=20, during threshold parameter=0.3 (yojan threshold values, yojan rate is about 57%), about Letter data set and the comparison diagram of raw data set.

4) evaluation of yojan effect

It is to evaluate Algorithm for Reduction the most directly to refer to that data set after yojan to what extent maintains the value of legacy data collection Mark.Assume V_OFor the value of raw data set, V_RFor the value of data set after yojan, its ratio is R:

R = \frac{V_{R}}{V_{O}} - - - (2)

R is closer to 1, and yojan effect is the best.

But, different data sets is under different application environment, and its criterion being worth is different.Existing for about The method that letter effect is evaluated is primarily directed to the data set of classification, and therefore nicety of grading is the unique effective of evaluation yojan effect Method, has certain limitation.May be described as the point in space in view of any data, and these are put in space Position and relation between points have certain feature, i.e. Spatial Statistical Character.The method of data space feature is described very Many, QQPlot (quantile fitted figure) figure be by statistics two data intensive data locus quantile compare this two The probability distribution of individual data set, is the similarity that presents two data sets of the form by fitted figure, and cannot quantized data collection Similarity, the statistics of the position quantile to data intensive data is carried out quantification treatment, enables the shape with numerical value by scheme Formula portrays the similarity of data set, in combination with the feature of digital evidence obtaining data set, provides a kind of yojan based on spatial statistics effect Really evaluation methodology.Meanwhile, by the evaluation of yojan effect is fed back, it is achieved the optimization of Algorithm for Reduction.

The step calculating QQplot figure is as follows:

Step 1: the average that after calculating raw data set S and yojan, data set S` respectively ties up is (respectivelyWith), calculating side Method is:

\overset{&OverBar;}{X} = \frac{1}{n} \frac{1}{D} Σ_{i = 1}^{n} Σ_{k = 1}^{D} X_{ik};

Step 2: calculate data set standard deviation (respectively σ after original data set yojan respectively²With σ '²), computational methods are:

σ^{2} = \frac{1}{n} \frac{1}{D} Σ_{i = 1}^{n} Σ_{k = 1}^{D} {(X_{ik} - μ)}^{2}

Step 3: the standard calculating two data sets compares d_i=(X_i-μ) and/σ (i=1 ..., n) (respectively d={d_i| i=1 ..., n} and D '=d '_i| i=1 ..., m}), and respectively it is ranked up；

Step 4: for transverse and longitudinal coordinate, scattergram can be obtained with the standard ratio of data set after raw data set and yojan respectively；

The quantification treatment of QQplot figure,

Step 1: even if the average of two data set standard ratiosWith

Step 2: ratio calculated

Step 3: evaluate whether R reaches yojan effect, if it did not, increase yojan parameter μ, re-executes Algorithm for Reduction, It is then return to step 1, otherwise arrives step 4；

Step 4: store or transmit data set after yojan.

Fig. 6 shows the contrast situation of the QQplot figure of data before and after yojan, when the QQplot figure of two data sets be one straight Line, say, that two data sets are the most similar, the QQplot figure between them is similar to straight line.When yojan rate is When 13.5%, after yojan, the QQplot between data set and raw data set schemes between raw data set and raw data set It is little that QQplot figure compares change, is still approximately straight line.Along with the raising of yojan rate, data set and original number after yojan Beginning to deviate from straight line according to the QQplot figure between collection, yojan rate is the biggest, and departure degree is the biggest.It is proposed that method in yojan rate When reaching 70%, the similarity between data set is preferable.

Fig. 7 shows that forensic data collection, characterization, yojan process and the flow chart of storage.On the network put up, adopt Being acquired network data with many agencies, the data supported at present specifically include that file type data, Account Type data With attack type data.After obtaining original forensic data, Algorithm for Reduction is utilized to initial data and to carry out yojan process, at yojan Reason process includes the evaluation to yojan effect.After yojan, data set provides basic data for later stage forensics analysis.

Fig. 8 shows that the characterization of the data as a example by file type data describes process.File type data are being carried out feature During change, need for specific, determine the weight of key word and key word, based on this file type data are scanned for, The frequency of statistics key word, thus file key word and its frequency are described, the point being converted in hyperspace.

Fig. 9 is the seamless cutting schematic diagram to a random data set.In general raw data set is the biggest, directly carries out It is inefficient that yojan processes, and we will carry out seamless cutting to raw data set, and after purpose is just so that cutting, data set to the greatest extent may be used The Spatial Statistical Character of raw data set can not be destroyed.

By the yojan to the small data set after segmentation, it is achieved the yojan to large data sets.It is finally reached and is not destroying number According to spatial distribution in the case of, reduce forensic data collection scale.

Claims

1. a forensic data reduction method based on spatial statistics, the method includes static evidence Data Reduction and Dynamic Forensics number According to two parts of yojan, it is characterised in that:

Described characterization is described as, and gives evidence obtaining target data D, the characteristic set F collected evidence for specific purpose and respective value Set V, wherein a forensic data is expressed as: D_i=(F, V), data set D={D_k| k=1....n}；

According to evidence obtaining type, determine the feature of forensic data:

A. document data, determines key word, the frequency that in search document, corresponding key word occurs, obtains text data set D_T=(W, V), W are keyword set, and V is frequency set；

B. attack class data, each attack data are decomposed into 41 features；

C. account class data, the principal character of account data includes Account Type, account number, accesses time, access frequency, visits Ask place, utilize these features that the account class data obtained are carried out characterization description；

Described yojan processes and comprises the steps:

Step 1: obtain the data set S after characterization；

Step 2: size M of partition data is set；

Step 3: from raw data set S, in choose reference point o；

Step 6: if the number n ＞ K, S=S-S of the data in data set D_i, jump to step 4, otherwise algorithm exits； Obtain a data set S={S being made up of partitioned data set_k| k=1....N}；

4) yojan effect is evaluated.

Forensic data reduction method the most according to claim 1, it is characterised in that:

Described evaluation uses formulaComplete, wherein V_OFor the value of raw data set, V_RFor the valency of data set after yojan Value.