CN106228173A - A kind of forensic data reduction method based on spatial statistics - Google Patents

A kind of forensic data reduction method based on spatial statistics Download PDF

Info

Publication number
CN106228173A
CN106228173A CN201510305873.8A CN201510305873A CN106228173A CN 106228173 A CN106228173 A CN 106228173A CN 201510305873 A CN201510305873 A CN 201510305873A CN 106228173 A CN106228173 A CN 106228173A
Authority
CN
China
Prior art keywords
data
data set
yojan
forensic
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510305873.8A
Other languages
Chinese (zh)
Inventor
彭涛
姜明华
杨贤
宋坤芳
胡鸣
魏雄
梁晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Textile University
Original Assignee
Wuhan Textile University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Textile University filed Critical Wuhan Textile University
Priority to CN201510305873.8A priority Critical patent/CN106228173A/en
Publication of CN106228173A publication Critical patent/CN106228173A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of forensic data reduction method based on spatial statistics, the method includes static evidence Data Reduction and two parts of Dynamic Forensics Data Reduction, first according to the needs of evidence obtaining, data for having collected carry out characterization description, the data described through described characterization can be mapped in hyperspace, can embody the substitutive characteristics of data simultaneously;It follows that the point being mapped as in space by forensic data, the point after mapping is divided into multiple data set, described data set carries out yojan process respectively;Then, the forensic data Algorithm for Reduction using local Haussdorff carries out yojan to described data set;Finally, yojan effect is evaluated.On the premise of the method can be worth not losing original forensic data collection, it is substantially reduced forensic data scale, thus improves the efficiency of digital evidence obtaining.

Description

A kind of forensic data reduction method based on spatial statistics
Technical field
Patent of the present invention relates to big Data processing Data Reduction field, particularly to a kind of forensic data based on spatial statistics about Letter method.
Background technology
Need to process the data of magnanimity when computer crime is collected evidence, to network transmission, store and process brings greatly Challenge.But from the forensic data of magnanimity, how to remove invalid, redundancy and that similarity is big data, obtain a phase To less data set, on the basis of not losing original forensic data value, reduce the scale of evidence obtaining, thus improve The efficiency of digital evidence obtaining is difficult and challenge.On the one hand need removing the same of invalid, redundancy and big data of similarity Time, the most do not destroy the value of initial data, it is ensured that the effectiveness of data after yojan.On the other hand need to consider such as What carries out quick yojan to evidence obtaining large data sets, does not affect evidence obtaining efficiency.
Therefore, a kind of forensic data reduction method of research, thus from large data sets, obtained forensic data valency fast and efficiently The data of value have become urgent need and have solved the technical problem that.
Summary of the invention
For overcoming the problems referred to above, the invention provides a kind of big Data Reduction method of evidence obtaining based on spatial distribution, the method can On the premise of not losing original forensic data collection value, it is substantially reduced forensic data scale, thus improves numeral and take The efficiency of card.
For achieving the above object, the technical solution used in the present invention is:
A kind of forensic data reduction method based on spatial statistics, the method includes static evidence Data Reduction and Dynamic Forensics data Two parts of yojan, it is characterised in that:
1) according to the needs of evidence obtaining, characterization description is carried out for the data collected, through the data that described characterization describes Can be mapped in hyperspace, the substitutive characteristics of data can be embodied simultaneously;
2) point being mapped as in space by forensic data, the point after mapping is divided into multiple data set, to described data set respectively Carry out yojan process;
3) the forensic data Algorithm for Reduction using local Haussdorff carries out yojan to described data set;
4) yojan effect is evaluated.
Preferably, described evaluation uses formulaComplete, wherein VOFor the value of raw data set, VRFor number after yojan Value according to collection.
Forensic data reduction method based on spatial statistics of the present invention compared with prior art has the advantage that
Accompanying drawing explanation
Fig. 1 is the forensics process of yojan;
Fig. 2 is that the characterization of forensic data describes;
Fig. 3 is the geometric meaning of Haussdorff distance;
Fig. 4 is that Haussdorff distance portrays different pieces of information collection similarity;
Fig. 5 is the comparison diagram of Reduced Data Set and raw data set;
Fig. 6 is the QQplot figure of data set after raw data set and yojan;
Fig. 7 is the schematic block diagram that forensic data gathers;
Fig. 8 is the schematic block diagram of the characterization of text type data;
Fig. 9 is the seamless cutting schematic diagram of data set.
Detailed description of the invention
With embodiment, the invention will be further described below, but the practical range of the present invention is not limited to this.
Static evidence Data Reduction and two portions of Dynamic Forensics Data Reduction should be included by forensic data reduction method based on spatial statistics Point, specifically comprise the following steps that
1) the spatial character extraction of different forensic data:
For the data collected, according to the needs of evidence obtaining, it is carried out characterization description so that it is both may map to multidimensional In space, the substitutive characteristics of data can also be embodied simultaneously.Such as the data of Doctype, utilize evidence obtaining key word to carry out characterization and retouch State;Attacking class data and then can attack frequency according to agreement, the persistent period etc. attacks data to every and carries out characterization description, Fig. 2 shows that the characterization of forensic data describes.
Definition 1: given evidence obtaining target data D, for the characteristic set F and the set V of respective value of specific purpose evidence obtaining, then one Bar forensic data can be expressed as: Di=(F, V), data set D={Dk| k=1....n}.
Based on definition 1, according to evidence obtaining type, determine the feature of forensic data:
(1) document data, determines key word, the frequency that in search document, corresponding key word occurs, obtains text data set DT=(W, V), W are keyword set, and V is frequency set.
(2) attack class data, teach and state greatly from North Carolina according to the Sal Stolfo from Columbia University Wenke professor Lee the learned tagsort to attack class data, attacks data for each and can be decomposed into 41 Individual feature.
(3) account class data, the principal character of account data includes Account Type, account number, accesses the time, access frequency, Access locations, utilizes these features that the account class data obtained are carried out characterization description.
2) the seamless cutting of large data sets:
In general, the data volume of forensic data collection is the hugest, directly carries out yojan on raw data set and can cause algorithm Complexity higher.The point being first mapped as in space by forensic data due to us, therefore, research finds, these are distributed in sky Between data have certain locality characteristics, i.e. data a subrange class and should in the range of do not deposit between other data At statistical relationship.Therefore, the present invention utilizes this feature, and raw data set is divided into the small data set with certain degree of overlapping, Small data set is carried out yojan process, thus realizes the yojan of whole original big data.Concrete step is as follows:
Step 1: obtain the data set S after characterization;
Step 2: size M of partition data is set;
Step 3: from raw data set S, in choose reference point o (generally central point);
Step 4: the point that one range points o of selection is nearest from S is as initial point xi
Step 5: utilize K-NN searching algorithm with xiCentered by point search go out Sub Data Set Si
Step 6: if the number n > K, S=S-S of the data in data set Di, jump to step 4, otherwise algorithm exits. A data set S={S being made up of partitioned data set will be obtained after above-mentioned stepsk| k=1....N}.
3) forensic data reduction method based on local Haussdorff
Fig. 3 shows the geometric meaning of Haussdorff distance, it can be seen that Haussdorff distance can preferably portray two Spatial simlanty between individual data set.The definition that can be obtained Haussdorff distance by Fig. 3 is formula (1)
d H ( X , Y ) = max { sup x ∈ X inf y ∈ Y d ( x , y ) , sup y ∈ Y inf x ∈ X d ( x , y ) } - - - ( 1 )
Fig. 4 shows and utilizes Haussdorff distance to describe similarity between two groups of different pieces of information collection.It can be seen that two data Haussdorff distance between collection is the least, and its similarity is the biggest.
Utilize local Haussdorff distance thought, first to segmentation after small data set carry out yojan, algorithm in two stages, (1), The acquisition of yojan threshold values μ;(2) algorithm based on parameter μ realizes;
The acquisition of yojan threshold values μ:
Step 1: determine yojan rate δ, given Initial Hurdle μiAdjust parameter lambda with threshold values, randomly choose a data set A;
Step 2: take from data set A and do not have selected data x;
Step 3: remove example x, obtain data set B;
Step 4: calculate Hausdorff distance Hd of data set S and data set B;
Step 5: if Hd is less than given threshold value μi, then this example can remove, and otherwise retains this example, obtains new number According to collection S;
Step 6: if data set S has not traveled through, then jump to step 2, otherwise arrive step 7;
Step 7: calculating yojan rate | S | μ=μ-λ, | B | and represent the number of example in data set B, | S | represents data set S The number of middle example.If δ ' is > δ, μ=μ-λ, jumps to step 2.
Algorithm based on parameter μ realizes;
Step 1: utilize the seamless cutting method of large data sets that original large data sets is carried out cutting and obtain data set S={Sk| k=1....N};
Step 2: get parms μ;
Step 3: small data set S in collection set of fetching datai, therefrom select an example x;
Step 4: from data set SiIn remove example x, obtain data set Si′;
Step 5: calculate data set SiWith data set Si' Hausdorff distance Hd;
Step: 6: if Hd is less than given threshold value μ, then this example can remove, and otherwise retains this example, and Si=Si′;
Step 7: if data set SiDo not travel through, then jumped to step 4, otherwise arrive step 3;
Step 8: finally give data set S ' after yojan.
Fig. 5 shows when Sub Data Set size K=20, during threshold parameter=0.3 (yojan threshold values, yojan rate is about 57%), about Letter data set and the comparison diagram of raw data set.
4) evaluation of yojan effect
It is to evaluate Algorithm for Reduction the most directly to refer to that data set after yojan to what extent maintains the value of legacy data collection Mark.Assume VOFor the value of raw data set, VRFor the value of data set after yojan, its ratio is R:
R = V R V O - - - ( 2 )
R is closer to 1, and yojan effect is the best.
But, different data sets is under different application environment, and its criterion being worth is different.Existing for about The method that letter effect is evaluated is primarily directed to the data set of classification, and therefore nicety of grading is the unique effective of evaluation yojan effect Method, has certain limitation.May be described as the point in space in view of any data, and these are put in space Position and relation between points have certain feature, i.e. Spatial Statistical Character.The method of data space feature is described very Many, QQPlot (quantile fitted figure) figure be by statistics two data intensive data locus quantile compare this two The probability distribution of individual data set, is the similarity that presents two data sets of the form by fitted figure, and cannot quantized data collection Similarity, the statistics of the position quantile to data intensive data is carried out quantification treatment, enables the shape with numerical value by scheme Formula portrays the similarity of data set, in combination with the feature of digital evidence obtaining data set, provides a kind of yojan based on spatial statistics effect Really evaluation methodology.Meanwhile, by the evaluation of yojan effect is fed back, it is achieved the optimization of Algorithm for Reduction.
The step calculating QQplot figure is as follows:
Step 1: the average that after calculating raw data set S and yojan, data set S` respectively ties up is (respectivelyWith), calculating side Method is: X ‾ = 1 n 1 D Σ i = 1 n Σ k = 1 D X ik ;
Step 2: calculate data set standard deviation (respectively σ after original data set yojan respectively2With σ '2), computational methods are: σ 2 = 1 n 1 D Σ i = 1 n Σ k = 1 D ( X ik - μ ) 2
Step 3: the standard calculating two data sets compares di=(Xi-μ) and/σ (i=1 ..., n) (respectively d={di| i=1 ..., n} and D '=d 'i| i=1 ..., m}), and respectively it is ranked up;
Step 4: for transverse and longitudinal coordinate, scattergram can be obtained with the standard ratio of data set after raw data set and yojan respectively;
The quantification treatment of QQplot figure,
Step 1: even if the average of two data set standard ratiosWith
Step 2: ratio calculated
Step 3: evaluate whether R reaches yojan effect, if it did not, increase yojan parameter μ, re-executes Algorithm for Reduction, It is then return to step 1, otherwise arrives step 4;
Step 4: store or transmit data set after yojan.
Fig. 6 shows the contrast situation of the QQplot figure of data before and after yojan, when the QQplot figure of two data sets be one straight Line, say, that two data sets are the most similar, the QQplot figure between them is similar to straight line.When yojan rate is When 13.5%, after yojan, the QQplot between data set and raw data set schemes between raw data set and raw data set It is little that QQplot figure compares change, is still approximately straight line.Along with the raising of yojan rate, data set and original number after yojan Beginning to deviate from straight line according to the QQplot figure between collection, yojan rate is the biggest, and departure degree is the biggest.It is proposed that method in yojan rate When reaching 70%, the similarity between data set is preferable.
Fig. 7 shows that forensic data collection, characterization, yojan process and the flow chart of storage.On the network put up, adopt Being acquired network data with many agencies, the data supported at present specifically include that file type data, Account Type data With attack type data.After obtaining original forensic data, Algorithm for Reduction is utilized to initial data and to carry out yojan process, at yojan Reason process includes the evaluation to yojan effect.After yojan, data set provides basic data for later stage forensics analysis.
Fig. 8 shows that the characterization of the data as a example by file type data describes process.File type data are being carried out feature During change, need for specific, determine the weight of key word and key word, based on this file type data are scanned for, The frequency of statistics key word, thus file key word and its frequency are described, the point being converted in hyperspace.
Fig. 9 is the seamless cutting schematic diagram to a random data set.In general raw data set is the biggest, directly carries out It is inefficient that yojan processes, and we will carry out seamless cutting to raw data set, and after purpose is just so that cutting, data set to the greatest extent may be used The Spatial Statistical Character of raw data set can not be destroyed.
By the yojan to the small data set after segmentation, it is achieved the yojan to large data sets.It is finally reached and is not destroying number According to spatial distribution in the case of, reduce forensic data collection scale.

Claims (2)

1. a forensic data reduction method based on spatial statistics, the method includes static evidence Data Reduction and Dynamic Forensics number According to two parts of yojan, it is characterised in that:
1) according to the needs of evidence obtaining, characterization description is carried out for the data collected, through the data that described characterization describes Can be mapped in hyperspace, the substitutive characteristics of data can be embodied simultaneously;
Described characterization is described as, and gives evidence obtaining target data D, the characteristic set F collected evidence for specific purpose and respective value Set V, wherein a forensic data is expressed as: Di=(F, V), data set D={Dk| k=1....n};
According to evidence obtaining type, determine the feature of forensic data:
A. document data, determines key word, the frequency that in search document, corresponding key word occurs, obtains text data set DT=(W, V), W are keyword set, and V is frequency set;
B. attack class data, each attack data are decomposed into 41 features;
C. account class data, the principal character of account data includes Account Type, account number, accesses time, access frequency, visits Ask place, utilize these features that the account class data obtained are carried out characterization description;
2) point being mapped as in space by forensic data, the point after mapping is divided into multiple data set, to described data set respectively Carry out yojan process;
Described yojan processes and comprises the steps:
Step 1: obtain the data set S after characterization;
Step 2: size M of partition data is set;
Step 3: from raw data set S, in choose reference point o;
Step 4: the point that one range points o of selection is nearest from S is as initial point xi
Step 5: utilize K-NN searching algorithm with xiCentered by point search go out Sub Data Set Si
Step 6: if the number n > K, S=S-S of the data in data set Di, jump to step 4, otherwise algorithm exits; Obtain a data set S={S being made up of partitioned data setk| k=1....N};
3) the forensic data Algorithm for Reduction using local Haussdorff carries out yojan to described data set;
4) yojan effect is evaluated.
Forensic data reduction method the most according to claim 1, it is characterised in that:
Described evaluation uses formulaComplete, wherein VOFor the value of raw data set, VRFor the valency of data set after yojan Value.
CN201510305873.8A 2015-06-02 2015-06-02 A kind of forensic data reduction method based on spatial statistics Pending CN106228173A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510305873.8A CN106228173A (en) 2015-06-02 2015-06-02 A kind of forensic data reduction method based on spatial statistics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510305873.8A CN106228173A (en) 2015-06-02 2015-06-02 A kind of forensic data reduction method based on spatial statistics

Publications (1)

Publication Number Publication Date
CN106228173A true CN106228173A (en) 2016-12-14

Family

ID=57528717

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510305873.8A Pending CN106228173A (en) 2015-06-02 2015-06-02 A kind of forensic data reduction method based on spatial statistics

Country Status (1)

Country Link
CN (1) CN106228173A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131145A (en) * 2023-08-03 2023-11-28 卡斯柯信号(北京)有限公司 Track map data verification method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101082925A (en) * 2007-07-09 2007-12-05 山西大学 Rough set property reduction method based on SQL language
CN101345692A (en) * 2008-08-05 2009-01-14 陈明 Bridge data illation reduction method for implementing data volume transmission reduction
CN102262682A (en) * 2011-08-19 2011-11-30 上海应用技术学院 Rapid attribute reduction method based on rough classification knowledge discovery

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101082925A (en) * 2007-07-09 2007-12-05 山西大学 Rough set property reduction method based on SQL language
CN101345692A (en) * 2008-08-05 2009-01-14 陈明 Bridge data illation reduction method for implementing data volume transmission reduction
CN102262682A (en) * 2011-08-19 2011-11-30 上海应用技术学院 Rapid attribute reduction method based on rough classification knowledge discovery

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
公伟 等: "云取证模型的构建与分析", 《计算机工程》 *
彭涛: "基于特征和实例的海量数据约简方法研究", 《中国博士学位论文全文数据库-信息科技辑》 *
熊欣 等: "基于道格拉斯改进的雷达回波数据简化算法", 《中国航海》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131145A (en) * 2023-08-03 2023-11-28 卡斯柯信号(北京)有限公司 Track map data verification method and device
CN117131145B (en) * 2023-08-03 2024-03-26 卡斯柯信号(北京)有限公司 Track map data verification method and device

Similar Documents

Publication Publication Date Title
Al Shalabi et al. Data mining: A preprocessing engine
CN101692224B (en) High-resolution remote sensing image search method fused with spatial relation semantics
Saraç et al. An ant colony optimization based feature selection for web page classification
US9043316B1 (en) Visual content retrieval
CN104408483B (en) SAR texture image classification methods based on deep neural network
Zhang et al. Three-way decisions of rough vague sets from the perspective of fuzziness
CN114244603B (en) Anomaly detection and comparison embedded model training and detection method, device and medium
CN111488917A (en) Garbage image fine-grained classification method based on incremental learning
CN107832631A (en) The method for secret protection and system of a kind of data publication
CN104268629A (en) Complex network community detecting method based on prior information and network inherent information
CN103366365A (en) SAR image varying detecting method based on artificial immunity multi-target clustering
CN101996245A (en) Form feature describing and indexing method of image object
CN104809161B (en) A kind of method and system that sparse matrix is compressed and is inquired
CN102722578B (en) Unsupervised cluster characteristic selection method based on Laplace regularization
DE102020133266A1 (en) Technologies for the refinement of stochastic similarity search candidates
Herrera et al. SAX-quantile based multiresolution approach for finding heatwave events in summer temperature time series
Graham et al. Finding and visualizing graph clusters using pagerank optimization
CN103034869A (en) Part maintaining projection method of adjacent field self-adaption
CN111626311B (en) Heterogeneous graph data processing method and device
CN106503386A (en) The good and bad method and device of assessment luminous power prediction algorithm performance
Kithulgoda et al. The incremental Fourier classifier: Leveraging the discrete Fourier transform for classifying high speed data streams
CN106228173A (en) A kind of forensic data reduction method based on spatial statistics
CN102034102B (en) Image-based significant object extraction method as well as complementary significance graph learning method and system
CN116894173A (en) Water quality tracing method, device, equipment and storage medium
Chung et al. Finding and visualizing graph clusters using PageRank optimization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20161214

RJ01 Rejection of invention patent application after publication