CN106228173A - A kind of forensic data reduction method based on spatial statistics - Google Patents
A kind of forensic data reduction method based on spatial statistics Download PDFInfo
- Publication number
- CN106228173A CN106228173A CN201510305873.8A CN201510305873A CN106228173A CN 106228173 A CN106228173 A CN 106228173A CN 201510305873 A CN201510305873 A CN 201510305873A CN 106228173 A CN106228173 A CN 106228173A
- Authority
- CN
- China
- Prior art keywords
- data
- data set
- yojan
- forensic
- point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/757—Matching configurations of points or features
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of forensic data reduction method based on spatial statistics, the method includes static evidence Data Reduction and two parts of Dynamic Forensics Data Reduction, first according to the needs of evidence obtaining, data for having collected carry out characterization description, the data described through described characterization can be mapped in hyperspace, can embody the substitutive characteristics of data simultaneously;It follows that the point being mapped as in space by forensic data, the point after mapping is divided into multiple data set, described data set carries out yojan process respectively;Then, the forensic data Algorithm for Reduction using local Haussdorff carries out yojan to described data set;Finally, yojan effect is evaluated.On the premise of the method can be worth not losing original forensic data collection, it is substantially reduced forensic data scale, thus improves the efficiency of digital evidence obtaining.
Description
Technical field
Patent of the present invention relates to big Data processing Data Reduction field, particularly to a kind of forensic data based on spatial statistics about
Letter method.
Background technology
Need to process the data of magnanimity when computer crime is collected evidence, to network transmission, store and process brings greatly
Challenge.But from the forensic data of magnanimity, how to remove invalid, redundancy and that similarity is big data, obtain a phase
To less data set, on the basis of not losing original forensic data value, reduce the scale of evidence obtaining, thus improve
The efficiency of digital evidence obtaining is difficult and challenge.On the one hand need removing the same of invalid, redundancy and big data of similarity
Time, the most do not destroy the value of initial data, it is ensured that the effectiveness of data after yojan.On the other hand need to consider such as
What carries out quick yojan to evidence obtaining large data sets, does not affect evidence obtaining efficiency.
Therefore, a kind of forensic data reduction method of research, thus from large data sets, obtained forensic data valency fast and efficiently
The data of value have become urgent need and have solved the technical problem that.
Summary of the invention
For overcoming the problems referred to above, the invention provides a kind of big Data Reduction method of evidence obtaining based on spatial distribution, the method can
On the premise of not losing original forensic data collection value, it is substantially reduced forensic data scale, thus improves numeral and take
The efficiency of card.
For achieving the above object, the technical solution used in the present invention is:
A kind of forensic data reduction method based on spatial statistics, the method includes static evidence Data Reduction and Dynamic Forensics data
Two parts of yojan, it is characterised in that:
1) according to the needs of evidence obtaining, characterization description is carried out for the data collected, through the data that described characterization describes
Can be mapped in hyperspace, the substitutive characteristics of data can be embodied simultaneously;
2) point being mapped as in space by forensic data, the point after mapping is divided into multiple data set, to described data set respectively
Carry out yojan process;
3) the forensic data Algorithm for Reduction using local Haussdorff carries out yojan to described data set;
4) yojan effect is evaluated.
Preferably, described evaluation uses formulaComplete, wherein VOFor the value of raw data set, VRFor number after yojan
Value according to collection.
Forensic data reduction method based on spatial statistics of the present invention compared with prior art has the advantage that
Accompanying drawing explanation
Fig. 1 is the forensics process of yojan;
Fig. 2 is that the characterization of forensic data describes;
Fig. 3 is the geometric meaning of Haussdorff distance;
Fig. 4 is that Haussdorff distance portrays different pieces of information collection similarity;
Fig. 5 is the comparison diagram of Reduced Data Set and raw data set;
Fig. 6 is the QQplot figure of data set after raw data set and yojan;
Fig. 7 is the schematic block diagram that forensic data gathers;
Fig. 8 is the schematic block diagram of the characterization of text type data;
Fig. 9 is the seamless cutting schematic diagram of data set.
Detailed description of the invention
With embodiment, the invention will be further described below, but the practical range of the present invention is not limited to this.
Static evidence Data Reduction and two portions of Dynamic Forensics Data Reduction should be included by forensic data reduction method based on spatial statistics
Point, specifically comprise the following steps that
1) the spatial character extraction of different forensic data:
For the data collected, according to the needs of evidence obtaining, it is carried out characterization description so that it is both may map to multidimensional
In space, the substitutive characteristics of data can also be embodied simultaneously.Such as the data of Doctype, utilize evidence obtaining key word to carry out characterization and retouch
State;Attacking class data and then can attack frequency according to agreement, the persistent period etc. attacks data to every and carries out characterization description,
Fig. 2 shows that the characterization of forensic data describes.
Definition 1: given evidence obtaining target data D, for the characteristic set F and the set V of respective value of specific purpose evidence obtaining, then one
Bar forensic data can be expressed as: Di=(F, V), data set D={Dk| k=1....n}.
Based on definition 1, according to evidence obtaining type, determine the feature of forensic data:
(1) document data, determines key word, the frequency that in search document, corresponding key word occurs, obtains text data set
DT=(W, V), W are keyword set, and V is frequency set.
(2) attack class data, teach and state greatly from North Carolina according to the Sal Stolfo from Columbia University
Wenke professor Lee the learned tagsort to attack class data, attacks data for each and can be decomposed into 41
Individual feature.
(3) account class data, the principal character of account data includes Account Type, account number, accesses the time, access frequency,
Access locations, utilizes these features that the account class data obtained are carried out characterization description.
2) the seamless cutting of large data sets:
In general, the data volume of forensic data collection is the hugest, directly carries out yojan on raw data set and can cause algorithm
Complexity higher.The point being first mapped as in space by forensic data due to us, therefore, research finds, these are distributed in sky
Between data have certain locality characteristics, i.e. data a subrange class and should in the range of do not deposit between other data
At statistical relationship.Therefore, the present invention utilizes this feature, and raw data set is divided into the small data set with certain degree of overlapping,
Small data set is carried out yojan process, thus realizes the yojan of whole original big data.Concrete step is as follows:
Step 1: obtain the data set S after characterization;
Step 2: size M of partition data is set;
Step 3: from raw data set S, in choose reference point o (generally central point);
Step 4: the point that one range points o of selection is nearest from S is as initial point xi;
Step 5: utilize K-NN searching algorithm with xiCentered by point search go out Sub Data Set Si;
Step 6: if the number n > K, S=S-S of the data in data set Di, jump to step 4, otherwise algorithm exits.
A data set S={S being made up of partitioned data set will be obtained after above-mentioned stepsk| k=1....N}.
3) forensic data reduction method based on local Haussdorff
Fig. 3 shows the geometric meaning of Haussdorff distance, it can be seen that Haussdorff distance can preferably portray two
Spatial simlanty between individual data set.The definition that can be obtained Haussdorff distance by Fig. 3 is formula (1)
Fig. 4 shows and utilizes Haussdorff distance to describe similarity between two groups of different pieces of information collection.It can be seen that two data
Haussdorff distance between collection is the least, and its similarity is the biggest.
Utilize local Haussdorff distance thought, first to segmentation after small data set carry out yojan, algorithm in two stages, (1),
The acquisition of yojan threshold values μ;(2) algorithm based on parameter μ realizes;
The acquisition of yojan threshold values μ:
Step 1: determine yojan rate δ, given Initial Hurdle μiAdjust parameter lambda with threshold values, randomly choose a data set A;
Step 2: take from data set A and do not have selected data x;
Step 3: remove example x, obtain data set B;
Step 4: calculate Hausdorff distance Hd of data set S and data set B;
Step 5: if Hd is less than given threshold value μi, then this example can remove, and otherwise retains this example, obtains new number
According to collection S;
Step 6: if data set S has not traveled through, then jump to step 2, otherwise arrive step 7;
Step 7: calculating yojan rate | S | μ=μ-λ, | B | and represent the number of example in data set B, | S | represents data set S
The number of middle example.If δ ' is > δ, μ=μ-λ, jumps to step 2.
Algorithm based on parameter μ realizes;
Step 1: utilize the seamless cutting method of large data sets that original large data sets is carried out cutting and obtain data set
S={Sk| k=1....N};
Step 2: get parms μ;
Step 3: small data set S in collection set of fetching datai, therefrom select an example x;
Step 4: from data set SiIn remove example x, obtain data set Si′;
Step 5: calculate data set SiWith data set Si' Hausdorff distance Hd;
Step: 6: if Hd is less than given threshold value μ, then this example can remove, and otherwise retains this example, and Si=Si′;
Step 7: if data set SiDo not travel through, then jumped to step 4, otherwise arrive step 3;
Step 8: finally give data set S ' after yojan.
Fig. 5 shows when Sub Data Set size K=20, during threshold parameter=0.3 (yojan threshold values, yojan rate is about 57%), about
Letter data set and the comparison diagram of raw data set.
4) evaluation of yojan effect
It is to evaluate Algorithm for Reduction the most directly to refer to that data set after yojan to what extent maintains the value of legacy data collection
Mark.Assume VOFor the value of raw data set, VRFor the value of data set after yojan, its ratio is R:
R is closer to 1, and yojan effect is the best.
But, different data sets is under different application environment, and its criterion being worth is different.Existing for about
The method that letter effect is evaluated is primarily directed to the data set of classification, and therefore nicety of grading is the unique effective of evaluation yojan effect
Method, has certain limitation.May be described as the point in space in view of any data, and these are put in space
Position and relation between points have certain feature, i.e. Spatial Statistical Character.The method of data space feature is described very
Many, QQPlot (quantile fitted figure) figure be by statistics two data intensive data locus quantile compare this two
The probability distribution of individual data set, is the similarity that presents two data sets of the form by fitted figure, and cannot quantized data collection
Similarity, the statistics of the position quantile to data intensive data is carried out quantification treatment, enables the shape with numerical value by scheme
Formula portrays the similarity of data set, in combination with the feature of digital evidence obtaining data set, provides a kind of yojan based on spatial statistics effect
Really evaluation methodology.Meanwhile, by the evaluation of yojan effect is fed back, it is achieved the optimization of Algorithm for Reduction.
The step calculating QQplot figure is as follows:
Step 1: the average that after calculating raw data set S and yojan, data set S` respectively ties up is (respectivelyWith), calculating side
Method is:
Step 2: calculate data set standard deviation (respectively σ after original data set yojan respectively2With σ '2), computational methods are:
Step 3: the standard calculating two data sets compares di=(Xi-μ) and/σ (i=1 ..., n) (respectively d={di| i=1 ..., n} and
D '=d 'i| i=1 ..., m}), and respectively it is ranked up;
Step 4: for transverse and longitudinal coordinate, scattergram can be obtained with the standard ratio of data set after raw data set and yojan respectively;
The quantification treatment of QQplot figure,
Step 1: even if the average of two data set standard ratiosWith
Step 2: ratio calculated
Step 3: evaluate whether R reaches yojan effect, if it did not, increase yojan parameter μ, re-executes Algorithm for Reduction,
It is then return to step 1, otherwise arrives step 4;
Step 4: store or transmit data set after yojan.
Fig. 6 shows the contrast situation of the QQplot figure of data before and after yojan, when the QQplot figure of two data sets be one straight
Line, say, that two data sets are the most similar, the QQplot figure between them is similar to straight line.When yojan rate is
When 13.5%, after yojan, the QQplot between data set and raw data set schemes between raw data set and raw data set
It is little that QQplot figure compares change, is still approximately straight line.Along with the raising of yojan rate, data set and original number after yojan
Beginning to deviate from straight line according to the QQplot figure between collection, yojan rate is the biggest, and departure degree is the biggest.It is proposed that method in yojan rate
When reaching 70%, the similarity between data set is preferable.
Fig. 7 shows that forensic data collection, characterization, yojan process and the flow chart of storage.On the network put up, adopt
Being acquired network data with many agencies, the data supported at present specifically include that file type data, Account Type data
With attack type data.After obtaining original forensic data, Algorithm for Reduction is utilized to initial data and to carry out yojan process, at yojan
Reason process includes the evaluation to yojan effect.After yojan, data set provides basic data for later stage forensics analysis.
Fig. 8 shows that the characterization of the data as a example by file type data describes process.File type data are being carried out feature
During change, need for specific, determine the weight of key word and key word, based on this file type data are scanned for,
The frequency of statistics key word, thus file key word and its frequency are described, the point being converted in hyperspace.
Fig. 9 is the seamless cutting schematic diagram to a random data set.In general raw data set is the biggest, directly carries out
It is inefficient that yojan processes, and we will carry out seamless cutting to raw data set, and after purpose is just so that cutting, data set to the greatest extent may be used
The Spatial Statistical Character of raw data set can not be destroyed.
By the yojan to the small data set after segmentation, it is achieved the yojan to large data sets.It is finally reached and is not destroying number
According to spatial distribution in the case of, reduce forensic data collection scale.
Claims (2)
1. a forensic data reduction method based on spatial statistics, the method includes static evidence Data Reduction and Dynamic Forensics number
According to two parts of yojan, it is characterised in that:
1) according to the needs of evidence obtaining, characterization description is carried out for the data collected, through the data that described characterization describes
Can be mapped in hyperspace, the substitutive characteristics of data can be embodied simultaneously;
Described characterization is described as, and gives evidence obtaining target data D, the characteristic set F collected evidence for specific purpose and respective value
Set V, wherein a forensic data is expressed as: Di=(F, V), data set D={Dk| k=1....n};
According to evidence obtaining type, determine the feature of forensic data:
A. document data, determines key word, the frequency that in search document, corresponding key word occurs, obtains text data set
DT=(W, V), W are keyword set, and V is frequency set;
B. attack class data, each attack data are decomposed into 41 features;
C. account class data, the principal character of account data includes Account Type, account number, accesses time, access frequency, visits
Ask place, utilize these features that the account class data obtained are carried out characterization description;
2) point being mapped as in space by forensic data, the point after mapping is divided into multiple data set, to described data set respectively
Carry out yojan process;
Described yojan processes and comprises the steps:
Step 1: obtain the data set S after characterization;
Step 2: size M of partition data is set;
Step 3: from raw data set S, in choose reference point o;
Step 4: the point that one range points o of selection is nearest from S is as initial point xi;
Step 5: utilize K-NN searching algorithm with xiCentered by point search go out Sub Data Set Si;
Step 6: if the number n > K, S=S-S of the data in data set Di, jump to step 4, otherwise algorithm exits;
Obtain a data set S={S being made up of partitioned data setk| k=1....N};
3) the forensic data Algorithm for Reduction using local Haussdorff carries out yojan to described data set;
4) yojan effect is evaluated.
Forensic data reduction method the most according to claim 1, it is characterised in that:
Described evaluation uses formulaComplete, wherein VOFor the value of raw data set, VRFor the valency of data set after yojan
Value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510305873.8A CN106228173A (en) | 2015-06-02 | 2015-06-02 | A kind of forensic data reduction method based on spatial statistics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510305873.8A CN106228173A (en) | 2015-06-02 | 2015-06-02 | A kind of forensic data reduction method based on spatial statistics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106228173A true CN106228173A (en) | 2016-12-14 |
Family
ID=57528717
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510305873.8A Pending CN106228173A (en) | 2015-06-02 | 2015-06-02 | A kind of forensic data reduction method based on spatial statistics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106228173A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117131145A (en) * | 2023-08-03 | 2023-11-28 | 卡斯柯信号(北京)有限公司 | Track map data verification method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101082925A (en) * | 2007-07-09 | 2007-12-05 | 山西大学 | Rough set property reduction method based on SQL language |
CN101345692A (en) * | 2008-08-05 | 2009-01-14 | 陈明 | Bridge data illation reduction method for implementing data volume transmission reduction |
CN102262682A (en) * | 2011-08-19 | 2011-11-30 | 上海应用技术学院 | Rapid attribute reduction method based on rough classification knowledge discovery |
-
2015
- 2015-06-02 CN CN201510305873.8A patent/CN106228173A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101082925A (en) * | 2007-07-09 | 2007-12-05 | 山西大学 | Rough set property reduction method based on SQL language |
CN101345692A (en) * | 2008-08-05 | 2009-01-14 | 陈明 | Bridge data illation reduction method for implementing data volume transmission reduction |
CN102262682A (en) * | 2011-08-19 | 2011-11-30 | 上海应用技术学院 | Rapid attribute reduction method based on rough classification knowledge discovery |
Non-Patent Citations (3)
Title |
---|
公伟 等: "云取证模型的构建与分析", 《计算机工程》 * |
彭涛: "基于特征和实例的海量数据约简方法研究", 《中国博士学位论文全文数据库-信息科技辑》 * |
熊欣 等: "基于道格拉斯改进的雷达回波数据简化算法", 《中国航海》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117131145A (en) * | 2023-08-03 | 2023-11-28 | 卡斯柯信号(北京)有限公司 | Track map data verification method and device |
CN117131145B (en) * | 2023-08-03 | 2024-03-26 | 卡斯柯信号(北京)有限公司 | Track map data verification method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Al Shalabi et al. | Data mining: A preprocessing engine | |
CN101692224B (en) | High-resolution remote sensing image search method fused with spatial relation semantics | |
Saraç et al. | An ant colony optimization based feature selection for web page classification | |
US9043316B1 (en) | Visual content retrieval | |
CN104408483B (en) | SAR texture image classification methods based on deep neural network | |
Zhang et al. | Three-way decisions of rough vague sets from the perspective of fuzziness | |
CN114244603B (en) | Anomaly detection and comparison embedded model training and detection method, device and medium | |
CN111488917A (en) | Garbage image fine-grained classification method based on incremental learning | |
CN107832631A (en) | The method for secret protection and system of a kind of data publication | |
CN104268629A (en) | Complex network community detecting method based on prior information and network inherent information | |
CN103366365A (en) | SAR image varying detecting method based on artificial immunity multi-target clustering | |
CN101996245A (en) | Form feature describing and indexing method of image object | |
CN104809161B (en) | A kind of method and system that sparse matrix is compressed and is inquired | |
CN102722578B (en) | Unsupervised cluster characteristic selection method based on Laplace regularization | |
DE102020133266A1 (en) | Technologies for the refinement of stochastic similarity search candidates | |
Herrera et al. | SAX-quantile based multiresolution approach for finding heatwave events in summer temperature time series | |
Graham et al. | Finding and visualizing graph clusters using pagerank optimization | |
CN103034869A (en) | Part maintaining projection method of adjacent field self-adaption | |
CN111626311B (en) | Heterogeneous graph data processing method and device | |
CN106503386A (en) | The good and bad method and device of assessment luminous power prediction algorithm performance | |
Kithulgoda et al. | The incremental Fourier classifier: Leveraging the discrete Fourier transform for classifying high speed data streams | |
CN106228173A (en) | A kind of forensic data reduction method based on spatial statistics | |
CN102034102B (en) | Image-based significant object extraction method as well as complementary significance graph learning method and system | |
CN116894173A (en) | Water quality tracing method, device, equipment and storage medium | |
Chung et al. | Finding and visualizing graph clusters using PageRank optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161214 |
|
RJ01 | Rejection of invention patent application after publication |