CN102622302B

CN102622302B - Recognition method for fragment data type

Info

Publication number: CN102622302B
Application number: CN201110031123.8A
Authority: CN
Inventors: 汤燕彬; 杨泽明; 刘宝旭
Original assignee: Institute of High Energy Physics of CAS
Current assignee: Institute of High Energy Physics of CAS
Priority date: 2011-01-26
Filing date: 2011-01-26
Publication date: 2014-10-29
Anticipated expiration: 2031-01-26
Also published as: CN102622302A

Abstract

The invention provides a recognition method for a fragment data type, which comprises the following steps: first extracting byte frequency distribution F(x) of fragment data x to be tested; then calculating similarity Tx of byte frequency distribution between the fragment data x to be tested and some sample S, judging whether the similarity Tx of the byte frequency distribution between the fragment data x to be tested and some sample S falls into a range of similarity of a fragment data type Ti in known data types T, and if the similarity Tx of the byte frequency distribution between the fragment data x to be tested and some sample S falls into the range of similarity of the fragment data type Ti in the known data types T, judging that the tested fragment data x belong to the type represented by the known fragment data type Ti; and if the similarity Tx of the byte frequency distribution between the fragment data x to be tested and some sample S does not fall into the range of any known data type T, judging that the type of the fragment data x to be tested cannot be recognized. The method can recognize the type of the fragment data, provides basis for follow-up fragment data reconstruction work, and accordingly files with certain content can be restored according to the fragment data so as to provide technical support for judicial evidence obtaining.

Description

The recognition methods of crumb data type

Technical field

The present invention relates to the recognition methods of crumb data type in the type of disk fragments data of a kind of hard disc of computer or other movable storage mediums or memory mirror, particularly relate to the recognition methods of the crumb data type based on byte frequency distribution.

Background technology

Disk bunch or piece formed by one or more sectors, sector is the physical memory cell of disk minimum, and bunch be the minimum unit that operating system is distributed, disk bunch be generally multiple sectors, for example have multiple sectors such as 2,4,8,16,32,64, each bunch can only be taken by a file, even if only have several bytes in this file, also never allow one bunch of plural file sharing, otherwise can cause the confusion of data.Wherein, sector is physics, and bunch is logic, bunch can be changed by operating system, forms and bunch is convenient to system management.

File system is with Cu Huokuaiwei unit in the time that storage data arrive disk, and distributed and saved, to the difference place of whole disk, in the prior art, is called file fragmentation by these distributed and saved to the different piece of the local file of difference of disk.These file fragmentations can cause system performance to reduce, travelling speed is declined, thereby, process fragment by traditional Disk Defragmenter, Disk Defragmenter can be analyzed the disk fragments in hard disk, and mobile and merged file fragment, makes each file can take storage area independent and continuous on hard disk, thereby improve the utilization rate of disk usage space, improve the speed of disk file reading.

In disk except existing above-mentioned traditional file fragmentation, also exist another kind of data, be present in the data in unallocated bunch or piece, the generation of these data normally, owing to using after a period of time at disk, copies repeatedly, generation and deleted file cause.For example, after file is deleted, but the part actual content of this file is still stored in this space.That these class data have is imperfect, easy capped feature.With the example that is operating as of deleted file, after file is deleted, the space of originally storing this file is identified as " unallocated space ", and the disk file system in disk, reclaiming in use unallocated space process, can write fresh content this part region.But in fact, this unallocated space also has the partial content of original deleted file, in the time that fresh content is write to this space, former already present data message is covered by new data message.

Although these class data are normally incomplete, easily capped, these class data extract and reconstruct after can obtain comparatively complete content, thereby use as electronic evidence.

For the present invention is clearly described, in the present invention, be crumb data by this data definition being kept in disk in unallocated bunch or piece.In addition, there is the file of what type, just there is the crumb data of corresponding types, the type identification of crumb data is a basis of file restructuring or file reduction, therefore, the present invention is based on sector 512B is unit, and definition crumb data type refers to the type of the data of the crumb data representative taking 512B as unit.

Visible by above-mentioned analysis, described crumb data is playing an important role aspect formation electronic evidence, and can improve the discrimination of follow-up file recombination to the identification of crumb data type, and reduces corresponding calculated amount.But not having at present any prior art can analyze and utilize described crumb data, and crumb data type is identified.

Summary of the invention

The present invention provides a kind of recognition methods of crumb data type in order to address the above problem, in order to identify the type of crumb data, for follow-up crumb data recombination provides basis.

In order to solve the problems of the technologies described above, the invention provides following technical scheme:

A recognition methods for crumb data type, comprises the following steps:

Step 1, extracts the byte frequency distribution F (x) of crumb data x to be tested; Wherein, F (x)={ f ₀, f ₁f _if ₂₅₅, f _ifor the number of times of byte value i appearance in the crumb data taking sector as unit;

Step 2, calculates the similarity T of byte frequency distribution between crumb data x to be tested and a certain sample S by formula (1) _x,

T (A, S) = \frac{A \cdot S}{{| | A | |}^{2} + {| | S | |}^{2} - A \cdot S}

Formula (1)

Wherein, A=F (x), is the byte frequency distribution of sector, described test crumb data x place, and S is the frequency distribution of sample data byte; n=256;

Step 3, judges the similarity T of byte frequency distribution between described crumb data x to be tested and a certain sample S _xwhether fall into a kind of crumb data type T of known types T _ithe scope of similarity in, if fallen into, judge that described test crumb data x belongs to known types T _ithe type of representative; If do not fallen in the scope of any one known types T, judge the type None-identified of described crumb data x to be tested; ;

Wherein, T={T ₁, T ₂... T _it _mrepresent that T has m kind crumb data type, T _irepresent i kind crumb data type, i=1 ... m.

Further, the recognition methods of described crumb data type also comprises step 4,

Step 4, the similarity T of byte frequency distribution between described crumb data x to be tested and a certain sample S _xfall into a known types T _ithe scope of similarity in time, further judge in crumb data x, whether there is δ _xif, exist, determine whether to meet δ _x∈ T _jif, meet, and, if i=j judges that described test crumb data x belongs to known types T _ithe type of representative;

Wherein, δ _xfor the architectural feature of described a certain file type, Tj is the set of the architectural feature of UNKNOWN TYPE data.

Further, the recognition methods of described crumb data type also comprises step 5,

Step 5, the similarity T of byte frequency distribution between the crumb data x described to be tested in step 3 and a certain sample S _xfall into a known types T _ithe scope of similarity in similarity while being less than preset range, or when i ≠ j in step 4, judge that the similarity of other crumb data in the data block at described crumb data to be measured place falls into described known types T _iscope in quantity whether reach predetermined quantity, if reached, judge that described crumb data x belongs to data type T _ithe type of representative, otherwise judge described crumb data x None-identified.

In addition, before the step 1 of the recognition methods of aforesaid crumb data type, comprise the steps:

Steps A: extract sample pattern S, determine the similarity between crumb data and the described sample pattern S of various file types;

Step B: extract the architectural feature δ of various file types, wherein, δ={ δ ₁, δ ₂δ _iδ _m, represent that δ has the architectural feature of m kind file type.

Crumb data of the present invention comprises the crumb data in crumb data and the internal memory in disk.

Method provided by the invention can be identified the type of crumb data, for follow-up crumb data recombination provides basis, thereby can make it possible to recover the file with certain content according to crumb data, for judicial evidence collection provides technical support.

Below in conjunction with the drawings and specific embodiments, technical scheme of the present invention is described in detail.

brief description of the drawings

Fig. 1 is the process flow diagram of crumb data kind identification method of the present invention;

Fig. 2 is the process flow diagram of a specific embodiment of crumb data kind identification method of the present invention;

Fig. 3 is the detail flowchart of step S15 in Fig. 2;

Fig. 4 is the detail flowchart of step S16 in Fig. 2;

Fig. 5 is the process flow diagram of crumb data evidence obtaining work.

Embodiment

As shown in Figure 1, be the process flow diagram of crumb data kind identification method of the present invention.

Step S1, before starting to carry out crumb data type identification, first will carry out preliminary work, should obtain byte frequency distribution sample and its distinctive architectural feature in various file type data region.If byte frequency distribution sample and its distinctive architectural feature in existing various file type data region, can skip this step and directly start to carry out identification work from step 2, if do not had, need to extract by a large amount of work in this step, as collection, contrast, analysis, summary etc., obtain byte frequency distribution sample and its distinctive architectural feature in various file type data region, for the type identification of lower step provides basis.

Step S2, for the crumb data to be tested that will identify, extracts the byte frequency distribution of crumb data to be tested.

Step S3, utilizes Tanimoto coefficient to set up corresponding model of cognition, calculates the similarity of the byte frequency distribution of crumb data to be tested and a certain sample.

Step S4, the similarity of the byte frequency distribution of the crumb data of the similarity calculating and a known type and same sample is compared, the scope of a similarity after whether the similarity that judgement calculates falls into, if fallen into, in step S4, the crumb data of determining crumb data to be tested and this known type belongs to same type, if not in the scope of a rear similarity, confirms this crumb data None-identified to be tested.

Wherein, judgement according to obtaining in advance, the similarity that is the byte frequency distribution of the crumb data of known certain type and a certain sample should be a known range, so just, can judge whether the similarity of calculating falls into this scope, if fallen into, illustrate that crumb data to be tested belongs to the type.

In addition, the present invention proposes the identification of the auxiliary crumb data type of two class Optimal Parameters, the one, search the specific structural features that whether contains related data type in crumb data, the 2nd, consider the relevance of crumb data, be between crumb data to be tested and adjacent crumb data type, have certain associated, can strengthen the accuracy of crumb data identification by these two kinds of methods, and guarantee in identifying, not change raw data, thereby guarantee authenticity and the reliability of counting.

Fig. 2 is the process flow diagram of a specific embodiment of crumb data kind identification method of the present invention, specifically comprises following step: 1) pre-service; 2) set up model of cognition; 3) type under the tested crumb data of preliminary judgement; 4) dependency structure of introducing tested crumb data is characterized as Optimal Parameters 1; 5) between introducing crumb data, the relevance of distance is Optimal Parameters 2.Utilize Optimal Parameters can improve the accuracy of crumb data type identification.Below illustrate above steps:

Step S11, pre-service.At pretreatment stage, comprise the byte frequency distribution sample that extracts various file type data region, set up sample pattern S, wherein, S={S ₁, S ₂... S _is _m, the set of S representative sample model, s _ibe one of them daughter element, this is that method with mathematics is out abstract sample pattern, represents with S;

Also comprise the distinctive architectural feature δ of extraction document type, wherein, δ={ δ ₁, δ ₂δ _iδ _m, the set of δ representation file architectural feature.

Byte frequency distribution refers to the operating system aspect of leaving, by the frequency distribution of byte statistics raw data.In function F (x), f _irepresent the number of times that in the crumb data taking sector as unit, byte value i (being the corresponding decimal system numerical value of each byte (byte) in computing machine) occurs.By this function F (x), can extract according to the difference of different types of data self property the feature of byte frequency distribution, the advantage of this feature is: can abandon the surface that file type, file extension, file special identifier etc. are given by operating system, be based on the content of crumb data self, can truly reflect the characteristic of crumb data.

The distinctive architectural feature δ of file type, refers to its distinctive continuous binary data mark of various file types, and these architectural features are not only distributed in the reference position of file, and are likely distributed in the central or ending of file.Need to obtain by mass data analysis, can automatically be obtained by some algorithm by machine, also can manual analysis obtain.

About the distinctive architectural feature δ of file type, different file types, its architectural feature difference, taking jpeg file type as example, the file of jpeg file type mainly comprises binary data mark as shown in table 1 below.

Table 1

Code	Implication
		FFD8	SOI SOI (Start of Image)
FFE0	APP0 mark (Marker)
		FFDB	Quantization table DQT (difine quantization table)
FFC4	Huffman table DHT (Difine Huffman Table)
		FFC0	Two field picture starts SOF0 (Start of Frame)
FFDA	Scanning starts SOS (Start of Scan)
		FFD9	Image finishes EOI (End ofImage)

The byte frequency distribution F (x) of step S12, extraction test crumb data x (x represents the code name of tested crumb data), wherein, F (x)={ f ₀, f ₁f _if ₂₅₅.

Step S13, by Tanimoto coefficient, formula (1) calculates the similarity T of byte frequency distribution between sample S and test data F (x) _x.

Tanimoto coefficient can be measured the similarity of document data, and reduction is Jaccard coefficient in double attributes situation.The present invention proposes a kind of crumb data model of cognition based on byte frequency distribution, this model is taking the crumb data of 512B as minimum test cell, add up the byte frequency distribution F (x) in each test 512B, can draw the similarity T of byte frequency distribution between sample S and test crumb data F (x) by Tanimoto coefficient _x.

T (A, S) = \frac{A \cdot S}{{| | A | |}^{2} + {| | S | |}^{2} - A \cdot S}

Formula (1)

Wherein A=F (X), for the byte frequency distribution of test sector, crumb data x place, is 1 dimensional vector with 256 elements; S is the byte frequency distribution of sample data;

A \cdot S = Σ_{i = 1}^{n} A_{i} S_{i}, {| | A | |}^{2} = Σ_{i = 1}^{n} A_{i} A_{i},

n＝256。

Visible, the span of T is [0,1], and in the time of T=0, A and S similarity are minimum; In the time that T trends towards equaling 1, A and S similarity are the highest.The value of T from 0 to 1 o'clock, A and S similarity were from low to high.

In the time calculating similarity, can calculate by means of means such as computing machines, for example, in computing machine, write calculation procedure, input S and A by inputting interface, can automatically calculate the similarity T of byte frequency distribution between sample S and test data F (x) _x.

Step S14, calculates the similarity T of byte frequency distribution between sample S and test data F (x) _xafter, the similarity T of the tested crumb data x of preliminary judgement _xwhether fall within the scope of the similarity of the crumb data of a known type and the byte frequency distribution of same sample.

In the present invention, the similarity between pre-stored with good grounds various types of crumb data and sample and the data type T that draws, i.e. T={T ₁, T ₂... T _it _m, represent total m kind crumb data type.Wherein, Ti representative be i kind data type, it represents with two parameter Ti1, Ti2, wherein, Ti1 represents similarity, the similarity of i kind data type represents with Ti1, and it is the scope of from 0 to 1, and the similarity of each data type has an effective range, taking jpeg file type as example, utilizing the effective range of the similarity that Tanimoto coefficient calculations goes out is [0.55,1], between 0.55 to 1; The set of Ti2 representative data architectural feature, the data structure characteristic set of i kind data type represents with Ti2, for example, the data structure characteristic set of jpeg file type can be the content of aforementioned table 1.

The data type T drawing based on the above-mentioned pre-stored similarity according between various types of crumb data and sample, the similarity T of the tested crumb data x of preliminary judgement _xwhether fall within the scope of Ti1, if similarity T _xfall within the scope of Ti1, can think that crumb data x belongs to i class crumb data; If similarity T _xdo not fall within the scope of Ti1, can think that crumb data x does not belong to i class crumb data, need to continue to judge similarity T _xwhether fall into T _i+1scope, within the scope of the similarity of another known type, if similarity T _xall do not fall within the scope of the similarity of all known types, this tested crumb data None-identified of thinking prestoring by m kind goes out type.

Step S15, introduce the dependency structure feature δ of tested crumb data _xfor Optimal Parameters 1, as shown in Figure 3.Be the similarity T of byte frequency distribution between crumb data x and a certain sample S _xfall into a known similar degree T _iscope in time, further judge in crumb data x, whether there is δ _xif, exist, continue to judge whether to meet δ _x∈ T _j, wherein, Tj represents the data structure characteristic set of another kind of UNKNOWN TYPE, if meet δ _x∈ T _j, continue to judge whether i equates with j, if i=j illustrates that the data structure characteristic set of Tj representative is identical with Ti2, can judge that described test crumb data x belongs to data type T _ithe type of representative.If i and j are unequal, continue step S 16, if do not meet δ _x∈ T _j, or the not no δ that exists in crumb data x _x, this situation is not analyzed, with the judged result of step S14 as overall result.

In step S15, the similarity T of byte frequency distribution between described crumb data x to be tested and a certain sample S _xfall into a known types T _isimilarity within the scope of time, further confirm the architectural feature δ of described crumb data x to be tested _xwhether also belong to this known types T _iarchitectural feature set, thereby determine more exactly the type of judgement test crumb data x.

Between step S16, introducing crumb data, the relevance of distance is Optimal Parameters 2.Be 80% because fragment in identical file is distributed in 32 possibilities within data block, therefore crumb data in disk and Non-random distribution, is to have certain relevance between fragment, and a certain section of continuous crumb data belongs to same file.

In step S15, in the time of i ≠ j, or in step S14, although the similarity T of tested crumb data x _xfall within the scope of Ti1, but similarity is lower, for example, taking jpeg file type as example, the effective range of similarity is [0.55,1], and the similarity T of tested crumb data x _xbe 0.56, the similarity degree of obvious tested crumb data x and jpeg file type is very low.In above-mentioned two situations, all can adopt the measure of step S16.As shown in Figure 4, whether the sequence number that judges current tested crumb data x is last of place data block, if not, sequence number adds 1, then judge that the similarity of crumb data of this sequence number is whether within the scope of Ti, loop comparison, until other data of tested crumb data x place data block are all compared, then add up the number of the crumb data of similarity within the scope of Ti, if similarity is greater than 80% in the ratio of the quantity of Ti scope internal fragment data, judge that described crumb data x belongs to data type T _ithe type of representative, otherwise judge described crumb data x None-identified.

In step S16, judgement falls into T _iscope in the crumb data ratio that accounts for this data block have muchly, for example, if be greater than 80%, can think very definitely that described crumb data x belongs to data type T _ithe type of representative.

Can effectively evaluate the type of crumb data by above-described embodiment.In addition, when Memory Allocation, distributing by page (4K), is the integral multiple of 512B, and therefore, the crumb data described in the present invention also can refer to it is the data in internal memory.

Crumb data kind identification method of the present invention can provide certain electronic evidence information for judicial evidence collection, guarantee to identify on the one hand the type of crumb data, further, improve the discrimination of type, on the other hand, ensured the reliability of the data in crumb data type identification process, with the consistance of raw data, for certain place mat work is done in follow-up crumb data recombination.

Fig. 5 is the collect evidence process flow diagram of whole work of crumb data.Wherein preparatory stage and crumb data are extracted the stage as a series of preliminary work of the present invention, are not described in detail at this, can adopt existing universal method.When having extracted after crumb data, carry out the analysis of crumb data, comprising rejecting contiguous file data block, carrying out the identification of crumb data type of the present invention, then carry out the restructuring of crumb data, then show fragment evidence, and be submitted to court, reach a conclusion according to the crumb data obtaining.

By the identification of crumb data type of the present invention, for the restructuring of next step crumb data in electronic evidence-collecting provides the foundation, and, because the crumb data in electronic evidence-collecting process of the present invention has ensured obtaining, in identification, regrouping process and the consistance of raw data, therefore reliability and the authenticity of the electronic evidence obtaining, have fundamentally been guaranteed.

Claims

1. a recognition methods for crumb data type, is characterized in that: comprise the following steps:

T (A, S) = \frac{A \cdot S}{{| | A | |}^{2} + {| | S | |}^{2} - A \cdot S}

Formula (1)

Step 3, judges the similarity T of byte frequency distribution between described crumb data x to be tested and a certain sample S _xwhether fall into a kind of crumb data type T of known types T _ithe scope of similarity in, if fallen into, judge that described test crumb data x belongs to known types T _ithe type of representative; If do not fallen in the scope of any one known types T, judge the type None-identified of described crumb data x to be tested;

2. the recognition methods of crumb data type according to claim 1, is characterized in that: also comprise step 4,

Wherein, δ _xfor the architectural feature of a certain file type, T _jfor the set of the architectural feature of UNKNOWN TYPE data.

3. the recognition methods of crumb data type according to claim 1 and 2, is characterized in that: also comprise step 5,

4. the recognition methods of crumb data type according to claim 1, is characterized in that: before step 1, comprise the steps:

Steps A: extract sample pattern S, determine the similarity between crumb data and the described sample pattern S of various file types.

5. the recognition methods of crumb data type according to claim 1, is characterized in that: before step 1, comprise the steps:

6. the recognition methods of crumb data type according to claim 1, is characterized in that: described crumb data comprises the crumb data in crumb data and the internal memory in disk.

7. the recognition methods of crumb data type according to claim 3, is characterized in that: the quantity of described crumb data to be measured place data block is 2 ⁵-2 ⁸piece.

8. the recognition methods of crumb data type according to claim 3, is characterized in that: described predetermined quantity is more than 80% quantity that accounts for described crumb data to be measured place data block quantity.