CN102622302A

CN102622302A - Recognition method for fragment data type

Info

Publication number: CN102622302A
Application number: CN2011100311238A
Authority: CN
Inventors: 汤燕彬; 杨泽明; 刘宝旭
Original assignee: Institute of High Energy Physics of CAS
Current assignee: Institute of High Energy Physics of CAS
Priority date: 2011-01-26
Filing date: 2011-01-26
Publication date: 2012-08-01
Anticipated expiration: 2031-01-26
Also published as: CN102622302B

Abstract

The invention provides a recognition method for a fragment data type, which comprises the following steps: first extracting byte frequency distribution F(x) of fragment data x to be tested; then calculating similarity Tx of byte frequency distribution between the fragment data x to be tested and some sample S, judging whether the similarity Tx of the byte frequency distribution between the fragment data x to be tested and some sample S falls into a range of similarity of a fragment data type Ti in known data types T, and if the similarity Tx of the byte frequency distribution between the fragment data x to be tested and some sample S falls into the range of similarity of the fragment data type Ti in the known data types T, judging that the tested fragment data x belong to the type represented by the known fragment data type Ti; and if the similarity Tx of the byte frequency distribution between the fragment data x to be tested and some sample S does not fall into the range of any known data type T, judging that the type of the fragment data x to be tested cannot be recognized. The method can recognize the type of the fragment data, provides basis for follow-up fragment data reconstruction work, and accordingly files with certain content can be restored according to the fragment data so as to provide technical support for judicial evidence obtaining.

Description

The recognition methods of crumb data type

Technical field

The present invention relates to the recognition methods of crumb data type in disk fragments type of data or the memory mirror of a kind of hard disc of computer or other movable storage mediums, particularly relate to recognition methods based on the crumb data type of byte frequency distribution.

Background technology

Disk bunch or piece form by one or more sectors, the sector is the minimum physical memory cell of disk, and bunch is the minimum unit that operating system is distributed; Disk bunch be generally a plurality of sectors; A plurality of sectors such as 2,4,8,16,32,64 are for example arranged, and each bunch can only be taken by a file, even have only several bytes in this file; Also never allow one bunch of plural file sharing, otherwise can cause the confusion of data.Wherein, the sector is a physics, and bunch is logic, bunch can be changed by operating system, forms bunch to be convenient to system management.

File system the storage data during to disk with bunch or piece be unit, distributed and saved is different local to whole magnetic disk, in the prior art, these distributed and saved is called file fragmentation to the different piece of the files in difference place of disk.These file fragmentations can cause system performance to reduce, and make travelling speed descend, thereby; DEFRAGMENT through traditional is handled fragment; DEFRAGMENT can be analyzed the disk fragments in the hard disk, moves and the merged file fragment, makes each file can take storage area independent and continuous on the hard disk; Thereby improve the utilization rate of disk usage space, improve the speed that disk reads file.

In disk except having above-mentioned traditional file fragmentation; Also exist another kind of data; Promptly be present in the data in unallocated bunch or the piece, the generation of these data normally because after disk uses a period of time, duplicate times without number, generation and deleted file cause.For example, after file is deleted, but the part actual content of this document still is stored in this space.Characteristics imperfect, that be prone to be capped that this type data have.With the example that is operating as of deleted file, after file was deleted, the space of storing this document originally was identified as " unallocated space ", and the disk file system in the disk can write this part zone with fresh content in reclaiming use unallocated space process.Yet in fact, this unallocated space also has the partial content of original deleted file, when fresh content is write this space, makes former already present data message covered by new data message.

Although this type data are normally incomplete, be prone to be capped, this type data extract and reconstruct after can obtain comparatively complete content, thereby use as electronic evidence.

For the present invention clearly is described, in the present invention, be crumb data with this data definition that is kept in the disk in unallocated bunch or the piece.In addition; The file what type is arranged; The crumb data that just has corresponding types, the type identification of crumb data are bases of file reorganization or file reduction, therefore; The present invention is based on sector 512B is unit, and definition crumb data type is meant with 512B the type of data of the crumb data representative that is unit.

Visible through above-mentioned analysis, said crumb data is playing an important role aspect the formation electronic evidence, and can improve the discrimination of follow-up file recombination to the identification of crumb data type, and reduces the corresponding calculated amount.Yet not having any prior art at present can analyze and utilize described crumb data, and the crumb data type is discerned.

Summary of the invention

The present invention provides a kind of recognition methods of crumb data type in order to address the above problem, in order to the type of identification crumb data, for follow-up crumb data recombination provides the basis.

In order to solve the problems of the technologies described above, the invention provides following technical scheme:

A kind of recognition methods of crumb data type may further comprise the steps:

Step 1 is extracted the byte frequency distribution F (x) of crumb data x to be tested; Wherein, F (x)={ f ₀, f ₁F _iF ₂₅₅, f _iFor with the sector being the number of times that byte value i occurs in the crumb data of unit;

Step 2 is calculated the similarity T of byte frequency distribution between crumb data x to be tested and a certain sample S through formula (1) _x,

T (A, S) = \frac{A \cdot S}{{| | A | |}^{2} + {| | S | |}^{2} - A \cdot S}

Formula (1)

Wherein, A=F (x) is the byte frequency distribution of sector, said test crumb data x place, and S is the frequency distribution of sample data byte; n=256;

Step 3 is judged the similarity T of byte frequency distribution between said crumb data x to be tested and a certain sample S _xWhether fall into a kind of crumb data type T of known types T _iThe scope of similarity in, if fall into, judge that then said test crumb data x belongs to known types T _iThe type of representative; If do not fall in the scope of any one known types T, judge that then the type of said crumb data x to be tested can't be discerned;

Wherein, T={T ₁, T ₂... T _iT _mThe total m kind crumb data type of expression T, T _iRepresent i kind crumb data type, i=1......m.

Further, the recognition methods of described crumb data type also comprises step 4,

Step 4, the similarity T of byte frequency distribution between said crumb data x to be tested and a certain sample S _xFall into a known types T _iThe scope of similarity in the time, further judge whether there is δ among the crumb data x _xIf, exist, then determine whether to satisfy δ _x∈ T _jIf, satisfy, and, if i=j judges that then said test crumb data x belongs to known types T _iThe type of representative;

Wherein, δ _xBe the architectural feature of said a certain file type, Tj is the set of the architectural feature of UNKNOWN TYPE data.

Further, the recognition methods of described crumb data type also comprises step 5,

Step 5, the similarity T of byte frequency distribution between crumb data x said to be tested in the step 3 and a certain sample S _xFall into a known types T _iThe scope of similarity in similarity during less than preset range, perhaps during the i in the step 4 ≠ j, judge that the similarity of other crumb data in the data block at said crumb data to be measured place falls into said known types T _iScope in quantity whether reach predetermined quantity, if reach, judge that then said crumb data x belongs to data type T _iThe type of representative, otherwise judge that said crumb data x can't discern.

In addition, before the step 1 of the recognition methods of aforesaid crumb data type, comprise the steps:

Steps A: extract sample pattern S, confirm the crumb data of various file types and the similarity between the said sample pattern S;

Step B: extract the architectural feature δ of various file types, wherein, δ={ δ ₁, δ ₂δ _iδ _m, the architectural feature of the total m kind file type of expression δ.

Crumb data of the present invention comprises crumb data and the crumb data in the internal memory in the various disks.

Method provided by the invention can be discerned the type of crumb data, for follow-up crumb data recombination provides the basis, thereby can make it possible to recover the file with certain content according to crumb data, for judicial evidence collection provides technical support.

Below in conjunction with accompanying drawing and specific embodiment, technical scheme according to the invention is at length explained.

Description of drawings

Fig. 1 is the process flow diagram of crumb data kind identification method according to the invention;

Fig. 2 is the process flow diagram of a specific embodiment of crumb data kind identification method according to the invention;

Fig. 3 is the detail flowchart of step S15 among Fig. 2;

Fig. 4 is the detail flowchart of step S16 among Fig. 2;

Fig. 5 is the process flow diagram of crumb data evidence obtaining work.

Embodiment

As shown in Figure 1, be the process flow diagram of crumb data kind identification method according to the invention.

Step S1 before beginning to carry out the crumb data type identification, at first will carry out preliminary work, promptly should obtain the byte frequency distribution sample and its its specific structure characteristic in various file type data zone.If the byte frequency distribution sample and its its specific structure characteristic in existing various file type data zone; Then can skip this step and directly begin to carry out identification work from step 2; If no, then need extract through a large amount of work, like collection, contrast, analysis, summary etc. in this step; Obtain the byte frequency distribution sample and its its specific structure characteristic in various file type data zone, for the type identification that goes on foot down provides the basis.

Step S2 to the crumb data to be tested that will discern, extracts the byte frequency distribution of crumb data to be tested.

Step S3 utilizes the Tanimoto coefficient to set up corresponding model of cognition, calculates the similarity of the byte frequency distribution of crumb data to be tested and a certain sample.

Step S4; The similarity of the byte frequency distribution of the crumb data of a similarity that calculates and a known type and same sample is compared, judge whether the similarity that calculates falls into the scope of back one similarity, if fall into; Then in step S4; The crumb data of confirming crumb data to be tested and this known type belongs to same type, if not in the scope of back one similarity, then this crumb data to be tested of affirmation can't be discerned.

Wherein, The foundation of judging should obtain in advance; The similarity of byte frequency distribution that is known certain type crumb data and a certain sample should be a known range; Can judge so just whether the similarity of calculating falls into this scope,, explain that then crumb data to be tested belongs to the type if fall into.

In addition; The present invention proposes the identification of the auxiliary crumb data type of two types of parameters optimization, and the one, search the specific structural features that whether contains the related data type in the crumb data, the 2nd, consider the relevance of crumb data; Be crumb data to be tested with adjacent crumb data type between have certain related; Can strengthen the accuracy of crumb data identification through these two kinds of methods, and guarantee in identifying, not change raw data, thereby guarantee the authenticity and the reliability of counting.

Fig. 2 is the process flow diagram of a specific embodiment of crumb data kind identification method according to the invention, specifically comprises following step: 1) pre-service; 2) set up model of cognition; 3) type under the preliminary judgement crumb data to be tested; 4) dependency structure of introducing crumb data to be tested is characterized as parameters optimization 1; 5) relevance of distance is a parameters optimization 2 between the introducing crumb data.Utilize parameters optimization can improve the accuracy of crumb data type identification.Below specify above-mentioned each step:

Step S11, pre-service.At pretreatment stage, comprise the byte frequency distribution sample that extracts various file type data zone, set up sample pattern S, wherein, S={S ₁, S ₂... S _iS _m, the set of S representative sample model, s _iBe one of them daughter element, this is to come out sample pattern is abstract with the method for mathematics, representes with S;

Also comprise extraction document type its specific structure characteristic δ, wherein, δ={ δ ₁, δ ₂δ _iδ _m, the set of δ representation file architectural feature.

The byte frequency distribution is meant the leave operation system level, by the frequency distribution of byte statistics raw data.In function F (x), f _iExpression is the number of times that byte value i (being the pairing decimal system numerical value of each byte (byte) in the computing machine) occurs in the crumb data of unit with the sector.Through this function F (x); Can extract the characteristic of byte frequency distribution according to the difference of different types of data self property; The advantage of this characteristic is: can abandon the surface that file type, file extension, file special identifier etc. are given by operating system; Be based on the content of crumb data self, can truly reflect the characteristic of crumb data.

File type its specific structure characteristic δ is meant its distinctive continuous binary data sign of various file types, and these architectural features not only are distributed in the reference position of file, and might be distributed in the central or ending of file.Need obtain through the mass data analysis, can come to obtain automatically through some algorithm, also can manual analysis obtain by machine.

About file type its specific structure characteristic δ, the different files type, its architectural feature is different, is example with the jpeg file type, and the file of jpeg file type mainly comprises binary data sign as shown in table 1 below.

Table 1

Code	Implication
		FFD8	SOI SOI (Start of Image)
FFE0	APP0 mark (Marker)
		FFDB	Quantization table DQT (difine quantization table)
FFC4	Huffman table DHT (Difine Huffman Table)
		FFC0	Two field picture begins SOF0 (Start of Frame)
FFDA	Scanning beginning SOS (Start of Scan)
		FFD9	Image finishes EOI (End ofImage)

The byte frequency distribution F (x) of step S12, extraction test crumb data x (x representes the code name of crumb data to be tested), wherein, F (x)={ f ₀, f ₁F _iF ₂₅₅.

Step S13, through the Tanimoto coefficient, promptly formula (1) calculates the similarity T of byte frequency distribution between sample S and the test data F (x) _x

The Tanimoto coefficient can be measured the similarity of document data, and reduction is the Jaccard coefficient under two meta-attribute situation.The present invention proposes a kind of crumb data model of cognition based on the byte frequency distribution; This model is minimum test cell with the crumb data of 512B; Add up the byte frequency distribution F (x) among each test 512B, can draw the similarity T of byte frequency distribution between sample S and the test crumb data F (x) through the Tanimoto coefficient _x

T (A, S) = \frac{A \cdot S}{{| | A | |}^{2} + {| | S | |}^{2} - A \cdot S}

Formula (1)

Wherein A=F (X) for the byte frequency distribution of sector, test crumb data x place, is 1 dimensional vector with 256 elements; S is the byte frequency distribution of sample data;

A \cdot S = Σ_{i = 1}^{n} A_{i} S_{i}, {| | A | |}^{2} = Σ_{i = 1}^{n} A_{i} A_{i},

n＝256。

It is thus clear that the span of T is [0,1], when T=0, A and S similarity are minimum; When T trended towards equaling 1, A and S similarity were the highest.The value of T from 0 to 1 o'clock, A and S similarity were from low to high.

When calculating similarity, can calculate by means of means such as computing machines, for example, in computing machine, write calculation procedure, through inputting interface input S and A, promptly can calculate the similarity T of byte frequency distribution between sample S and the test data F (x) automatically _x

Step S14 calculates the similarity T of byte frequency distribution between sample S and the test data F (x) _xAfter, the similarity T of preliminary judgement crumb data x to be tested _xWhether fall in the similarity scope of byte frequency distribution of crumb data and same sample of a known type.

In the present invention, store the data type T that draws according to the similarity between various types of crumb data and the sample in advance, i.e. T={T ₁, T ₂... T _iT _m, the total m kind crumb data type of expression.Wherein, what Ti represented is i kind data type, and it is represented with two parameter Ti1, Ti2; Wherein, Ti1 represents similarity, and promptly the similarity of i kind data type is represented with Ti1; It is one from 0 to 1 a scope, and the similarity of each data type all has an effective range, is example with the jpeg file type; Utilize the effective range of the similarity that the Tanimoto coefficient calculations goes out to be [0.55,1], promptly between 0.55 to 1; The set of Ti2 representative data architectural feature, promptly the data structure characteristic set of i kind data type is represented with Ti2, and for example, the data structure characteristic set of jpeg file type can be the content of aforementioned table 1.

Based on the data type T that the above-mentioned similarity according between various types of crumb data and the sample of storage in advance draws, the similarity T of preliminary judgement crumb data x to be tested _xWhether fall in the Ti1 scope, if similarity T _xFall in the Ti1 scope, can think that then crumb data x belongs to i class crumb data; If similarity T _xDo not fall in the Ti1 scope, can think that then crumb data x does not belong to i class crumb data, need to continue to judge similarity T _xWhether fall into T _I+1Scope is promptly in the similarity scope of another known type, if similarity T _xAll do not fall in the similarity scope of all known types, be about to this crumb data to be tested of then thinking that the m kind prestores and identify type.

The dependency structure characteristic δ of step S15, introducing crumb data to be tested _xFor parameters optimization 1, as shown in Figure 3.Be the similarity T of byte frequency distribution between crumb data x and a certain sample S _xFall into a known similar degree T _iScope in the time, further judge whether there is δ among the crumb data x _xIf, exist, continue to judge whether to satisfy δ _x∈ T _j, wherein, Tj represents the data structure characteristic set of another kind of UNKNOWN TYPE, if satisfy δ _x∈ T _j, continue to judge whether i equates with j, if i=j explains that then the set of Tj data represented architectural feature is identical with Ti2, can judge that then said test crumb data x belongs to data type T _iThe type of representative.If i and j are unequal, continue step S 16, if do not satisfy δ _x∈ T _j, perhaps do not have δ among the crumb data x _x, then this situation is not done analysis, with the judged result of step S14 as overall result.

In step S15, the similarity T of byte frequency distribution between said crumb data x to be tested and a certain sample S _xFall into a known types T _iThe similarity scope in the time, further confirm the architectural feature δ of said crumb data x to be tested _xWhether also belong to this known types T _iArchitectural feature set, thereby confirm to judge the type of test crumb data x more exactly.

The relevance of distance is a parameters optimization 2 between step S16, the introducing crumb data.Because to be distributed in 32 possibilities within the data block is 80% to fragment in the identical file, so crumb data is not stochastic distribution in disk, is that certain relevance is arranged between the fragment, and promptly a certain section continuous crumb data belongs to same file.

In step S15, when i ≠ j, perhaps among the step S14, although the similarity T of crumb data x to be tested _xFallen in the Ti1 scope, but similarity being lower, for example, is example with the jpeg file type, and the effective range of similarity is [0.55,1], and the similarity T of crumb data x to be tested _xBe 0.56, the similarity degree of obvious crumb data x to be tested and jpeg file type is very low.Under above-mentioned two kinds of situation, all can adopt the measure of step S16.As shown in Figure 4; Whether the sequence number of judging current tested crumb data x is last of place data block, if not, sequence number adds 1; Whether the similarity of crumb data of then judging this sequence number is in the Ti scope; Circulation compares, and all compares up to other data with crumb data x to be tested place data block, adds up the number of the crumb data of similarity in the Ti scope then; If similarity greater than 80%, judges then that said crumb data x belongs to data type T in the ratio of the quantity of Ti scope internal fragment data _iThe type of representative, otherwise judge that said crumb data x can't discern.

Promptly in step S16, judge to fall into T _iScope in the crumb data ratio that accounts for this data block have muchly, for example,, can think very definitely that then said crumb data x belongs to data type T if greater than 80% _iThe type of representative.

Can effectively evaluate the type of crumb data through the foregoing description.In addition, distribute by page or leaf (4K) during Memory Allocation, be the integral multiple of 512B, therefore, the crumb data described in the present invention also can refer to it is the data in the internal memory.

Crumb data kind identification method of the present invention can provide certain electronic evidence information for judicial evidence collection; Guarantee to identify the type of crumb data on the one hand; Further, improved the discrimination of type, on the other hand; Guaranteed the data in the crumb data type identification process reliability, with the consistance of raw data, for certain place mat work is done in follow-up crumb data recombination.

Fig. 5 is the collect evidence process flow diagram of whole work of crumb data.Wherein preparatory stage and crumb data are extracted the stage as a series of preliminary works of the present invention, do not do detailed description at this, can adopt existing universal method.After having extracted crumb data; Carry out the analysis of crumb data; Comprising rejecting the contiguous file data block, carrying out the identification of crumb data type of the present invention, then carry out the reorganization of crumb data, show the fragment evidence then; And be submitted to court, promptly reach a conclusion according to the crumb data that obtains.

Identification through crumb data type of the present invention; For the reorganization of next step crumb data in the electronic evidence-collecting provides the foundation; And; Because the crumb data in the electronic evidence-collecting process of the present invention is in the consistance of obtaining, having guaranteed in the identification, regrouping process with raw data, therefore, the reliability and the authenticity of the electronic evidence of fundamentally having guaranteed to obtain.

Claims

1. the recognition methods of a crumb data type is characterized in that: may further comprise the steps:

T (A, S) = \frac{A \cdot S}{{| | A | |}^{2} + {| | S | |}^{2} - A \cdot S}

Formula (1)

Wherein, A=F (x) is the byte frequency distribution of sector, said test crumb data x place, and S is the frequency distribution of sample data byte;

n=256;

2. the recognition methods of crumb data type according to claim 1 is characterized in that: also comprise step 4,

3. the recognition methods of crumb data type according to claim 1 and 2 is characterized in that: also comprise step 5,

4. the recognition methods of crumb data type according to claim 1 is characterized in that: before step 1, comprise the steps:

Steps A: extract sample pattern S, confirm the crumb data of various file types and the similarity between the said sample pattern S.

5. according to the recognition methods of the described crumb data type of claim, it is characterized in that: before step 1, comprise the steps:

6. the recognition methods of crumb data type according to claim 1 is characterized in that: described crumb data comprises crumb data and the crumb data in the internal memory in the various disks.

7. the recognition methods of crumb data type according to claim 3 is characterized in that: the quantity of said crumb data to be measured place data block is 2 ⁵-2 ⁸Piece.

8. the recognition methods of crumb data type according to claim 3 is characterized in that: described predetermined quantity is the quantity more than 80% that accounts for said crumb data to be measured place data block quantity.