CN103942122B - A kind of identification AVI types block method - Google Patents

A kind of identification AVI types block method Download PDF

Info

Publication number
CN103942122B
CN103942122B CN201410164339.5A CN201410164339A CN103942122B CN 103942122 B CN103942122 B CN 103942122B CN 201410164339 A CN201410164339 A CN 201410164339A CN 103942122 B CN103942122 B CN 103942122B
Authority
CN
China
Prior art keywords
block
avi
types
byte
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410164339.5A
Other languages
Chinese (zh)
Other versions
CN103942122A (en
Inventor
杨涛
杨一涛
潘俊
孙国梓
刘力颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201410164339.5A priority Critical patent/CN103942122B/en
Publication of CN103942122A publication Critical patent/CN103942122A/en
Application granted granted Critical
Publication of CN103942122B publication Critical patent/CN103942122B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of identification AVI types block method, this method is byte identification code based on audio video interleaved and the method for C4.5 decision trees, it is the recognition methods for avi file type block in the storage mediums such as disk, USB flash disk, this method design carries out engraving independent of file system metadata for the storage mediums such as disk, USB flash disk deletion data and provides pre-treatment step, and general file engraving needs undergo classification and recover two steps;The present invention step be:The block with particular identification code is identified by byte identification code first, then for the block not yet identified, after decision tree is obtained by simulating the training set similar to disk storage environment, then is recognized.The program adapts to complexity, multifile, the storage environment of Large Copacity.In addition, the present invention has good recognition accuracy for originally belonging to AVI block, there is higher application value for application fields such as judicial evidence collection, data recoveries.

Description

A kind of identification AVI types block method
Technical field
The present invention relates to computer data digging technology field, more particularly to a kind of identification AVI types block method.
Context analyzer
With Information Technology Development, data recovery is more and more important as the effect of last one of barrier of information security, Application demand in judicial evidence collection, military and civilian field is strong all the more.Traditional data reconstruction method is directed to the number of fragmentation Even if can not recover according to the metadata using remaining.Therefore, data may it is damaged and in the case of lacking metadata such as What recovers data this problem urgent need to resolve.Damaged data are often worth very greatly, sometimes include the crucial letter of case Breath.And in civil area, video recovery also has a wide range of applications occasion, such as:Wedding celebration company needs to give imprudence deletion for change Client's wedding dinner DV.Video recovery has great economic value for specific enterprise.The development of information technology is created for people Make surprising data simultaneously, also propose data recovery this problem to researcher.
The metadata that the data recovery of early stage excessively dependent file system is provided, progressively occurred extensive independent of metadata later The file engraving process of complex data.File engraving basis goes out data to file internals and content recovery.The text occurred earliest The method value that part engraving process is read according to the flag sequence of file end to end is adapted to the situation that document order is stored.Research shows, The file that files more than several M (million) there are about 15%~20% can produce fragment, that is to say, that there are a large amount of fragmentations on disk File.For the file of fragmentation, it will be malfunctioned using the engraving process continuously read.Therefore, it is necessary to which studying to fit Engraving process for fragment file.
At present, for the engraving of fragment file, corresponding framework proposes.It is main to include identification block, recovery two Individual part.But, in the recognition methods for AVI (Audio Video Interleaved form), generally there is the problem of discrimination is not high. The present invention will propose that a kind of new method is used for AVI types block classification.
The content of the invention
Present invention aims at propose a kind of method for recognizing AVI types block in the storage mediums such as disk, this method Tentatively recognized by the intrinsic byte identification code of AVI format, then for remaining block applications C4.5 traditional decision-trees, Go out the AVI type block of no byte identification code using byte value frequency distribution BFD as feature recognition, pass through the knowledge of priority two-wheeled The identification to AVI types block is not realized,
The technical scheme adopted by the invention to solve the technical problem is that:The present invention is a kind of in analysis AVI types block On the basis of feature, the byte condition code and byte value frequency distribution information that may contain in block are excavated, Jin Eryi Matched according to byte identification code and using C4.5 Decision-Tree Methods identification target block method, this method mainly includes mirror As backup, extract the steps such as block, the matching of byte flag code, the identification of C4.5 decision trees.
Method flow:
Step 1:Mirror back-up.
Mainly the content in storage medium is backuped in other storage mediums completely by special backup tool, it is to avoid Data source is damaged in data recovery procedure.The scope of backup is from first sector until last sector.Backup Data include meta-data section and real data part.
Step 2:Extract block.
By scanning storage medium, according to file table, unwritten piece of file table is marked.These unwritten piece The data block lost or damaged comprising the block not stored and metadata.To not have markd piece to backup in other storage mediums Remove the object as identification target block.
Step 3:Byte flag code matching.
AVI types block it is exclusive byte-identifier code have List, avi, hdrl, avih, strl, strf, strd, JUNK, Odml, movi, ##wb, ##dc, ##db (## represents numbering 01,02,03 etc.), rec, idx1 etc..Each blcok is retrieved successively Byte identification code, when occurring in that the byte identification code in the byte identification code set being mentioned above in block, it is determined that AVI fragments.
Step 4:C4.5 decision trees recognize.
It is determined that after the file type that mirror image is included, setting up the training set being made up of these types block.In various texts In the case that how much unknown part number of types is, each type of block equivalent is chosen, and ensure that block number is enough.So Each block byte frequency distribution (Byte Frequency Distribution, BFD) is extracted afterwards.As feature, pin Decision tree is set up according to C4.5 algorithms to training set.The block in each test set is identified using decision tree.
C4.5 algorithms set up classification tree by following steps:(1) entropy of classification stochastic variable is calculated.(2) in turn with wherein Then one attribute calculates entropy production as root.(3) that maximum attribute of selection entropy production is root.
Beneficial effect:
1st, the present invention can identify the block of AVI types with higher discrimination.
2nd, the present invention can adapt to the storage environment of complexity, include the polytype form such as picture, video, document Target block is identified in the environment of block.
Brief description of the drawings:
Fig. 1 is flow chart of the method for the present invention.
Fig. 2 is the flow chart of C4.5 algorithms.
Embodiment
The invention is described in further detail below in conjunction with Figure of description.
As depicted in figs. 1 and 2, the present invention proposes a kind of identification AVI types block method, and this method includes as follows Step:
Step 1:Mirror back-up
The object of backup includes the storage mediums such as disk, USB flash disk, CD.Ghost is the instrument for hard disc cloning.For U Disk backup has the softwares such as UBackUp, USB flash disk backup tool.Optical disc backup can just be realized by imprinting software.Here backup It is complete backup, the deletion data being stored on backup object and does not delete data and be all copied and be stored on another medium.
1) another storage medium is selected.
2) it is different according to backup object, different backup tools are selected, all data progress to backup object is completely standby Part.
3) backup is completed, and preserves former storage medium.The data backed up on another storage medium will be used for AVI types Block identification.
Step 1 of the present invention is to ensure that according to storage media types, selects suitable backup software, and back up completion Afterwards, former storage medium is preserved.The scope of backup is from first sector until last sector.Backup Data includes metadata Part and real data part.
Step 2:Extract block
1) mirror image data is scanned, analysis of metadata determines allocated block and unappropriated block in mirror image.
2) allocated block data need not be recovered.Allocated block is made marks.Then, it is successively read out not The block of distribution, and stored with certain document form (being set as txt here).What each was stored with txt forms Block is the object of identification.
Step 2 of the present invention is, according to metadata information, to mark allocated block, namely need not recover block.For unallocated block, it is preserved one by one using txt file type, for subsequently recognizing.
Step 3:Byte flag code matching
The file that avi file type belongs to RIFF encapsulated types is a kind of.RIFF file types spend differentiation number comprising various According to the byte identification code of type.By the file analysis to RIFF file types, in addition to RIFF this byte identification code, these The file of type is without other identical identification codes.That is, block can be determined by the byte identification code in addition to RIFF Type.
1) the distinctive byte flag code of AVI type files is determined.By analyzing file format, following byte flag is obtained Code is exclusive for AVI type files:List、avi、hdrl、avih、strl、strf、strd、JUNK、odml、movi、##wb、## Dc, ##db (## represents numbering 01,02,03 etc.), rec, idx1.
2) byte identification code matching is carried out to each block stored with txt forms by KMP methods.As long as the txt is literary Byte identification code containing a matching in part, just stops matching operation, and think that the block is exactly AVI types block。
3) block identified constitutes a set.Eliminated from original txt file set identified Block out.Remaining txt file is used as the second wheel C4.5 traditional decision-tree identifications.
Step 3 of the present invention includes, to AVI format file distinctive byte-identifier code, there is as follows:List、avi、 Hdrl, avih, strl, strf, strd, JUNK, odml, movi, ##wb, ##dc, ##db (## represents numbering 01,02,03 etc.), rec、idx1.These identification codes be used to carry out bytes match to the block that each needs are recognized.Using KMP methods, to each Byte identification code matching is carried out with the block that txt forms are stored.As long as the byte-identifier containing a matching in the txt file Code, just stops matching operation, and think that the block is exactly the block of AVI types.
Step 4:C4.5 decision trees recognize.
After the data type tentatively understanding of storage medium, an instruction suitable with storage media types storage environment is set up Practice collection.The data acquisition system contains the block of All Files type in storage medium, and the block numbers of every kind of file type Amount is enough and identical.Then following pre-treatment step is carried out to these block:
1) application Matlab extracts the block of input BFD features, and the BFD features of All Files constitute block numbers * 256 matrix, and save as csv file.Block BFD features are represented per a line, each row are exactly one and are used as feature Byte value.
2) file type according to belonging to each block, determines the property value of the row.If the BFD of the row is AVI fragments Feature, is denoted as Yes.Conversely, being denoted as No.
The csv file obtained for pretreatment, decision tree is set up by C4.5 traditional decision-trees.Each node of decision tree All it is the byte value (byte value) as feature.Remaining block is calculated according to C4.5 successively after byte identification code is matched Method is recognized.Comprise the following steps that:
1) block for needing to recognize is read, its BFD feature is extracted.
2) according to the C4.5 decision trees having built up, after some block for needing to recognize BFD is obtained, according to each The threshold value of node selects branch one by one, when determining leafy node, and identification is completed.
3) other all block identification is completed according to step 1,2.
Step 4 of the present invention is remaining bis- identifications of block after being matched using C4.5 algorithms to byte identification code, with Ensure that the block of really AVI types is identified without signature identification code for itself.In order that decision tree more conforms to reality The storage environment of storage medium, before training set is prepared, initial analysis is done to the file type mainly included in storage medium.So Afterwards so that the block types (i.e. the affiliated file types of block) that training set is included are consistent with file type in storage medium, and And every kind of block numbers are identical, quantity is enough.Obtain after training set, its BFD feature is extracted by Matlab, according to each File type belonging to block, determines the property value of the row.Finally, a csv file for representing training set is formed.Pass through Processing of the C4.5 traditional decision-trees to training set, constructs the decision tree for belonging to the training set.To each block for needing to recognize After its BFD is extracted, according to the threshold value trade-off decision tree branch one by one of each node of decision tree, when determining leafy node, Identification is completed.

Claims (8)

1. a kind of identification AVI types block method, it is characterised in that methods described comprises the following steps:
Step 1:Mirror back-up;
The backup for backup completely, the deletion data that are stored on backup object and do not delete data be all copied be stored in it is another On one medium, including:
1) another storage medium is selected;
2) it is different according to backup object, different backup tools are selected, all data to backup object are backed up completely;
3) backup is completed, and preserves former storage medium;The data on another storage medium are backed up by for AVI types block's Identification;
Step 2:Extract block;
1) mirror image data is scanned, analysis of metadata determines allocated block and unappropriated block in mirror image;
2) allocated block data need not be recovered;Allocated block is made marks;Then, it is successively read out unallocated Block, and stored in txt file form;Each is the object recognized by the block stored with txt forms;
Step 3:Byte flag code matching;
The file that avi file type belongs to RIFF encapsulated types is a kind of;RIFF file types are used to distinguish data class comprising various The byte identification code of type;By the file analysis to RIFF file types, in addition to RIFF this byte identification code, these types File without other identical identification codes;Block type is determined by the byte identification code in addition to RIFF;
Step 4:C4.5 decision trees recognize;
A training set suitable with storage media types storage environment is set up, the training set contains all texts in storage medium The block of part type, and the block quantity of every kind of file type is enough and identical, and then these block are located in advance Reason, including:
1) application Matlab extracts the block of input byte frequency distribution feature, the byte frequency distribution feature of All Files Block numbers * 256 matrix is constituted, and saves as csv file;The byte frequency distribution that a block is represented per a line is special Levy, each row are exactly a byte value for being used as feature;
2) file type according to belonging to each block, it is determined that the property value per a line;If the byte frequency distribution of the row is AVI shred characterizations, are denoted as Yes, conversely, being denoted as No;
The csv file obtained for pretreatment, sets up decision tree, each node of decision tree is by C4.5 traditional decision-trees As the byte value of feature, remaining block is recognized according to C4.5 algorithms successively after byte identification code is matched, including:
Step 4-2-1:The block for needing to recognize is read, its byte frequency distribution feature is extracted;
Step 4-2-2:According to the C4.5 decision trees having built up, some block for needing to recognize byte frequency point is being obtained After cloth, branch is selected one by one according to the threshold value of each node, and when determining leafy node, identification is completed;
Step 4-2-3:Other all block identification is completed according to step 4-2-1, step 4-2-2.
2. a kind of identification AVI types block according to claim 1 method, it is characterised in that the step of methods described 1 includes:Ensure according to storage media types, select after the completion of suitable backup software, and backup, preserve former storage and be situated between Matter;The scope of backup is from first sector until last sector;Backup Data includes meta-data section and real data Part.
3. a kind of identification AVI types block according to claim 1 method, it is characterised in that the step of methods described 2 include:According to metadata information, allocated block, namely the block that need not recover are marked;For unallocated Block, is preserved it using txt file type one by one, for subsequently recognizing.
4. a kind of identification AVI types block according to claim 1 method, it is characterised in that the step of methods described 3 include:Byte-identifier code distinctive to AVI format file, there is as follows:List、avi、hdrl、avih、strl、strf、 strd、JUNK、odml、movi、##wb、##dc、##db、rec、idx1;The identification code enters to each block for needing to recognize Row bytes match, ## represents numbering 01,02,03 ....
5. a kind of identification AVI types block according to claim 1 method, it is characterised in that the step of methods described In 3, using KMP methods, byte identification code matching is carried out to each block stored with txt forms;If containing in txt file The byte-identifier code of one matching, just stops matching operation, and think that the block is exactly the block of AVI types;Recognize Block out constitutes a set, and the block identified is eliminated from original txt file set, remaining Txt file is used as the second wheel C4.5 traditional decision-tree identifications.
6. a kind of identification AVI types block according to claim 1 method, it is characterised in that the step of methods described In 4, remaining bis- identifications of block after being matched using C4.5 algorithms to byte identification code, to ensure itself without signature identification code And the block of really AVI types is identified;Before training set is prepared, to the files classes mainly included in storage medium Type does initial analysis, then so that the block types that training set is included are consistent with file type in storage medium and every kind of Block numbers are identical.
7. a kind of identification AVI types block according to claim 1 method, it is characterised in that the step of methods described In 4, obtain after training set, its byte frequency distribution feature, the files classes according to belonging to each block is extracted by Matlab Type, determines the property value of the row, finally, forms a csv file for representing training set;By C4.5 traditional decision-trees to training The processing of collection, constructs the decision tree for belonging to the training set, and its byte frequency distribution is being extracted to each block for needing to recognize Afterwards, according to the threshold value trade-off decision tree branch one by one of each node of decision tree, when determining leafy node, identification is completed.
8. a kind of identification AVI types block according to claim 1 method, it is characterised in that:Methods described is to be based on Feature recognition code and C4.5 traditional decision-trees, the method for recognizing AVI types block.
CN201410164339.5A 2014-04-22 2014-04-22 A kind of identification AVI types block method Active CN103942122B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410164339.5A CN103942122B (en) 2014-04-22 2014-04-22 A kind of identification AVI types block method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410164339.5A CN103942122B (en) 2014-04-22 2014-04-22 A kind of identification AVI types block method

Publications (2)

Publication Number Publication Date
CN103942122A CN103942122A (en) 2014-07-23
CN103942122B true CN103942122B (en) 2017-09-29

Family

ID=51189795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410164339.5A Active CN103942122B (en) 2014-04-22 2014-04-22 A kind of identification AVI types block method

Country Status (1)

Country Link
CN (1) CN103942122B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6511893B2 (en) * 2015-03-23 2019-05-15 日本電気株式会社 Image processing apparatus, image processing method, and program
DE102016209032B3 (en) * 2016-05-24 2017-09-14 Siemens Healthcare Gmbh Image-providing method for carrying out a medical examination together with the associated imaging system and associated computer program product
CN109947760A (en) * 2017-07-26 2019-06-28 华为技术有限公司 It is a kind of excavate KPI root because method and device
CN113032179B (en) * 2021-02-25 2024-03-26 北京工业大学 Third party data recovery software clearing effect evaluation and selection method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101064158A (en) * 2006-04-30 2007-10-31 凌阳科技股份有限公司 Optical storage media recorded with audio-video staggered formation files and recording method
US8374573B1 (en) * 2009-03-30 2013-02-12 Reno A & E AVI system with improved receiver signal processing
CN103165157A (en) * 2011-12-16 2013-06-19 深圳市快播科技有限公司 Method and device for locating playing position of no-indexing audio video interleaved (AVI) file and player

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011253589A (en) * 2010-06-02 2011-12-15 Funai Electric Co Ltd Image/voice reproducing device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101064158A (en) * 2006-04-30 2007-10-31 凌阳科技股份有限公司 Optical storage media recorded with audio-video staggered formation files and recording method
US8374573B1 (en) * 2009-03-30 2013-02-12 Reno A & E AVI system with improved receiver signal processing
CN103165157A (en) * 2011-12-16 2013-06-19 深圳市快播科技有限公司 Method and device for locating playing position of no-indexing audio video interleaved (AVI) file and player

Also Published As

Publication number Publication date
CN103942122A (en) 2014-07-23

Similar Documents

Publication Publication Date Title
Pal et al. The evolution of file carving
CN103942122B (en) A kind of identification AVI types block method
CN104462433B (en) A kind of method of recovery FAT32 partition datas
CN101763394B (en) Method for searching secret-related files in computer system
CN109522290A (en) A kind of HBase data block restores and data record extraction method
CN108319518B (en) File fragment classification method and device based on recurrent neural network
CN104360837B (en) The method for realizing electronic data evidence obtaining analysis in evidence obtaining software based on custom script
US20150278023A1 (en) Apparatus and method for recovering data in oracle database
Laurenson Performance analysis of file carving tools
KR101593184B1 (en) Method and apparatus for recovering partition based on file system metadata
Pahade et al. A survey on multimedia file carving
CN103870364B (en) A kind of final version restoration methods of YAFFS2 files based on timestamp
JP2011065268A (en) Method and device for determining start-end offset of variable-length data fragment
Ravi et al. A method for carving fragmented document and image files
CN105653567A (en) Method for quickly looking for feature character strings in text sequential data
CN105701500A (en) Single-sided English paper scrap splicing identification method
KR101938730B1 (en) METHOD, APPARATUS AND COMPUTER PROGRAM FOR RECOVERING THE DELETED RECORD IN ABNORMAL PAGE AND JOURNAL FILE OF SQLite
CN108256003A (en) A kind of method that union operation efficiencies are improved according to analysis Data duplication rate
Lee et al. Block based smart carving system for forgery analysis and fragmented file identification
Yoo et al. A study on a carving method for deleted NTFS compressed files
CN102662981A (en) Windows recycle bin delete record forensics method based on feature scan
Chen et al. A novel data recovery algorithm for fat32 file system
Azeem The Data Carving-The Art of Retrieving Deleted Data as Evidence
CN102902814B (en) A kind of IM deletes the restoration methods of information
CN108681433A (en) A kind of sampling selection method for data de-duplication

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant