WO2016101006A1

WO2016101006A1 - Data reduction method for digital forensic data

Info

Publication number: WO2016101006A1
Application number: PCT/AU2015/000475
Authority: WO
Inventors: Darren Paul QUICK; Kim Kwang CHOO
Original assignee: University Of South Australia
Priority date: 2014-12-23
Filing date: 2015-08-11
Publication date: 2016-06-30

Abstract

A data reduction method and tool for collection of digital forensic data from a target source is described. A target data source is forensically accessed and the files are filtered to generate a first set of files to be added to a data container. These files are further processed to create representations of the original files with reduced file sizes. Video files are converted into composite image files in which each image file comprises a plurality of frames sampled from the video file, and image files are converted into a standard format and size in order. The target data source is also processed to identify hidden data. The processed files are added to the data container using a compressed container format. Additionally a report is generated on all files, together with hard disk drive and partition information which can be reviewed by a practitioner. The method can be performed in parallel or subsequent to collection of a full forensic bit- for-bit copy and enables rapid collection, processing, indexing and searching of subset data to take place, which can quickly highlight devices that contain potential evidential material that may require full imaging and analysis, as well as devices for which full analysis may not be required.

Description

DATA REDUCTION METHOD FOR DIGITAL FORENSIC DATA

PRIORITY DOCUMENT

10001 ] The present application claims priority from:

Australian Provisional Patent Application No. 2014905242 titled "DATA REDUCTION METHOD FOR DIGITAL FORENSIC DATA" filed on 23 December 2014.

The content of this application is hereby incorporated by reference in its entirety.

TECHNICAL FIELD

[0002] The present invention relates to forensic investigations of digital devices and data. BACKGROUND

[0003 ] Digital forensic analysis is the process of identification, preservation, analysis, and presenting digital evidence in a manner that is legally acceptable. With the growing use of digital devices such as computers and mobile phones, there has been a growing demand for digital forensic investigations and analysis by digital forensic laboratories within Law Enforcement and other investigative agencies. There are guidelines and recommended practices to which practitioners abide to ensure the process they use to examine data and information is done in a legally acceptable manner. One of these practices is the process of undertaking an examination on a forensic copy of media, rather than the original. Current practice is to make a full bit-for-bit forensic copy of exhibit media, such as computer hard drives or USB storage media, which, with the growth in media volume, can take many hours, and increases with the volume of data to be copied. The speed of hard drives and USB media has an upper limit, which limits any improvement in transfer rates, and so the time to make a full forensic image is increasing as the size of media grows.

[0004] To date the growth of digital forensic data has outgrown the capabilities of digital forensic laboratories to store and analyse the presented data, which has led to large backlogs of evidence awaiting analysis, often many months to years. With the predicted ongoing growth in technology and capacity of storage devices, including the increasing use of consumer devices (smart phone and tablets), together with the use of cloud storage device, this problem is estimated to become increasingly worse over the coming years. While many of the challenges posed by the volume of data are addressed in part by new developments in technology, the underlying issue of data growth has not been adequately resolved.

10005] The issue of the volume of data required to be analysed in a digital forensic examination has been raised over many years. In 1999, McKemmish in "What is forensic computing? Trends & Issues in Crime and Criminal Justice no. 1 18" (Canberra: Australian Institute of Criminology, http://aic.gov.au/ publications/current%20series/tandi/101 -120/ tandi 1 18.html) stated that the rapid increase in the size of storage media is probably the greatest single challenge to forensic analysis. In the interim years, there have been many publications stating the increasing volume of data is a major issue for forensic analysis. However, there have been no overall solutions proposed and the problem is still discussed and remains largely unsolved. For example, while there are various tools and techniques to assist an investigator, the time and effort to undertake analysis still remains a serious challenge.

[0006 ] Moore's Law is the observation that the number of transistors on an integrated circuit doubles every 18-24 months and that this assists in predicting the development of technology. Similarly Kryder (as cited in Walter C 2005. Kryder's Law. Scientific American 25 July.

http://www.scientificamerican.com/article/ki-yders-law/) observed that in the space of under 15 years, the storage density of hard disks had increased 1,000 fold, from 100 million bits per square inch in 1990, to 2005 when 1 10 gigabit drives were released by Seagate. Kryder's Law can equate to the storage density doubling every 12 months, and has been holding true since 1995. This is about twice the pace of Moore's Law. While storage capacity is doubling every year, the capacity to process data is only doubling every 18 to 24 months, leading to an ever-growing gap in the capability to process the volume of data seized using processing power alone.

[0007] Data published by US Federal Bureau of Investigation (FBI) Regional Computer Forensic Laboratory (RCFL) reveals that the volume of data analysed rose from 82 terabytes in 2003 to 5,896 terabytes (or 5.8 petabytes (PB)) in 2012 - an annual increase of 67 percent. The total volume of forensic data examined by the FBI RCFL in the period 2003 to 2012 was approximately 20PB. To store this volume of data poses significant cost issues. For example in 201 1 , the cost of a commercial solution to house 14PB of data, with the ability to scale to 15PB was an estimated US$ 18m. A cheaper option in 2013 is to store the data using widely available 3TB removable hard drives. This solution requires 6,588 external hard disk drives with an estimate total cost of US$922,292. However, forensic data stored in this way would simply be archived and would not be available for immediate review. Tape storage or other solutions would potentially be cheaper, but also require a method to retrieve the data from the stored medium prior to enabling access to the data for processing or searching. Consequently, the data is not readily available for review or analysis.

[0008] One approach to reduce data volume is to use compression and thus forensic bit-for-bit copies of hard drives or other media (commonly referred to as forensic 'images') are often compressed, using containers such as the Expert Witness, E01 , or other compressed formats. Data analysis was conducted on the figures for the volume of data comprising a range of forensic case types examined by the South Australia Police (SAPOL) Electronic Crime Section (ECS) and it was found that the compression amount varied according to the data on each evidence item and ranged from 92 percent to two percent of the total volume, with an the average compression observed across 107 hard drives being 51.1 percent. When this compression percentage is applied to the FBI's 20PB of data, this reduces the storage requirement to just over 10PB of forensic images. Hence, using compressed forensic image formats would reduce the cost to store the data, although storage costs are still significant.

[0009 ] Given the underlying issue of the large quantity of raw data to be stored, and the significant costs associated with providing storage and access, several researchers have proposed data reduction methods as a potential solution to the problem of 'big digital forensic data' (ie the volume, variety, and velocity of digital forensic data seized and presented for analysis). However traditional data mining and data reduction methods often focus on random extracts of data for processing, raising concerns about their suitability to analysis of forensic data holdings as such approaches may miss crucial evidence. Thus these traditional data reduction processes must be undertaken on the understanding that by not collecting or storing all data, there is a subsequent risk that evidential information is potentially missed and therefore a random collected subset of data may not be suitable for full or thorough analysis at a later date.

[0010] However, few methods have been outlined to date to enable data reduction in forensic investigations for current and future cases. One approach is Forensic Feature Extraction (FFE) and Cross Drive Analysis methods. FFE is outlined as a scan of a disk image for email addresses, message information, date and time information, cookies, social security and credit card numbers. The mfonnation from the data scan is stored as XML for analysis and comparison. With this approach the original data is not stored, and is interpreted, as such there may be instances where new techniques are not able to be applied to the original or historical data. There have been many developments in recent years whereby additional information is able to be extracted from data holdings that were previously unknown. As an example, Windows Registry analysis methodologies include newly discovered areas for locating information. Other approaches avoid data reduction and instead are selective about which files are extracted, such as those based on searches or predefined profiles. However these approaches also risk missing important data.

[001 1 ] Another issue that arises in a forensic investigation is the initial time taken simply to take a forensic copy and index the data so that analysis can begin. Frequently investigators are provided with a large quantity of data and/or devices early in the investigative process and there is considerable pressure to rapidly identify the most relevant pieces of evidence in order to determine the best investigative leads to follow up. A frequent first step is to index the full forensic copy of the hard disk (or disks) to enable further investigation and analysis of disk contents. In many investigations it is now not uncommon for indexing of a full forensic copy to often take several days, and in some cases the total quantity of collected data can exceed 6 Terabytes or more, which can take longer. Further, in particularly large cases the index and database can become too large for typical software to function. Thus whilst indexing has a valuable part to play in the digital forensic analysis process, the increasing time to index cases is becoming problematic.

[0012] The growth in the number of devices as well as the storage capacity impacts forensic

examinations in many ways, including increasing lengths of time to create forensic copies and conduct analysis, which contributes to the increase in the backlog of requests. Digital forensic practitioners and laboratories, especially those in government and law enforcement agencies, are, and will continue to be, under pressure to deliver more with less especially in today's economic landscape. This gives rise to a variety of needs for digital forensic practitioners and laboratories, such as the need for: more efficient methods of collecting and preserving evidence; the capacity to triage evidence prior to conducting full analysis; reduced data storage requirements; the ability to conduct a review of information in a timely manner for intelligence, research and evidential purposes; the ability to archive important data; the ability to quickly retrieve and review archived data; and the provision of a source of data to enable a review of current and historical cases (intelligence, research and knowledge management).

[0013] There is thus a need to provide improved methods to reduce the volume of digital forensic images of digital devices and data, or at least to provide a useful alternative to current tools and methods in relation to data reduction of digital forensic data from source or a full forensic image.

SUMMARY

[0014] According to a first aspect of the present invention, there is provided a data reduction method for digital forensic data, the method comprising:

forensically accessing a data source;

filtering the files in the data source to generate a first set of files to be added to a data container; processing the first set of files wherein a plurality of files are created which are reduced size representations of files on the data source, wherein processing comprises

converting one or more video files in the first set of files to one or more composite image files, each image file comprising a plurality of frames sampled from the video file;

converting one or more image files to a standard format and size, and

updating the first set of files with the converted files;

generating a report on all files; and

exporting the first set of files to a compressed container format.

[0015] In one form, each composite image file comprises a plurality of frames sampled from the video file at a predefined sampling interval or frequency. In one form, the predefined sampling frequency is every frame. [0016] In one fonn, the video file is divided into contiguous portions, and sampling comprises selecting a frame from each contiguous portion.

[0017] In one fonn, the method further comprises processing each sampled frame and omitting frames which fail an image quality check.

[0018] In one fonn, each composite image file comprises a plurality of thumbnail images arranged according to a predefined layout, and each thumbnail image has a predefined size and represents a frame of the video file. In one form, each composite image further comprises an information portion comprising one or more items of contextual infonnation.

[0019] In one fonn, filtering the file in the data source further comprises performing a known file check on a file, and omitting a file from the first set of files if it is determined to be identical to a known file generated by a third party. In one form, the known file check comprises perfonning a Hash calculation and comparing the Hash calculation with a known Hash value for the file.

[0020] In one form the filtering step further comprises one or more of recovering deleted files and folders, performing a file signature analysis, and/or expanding compressed container files.

[0021 ] In one fonn, the method further comprises reviewing the processed first set of files, and allowing a user to add or omit files from the first set of files.

[0022] In one fonn, the method further comprises generating a report on all files. In one form, reporting comprises providing a list of all files and a report on hard disk drive and partition information.

[ 0023] According to a second aspect of the present invention, there is provided a computer readable medium comprising instructions for causing a computer to perfonn the method of the first aspect.

[ 0024] According to a second aspect of the present invention, there is provided a computing apparatus comprising a communications interface, a memory and at least one processor, wherein the at least one processor is configured to perform the method of the first aspect.

BRIEF DESCRIPTION OF DRAWINGS

[0025 ] Embodiments of the present invention will be discussed with reference to the accompanying drawings wherein:

[0026] Figure 1 A is a flowchart of a data reduction method for digital forensic data according to an embodiment; [0027] Figure I B is a more detailed flowchart of a data reduction method for digital forensic data according to an embodiment;

[0028] Figure 2 A is schematic diagram of conversion of a video into a sequence of thumbnail images according to an embodiment

[0029] Figure 2B is schematic diagram of generation of a thumbnail image from a sequence of video frames according to an embodiment;

[0030] Figure 2C is schematic diagram of generation of a thumbnail image from a sequence of video frames according to another embodiment;

[0031 ] Figure 2D is a thumbnail image generated from a sequence of video frames according to an embodiment; and

[0032] Figure 3 is a schematic diagram of a computer apparatus according to an embodiment.

[0033] In the following description, like reference characters designate like or corresponding parts throughout the figures.

DESCRIPTION OF EMBODIMENTS

[0034] Typically digital forensic investigative process comprises identifying or discovering the location of potential evidence, such as a personal computer, mobile phone, portable storage, network stored data, or cloud storage comprising digital data. This identification/discovery is undertaken with appropriate legal authority to collect media containing potential evidence. Having identified and obtained the digital devices or data, the evidence is then preserved for subsequent analysis. This typically involves taking a full forensic bit-for-bit copy (image) of digital data sources using common forensic tools appropriate for the specific device, media or digital data (together digital data sources).

[0035] Embodiments of a data reduction method for digital forensic data will now be described. By utilising embodiments of the method described a reduced subset of data is generated from the primary data sources allowing a greater understanding of data to be made in substantially less time and at a substantially reduced cost, thus enabling investigators to rapidly analyse the data and identify potential leads for follow up. This is achieved by focusing on a subset of pre-categorised file and data types, thumbnailing video files, and reducing the dimensions of picture files. This enables a much quicker review process, which can be combined with reports or with analysis of subsets to identify entities, and search of external data sources to provide additional information about the identified entities. This data reduction methodology addresses a range of areas, including: reducing the volume of data to be stored for review, forensic triage, rapid review, data mining, intelligence analysis, presentation, and archival needs. This methodology considers the type of data to be collected, stored, and reviewed, with a focus on data which will provide the maximum information for minimal size (volume and time), and fits within existing digital forensic processes.

[0036] The data reduction method is typically performed either in parallel or subsequent to taking a full forensic bit-for-bit copy on the proviso that common forensic rules and practices are complied with, namely no change to the original media is made where possible. That is, the reduction process should not be undertaken to the detriment of the preservation process and hence, evidential and legal requirements take priority. If changes to media are suspected to result from the subset reduction collection process, this should either not be undertaken, or be done subsequent to the evidence preservation process to ensure the evidence is not put at risk of not being accepted in court due to any changes made. The subset reduction process can be run across the original (write- blocked) media, or a full forensic image. However if time or cost prohibits taking a full forensic bit-for-bit copy (image) of data sources the data reduction method could be used instead of taking a full forensic bit-for-bit copy, provided of course that common forensic rules and practices are complied with, and the original source is available for subsequent analysis if necessary.

[0037] Figure 1 A is a flowchart of a data reduction method for digital forensic data according to an embodiment. The methodology broadly comprises loading 2 or forensically accessing the target media or digital data source by a computing apparatus configured to implement the data reduction method and filtering 3 the data to display and select key files and file types to be included in the data subset, and identify files to be ignored. The digital data is then processed and converted 4 to identify hidden data and to reduce file size, followed by a review step 5 to allow the user to review and select any other files determined to hold possible relevant information, and to discard unusable files such as overwritten files. Reports on the data source 6 may be optionally generated and analysis may be performed (this is optional and could be performed before or after preservation step). The data is then preserved 7 such as by exporting the selected files as a logical evidence file (eg L01 ). This method does not necessarily replace the need for full analysis. Rather it enables rapid analysis to assist in both early stage identification of potential leads and evidence, or to facilitate further investigation as new data comes to light. The process can also be applied in a triage manner, as the framework enables rapid collection, processing, indexing and searching of subset data to take place, which can quickly highlight devices that contain potential evidential material, which may require full imaging and analysis. The application as a triage process can alleviate imaging and analysing exhibits which may not have any relevant information (which likelihood can be determined in the triage review). The review stage may provide a quick review that identifies key evidence, in which case, full analysis may not be required, saving the time to fully image and process the entire exhibit data. However, if required, more detailed analysis can then be performed on the complete data set.

[0038] To further illustrate the method, Figure IB is a more detailed flowchart of a data reduction method for digital forensic data 10 according to an embodiment. The forensic access or load step 1 is broken into two steps 1 1 and 12. Step 1 1 comprises connecting a physical drive or media device to the computing apparatus implementing the data reduction method or mounting a forensic image (ie the bit- for-bit copy) as a physical drive. To ensure the reduction process is undertaken in a forensically sound manner, hardware and/or software write blockers and forensic software is used to enable the collection of data subsets. In one embodiment a SATA hard drive is connected via a hardware write blocker to ensure data is not altered. Once a digital data source is connected or mounted, the evidence is then loaded into forensic software configured to implement the data reduction process 3.

[0039] Forensic software (e.g. EnCase, X-Ways, The Sleuth Kit Framework, etc) is then used to access the write-protected hard drive and to filter files according to a configuration (config) file or database 13. This allows the user to define or select files containing potential data of interest, such as Windows Registry files, Internet browsing history, log files, documents, software initialisation files, software data fil es and other files of potential importance. Additionally the filtering process can identify unusable files, such as files that have been overwritten since a predefined date or time (whether by a user, by the operating system or software program). The filtering process can be specified by the forensic investigator using various methods. In one embodiment the forensic investigator edits a configuration file to define filtering criteria which is then read by the forensic software during loading of the software, or during an initialisation phase of the forensic software. Alternatively a database can be configured with the pre-built conditions or filters and accessed by the forensic software during an initialisation phase. In another embodiment, the forensic software is configured with pre-built conditions, filters and file types, and a user interface allows the user to select the filtering to be perfonned. For example a list of possible files, file types or criteria could be presented to the user as a checklist, and the user can review and selects (or deselects) which files are considered as the important files for inclusion in the data subset. Additionally a combination of the above methods may be used. For example a configuration file or database may be used to define typical values which can then be overridden via a user interface. The software for performing filtering could be software written specifically for the data reduction process, or it could be written to utilise existing forensic software such as EnCase, X-Ways, ForensicToolkit, etc. Software could also be written which will run from a boot disk or USB to collect the reduced subset of data in field situations. This can be designed to be undertaken by users with limited training, and the subset data is then forwarded to a forensic practitioner. [0040 ] The configuration (config) file may comprise a list of file type, extensions or folder paths, and can be updated with new information and data types. Table 1 lists example file types, extensions and folder paths in a config file according to an embodiment.

TABLE 1

Example Config file listing file types, extensions and folder paths for inclusion or exclusion in reduced dataset.

Selective Image (SI) Files

Full Path find: \home\, destinations, ExplorerStartup, Microsoft Windows\Explorer,

\AppleComputer\MobileSync\, prefetch, recent, Recycle Bin, Recycler, thumbcache, tilecache, windows messenger

Name find: bitcoin, ccleaner, com.apple.ical, config, destinations, dogecoin, eraser, iconcache, litecoin, log. l , log.2, log.3, log.4, log.5, log.6, log.7, log.8, log.9, login.keychain, thumbcache, wallet

Name Matches: $ Journal, $LogFile, $MFT, $UsnJral, +startup, allocation, attributes, catalog, data, DS Store, extents, FAT1 , FAT2, INF02, inode table, journal, MDB, NTUSER.DAT, Primary FAT, README, SAM, Secondary FAT, SECURITY, SOFTWARE, strings, superblock, SYSTEM, tasks.xml, thumbs. db, UsrClass.dat, VBR, volume header, volumeboot

Extension matches: 0, 1 , 2, 3, 4, 5, 6, 7, 8, 9, bak, bash history, binarycookies, cache, cfg, chk, config, contact, dashboard, dat, data, db, db3, drift, edb, ers, etl, evt, evtx, fav, fkc, history, inf, ini, itdb, ithmb, journal, keep, keychain, little, Ink, localstorage, log, logl, log2, log3, log4, log5, log6, log7, log8, log9, mbdb, met, nfo, nri, ods, old, pf, plist, pis, previous, props, rbk, reg, reg, resource, searchindexcache, sh history, shadow, signed, sqlite, sqlite3, stats, strings, sxml, ticketscore, wer, win, wmdb, wpl, xpi, xsslog

Documents

Extension Matches: csv, doc, docx, dot, dotx, odg, ods, odt, ots, pdf, ppt, pptx, pub, rtf, txt, wri, xls, xlsm, xlsx, myob, prf (MYOB Premier) (review; wps, pnm, pcx, emf, wmf, ove)

Email

Full Path find: mychatlogs, yahoo\messenger\profiles, \myreceivedfiles\mychatlogs,

\thunderbird\

Name find: ichat.pictures, showletter, showfolder, xsslog

Name Matches: cached-consensus, cached-descriptors, cached-descriptors.new state, hostname, logs. cab

Extension matches: eml, emlx, emlxpart, imm, ipd, msg, pst, dst, ost

Internet

Full Path find: my received files, AppData\Roaming\Opera\,

\Google\Chrome\UserData\, \AppleComputer\Safari\, \Mozilla\Fircfox\Profiles\, temporary internet files

Name find: CACHE OO, ebayisapi

Name Matches: _CACHE_001_, _CACHE_002_, _CACHE_003_, _CACHE_MAP_, bookmarks, bookmarks. adr, config.db, config.dbx, cookies, cookies4.dat, current session, current tabs, data, dcache4.url, dht.dat, dht.dat.old, favicons, filecache.db, filecache.dbx, global_history.dat, history, index.dat, resume.dat, resume. dat.old, rss.dat, rss. dat.old, search_field_history.dat, sessionstore, settings.dat, settings. dat.old, shortcuts, snapshot.db, sync config.db, syncdb, syncdiagnostic.log, synciddb, tasks.xml, TempPassword.$$$, top sites,

typed history.xml, vlink4.dat, wand.dat, web data

Extension matches: aol, bookmarks, db, dbx, htm, html, torrent, webbookmark,

webhistory, xml, xsl, win, json

Pictures (Review and/or

reduce)

Extension matches: bmp, jpg, jpeg, png, gif, psd, tif, tiff, raw, dng, nef

Video (Review and/or

thumbnail)

Extension matches: 3g2, 3gp, avi, divx, flv, m2v, m4v, mod, mov, mp4, mpg, mts, vob, wmv, xvid, mpeg, mkv, m2ts, mpo

ZIP (Review)

Extension matches: gho, iso, jbc, rar, vhd, vmdk, vmx, zip, wim, dd, 001 , 002, 003, 004,

005, 006,^" 007, 008, 009, v2i

Full Path find: Bestcrypt, truecrypt

Audio (Review)

Extension matches: amr, m4a, mp3, ogg, wav, wma, aiff, amr, mid, oga

Exclude (Review)

Extension matches: dll, exe, cab, fon, gpd, hip, ico, inf, man, manifest, mum, mui, pnf, ppd, ttf, wmf, qtr, sys, msi, ipa

Full Path find: licence.rtf, HCData.edb, datastore.edb

[0041 ] A set of optional filtering steps 14 may be performed such as one or more of a folder or drive scan to recover any deleted file entries or folders, a file signature analysis, and a known file check which may include calculating a hash value such as MD5 or SHA value and comparing this to a database of known hash values for known files. A drive scan is a scan of a drive or media to locate Master File Table ($MFT) remnants of previous files. This can take additional time, so a decision is made based on the time available balanced with need and benefits. The benefits may be that additional files are identified, such as deleted files with some remnants, although often the data is partially overwritten. A signature analysis is a comparison of the header information for each file to determine the file type (eg word doc, pdf, etc) and may include a comparison with the file extension or file type reference information. This scan can additional time, so a decision is made based on the time available balanced with need and benefits. The benefits may be that additional data is identified, and identification of files where the file extension is missing or incorrect for the actual data contained within the file (such as .dat files that are sqlite database files). Known file checking is a process to identify files which are identical to known files generated by a third party, and thus do not need to be collected as they have not been changed. This may include operating system files, software files and/or audio and video files such as music and movie files. One approach to performing a known file check is to use a Hash calculation. A Hash calculation is a process of calculating a Hash algorithm value, such as MD5 or SHA1 for each file, and then comparing each value to a database of known Hash values, such as the NSRL database of known operating system and software files, or from other sources such as iTunes, Google Play, etc. This comparison can take additional time, so a decision is made based on the time available balanced with need and benefits. The benefits may be identification of known operating and software files, which can be excluded and not collected.

[0042 ] Other optional processes include expanding compressed container files. Container files are files which contain compressed or stored files in a single file, such as ZIP or VMDK. As these files can contain other files, there may be a need to collect particular files from within a container. This would involve opening the container file, selecting files within the container, and including these within the selected files for the subset. This is undertaken if needed, based on a risk managed decision. It may take additional time to mount the container or view the contents. The benefits may be additional data which is identified and selected to be included in the subset, or the selection of the entire ZIP container to include in the subset. Other common forensic processes can also be undertaken, such as; partition finder, data carving, etc. This would depend on need in balance with time considerations. As an example, there may be occasions where exhibit media is faulty or has been formatted or deleted. In these cases other forensic processes may be needed to be undertaken prior to being able to identify data or make a subset, such as a partition scan, or a data carve process. These processes can take time, but may be necessary to identify any data on media, in the case of deleted or faulty media, and the filter process then applied to the recovered data.

[0043 ] These optional steps are undertaken if time permits as they can take many hours in some cases. Thus if the digital forensic investigation is being performed on-site, or if there are urgent time pressures, then these steps will typically be omitted. Alternatively if the investigation is being performed in a forensic lab environment, then these steps are more likely to be performed, although in most cases the decision will be made by the digital forensics practitioner based upon the specific case requirements. For example if the device is running the Mac OSX operating system compared to a Microsoft Windows OS, there may be greater need to undertake file signature analysis as many files on OSX may not have a file extension from which the file type can be quickly determined, for example to allow quick identification of video files on which to perform data reduction as described herein. Ultimately the decision is made by the practitioner based on investigation requirements with a risk management focus weighing up benefits versus cost. The cost would relate to time and resources, the benefits may be additional data identified and included in the subset collection for processing and review.

[0044 ] Processing and conversion of the digital data 4 is then performed, and this is further illustrated in step 15 - 16. In this embodiment processing and conversion step 4 further comprises collecting and thumbnailing video files 15 and collecting picture files and processing them to reduce the dimensions to a predefined size and converting them to JPG (or JPEG) format if required 16. In one embodiment the forensic software is configured to scan through the folders and sub-folders of a hard drive or mounted media and process each video or image file identified.

[0045 ] Thumbnailing video files comprises generating one or more image files for each video with each image file comprising a sequence of thumbnails of frames taken at various interv als or sampling points along the video file. That is each image is a composite image of thumbnails of frames from the video file. In one embodiment, the sampling rate or interval is fixed so the number of image files will increase with the length of the video. In this way video file can be stored and viewed in a compressed format. This is further illustrated in Figures 2A to 2D.

[0046J A sampling interval or frequency is selected (or a predetermined default value may be used), and video frames are extracted at this selected interval/frequency. The extracted frames are then inserted as thumbnails into one or more image files. The overall size of the image file as well as the layout of thumbnails, such as the dimensions of the matrix or grid can be predefined (ie default values), or the software may allow the user to specify the size and layout (e.g. overriding any default values). In an alternative embodiment the user could specify the number of images per video file, and the matrix and file dimensions could be selected to meet these criteria, generally subject to some constraints such as minimum thumbnail sizes. Typically the image size will be selected to correspond to, or slightly vary from, typical aspect ratios or and resolutions of display screens such as 4:3, 16:9, 1024x600, 1024x768, 720x480, 720x576, 1280x720, 1280x 1024, 1440x 1080, 1600x1200, 1920x1080, etc.

[0047] In one embodiment, each image file is al024 pixel wide by 600 pixel high JPEG format file, and is comprised of a 5x5 grid (ie 25 frames) of thumbnails, with each thumbnail having a size of 200 x 120 pixels. Additional information may be added to the image to provide information on the source, such as a filename, sampling interval or the time of a thumbnail within the video file. Alternatively this information may be stored in a log file. In the case of a sampling every frame (ie an interval of 1 ), this represents a 1/25* reduction in the length of the video and the number without loss of frame information (typically there will be a loss of resolution). However if the sampling interval is increased to 5 or 1 frame per 0.5 or 1 second, greater reductions in file size can be generated. Additionally basic image processing can be performed on sampled frames and an image quality check can be used to detect frames which are very dark or blurry, or otherwise undesirable or unnecessary, and so can be skipped. Alternatively a nearby frame (in the temporal sequence) can be selected instead. The image quality check can also be used to detect unchanging frames, such as video staring at the same unchanging scene and the unchanging frames can be omitted. Detection of unchanging frames can be performed by comparing pixel change values from adjacent images. The dimensions of the matrix (rows and columns) and image file parameters (length, width, pixel depth, compression format) will typically be predefined (eg as default values) although the software will generally allow the user to override these default values. For example the user could specify one image per second of video footage, in which case the size of the grid will be determined by the frame rate of the video - e.g. 20 frames per second (fps) would use a 5x4 grid, 24 fps = 6x4 grid, 30fps = 6x5 grid, etc.

[0048] Figure 2A is schematic diagram of conversion of a video 21 into a sequence of images 23 each comprised of a grid of thumbnails of frames (or slots) in the video file. As shown in Figure 2A, the video file 21 is comprised of a sequence of frames, and this sequence is divided into distinct portions 22 from which one image file is generated. In one embodiment these portions 22 are contiguous portions, although in other embodiments there could be gaps or they could overlap - e.g. the last frame of one image could be the first frame of the next image to provide some continuity between images. Figure 2B is a schematic diagram of the generation of an image with a 5x5 grid of thumbnails from a sequence of video frames according to an embodiment. In this embodiment, the sequence of frames is broken in contiguous portions or blocks 24, 25, 26, 27, and 28, each of which comprises 5 consecutive frames. The image 29 is generated from a 5x5 grid of frames, with each row comprised of a block. That is the first block of 5 images 24 forms the first row of image 29, the second block 25 forms the second row, the third block 26 forms the third row, the fourth block 27 forms the fourth row, and the fifth block 25 forms the fifth row. The image can be selected to have a standard size e.g. 1024x600. The image 29 thus provides a snapshot summary of a portion of the video file, and multiple images then provide a summary of the video file. This effectively summarises and reduces the size of the video file, and allows an investigator to rapidly scan through the video to detect points of interest.

[0049 ] Figure 2C is schematic diagram of generation of an image 30 from a sequence of video frames 21 according to another embodiment. In this embodiment, the image 30 is generated by sampling the sequence of video frames 21 at regular intervals. In this embodiment the image again has a predefined size, such as 1024 pixels wide by 600 pixels high, and is generated from a 5x5 grid of frames obtained by sampling the sequence of video frames every 25 frames. Thus the first row of image 30 comprises the 1st frame 31 , the 26^th frame 32, the 51 ^st frame 33, the 76^th frame 34 and the 101^st frame 35. The 126^th frame 36 is then used as a first column of the second row. This embodiment samples the sequence of frames 21 at regular intervals. However in other embodiments this could be varied. For example rather than using a regular spacing, a random frame in a portion of contiguous frames could be selected. In one embodiment the sampling interval may be a time interval (eg one frame per second, or every 5 seconds), rather than spatial or sequential (ie frame number based) interval. In another embodiment, the selected frame could be analysed to determine certain image quality characteristics, and if the selected image fails to meet quality requirements (ie fails a quality check) then an adjacent or nearby frame that passes the quality requirements may be selected in its place. If no suitable frame can be located then no slot may be selected from the portion. For example the image quality characteristics may require a minimum brightness or brightness range so dark frame are ignored, or edge detection could be performed to determine if the frame is blurry or in focus. More generally the sequence of frames 21 is divided into portions, and a single frame is selected from each portion.

10050] Figure 2D is an example of an image 40 comprising a 5x5 grid of thumbnails generated from a sequence of video frames according to an embodiment. In this embodiment the image 40 has a header portion 41 which provides contextual information, such as the file source (file name), sampling information, temporal position within the video, or the date and/or time one or more frames in the video were recorded (according to the values stored in the video file; ie the date/time extracted from the file). Below the header is a 5x5 grid 42 of frames 43 obtained by sampling a sequence of video frames. The time of each frame is indicated 44 in the bottom right of each frame. The image has a width 45 of 1024 pixels and a height 46 of 600 pixels so that each frame from the video file has a resolution of

approximately 195 x 1 10 pixels with a 50 pixel high header strip. More generally the user can select parameters such as the size of the output image, the dimensions of the matrix forming the output image, and the sampling interval. In this way a video file can be summarised into a set of 1 or more picture or image files comprised of thumbnails of frames in the video file.

[0051 ] Implementation can be perfonned by directly calling image or video processing functions from a video library such as FFmpeg's libavcodec, openCV, or language specific libraries, or by calling an external program to process a file. In one embodiment processing of the video files can be performed by calling the movie thumbnailer software "mtn" (http ://mo vi ethumbnai 1.sourceforge. et/). For example the main forensic software could be configured to call "mtn" on each video file and to save the images in a specified directory and log the results, such as by executing a system call or open command to execute the program mtn and pipe the output to the controlling program. Alternatively a batch file approach could be used. Table 1 below outlines a batch file approach in which a first batch file is used to send console data (ie log information) to a file using the Mtee program (httpi/ywww.commandline.co.u mtee/; which is a windows version of the unix "tee" command), and a second batch file to call mtn using a standard set of input values.

TABLE 2

Batch files for generating image files according to one embodiment Batch File 1 : Obtain Source Drive and Destination Folder and log process

echo off

set /P folder=" Enter the drive letter or path to scan i.e. 0:\ : "

set P exhibit- 'Enter the exhibit number i.e. 14-A 12345-1 PC HD P I : "

set /P output- 'Enter the full path to save the zip file: "

echo The location to search is: %folder%

echo The exhibit number is: %exhibit%

echo The output folder is: %output%

set outputf=C:\tmp\Video

pause

mkdir C:\trapWideo

echo on

mtn.bat 2>&1 j mtee /t /d /+ %outputfX>\outputlog.txt

echo off

Zip -r "%output%\%exhibit%tn.zip" C:\tmp\Video\

rd /S /Q C:\tmpWideo

echo All done

Pause

echo on

Batch File 2: Create Thumbnails

mtn.exe -c 5 -r 5 -P -g 1 -k 000000 -W -b 0.80 -D 4 -h 120 -O %dest_folder% %source_drive%\*.*

[0052 ] At step 16, image files are identified and processed to rescale the dimensions to a predefined size. This may involve downscaling or downsampling the image to a predefined or user defined size such as 1024x600, 1024x720, 256x180 (or other common screen resolutions as outlined above), and/or aspect ratio 4:3 or 16:9. For example a 4000x3000 pixel image downsized to 800x600 pixels reduces the storage size from 7.24MB to 13 lkB. The reduced size image may be a thumbnail image. Similarly to

summarisation of frames in video files, a summary image of set of images could be generated. For example the summary image may have a size of 1024x720 and comprises a 4x4 grid of thumbnails each of size 256x 180. Additionally images in non JPG(JPEG) formats such as TIF, GIF, PNG, BMP, etc can be converted into a JPG (or JPEG) fonnat, or some other efficient compressed file fonnat (whether lossy or lossless). JPEG format is suitable as it has been designed to efficiently compress photographic images and further provides flexibility in the amount of compression through specifying a quality ratio or compression factor. However in other embodiments, images could be converted into another compressed file format. Similarly to the case for processing video files, image rescaling and conversion can be performed by calling an appropriate image library (e.g. libavcodec, openCV or libjpeg) or calling an appropriate image processing program (e.g. ImageMagick, Gimp, etc) from within the forensic software or via a batch program. A log of the files converted and the filename and paths may also be stored.

TABLE 3

Batch files for reducing image file dimensions

Batch File 1 : Obtain Source Drive and Destination Folder and log process

echo off

set /P folder="Enter the drive letter or path to scan i.e. 0:\ : "

set P exhibit- ' Enter the exhibit number i.e. 14-Al 2345-1 PC HD P I : "

set /P output- 'Enter the full path to save the zip file: "

set /P frame- 'Enter the picture size e.g. 800x600: "

echo The location to search is: %folder%

echo The exhibit number is: %exhibit%

echo The output folder is: %output%

echo The size is %frame%

set outputf=C:\tmp\Pictures

pause

mkdir C:\tmp\Pictures

echo on

CONVERT.bat 2>&1 [ mtee It Id /+ "%outputf%\PicOutputLog.txt"

echo off

Zip -r "%output%\%exhibit%pics.zip" %outputf%

rd IS /Q C:\tmp\Pictures

echo All done

Pause

echo on

Batch File 2: Reduce image dimensions

FOR IK "%folder%" %%G in (*.jpg *. jpeg *.bmp *.png *.gif) DO (

mkdir "%outputf¼\%%~pG"

convert "%%G" -resize %frame%^A>

) [0053] If file recovery, signature analysis or hash calculations (step 14) are performed, these will typically be performed prior to thumbnailing of videos 15 or downsizing and conversion of picture files 16. However the steps could be performed in parallel, with any relevant video or image files identified passed to a video or image processing queues. For example a queue of files to be processed could be maintained, and as new video or image files are identified they are added to the queue. In another embodiment, video and audio files could be compared with remote libraries to verify if local files are simply local copies of music or movie files which can be omitted from the archive. This may involve performing basic analysis such as calculating a checksum or extracting other parameters (e.g. file name, codec, bit or frame rate) and then comparing these values with those available from an online source (e.g. iTunes). Files verified as local copies can then be omitted from the reduced dataset.

[0054] The review step 5 comprises presenting the user with options to view the selected files, as well as the remaining (unselected) files and data on the media, and to allow the user to select additional files to be added to the collection 17. The review step may indicate which files are known files and are not being preserved. For example the user may review picture files, very large files, music files, movie files, etc to select or deselect files to add to the dataset.

[0055] An optional reporting step 6 can be performed which comprises the steps of generating a full file list and metadata information such as dates, times, deleted, location size, etc, which is exported to a report in XLSX or CSV format 18, and generating a report which details the drive and partition information 19, such as the number of files, total space used, etc. These may be generated using forensic software such as EnCase Export function, X-Ways report function, TSK fls listing, FT Imager directory listing, or using custom software. The reports can then be reviewed or analysed. As an example, questions that can be answered from this report could include the filename contents of a folder, a timeline of file activity, or the data volume of a particular file or folder. These reports would be stored with the data subset files. The storage demand for these reports are quite small, and multiple reports can be merged into a single spreadsheet or database for analysis across devices and across cases, such as producing timeline reports of file activity. This can also be merged with other report types and Internet history for a more complete picture of use.

TABLE 4

Example of CSV Table output according to an embodiment

Name File File File Descriptio Is Last File Last Entry

Ext Type Category n Delet Accessed Created Written Modified ed

adobeflashcs3.t txt Text Document File, NO 25/02/2012 13/1 1 /201 1 13/1 1 /201 1 25/02/2012 xt Archive 13:56 21 :50 21 :50 1.3:56 adobephotosho txt Text Document File, NO 25/02/2012 13/1 1/201 1 13/1 1/201 1 25/02/2012 pcs3.txt Archive 13:56 21 :50 21 :50 13:56 googledesktop.t txt Text Document File, NO 25/02/2012 13/11/2011 13/1 1/2011 25/02/2012 xt Archive 13:56 21 :50 21 :50 13:56 microsoftoffice txt Text Document File, NO 25/02/2012 13/1 1 /201 1 13/1 1/201 1 25/02/2012 2003.txt Archive 13:56 21 :50 21 :50 13:56 vistasidebar.txt txt Text Document File, NO 25/02/2012 13/1 1/201 1 13/1 1/201 1 25/02/2012

Archive 13 :56 21 :50 21 :50 13:56 visual studio2()() txt Text Document File, NO 25/02/2012 13/1 1/201 1 13/1 1/201 1 25/02/2012 5.txt Archive 13:56 21 :50 21 :50 13:56

[0056 ] The preservation step 7 comprises exporting the selected files, thumbnails pictures and reports to a container file 20, such as logical evidence container L01 , or other container such as AD 1 , ZIP, or CTR. By focusing on files of importance rather than copying every bit of a hard drive, it is possible to rapidly process and preserve important data whilst substantially reduce the size of data preserved compared with standard digital forensic processes. The process used for undertaking forensic analysis can be used with the smaller subset of data and results in a faster and more efficient review of the information.

[0057] The information review could consist of analysis of internet browsing history, filename information, a timeline review, Windows Registry analysis, keyword indexing and searching, hash analysis and other common forensic analysis techniques using a range of tools. Typically these activities can be perfonned much faster, and more easily on the reduced data subset compared to the much larger full image. For example, by applying analysis processes to the data subset, entity searches can be conducted across a range of database sources external to the media or investigation. This is done to build an intelligence picture of the entities identified within the data subset. The search results would be presented to a practitioner or investigator, and an ability to select relevant information for reporting purposes. As an example, a particular email address is prominent within a data subset corresponding with the suspect of a case. Searches are conducted across external source data, such as open source and closed source, to gain a greater understanding of the possible person/s involved with the email address. This can be applied to the range of differing entity information extracted in Step 6, i.e. GPS co-ordinate data can be mapped to show the data in context, rather than just the coordinates. The data from other cases can also be searched to determine if entity information is located across cases, i.e. a website URL in one investigation is also present on another case hard drive, with a possible linkage between disparate cases.

[ 0058 ] Following preservation of the data, a review step can be perfonned. The review step presents the data contained within the subset to an investigator or practitioner to enable them to quickly review the data and select that which may be relevant to an investigation. For example image files could be display in a thumbnail grid pattern such as a 4x4 grid of 16 images on a screen. Generation of thumbnails, or a summary image of thumbnails, could be perfonned as part of the processing step 4. The video thumbnails are presented, as are documents, emails, spreadsheets, and other such files, in a native view or as a text view. The benefits of this review are that analysis can be conducted on the information contained within the subset, potentially alleviating a need to examine the full forensic copy. In one embodiment the reporting step 6 is performed after preservation step 7 on the collected and preserved data.

[0059] The review step may further comprise automated analysis. This may comprise automatically searching though the data subset and extracting a range of information, such as names, addresses, Internet history, chat logs, emails, GPS co-ordinate data, Operating System and software log data, file metadata, cloud storage logs, JPG EXIF information, etc. As this information is extracted from the data subsets for reporting and storage, it can be quite fast to process the data when compared to doing this across a full forensic image. The processed data is then included in a database of information, which is available to a practitioner to review, such as in a timeline of events, or an entity chart showing relationship linkages. The process can be undertaken prior to the quick review, depending on the need of an investigation, i.e. whether there is a need to review the data, or whether there is time to process it first. Automated analysis can also be undertaken subsequent to the review, and the extracted information stored in a database with other case data to enable cross case review and analysis. As an example of a quick review process, the data subset information could be loaded into forensic analysis software such as EnCase, Nuix, or X- Ways. This enables a practitioner or investigator to review the files and data in the subset, and select relevant files. The storage demand for the subsets and reports are quite small, and multiple cases can be merged for review purposes.

[0060 ] A suitable computing system for implementing the method comprises a display device, a processor and a memory and an input device. The memory may comprise instructions to cause the processor to execute a method described herein. The processor memory and display device may be included in a standard computing device, such as a desktop computer, a portable computing device such as a laptop computer, tablet or smart phone, or they may be included in a customised device or system. The computing device may be a unitary computing or programmable device, or a distributed device comprising several components operatively (or functionally) connected via wired or wireless connections. An embodiment of a computing apparatus 100 is illustrated in Figure 3 and comprises a central processing unit (CPU) 1 10, a memory 120, a display apparatus 130, and may include an input device 140 such as keyboard, mouse, touch screen, etc. The CPU 1 10 comprises an Input/Output Interface 1 12, an Arithmetic and Logic Unit (ALU) 1 14 and a Control Unit and Program Counter element 1 16 which is in communication with input and output devices (e.g. input device 140 and display apparatus 130) through the Input/Output Interface. The Input/Output Interface may comprise a network interface and/or communications module for communicating with an equivalent communications module in another device using a predefined communications protocol (e.g. Bluetooth, Zigbee, IEEE 802.15, IEEE 802.1 1, TCP/IP, UDP, etc). A graphical processing unit (GPU) may also be included. The display apparatus may comprise a flat screen display (e.g. LCD, LED, plasma, touch screen, etc), a projector, CRT, etc. The computing device may comprise a single CPU (core) or multiple CPU's (multiple core), or multiple processors. The computing device may use a parallel processor, a vector processor, or be a distributed computing device. The memory is operatively coupled to the processor(s) and may comprise RAM and ROM components, and may be provided within or external to the device. The memory may be used to store the operating system and additional software modules or instructions. The processor(s) may be configured to load and executed the software modules or instructions stored in the memory.

[0061 1 In use the computing apparatus is connected to a digital data source 150 such as a seized hard disk via a write blocker module 151 so that the digital data source is effectively mounted in a read -only mode. The write blocker may be a hardware module, a software module or a combined hardware/software module that enables data to be read from the digital data source via a read line (or connection or data pipe) 152, but any attempt to modify or write data to the digital data source such as via a write line (or connection or data pipe) 153 is blocked to preserve the forensic value of the digital data source. Data read from the digital data source is processed in accordance with the method and the results are stored in local memory 120 or remotely connected storage devices (e.g. hard disks). The method can be implemented in software using a variety of programming languages and operating system utilities such as JAVA, .NET, C++, PERL, PYTHON, batch files, power shell, shell scripts etc. The software can directly call image processing libraries and/or operating system commands or external programs. A user interface can be provided to guide a user through various steps and to ensure forensic integrity is maintained.

[0062 ] An embodiment of the data reduction method was implemented in software and tested on a variety of real and test digital forensic cases, was found to provide significant reduction in data storage and archive requirements. Using South Australian Police (SAPOL) ECS case files, the data reduction process was applied to a sample of full forensic images (see Table 5). The subsequent size of the reduced dataset files (L01 in Table 5) was then compared with the size of the forensic copy (E01 in Table 5) and the original media volume sizes (HD in Table 5). Across a sample range of 34 cases from financial years 2012 and 2013 (ie 1 July 201 1 to 30 June 2013) comprising 144 hard drives and other media, the volume of data was able to be reduced to 0.196 percent of total evidence drive volume.

TABLE 5

Data reduction applied to SAPOL ECS cases

Item Number Hard disks HD E01 LOl E01 :HD L01.E01 LOl : HD of drives (in GB) (in GB) (in GB) ratio ratio ratio

Smallest 1 40 4.5 .0415 1 1% 0.92% 0.10%

Largest 1 1000 121 .0143 12% 0.12% 0.01%

Total

212 102396.5

(all cases) E01 107 45388 22040.68 51.1%

LOl 144 66438.5 5197.9 62.98 55% 0.423% 0.196%

E01 & LOl 37 9430 22 0.233%

Average 461.4 136.79 0.44 58.7% 0.705% 0.196% (across all)

[0063 ] The reduction process was also applied to the forensic disk copies comprising the Digital Corpora (Garfinkel S, Farrell P, Roussev V & Dinolt G 2009. Bringing science to digital forensics with standardized forensic corpora. DFRWS 2009. Montreal, Canada, http://simson.net/clips/

academic/2009.DFRWS. Corpora.pdf). The results are listed in Table 6. While these figures differ from the figures from the SAPOL ECS files, this can be explained in that many of the Corpora images are scenarios purposely built on smaller hard disk drives in a test environment, rather than larger hard drives observed in actual cases.

TABLE 6

Data reduction applied to applied to Garfinkel (2009) digital corpora forensic images

Item Hard disks E01 LOl E01:HD L01:E01 LOl : HD

HD (in GB) (in GB) (in GB) ratio ratio ratio

2008 m57 Jean 10 2.83 0.088 28% 3.1 1% 0.88%

4Dell Latitude 4.5 1 0.0735 22% 7.35% 1.63% charlie-2009-1 1 -12 9.5 3.02 0.185 32% 6.13% 1.95% charlie-work-usb-2009-12- 1 0.00883 0.0047 1% 53.23% 0.47% 1 1

jo-2009-1 1 -12 12 3.06 0.0971 26% 3. 17% 0.81 % jo-2009-12-1 1-002 14.3 5.53 0.312 39% 5.64% 2.18% nps-2009-domexusers 40 4 0.084 10% 2.10% 0.21% nps-201 1 -scenario 1 74.5 34.5 0.613 46% 1.78% 0.82% nps-201 1 -scenario4 232.8 18. 1 0.668 8% 3.69% 0.29% pat-2009-12-1 1 12.1 2.97 0.243 25% 8.18% 2.01% terry-2009- 12- 1 1 -001 19.1 7 0.157 37% 2.24% 0.82% tracy-external -2012-07-03 - 13.2 3.47 0.000518 26% 0.01% 0.00% initl

tracy-home-2012-07-03 - 17.4 3.99 0.605 23% 15.16% 3.48% initial

tracy-home-2012-07- 16- 17.4 3.99 0.471 23% 1 1.80% 2.71% final

Total 477.80 93.47 3.60 19.56% 3.85% 0.75%

Average 34.13 6.68 0.26 19.57% 3.89% 0.76%

10064] To highlight the figures in the Corpora (see Table 6), it can be seen that in the 'nps- 2009- domexusers' case, from a 40GB hard drive, the E01 file is 4GB (10%) and the resulting data subset is an 84MB LO l file (0.21%). The 'nps-201 1 -scenario disk image is of a 74.5GB hard drive and the forensic copy is 34.5GB (46%), with the resulting data subset consisting of a 613MB LOl file (0.82%). By comparison, one of the SAPOL ECS cases comprised 6TB of hard drives, which when imaged comprised 3TB of E01 forensic copies (50%) and reduced to 1.6GB of LOl data subset files (0.03%). Applying the 0.196 percent reduction percentage to the FBI data discussed in the background could theoretically reduce the 20PB of total data to only 4TB as a reduced subset of the data. The potential storage cost savings are quite significant and the ability to search the data would be considerably faster (resulting in more savings).

[0065] Also observed were benefits in conducting evidence analysis by initially collecting a reduced subset and conducting a review while waiting for the full forensic image to complete. Results observed included a subset collection only taking 79 seconds to collect the reduced dataset from a 320GB hard drive (Windows 7 Professional), compared with three hours to complete a full forensic copy and another three hours to verify the copy. Further using forensic software to process and fully index the full forensic copy from the above 320GB hard drive took nearly six hours, and the reduced subset only took two minutes 53 seconds to process and index. In relation to the storage requirements, the E01 images comprised 218GB compared with 687MB for the LO l file (0.215% LO LHD).

[0066 ] In another example, a case comprised of multiple computers and storage devices totalling 8.57 TB. Estimations indicated it would take approximately a week of imaging, and a week of processing to have ready for an investigator to review. Using the method described herein to collect relevant files and thumbnail video, the volume of data collected was 12.3 GB in about 2 hours. This was processed in 23 minutes, and was then available for the investigator to review. In particular the video files were approximately 1.3 TB, and were thumbnailed down to 343 MB. This thus represents a significant reduction in both the size of the dataset and the normal processing time. In many cases this time difference can be crucial to rapidly progress an investigation with persons at risk.

[0067] The above methodology has been applied in a number of cases comprising many terabytes of storage media, which previously have taken up to a fortnight to fully forensically image and process before being ready for an investigator to review. When applied to a range of test cases, the methodology demonstrated a reduction of original media volume to 1.08%, and 1.49% with all picture files, resulting in 74% of the information in 1 1% of the processing time. When applied to a sample of real world cases, the Logical image (L01) files were only 0.206% of the original media volume, collected in 3.78% of the time.

[0068 ] The methodology also provides additional benefits to the digital forensic investigation process. When applied in a triage manner, the data reduction method enables rapid collection, processing, indexing and searching of subset data to take place, which can quickly highlight devices that contain potential evidential material. Other devices can be then excluded or given a lower priority if there is less chance of evidential data being present. In practice, a reduced data subset could be collected from each item (even if not analysed) and then archived. This would then assist with any future questions that may arise, such as questions from prosecution or legal counsel prior to court proceedings. A further benefit is that an investigator can rapidly produce a report and supply this to investigators or legal counsel without requiring a full forensic image of every item seized. For example in one case a review of the subset data located information of relevance in the internet history and the registry files (website listings and recent document entries). This discovery highlighted the need to conduct further analysis of the full forensic disk image. Had there been no information found in the review, the drive would still have been fully examined, but would have been undertaken subsequent to other items of a higher priority in the investigation. Also as the review process can be configured to provide a listing of potentially relevant files to be added to the dataset, files useful to intelligence purposes can be identified early, such as internet history of the user, and specific documents authored by the user such as a resume detailing the person's work history and experience, or an address book with a list of contacts. The method also has benefits when time on site is limited or an item cannot be seized or could only be seized with a further court order. In these cases the reduction process can be used to rapidly identify evidence of potential value, maximising the available time on site and assisting in making a decision to seize equipment or not.

[0069 ] The ability to index forensic data prior to analysis has been available for many years. However, with the ever-growing size of data, the time to index the data is also growing. This is leading to longer times an examiner has to wait until the indexing is complete. The process of indexing by its very nature does not fully index every character or word and hence, searches undertaken across an index can potentially miss important evidence when compared with a full text search. By indexing a data subset, rather than the entire forensic image, there will be potential time savings in relation to processing and indexing. For example as outlined above, indexing a full forensic copy of a 320GB hard drive took nearly six hours, whereas indexing the reduced dataset took only two minutes 53 seconds - a 120 fold reduction in time.

[0070] Further the method can also be applied to a range of devices including, tablets, phones and cloud storage. For example in relation to mobile phones or tablet computers the reduction method is configured to only save call-related data, internet history, email and other software data files, with large files such as pictures and video not saved within the reduced subset (although a full extract collection would typically be first undertaken for evidential analysis purposes). Cloud storage provides users with an ability to store large amounts of data in remotely accessible storage locations. This can cause issues for an examiner in relation to identifying the data, collecting the data and analysing the data. A review of a data subset can potentially identify cloud stored data faster than waiting for a full forensic image to complete and process (indexing, metadata extraction and other processes). There are a range of issues relating to the collection of data from cloud storage including legal issues, the time to access and preserve the data, and undertaking analysis of the preserved data. Collecting a data subset from cloud storage has potential time and storage size savings. This can be achieved by only collecting the data with potential to provide evidence, rather than collecting every byte of data stored remotely. Conducting a review of a subset of data will also be faster than undertaking a review of a full forensic copy.

[0071 1 Long-term storage of the reduced subset of data also provides further benefit to the investigators. The data subset files can also be stored with other data subset files; for example, in a structured manner in folder and sub- folders as per the work request number, by financial year, case number allocation, exhibit number or device information. As the reduced data subsets are vastly smaller than full forensic images, it is possible to store a considerable number of subset logical containers in a comparatively small storage space. The resulting subset files can then be reviewed for relevant information. For example if questions arise from investigators, prosecutors and counsel (which can often be many months after the analysis is finalised), it can be beneficial to be able to access the case subset data, such as registry files or internet history, to promptly answer questions relating to user accounts, recent documents, or browsing history, without having to fully reimage or reprocess physical evidence to enable analysis of the information.

[0072 ] Further, the reduction process allows data subsets of case and device data examined by a law enforcement or government agency to be stored on relatively small hard drives or network storage. This provides the ability to search data quite rapidly or analyse data over a range of case which is currently cost and time prohibitive using full forensic images. There is potential intelligence and evidential benefits in relation to an understanding of historical cases, such as the use of a particular URL across historical investigations, or matching illicit file hash values among disparate and historical cases, potentially providing valuable intelligence. An example is loading multiple mobile phone subset datasets (without pictures or videos) into visualisation software to locate links between disparate devices and cases. [0073] Additionally performing analysis across different datasets can assist in identifying which files to include and which to exclude to provide the most valuable information. Once files with the greatest potential are identified, this can be incorporated into future filtering processes (e.g. by modifying config files) as well as existing datasets. For example existing datasets can be modified by purging low value files, and new high values files can be added by reprocessing full images if available. For example, researching trends over time can assist to provide information to investigators as part of focusing investigations to locate evidence earlier. For example, research of historical case data may highlight a trend showing the increased use of specific internet chat software among specific criminal offenders and as such, future investigations can first look for these data remnants rather than examining data from software that has declining use.

[0074] The data reduction methodology addresses a range of areas, including; forensic triage, rapid review, intelligence analysis, presentation, and archival needs. This methodology considers the type of data to be collected, stored, and reviewed, with a focus on data which will provide the maximum information for minimal size (volume and time). The methodology consists of forensically accessing the target media, filtering and selecting key files, processing and compressing files, and in particular video and image files, reviewing and selecting any other files determined to hold possible relevant information, discarding overwritten files, and then exporting the selected files as a logical evidence file. When applied to a variety of real digital forensic cases, the method has provided a significant reduction in data storage and archive requirements, with the volume of data able to be reduced to 0.196 percent of total evidence drive volume. Additionally this method can be performed in parallel with taking a full forensic bit-for-bit copy. In this way relevant data can be much more quickly and easily identified, and can also be more easily analysed.

[0075] Further the method can be applied in a triage manner to enable rapid collection, processing, indexing and searching of subset data to take place, which can quickly highlight devices that contain potential evidential material. The method also provides the capability to conduct a review of stored data subsets for intelligence analysis, research, archival and historical review purposes. The reduced file sizes provide dramatic reductions in the time to process and further analyse data subsets and gain knowledge and potential evidence from digital forensic data.

[0076] Those of skill in the art would understand that information and signals may be represented using any of a variety of technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof. [0077] Those of skill in the art would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software or instructions, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

10078] The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. For a hardware implementation, processing may be implemented within one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof. Software modules, also known as computer programs, computer codes, or instructions, may contain a number a number of source code or object code segments or instructions, and may reside in any computer readable medium such as a RAM memory, flash memory, ROM memory, EPROM memory, registers, hard disk, a removable disk, a CD- ROM, a DVD-ROM, a Blu-ray disc, or any other form of computer readable medium. In some aspects the computer-readable media may comprise non-transitory computer-readable media (e.g., tangible media). In addition, for other aspects computer-readable media may comprise transitory computer- readable media (e.g. , a signal). Combinations of the above should also be included within the scope of computer- readable media. In another aspect, the computer readable medium may be integral to the processor. The processor and the computer readable medium may reside in an ASIC or related device. The software codes may be stored in a memory unit and the processor may be configured to execute them. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.

[0079] Further, it should be appreciated that modules and/or other appropriate means for performing the methods and techniques described herein can be downloaded and/or otherwise obtained by computing device. For example, such a device can be coupled to a server to facilitate the transfer of means for performing the methods described herein. Alternatively, various methods described herein can be provided via storage means (e.g., RAM, ROM, a physical storage medium such as a compact disc (CD) or floppy disk, etc.), such that a computing device can obtain the various methods upon coupling or providing the storage means to the device. Moreover, any other suitable technique for providing the methods and techniques described herein to a device can be utilized.

[0080] In one form the invention may comprise a computer program product for performing the method or operations presented herein. For example, such a computer program product may comprise a computer (or processor) readable medium having instructions stored (and/or encoded) thereon, the instructions being executable by one or more processors to perform the operations described herein. For certain aspects, the computer program product may include packaging material.

[0081 ] The methods disclosed herein comprise one or more steps or actions for achieving the described method. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.

[0082] As used herein, the term "determining" encompasses a wide variety of actions. For example, "determining" may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Also, "determining" may include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like. Also, "determining" may include resolving, selecting, choosing, establishing and the like.

[0083] Throughout the specification and the claims that follow, unless the context requires otherwise, the words "comprise" and "include" and variations such as "comprising" and "including" will be understood to imply the inclusion of a stated integer or group of integers, but not the exclusion of any other integer or group of integers.

[0084] The reference to any prior art in this specification is not, and should not be taken as, an acknowledgement of any form of suggestion that such prior art forms part of the common general knowledge.

[0085] It will be appreciated by those skilled in the art that the invention is not restricted in its use to the particular application described. Neither is the present invention restricted in its preferred embodiment with regard to the particular elements and/or features described or depicted herein. It will be appreciated that the invention is not limited to the embodiment or embodiments disclosed, but is capable of numerous rearrangements, modifications and substitutions without departing from the scope of the invention as set forth and defined by the following claims.

Claims

1. A data reduction method for digital forensic data, the method comprising:

forensically accessing a data source;

converting one or more video files in the first set of files to one or more composite image files, each composite image file comprising a plurality of frames sampled from one of the one or more video files;

converting one or more image files to a standard format and size, and

updating the first set of files with the converted files; and

exporting the first set of files to a compressed container format.

2. The method as claimed in claim 1 , wherein each composite image file comprise a plurality of frames sampled from the video file at a predefined sampling interval or frequency.

3. The method as claimed in claim 2, wherein the predefined sampling frequency is every frame.

4. The method as claimed in claim 1 , wherein the video file is divided into contiguous portions, and sampling comprises selecting a frame from each contiguous portion.

5. The method as claimed in claim 1 , further comprising processing each sampled frame and omitting frames which fail an image quality check.

6. The method as claimed in claim 1 , wherein each composite image file comprises a plurality of thumbnail images arranged according to a predefined layout, and each thumbnail image has a predefined size and represents a frame of the video file.

7. The method as claimed in claim 4, wherein each composite image further comprises an information portion comprising one or more items of contextual information.

8. The method as claimed in claim 1 , wherein filtering the file in the data source further comprises performing a known file check on a file, and omitting a file from the first set of files if it is determined to be identical to a known file generated by a third party.

9. The method as claimed in claim 8, where the known file check comprises performing a Hash calculation and comparing the Hash calculation with a known Hash value for the file.

10. The method as claimed in claim 1 further comprising reviewing the processed first set of files, and allowing a user to add or omit files from the first set of files.

1 1. The method as claimed in claim 1 , wherein the filtering step further comprising one or more of recovering deleted files and folders, perfonning a file signature analysis, and/or expanding compressed container files.

12. The method as claimed in claim 1 further comprising generating a report on all files.

13. The method as claimed in claim 12 wherein reporting comprises providing a list of all files and a report on hard disk drive and partition information.

14. A computer readable medium comprising instructions for causing a computer to perform the method of any one of claims 1 to 13.

15. A computing apparatus comprising a communications interface, a memoiy and at least one processor, wherein the at least on processor is configured to perform the method of any one of claims 1 to 13.