WO2015166052A1

WO2015166052A1 - Data acquistion

Info

Publication number: WO2015166052A1
Application number: PCT/EP2015/059519
Authority: WO
Inventors: Nicholas PRINGLE
Original assignee: Usw Commercial Services Ltd
Priority date: 2014-04-30
Filing date: 2015-04-30
Publication date: 2015-11-05
Also published as: GB201407605D0

Abstract

The present invention relates to forensic data acquisition and analysis. There is presented a method for copying data from an original resource to a plurality of target resources. Data representing a directory of the original resource is read. Data of the original resource is then prioritised for copying based on the first data representing the directory. At least some of the data of the original resource is then copied, based on the prioritisation, to the plurality of target resources. Also presented is a method of for distributed processing of data so copied. In this processing method, the copied data is received at a first network node, and then distributed to one or more second network nodes. The data is then processed at the one or more second network nodes.

Description

DATA AQUISITION

Technical Field

The present invention relates generally to data acquisition, and more specifically to forensic data acquisition and analysis.

Background

Digital forensics, relating to the acquisition, analysis, and reporting of data used by or stored in digital devices, has existed for several decades. Of the applications of digital forensics, of particular importance is its use in criminal investigations to recover, for example, objective evidence of criminal activity, and to therefore support, for example, prosecution in a trial. In this example, admissibility of digital evidence concerns, amongst other things, its integrity and authenticity. Integrity of digital evidence can turn on the fact that the process of acquiring does not modify the evidence in any way. For example, the act of searching for and opening a file on an operating system may change important attributes of the file, and hence the integrity of any evidence associated with these attributes may be lost. Authenticity of digital evidence concerns the chain of custody from, for example, crime scene to trial, and turns on an ability to demonstrate the integrity of the evidence throughout the acquisition, analysis and reporting stages.

In order to ensure integrity and authenticity of the digital evidence recovered from a digital device, it is customary in the first instance to take a snap-shot of the storage media, for example a computer hard disk or other memory. This commonly involves making an exact forensic duplicate or 'image' of the source disk, including all metadata, for example, information regarding the physical location of data on the source disk (i.e. volume and sector information of clusters comprising files on a hard disk), whilst ensuring no writing to the source disk occurs in the process. Importantly, this forensic copy includes a copy and metadata of the unallocated space of the disk, which may still contain important information relating to deleted files.

In the infancy of digital forensics, disk imaging could only be implemented using specialized programming tools (for example UNIX utility disk duplicator (dd)), but has since enjoyed the introduction of dedicated devices which can, once connected to a source hard disk for example, automatically create a full disk image.

A further notable change in digital forensics over recent years has been the extraordinary increase in storage volume available to the average computer user. For example, in 2000, one may have expected the average hard disk to contain around 50 million sectors (where a sector traditionally contains 512 bytes, a byte being unit of digital information commonly consisting of eight bits, a bit commonly taking the form of a "0" or "1"). More recently, however, the number of sectors in a hard disk may be around 7 billion (i.e. a few terabytes). Although processing speeds and reading/writing speeds have increased along with this increase in storage volume, whereas to produce a forensic image of the average computer hard drive in 2000 may have taken, say, one or two hours, today, it may take as long as 30 or 40 hours to produce an image of an average hard drive.

In many countries, there are laws governing the length of time for which a suspect can be detained without charge, for example, currently in Britain this is typically 48 hours. It may well be the case, for example, that in order to charge a suspect, digital evidence is required from an associated hard drive, but since a forensic copy must be made before inspection of the files within the hard drive, the time before evidence is made available for use in charging may be, for example, 40 hours. There is therefore a clear existing need, and expected growing need in the future as hard drives become larger still, to be able to account for both the need to quickly produce a forensic copy of a hard-drive admissible in a court of law, and, for example, the need to quickly determine the contents of files of interest required to charge suspects.

US2009287647 discloses a method and apparatus for detection of data in a data store. This disclosure relates to initially copying the physical memory of a source disk to a forensic hard drive, followed by the building of predefined and customised reports based on document attributes and filters thereof. The actual data of the files of the report can then be examined by purchasing a command block form a control server. Whilst this disclosure allows the user to choose files of interest to be inspected, this is still subsequent to the forensic copying of the disk, and hence does not account for the need to quickly determine the contents of files of interest mentioned above. US2007168455 discloses a forensics tool for examination and recovery of computer data. This disclosure relates to software that examines a physical drive selected by a user for analysis, shows the user what files would be available on a target disk (or disks) for copying, asks the user to choose which files they want to copy based on limited data such as file size, and based on filtering by, for example, file extension, and copies those chosen files after a corresponding command block has been purchased by the user. Whilst this disclosure may contribute to quickly determining the contents of files of interest, it does not account for the need to quickly produce a forensic image of the hard drive mentioned above.

It is desirable to quickly produce a forensic copy of a disk and to quickly determine the contents of files of interest.

Summary

According to a first aspect of the present invention, there is provided a method for copying data from an original resource to a plurality of target resources, the method comprising: reading, from the original resource, first data representing a directory of the original resource; prioritising, based on the first data representing the directory, second data of the original resource for copying; and copying, based on the prioritising, at least some of the second data of the original resource to the plurality of target resources.

According to a second aspect of the present invention, there is provided an apparatus for copying data from an original resource to a plurality of target resources, the apparatus comprising: means for reading, from the original resource, first data representing a directory of the original resource; means for prioritising, based on the first data representing the directory, second data of the original resource for copying; and means for copying, based on the prioritising, at least some of the second data of the original resource to the plurality of target resources.

Further features and advantages of the invention will become apparent from the following description of preferred embodiments of the invention, given by way of example only, which is made with reference to the accompanying drawings. Brief Description of the Drawings

Figure 1 illustrates schematically an example arrangement;

Figure 2 illustrates schematically an example process;

Figure 3 illustrates schematically a conventional workflow;

Figure 4 illustrates schematically and example workflow;

Figure 5 illustrates schematically an example arrangement;

Figure 6 illustrates an exemplary dataflow;

Figure 7 illustrates some data;

Figure 8 illustrates schematically an example arrangement;

Figure 9 illustrates schematically an example process;

Figure 10 illustrates schematically a conventional network;

Figure 11 illustrates schematically an example file format;

Figure 12 illustrates schematically an example network;

Figure 13 illustrates schematically an example network;

Figure 14 illustrates schematically an example dataflow;

Figure 15 illustrates schematically an example workflow;

Figure 16 illustrates an example arrangement; and

Figure 17 illustrates an example arrangement. Detailed Description

Jigsaw Imaging

Figure 1 is a schematic illustration of an exemplary system 100 in which a data transfer device (DTD) 102 (also referred to herein as "Jigsaw imaging hardware") according to the present invention can be implemented.

The system 100 comprises source storage medium (source SM) 104, data transfer device (DTD) 102, image target storage medium (target SM) 106, and a 'Digital Evidence Bag' (DEB) storage medium 108. The DTD 102 is communicatively connected to the source SM 104, and to each of the target SM 106 and DEB SM 108, such that data may be transferred there between. The connections may be by wires, for example, via a USB cable or the like, or wirelessly, for example via radio communications or the like.

Any of the source SM 104, target SM 106 and DEB SM 108 may be any computer readable storage medium, for example, a hard disk drive, or a solid state drive, for example a USB flash drive, or the like.

The source SM 104 stores data, for example data which may be considered as potential evidence for a criminal investigation. The data stored on the source SM 104 may comprise a directory 110, files 112, and unallocated space 114.

The DTD 102 reads data from the source SM 104, and writes data to the target SM 106 and DEB SM 108.

The DTD 102 writes data to the target SM so as to create a complete (forensic) disk image 116 of the source SM 104 on the target SM 106. This forensic disk image 116 may then be used, for example, in a court setting as assurance of the exact contents of the source SM 104 at the time the copy was made. The DTD 102 simultaneously writes data to the DEB SM 108 as 'Digital Evidence Bags' (DEBs) 118 which contain only a portion of the data stored on the source SM 104. These DEBs may contain only those files from the source SM 104 determined to be most likely to contain useful evidence, and so be the subject of analysis by, for example, an investigation team.

The target SM 106 may be a completely cleaned storage medium, and initially may not contain any data at all, including any partition table, partition, or file system within a partition. The target SM 106 is typically larger than the source SM 104.

The DEB SM 108 may incorporate a file system upon which Digital Evidence Bag (DEB) 118 files can be written as ordinary files. Each DEB 118 may comprise a plurality of files or a multitude of DEBs 118 may each contain one file or there may be a combination of multiple files in multiple DEBs 118. Since the DEB SM 108 may only be transferred a portion of the data stored on the source SM 104, the DEB 108 may be smaller than the source SM 104.

The source SM 104 may store and organise data according to the partitioning and formatting of a given file system utilised thereon. Examples of file systems include "New Technology File System" (NTFS) developed by Microsoft Corporation, and 'Hierarchical File System +" (HFS+) developed by Apple Inc.

The file system of the source SM 104 may utilize a directory 110 to keep a record of the data (i.e. files 112) stored within the file system. The directory 110 may store file names of the files 112 in association with an index of a table of contents, for example a so called 'index node' or 'inode'. Such indices represent the disk block locations of the files 112 with which they are associated. The directory 110 may also store attributes of the files 112, for example the size of a file (which may be measured in a number of blocks associated with the file) and other metadata. Other metadata may comprise, for example, the owner of the file, the access permissions associated with the file, manipulation metadata indicating a log of changes or access to the file and other important information associated with the file. Whilst the directory 110 comprises location and other metadata about the files 112 of the file system, it does not contain the files 112 itself.

The directory 110 may be stored in a hidden file of the file system of the source

SM 104, and may be located at a known physical location of the source SM 104. For example, in NTFS, the first 32 files are system files. Of these files, the directory 110 is held in a hidden file called "$MFT", which is always located starting at the first cluster of the partition of the disk constituting the file system (i.e. MFT record 0). Further, '$Bitmap' (MFT record 6) contains a representation of the clusters that are in use and those that are currently unallocated to files.

The DTD 102 transfers data from the source SM 104 simultaneously to the both the target SM 106 and DEB SM 108. In this context, 'simultaneously' includes 'near simultaneously', for example by multiplexing the data at a cluster level. For example, such near simultaneous transfer may comprise write events of the form "Write Cluster nl to target SM 106", "Write Cluster nl to DEB SM 108", "Write Cluster n2 to target SM 106", "Custer n2 to DEB SM 108" and so on.

The DTD 102 is arranged to control the copying of data (i.e. directory 110 and/or files 112) stored on the source SM 104 onto the target SM 106 and DEB SM 108.

As explained in more detail below with reference to figure 2, in embodiments of the present invention, in a data transfer process, first the directory 110 of the source SM 104 is read by the DTD 102 and written to the target SM 106 and the DEB SM 108.

As the directory 110 is read by the DTD 102, the DTD 102 can build up in array 120 (for example implemented on Random access memory (RAM) at the DTD 102) a list of priority files that are likely to be a source of evidence, for example for a criminal investigation, based on the metadata of the files 112 contained in the directory 110. This can be based on the file extension, or the last modification date of the file, or the file size, or any other metadata contained in directory 110. The DTD 102 may then transfer the files from the source SM 104 to the target SM 106 and DEB SM 108 in an order according to that defined in the priority list, and according to paths defined by the location of file data as recorded in the file-system directory 110 of the source SM 104 stored in the array 120.

More specifically, the DTD 102 extracts data in the order according that defined in the priority list from the source SM 104 and writes it to the target SM 106 so as to build up a complete (forensic) image (including directory 110 and all files 112 and all unallocated space) of the source SM 104 at the target SM 106. Each block/cluster is read just once from the source SM 104 and written just once to the target SM 106. As a result, an image of the source SM 104 can be written directly to the target SM 106.

Simultaneously, the DTD 102 writes the extracted data to the DEB SM 108 as

DEBs so as to populate the DEB file system stored thereon, i.e. as each block/cluster is read from the source SM 104, it is written to the target SM 106 while simultaneously creating DEBs of selected files which are sent to the DEB SM 108.

During the data transfer process therefore, the DEB SM 108 may be quickly populated with files that are likely to be a source of evidence. The DEB SM 108 may therefore be disconnected from the DTD 102 and the files stored thereon analysed, for example by forensic investigators, after only a relatively short time, whilst the target SM 108 remains connected to the DTD 102 so as to complete the formation of a complete image of the source SM 104 thereon (necessary for assurance of the evidence, for example later in a court setting).

Figure 2 illustrates an exemplary process at the DTD 102 by which data may be copied by the DTD 102 in system 100 of figure 1.

The process begins at S202, where the device ID and structural data is read from the source SM 104. This data is not part of the contents of the source SM 104 as such and so need not be written to the target SM 106, but may be written to a 'directory metadata DEB' file of the DEB SM 108.

The process then moves onto S204, where the directory 110 (and partition table) is read, interpreted, and stored as an array 120 in a memory (e.g. RAM) of the DTD 102. This stored directory may then serve as a guide to the DTD 102 for the rest of the data transfer process. As each block of data that contains directory/partition information is read, it is written to the target SM 106 and is written to the DEB SM 108 so as to be appended to the 'directory metadata DEB file' of the DEB SM 108. This is read and write may be done just once.

During the reading of the directory data by the DTD 102, or after the directory 110 has been stored in array 120, the directory 110 data may be used to determine and select files that may be of a high likelihood of containing relevant evidence (i.e. high priority files). Priority files may be subdivided into different levels of priority, for example, tier 1 may contain files that are most likely to contain evidence and hence are to be transferred first in the data transfer process, tier 2 may contain files that are second most likely to contain evidence and hence are to be transferred second in the data transfer process, and so on.

As explained in more detail below, priority files may be determined according to selection criteria, which may be predetermined, or may be determined at the time of data transfer, for example, by an operative of the DTD 102. The files may be selected, for example, according to a regular expression acting on the file name or according to dates associated with the files or the like.

Once the directory 110 is written to the target SM 104 and the DEB SM 106, the process moves onto step 206, where files 112 (and unallocated space) are transferred in an order according to the priority list stored in array 120. More specifically, a cluster by cluster copy of a file is transferred to the target SM 106 at the same time a DEB 118 is built on the DEB SM 108. Each cluster that makes up these files is read only once from the source SM 104 and written twice, once to the target SM 106 and once to the DEB SM 108. The DEBs 118 contain not only the file data, but metadata that defines exactly the origin of the data on the original media. Meta data is taken in a comprehensive manner such that the source SM 104 could be completely and identically reconstructed from the totality of the DEBs created during the process (i.e. if it were to run long enough such that all files 112 and unallocated space 114 were transferred into DEBs 118).

Once all the files determined as being at a certain priority level, e.g. 'high priority', have been transferred, the process may move on to step 208, where DTD 102 may automatically stop writing data to the DEB SM 108, and produce an indication that the DEB SM 108 may be disconnected from DTD 102. Of course, during this step, the DTD 102 will continue transferring the rest of the data (i.e. lower priority files and the unallocated space) from the source SM 104 to the target SM 106 so as to create a complete disk image of the source SM 104 on the target SM 106. Upon indication that the DEB SM 108 may be disconnected from DTD 102, a user may disconnect DEB SM 108 and analyse the files stored therein, for example, to identify evidence contained therein.

Once a complete disk image of the source SM 104 has been created on the target

SM 106, the process moves onto S210, where the DTD 102 indicates that the disk imaging onto the target SM 106 is complete, and that the target SM 106 may be disconnected from the DTD 102. Upon indication that the target SM 108 may be disconnected from DTD 102, a user may disconnect target SM 108 and, for example, store it safely so that it may be used as evidence of the contents of the source SM 104 at that time, for example in a court setting.

Although in the above examples, the data was transferred to only two storage media, the data may be transferred to more than two storage media. For example, in addition to the target SM 106, there may be two or more DEB SM 108 connected to the DTD 102. In this case, the DTD 102 may be arranged to transfer only files of a first tier of priority to a first DEB SM 108 before indicating that this first DEB SM 108 may be disconnected. The DTD 102 may then continue to transfer files of a second tier of priority to the second DEB SM 108 before indicating that it may be disconnected. In such a way, investigative analysis may begin on the highest priority files first, and investigative analysis may begin on files of secondary priority at a slightly later time, but considerably sooner than as compared to waiting for the entire disk image to be transferred.

Although in the above examples, data is transferred to storage media, this need not necessarily be the case. As described in more detail below, DEBs may be transferred one by one, for example, as they are written, to a distributed storage and processing system where each DEB can be analysed on the fly. This allows near instantaneous processing files, in the order according to the priority list, at the same time as a forensic image of the source SM 104 is being made.

The data transfer technique of the DTD 102, also referred to herein as 'Jigsaw imaging', will now be described in more detail with reference to specific embodiments of the present invention.

Figures 3 and 4 illustrate the generic sequence of an investigative process according to some conventional models (figure 3) as compared to that of embodiments of the present invention (figure 4).

Figure 3 illustrates how conventional models of the investigative process, comprising the steps of identification of data 302, acquisition of the data 304, ingestion onto an analysis system and automatic processing of the data 306, manual processing of the data 308, and presentation of the results of the analysis 310, typically follow a linear sequence. A consequence of this linear processing is that, at least in a theoretical sense, the start of the next stage is delayed until the previous stage is complete.

As illustrated in figure 4, embodiments of the present invention optimize the steps listed for figure 3 by ensuring that each successive stage starts as soon as it possibly can after the commencement of the previous stage.

The conventional best practice approach to forensic imaging of storage media is referred to as linear imaging. This technique originated in the UNIX utility dd which does a byte for byte copy of an input device to an output device. This takes no account of any files-system formatting on the media.

In embodiments of the present invention, in the technique referred to herein as "Jigsaw Imaging" instead of reading sequentially through the media, the data is tracked through paths defined by the location of file data as recorded in the file-system directory.

In this specific example, the source SM 104 comprises a NTFS file-system. The first 32 files in NTFS are system files. The key file-system meta-data is contained within a system file called $MFT which is MFT record 0. Another note- worthy file is $Bitmap, MFT record 6, which contains a representation of which cluster are in use and which are currently unallocated to files. The file system is used to control the creation of the copied image.

Jigsaw imaging has two types of deliverables. Firstly, an imaged copy or copies of the original source SM 104. This is not in the form of a file stored on a file system, as is the case with EWF (Enhanced write filter) or FTK (Forensic tool kit) but is like the output of dd to a device. Secondly, Digital Evidence Bags or containers (DEB) 118 containing evidential data. This could be a multitude of files forming one DEB or a multitude of DEBS each containing one file or a combination of multiple files in multiple DEBS, or a combination of the pervious arrangements.

In this description, the generic term 'Digital Evidence Bag' (DEB) is used. This can be any one of a variety of formats suitable for storing one of more files, for example AFF (Advanced Forensic Format) or DEB (Digital Evidence Bag) file format. A Digital Evidence Bag as generically referred to herein may also be a Digital Evidence Container. 'Digital Evidence Bag' may also refer to any suitable 'Submission Information Package' (SIP), for example as defined in ISO 14721 (from International Organisation for Standardisation). In one example, a digital evidence bag may comprise any combination of: (a) case specific metadata, for example, any of an evidence reference identifier, a location identifier and a timestamp; (b) data source metadata regarding the original source SM 104; and (c) evidential data, for example, one or more files 112 from the source SM 104. In one example, the case specific metadata, the data source metadata, and the evidential data may each be stored as a separate file in the digital evidence bag.

In embodiments of the invention, Jigsaw imaging uses linear imaging approaches but directs these by interpreting the file-system and using this information to direct the focus of linear copying at the high value areas of the media. First, the disk meta-data is accessed, then the partition data and then, in turn, the directory data for each partition. This forms the border of the file-system into which the gaps can be filled with data from the files. After this, the files themselves are accessed.

As explained in more detail below, a triage approach may be used to focus investigations to a broad subset of files on the media under investigation. For example, for some types of crime, the Internet History may have top priority followed by documents, email, messenger etc. Several methods of selecting file as potential evidence can be used. These include by using the file extension, by file name, by date, by keyword or signature.

The directory of the source SM 104 is read, certain files can be selected to be accessed as a priority because of their likelihood of holding evidence. Jigsaw imaging reads these files and adds them to the evidence DEB(s). Then, all the other files, considered to be of lesser potential value are read, and finally the unallocated space is read to complete the imaging of the partition.

An important feature of Jigsaw imaging is that no matter what the stage, as each block/cluster is read from the source, it is written to the target drive while simultaneously creating DEBs of selected files. Each block/cluster is read just once from the source and written just once to the target drive. The nature of this process means that the output cannot be to a file on a file system as is the case with EWF and FTK, but an image written directly to a storage device.

In this example embodiment, the Jigsaw imaging hardware 102 (i.e. DTD 102) is configured as in Figure 5.

The source evidence is a storage media device 104 (i.e. source SM 104), it could be a hard disk or a memory device like a USB stick. The imaging process has at least two outputs. A target device 106 (i.e. target SM 106), which is larger than the source, and a DEB file storage device 108 (i.e. DEB SM 108). Additional targets and DEB storage may be added as required. Additionally the DEB File storage could be replaced with a link to a network storage facility.

Figure 6, provides an overview of the process according to this example embodiment.

The target drive 106 is completely cleaned storage media and needs to be wiped clean prior to imaging. It does not contain any data at all, not even a partition table, partition or file-system within a partition. The DEB storage media 108 has a file-system upon which DEB files can be written as Ordinary' files.

First, at S602, the device ID and structural data is read from the source device 104. This is not actually part of the media contents and so is not written to the target device 106 but will be written, as part of the header section, to a "directory metadata DEB" file.

Then, at 604, the partition table is read and interpreted as a guide for the rest of the imaging process. As each block of data that contains partition information is read, it is written to the target drive 106 and is also appended to the directory metadata DEB file as part of the header section. This is read and write may be done just once.

At step S606, Jigsaw imaging reads the $MFT file, which always starts in cluster 0, and creates an array of data representing the file structure of the partition. It is noted that even large disks rarely have $MFT files greater than 500MB. As with the partition table data, as each block/cluster of data is read, it is written to the corresponding cluster on the target drive and is also appended to the directory metadata DEB file as part of the header section. The $ Volume MFT Record is read to obtain the volume label and version information which is written to the Directory DEB. The $MFT file contains, not only, entries for 'visible' files which can be seen by the user but also a set of up to 32 'system' files which hold information about the file-system's meta-data that includes, for example, security schemas, unused cluster availability. These files are copied as described below.

In Jigsaw imaging, the array of file/directory data created is used to select files that conform to a selection criteria decided by the acquirer. Currently these can be selected by a regular expression acting on the file name or a selection based on the MAC dates of the file. The files may be prioritised according to criteria as described below. The recording of potentially high value target files may be done as the directory data is read.

At step S608, Jigsaw imaging proceeds by processing each of the selected files. This may be done according to an order of priority established as described above. A cluster by cluster copy of the file is taken to the target drive at the same time a digital evidence bag is built on the DEB storage media 108. Each cluster that makes up these files is read only once from the source and written twice, once to the target 106 and once to the DEB 108. The DEBs 112 contain not only the file data, but meta data that defined exactly the origin of the data on the original media 104. Meta data is taken in a comprehensive manner such that the original disk 104 could be completely and identically reconstructed from the totality of the DEBs created during the imaging process.

There is an option to lookup these checksums against a database of Known File Fingerprints. As the files are read, the SHA1/MD5 is calculated. Known 'Good' files can be discarded at this stage and not included in the primary DEB collection.

At the end of this stage 608 all of the evidence selected as primary will have been collected in DEBs. Figure 5 showed that several storage devices could be used for DEBs. If one was allocated to storing the DEBs produced by stage 4 (S608), it could now be removed and the transport and ingestion to the analysis facility could begin leaving the remainder of the imaging process to continue. More so, if the DEBs were stored on shareable media, they could be ingested and processed as soon as they are created. This could proceed for example in combination with the system as described in more detail below.

After the primary evidence files are copied, in S610, the remaining files are imaged. As each of these is read cluster by cluster and copied to the target drive, the SHA1/MD5 checksum is calculated. This stage includes copying the remaining NTFS 'system' files that occupy inodes less than 32. As with the previous files, these are read cluster by cluster and written cluster by cluster to the target media. As the clusters pass through, a SHA1/MD5 is calculated. If selected, the result can be tested against a database of SHA1 checksums of files known to be 'bad'. At the end of this stage, provided the option was selected, all of the evidence selected as 'bad' from its KFF will have been collected in DEBs. In a similar way to 'DEB exit point 1 ', this data can be processed as the rest of the imaging continues.

The NTFS system file, $volume, contains a bit-mapping of cluster usage. In stage SI 12, it is used to copy, sequentially cluster by cluster, all the remaining data from the original to the target. Jigsaw imaging has an option to write these to DEB(s). Some conventional carving software only supports carving of whole image files. Jigsaw imaging can provide a series of DEBs that contain the data from unallocated areas of the files system. The advantage of this is that it can be used across a distributed system to allow parallel processing of carving without special programming techniques.

Simply, carving involves reading a stream of data and attempting to detect strings of codes that indicate data that can be recovered in some intelligible form. There are three primary factors that govern the success of carving. Firstly, Fragmentation - carving can only recover whole files when they are stored in contiguous clusters and have not been overwritten in part. It can recover part files consisting of a series of contiguous cluster whose run is interrupted when clusters are overwritten. Secondly, Analysis complexity - the more complex the regular express or search technique, the slower the process. For different searches, often multiple passes are needed. Thirdly, the read speed of the storage media 104 will dictate whether this is a disk I/O bound, where the disk cannot keep up with the need to feed the processor, or processor bound operation, where the processor leaves the media I/O idling.

In some embodiments, Jigsaw imaging provides a facility, if required, to write out the unallocated data into DEBs which can be distributed across a distributed processing cluster. Massively parallel carving can then be done applying many processors to analyse the data read from multiple storage media.

The unallocated space DEBs can be limited in size. A natural break is when a 'run' ends when it is interrupted when a cluster is marked as 'in use'. Another is when the user specifies a maximum files size. The default is 500MB but can be changed. Typically, 40% of all files on a system running a Windows OS are less than 10MB. In a 500MB file with 4k clusters, there are 125,000 possible start positions for a file, which must be cluster aligned. A 10MB file occupies 2,500 clusters. This means that a randomly placed 10MB will be truncated only if it starts in any one of the last 2500 clusters. A chance of 2,500/125,000 or 1 :50. A 1GB DEB would reduce this chance to 2,500/250,000 = 1 : 100, and so on.

At this stage all the unallocated DEBs will be written. In a similar way to 'DEB exit point 1 & 2', this data can be processed as the rest of the imaging continues.

Stages 3 - 6 (i.e. S606 to S612) are repeated for each partition found in stage 2 (i.e. S604).

The process then moves onto S614, where linear imaging is used to copy any gaps left between the partitions as recorded in the partition table. Finally, in S616, the HPA/DCO (Host Protected Areas/Device Configuration Overlay) areas are copied cluster by cluster to the target using linear imaging.

An evaluation of a specific implementation of Jigsaw imaging as compared to linear imaging is now provided. The evaluation is based on: overall time to complete the entire image, time to deliver actionable data, resilience to interruption, recovery from read error.

Assuming the linear read speed remains constant, the speed degradation with Jigsaw imaging comes from two actions. Random access seek times when moving between the starts of data runs, e.g. at the start of reading a file and similar seek actions when reading a file that is fragmented.

In the former it is observed that it is not uncommon for disks to now contain, perhaps, 250,000 files, which means 250,000 seeks to the starts of these files. A 2TB NTFS file system consists of 500,000,000 4kb clusters. The data transfer rate from a hard disk is a complex matter and consists of several components. The most important of these is the random seek time, which is typically 9ms for these domestic devices. Even if the 250,000 random seeks are 1000 times slower than the sequential reads the overall impact is to add less than 5% to the overall time. For a 2TB disk, with 500,000,000 clusters: Linear imaging - 500,000,000 x [9 / 1000] compares with 499,750,000 x [9 / 1000] + 250,000 x [9] for Jigsaw imaging.

In the later, of seek times when reading a fragmented file, two observations may be made. Firstly, the new huge disks are often only sparsely filled, and so when files are written they do not need to weave between used clusters. Secondly, when they are filled it is often with write once, read after, files like video. In addition, Windows 8 now runs defragmentation as a scheduled event as a standard configuration item. This overhead may therefore only have a relatively small impact.

According to the above, Jigsaw imaging delivers actionable data very quickly. For example, from initiating imaging, the first evidence could be delivered as a single DEBs within less than a minute. Subsequent DEBs containing high valued data then rolls off as fast as it can be written to the collecting device.

On the issues of thoroughness, resilience to interruption and recovery from error, Jigsaw imaging has the same efficacy as its peers because it could be viewed as a linear imaging applied in an erratic, controlled, manner. Forensic Evidence Data Processing Triage

Now will be described an example method for prioritising files for transfer for use in the embodiments of the invention described herein.

Some embodiments of the present invention utilise a triage methodology based on statistical data gathered on file sizes and their contents. Preselecting subsets of evidence can be used to reduce the overwhelming quantity of data that may be contained on a source SM 104 that is to be analysed by an investigator.

Using conventional linear imaging, it may take many hours to read the data from a source SM 104, and may take perhaps many days, or even weeks, to process it. There are however, different sequences according to which data may be processed. Four examples are presented below.

1) By inode number: Filenames, and their associated details, are stored in a file- system's database (in NTFS, Ext and HFS) as a series of records, one per directory entry. These records are filled chronologically as files are added to the file-system. This means that, at least initially, as the operating system is the first thing written to the file- system, the operating system files will occupy the earlier records. Most likely, next the application programs are stored. These therefore occupy the next series of records and the user files are last to be created and added. When a file is deleted, the record it once occupied is marked as available and can be reused when a new file is created. As the first files stored on a disk are those that comprise the operating system and application programs, there is a tendency for user files, which are the most likely source of evidence, to be placed towards the end of the file-system inode list. If the data is processed in inode sequence order then the processing of the evidence rich data is delayed until last. This does offer a strong argument for a reverse inode sequence processing.

2) Alphabetically by filename: The file system could be passed through in alphabetical order. On windows XP the user files are held under /Documents and Settings and so these would be processed before /Program Files and /Windows. On Windows 7 system this would put /Users before /Windows. In Linux, /etc comes before /home. 3) Completely randomly.

4) By some form of priority: Specific directories or specific files could be dealt with first. For example, a search sequence of may consist of:

/Users/User 1 /Documents

/Users/Userl/Pictures

/Users/User2/Pictures

Index.dat

*0Pg

*.pst

This approach is likely to be most efficient if the sequence is appropriate for the specific investigation. It may also be sensitive not just to the existence of a file but the implications of placing it in a processing queue ahead of other files. For example, it may not be beneficial to process a 25GB video file, which may take 5 minutes, ahead of a single JPG that would take 0.1 second to process.

Examples of specific investigations may include: Personal Hi-Tech: where an individual engages in a crime which uses digital technology as its methodology, for example hacking. Personal General: where an individual engages in 'traditional' crimes like fraud or counterfeiting but actions them on a digital device. Corporate Hi-Tech: This is typically the 'insider threat' of staff that have partial access and exceed that to gain unauthorised access into corporate systems. Corporate General: As with personal General but in a corporate environment.

As shown in figure 7, different file types may be set in advance by, for example, forensic experts, as having a high likelihood of being of relevance in a given class of investigation.

Examples of file types that may be chosen as relevant for a given investigation may include: Documents, word processing, spreadsheets; Engineering drawing; Graphic; Voice file; Video file; Internet history; Email; Messenger; Registry files; Event log; Executable files; Accounting data files; Source code files; Link file; Printer spool file; Thumbs. db. Such categories as shown in figure 7 may comprise collections of relatively few file types. For example, email can be covered by 3 file types with the extensions .pst .mbx .ost . Prioritising of files may achieved, for example, with reference to a Bayesian Scorecard, where a low score may indicate a low likelihood of the file containing relevant evidence, and a high score may indicate a high likelihood of the file containing relevant evidence. Such Bayesian Scorecard may contain scores according to the following parameter:

Files by Extension (type): 1 - 9, representing the likelihood of evidence being present. For example, JPG images are likely to be good sources of evidence whereas .pg5 "Guitar pro 5 Tablature" files are not.

Size of the file: in MB

· KFF: is this known as Ok' in the NIST database, in which case -1, or 'bad' in the CEOP database in which case 1.

• Last Modified Date: Many crimes can be reduced to having been taken place within a certain time period. This would most likely be an array of values for discrete periods of time, with 9 representing periods of high interest and 1 representing periods of low interest. Continuously accessed files like logs could not be classified in this way and so could be allocated 9 as being always of interest or 1 as never.

• Location in the file system: JPGs in "My Documents" are 'normal' and so could be set at 1, however the same file in "/Windows/System32" is abnormal and so could be set at 7.

Prioritising may also be based on other indexes, for example whether the files fall within a given category of files, for example as per figure 7. This allows the investigator to manually tweak the priority to align with their expectations of the location of evidence for the specific case. Prioritising may also be based on the processing rate for the given type of data to be processed and the given equipment.

According to the above, an index number for each file can be created to indicate its likelihood of containing relevant evidence. For example:

Priority Index =

( Kl * Category Value )

+ ( K2 * Ext Value)

+ ( K3 * Size / Processing Rate)

+ ( K4 * KFF index)

+ ( K5 * Date index) + ( K6 * Location normality)

Where KN, N=l-6 are adjustable constants. Note that the KFF index would only apply if the file was read in its entirety. However, this may not be appropriate, and the KFF index may, for example be omitted.

The constants (Kl ..K5) may be adjusted according to feedback from real cases.

If K3 is used it may effect the third term so that it forces either inclusion or exclusion of the file in the priority list. For example, K3 = 1000.

In some example embodiments, the files may be prioritised according to the application queue policies, for example based on parameters such as time between arrivals to the queue, the size of the jobs, and the number of server for the node. Examples of queue policies that may be applied include First in First Out; Processor sharing; Priority; Shortest job Firsts; Pre-emptive Shortest Job First; Shortest Remaining Processing Time. Distributed Forensic cluster: 'FCluster'

In some of the example embodiments described above, data is transferred by the DTD 102 to storage media, but this need not necessarily be the case. As described in more detail below, data, including DEBs 112, may be transferred to a distributed storage and processing system, also referred to herein as "FCluster" where each DEB 112 can be analysed, for example as it becomes available. This allows near instantaneous processing of files, in the order according to the priority list, at the same time as a forensic image of the source SM 104 is made. As described in more detail below, in specific embodiments, assurance of the integrity of the data is maintained at each stage of data acquisition, ingestion, distribution, storage, and processing.

Figure 8 illustrates a system 800 in which exemplary embodiments of the present invention may be implemented.

System 800 comprises, similarly to as in figure 1, source SM 104 and DTD 102 which are communicatively connected, either via a wired or wireless connection, such that DTD 102 may read data from the source SM 104.

System 800 comprises network 810 (also referred to herein as 'FCluster system' 810), which may receive data from DTD 102 and pass the data to components of the network 810 such that the data may be stored and/or processed by components of the network 810. Any one of the components of the network 810 may be communicatively connected to any other one of the components of the network 810. Network 810 may be, for example, a computer network, such as a Local Area Network (LAN), a Wide Area Network (WAN), or the like. The components of network 810 may be connected, for example, using a Virtual Private Network (VPN). Data transferred between any of the components of network 810 may be encrypted.

In figure 8, the components of network 810 comprise 'FCluster' server 802 (e.g. a computer on which 'FCluster' as described herein is implemented, for example as FCluster software 806 installed on server 802); target server 804 which comprises target data base (DB) 106; and storage and processing network nodes (NN) 808a, 808b, and 808c.

In this embodiment, DTD 102 may transfer data from source SM 104 simultaneously to target server 804 and to FCluster server 802. This may be done in a similar way to how data was transferred by DTD 102 from source SM 104 to target SM 106 and DEB SM 108 as described with reference to figures 1 and 2 above. In other words, in this embodiment, DTD 102 transfers data to target server 804 such that a forensic image of source SM 104 may be built up at target DB 106, and simultaneously transfers data (e.g. as Digital Evidence Bags DEBs 112) to FCluster server 802.

Figure 9 illustrates an exemplary process carried out in system 800 according to an embodiment of the present invention.

The process begins at step S902, where, similarly to as described above with reference to figures 1 and 2, the directory of the source SM 104 is first read by the DTD 102, stored into a local memory, and transferred to FCluster server 802 and target server 804. In other words, the data is acquired. On or after reading the directory of source SM 104, a priority list of files to be transferred may be built up at the DTD 102 according to metadata contained in the directory.

Next, in S904, data (including files and/or unallocated space) may be transferred by the DTD 102 from source SM 104 to target server 804 and FCluster server 802 in an order according to the priority list. In other words, the data is ingested into FCluster system 810. Next, in S906, on receiving the DEBs from DTD 102, the FCluster server 802, may distribute the DEBs 112 for storage and/or processing at one or more network nodes 808a, 808b, 808c. In such a way, the DEBs 112 may be processed by FCluster system 810 almost as soon as they are received from DTD 102. A given DEB 112 may be stored at more than one network node 808a, 808b, 808c during processing so as to provide, for example, redundancy and secondary load balancing.

When a DEB 112 is received at a given network node 808a, 808b, 808c, a defined list of tasks is invoked and automatic processes are conducted, for example text indexing if the contents comprises text, or thumb-nailing if the contents comprise images.

FCluster server 802 may not distribute a DEB for processing, for example due to load balancing.

In some embodiments, FClsuter server 802, may, at any time, call target server 804 to provide it with DEBs 112 on demand based on the data stored at the target DB 106. These DEBs may then be distributed to network nodes 808a, 808b, 808c for processing.

In any case, next, at S908, once a given node 808a, 808b, or 808c has finished processing of a given DEB 112 that has been distributed to it, it may send the results of the processing back to FCluster server 802, which in turn may send the results to be stored in a database (not shown).

Finally, at S910, the results may be accessed, visualised and/or reported on-the- fly at reporting node 812.

In such a way, analysis and reporting relating to files from the source SM 104 can be achieved (i) almost as soon as the data is read from the source SM 104, (ii) with those files with the highest likely hood of containing relevant evidence coming first, and (iii) in parallel with the creation of a complete (forensic) image that can be used at a later date.

Further, as described in more detail below, embodiments of the invention allow for complete assurance of the integrity of the data at the acquisition, ingestion, distribution, storage, and processing stages. The distributed processing of data in the FCluster system 810, and the assurance of the integrity of the data being processed, will now be described in more detail with reference to specific embodiments of the present invention.

For data processing in forensic investigations, there has been a resistance to the idea of using an architecture where the data is moved and stored on a multitude of workstations for processing because of a lack of adequate control over data stored in a distributed manner.

Embodiments of the present invention provide for a middleware distributed processing solution, referred to herein as 'FCluster' system 810, which provides assurance for the integrity of data required to be acceptable in a legal submission.

Examples of conventional investigative tools for digital media forensics are 'Forensic Tool Kit' (FTK) and 'EnCase' forensic software. Using such tools, the risk of 'mixing up data' between the evidence media and the host computer is negligible, i.e. there is a negligible risk of data from another image could be introduced because there is no mechanism, other than operator error working on the wrong image, for this to happen. Using such tools, provided the investigator is trained to use these applications as they were intended, the system is inherently assured. Further, in these tools, there is no write ability, such that the data under investigation may be protected.

These conventional systems are based on the principle of always presenting evidence originating 'from the image', and have more than a decade of acceptance and precedence. In such conventional systems, it is the administrative system built around the computer system that provides assurance with existing tools that store or process these images. For example, imaging tools usually make an MD5/SHA1 checksum of either sections of data or the whole media. When the investigator copies the image onto the laboratory storage facility they should run a program to create a new MD5/SHA1 checksum to confirm the data is unchanged from the originally captured evidence item. When this agrees with the original from acquisition they can continue. There are a number of conventional imaging programs used with varying assurance. For example 'disk duplicator' (dd) has no internal check-summing facility, both Expert Witness Format (EnCase) and Smart use file structures within their images to checksum every block, typically 64KB. The integrity of the data is assured because it is seen as one complete, wholesome entity and is internally consistent. Some conventional systems incorporate distributed processing with centralised storage as opposed to truly distributed processing working with distributed storage.

An exemplary conventional distributed processing architecture 1000 that relies on a central, non-distributed, store of forensic images is illustrated in figure 10.

Architecture 1000 comprises a file server 1004 which stores the images, a network switch 1006, and workstations 1002a, 1002b, 1002c, and 1002d, on which data may be processed. The file server 1004 and each of the workstations 1002a, 1002b, 1002c and 1002d are communicatively connected via network switch 1006. Having such a distributed processing architecture that relies on a central, non-distributed, store of forensic images, (e.g. as illustrated in figure 10) implies that the data has to be distributed to the processing nodes (1002a, 1002b, 1002c, 1002d) before it can be subjected to processing. This is the architecture in some conventional forensic tools that support 'distributed' processing.

However, processing times with the topology illustrated in figure 10 is dependent on the connection between the switch 1006 and the file server 1004 which rapidly becomes overloaded and limits scalability. This could be mitigated against by building a storage facility 1004 based on fast SSD storage (450MB/s), SATA III (600MB/s) interfaces and even 10Gb (1000MB/s) Ethernet networking but this can be prohibitively expensive. Even this has limited capabilities in scaling out to even tens of processing hosts 1002a, 1002b, 1002c, 1002d. Even in this case it may still take many hours just to read an image off a source storage media 104.

Further, simultaneous analysis of several images held on the same storage facility 1004 would have a significant impact on data dispersal time and so overall processing time.

Other conventional data processing and storage systems exist where both storage and processing are distributed. For example, Google™ 's distributed data storage and processing model called Hadoop/Map Reduce. However, such systems lack the levels of assurance required in processing data for presentation as evidence in legal proceedings.

Embodiments of the present invention adopt a truly distributed storage and processing approach but built on a foundation of an assurance system rather than amend the existing systems. In example embodiments of the present invention, as described above, the data of the source SM 104 is split into DEBs 112 by the DTD 102, which are fed to the FCluster server 802 of FCluster system 810 for distributed storage and processing. However, in splitting the data of source SM 104 into DEBs 112, the inherent integrity of the Oneness' of the data is lost.

In exemplary embodiments of the present invention, information assurance while processing DEBs 112 in the FCluster system 810 may be provided using some of the following methods:

By a Property of an object: making and testing Checksums, Check digits, size, Control totals

• By the Position/Location of the object: the fact that a file is in a certain location further enhances our faith that it is the correct one.

• By Loops of Authority and Acknowledgement: only accepting data from a device that was authorised to provide it.

· By Access control, restricting and allowing.

• By Separation of process: having the same functionality provided by more than one program and clearly separating stages by function.

By Audit trail: requiring independent sequential stamp, indelible records, recording with an authority.

· By Checklist: testing to see if previous checks have been completed and recording them in a table.

In some embodiments, FCluster 806 is a middleware for conducting forensic processing with high assurance in a cluster environment. In some embodiments, it is a vehicle upon which application programs can be run; it is not an application program.

More specifically, in some embodiments, FCluster 806 is a peer-to-peer middleware for a network of heterogeneous host computers 808a, 808b, 808c. For example, it may be built on Ubuntu Linux, Windows and MacOS, and may use relational database management systems such as MySQL, File Transfer Protocol (FTP) servers, multiprotocol file transfer libraries such as 'libcurl', and read- write drivers such as NTFS-3G.

The specific embodiments described below are described with reference to NTFS file-systems and DEBs 112 with NTFS file system format. However, other file systems may be used instead, and DEBs 112 may be instead formatted according to the specific file system with which they are being used.

Figure 11 shows an FCluster file DEB 112 comprising two parts, according to an exemplary embodiment of the invention. An extensive header section contains XML delimited meta-data about the file's place on the original evidence media (i.e source SM 104). This includes data from the file's entry in the NTFS $MTF and also a list of cluster numbers the file originally occupied on the source file-system together with a SFiAl for each of the clusters. The data section holds the file data which is firstly encrypted using AES-256 with the key sent from FCluster and then UUencoded to reduce problems in portability.

The DEBs 112 themselves are named in a regular manner, for example, [VolumeID]-[SHAl].meta. When the DEB 112 is finally unpacked, decoded and decrypted on the FCluster server 802 the resulting file must have the same SHA1 as its file name suggests and is included within the header section of the DEB. To achieve this it must have been generated on the imaging device authorised by the key created by the FCluster software 806 or it will not decrypt when it is ingested into the FCluster system 810. These form two assurances, one of a property of the file, the name and the 'double entry' of the success of the encryption/decryption key.

In an exemplary embodiment, the FCluster system 810 comprises of 4 sub- systems: acquisition, ingestion, distribution and processing.

Although in figure 8, FCluster server 802, target server 804, and nodes 808a, 808b, and 808c were described as having separate functions, this need not necessarily be the case. In some embodiments, each host in the cluster (804, 802, 808a, 808b 808c) may perform the function of any other host as described above. In some embodiments, each host in the cluster (804, 802, 808a, 808b 808c) may perform all of the FCluster system functions (for example all of the functions (i) - (viii) listed below). In other embodiments, each host may only be allocated three or four functions.

The functions that may be performed by each, some or all of the hosts in the cluster (804, 802, 808a, 808b 808c) may comprise: (i) an Acquisition Authority that creates the cryptographic keys used to authorise imaging; (ii) an Imager that creates the directory meta data DEBs 112, file data DEBs 112 and Image files; (iii) FClusterfs file- system metadata storage, for example, a multi-featured File System in User Space (FUSE) file system based around an SQL database; (iv) a DEB 112 Ingestor that locates expected new evidence and triggers ingestion; (v) a Load Balancer that chooses which storage/processing host should hold the primary copy of the data based on its workload; (vi) a replicator that makes sure there are enough copies of the DEBs 112 to ensure redundancy and also verification that the data is still valid; (vii) a Data Storage server that holds the data; and (viii) Processing for carrying out processing functions (which may be, in most cases, combined with the storage role).

An FClusters system 810 according to exemplary embodiments of the present invention provides assurance at every stage of the process in such a way that the next stage cannot commence if the previous assurance is not satisfied. This assurance functionality may utilise a "File-system in User Space" (FUSE) file-system. More specifically, embodiments of the present invention may utilise a new file system (for example a new Hadoop Distributed File System (HDFS)) implemented as a middleware on top of the native file-system used by the operating system of the computers on which it is run.

Embodiments of the present invention may utilise and merge several existing FUSE file system to form a new file system. This new file system will be referred to herein as FClustersfs. The existing file systems which may be merged to form the FClsuterfs comprise MySQLfs, curlFTPfs, ecryptfs, and Loggedfs.

FClusterfs may be based, for example, on MySQLfs. MySQLfs employs an

SQL database consisting of 3 tables to completely replace the native file system. The 'inodes' table provides storage for file metadata like names, dates/times, size, access rights etc. usually seen as a 'directory'. The 'tree' table stores the hierarchical structure of folders and filenames found in the file-system. The 3rd table 'data blocks' stores the actual data as a series of BLOBs replacing the clusters of the disk format.

In an embodiment, FClusterfs joins together the tree and inodes tables found in MySQLfs. This is possible because, unlike MySQLfs, FClusterfs is read-only and we never need to manipulate directories. This table is 'write once, read-only after' as is described in more detail below. In addition, a table 'meta-data' may be added to store the meta-data from the original location of the data. This is a variable length, large text field and so is better in a table of its own. A single FClusterfs database may store many file-systems. There may be a table, Volumelnformation, which contains a record of each file-system stored within the inodes table. A field 'VolumelD' may be added to inodes to identify which file-system the entry relates to.

The functionality of the 'data b locks' table in MySQLfs may be substituted with the ability to read data stored on remote servers. Connection to remote servers may be achieved using the ftp protocol. This may be advantageous for use with curlFTPfs, specifically, curlFTPfs allows for a connection to an ftp server to be mounted and appear to be part of the host's file system. Although the following description relates to communication using FTP, this not necessarily be the case. Indeed, curlFTPfs is based in the libcurl library and can support not only ftp but SSH, SFTP, HTTP, HTTPS.

Conventionally, curlFTPfs only allows one ftp server per mount. In embodiments of the present invention however, this may be enhanced to be able to access to individual files on any ftp server on a file by file basis. The corresponding server details are stored in the file's record in the 'inodes' table. Each file is held in its entirety on the ftp server. The entire file may be transferred and held in cache in memory. In curlftpfs, 128MB chunks are transferred just once and, if the file is over 128MB, a mosaic is built in cache in memory.

Although FClusterfs does allow data to be transported across the Ethernet network, it is also a means of standardising access to data held locally on its own, as well as remote, ftp server.

Figure 12 shows a schematic diagram of illustrating the differing ways in which differing FClusterfs mounts can reference files stored at different FTP servers. In figure 12, Host A comprises FTP server function that stores File2 and File3, and Host B comprises FTP server function that stores Filel . Both Host A and Host B comprise a Fclusterfs mount that references Filel, File2, and File3. Host A may obtain Filel from the FTP server of Host B over the network, however, Host A may obtain File2 and File3 from its own FTP server function via 127.0.0.1, the localhost loopback connection. Conversely, Host B may obtain File 2 and File 3 from the FTP server of Host A over the network, however, Host B may obtain Filel from its own FTP server function via 127.0.0.1, the localhost loopback connection. In some embodiments, FCluster is peer to peer and so any node can mount a directory that can reference files on any server.

Figure 13 shows an example of a inode table 1304 that may be held at a computer 1302 acting as an FClusterfs MySQL server, for a peer-to-peer FCluster system 1300 comprising servers: 'ftp://server-a', 'ftp://server-b', 'ftp://server-c', and 'ftp://server-d'. A MySQL database, for example stored at FClusterfs MySQL server 1302, may store one or more file system directory data sets. FTP servers (e.g. 'ftp://server-a', 'ftp://server-b', 'ftp://server-c', and 'ftp://server-d') allow access to files stored locally on the host running the FTP server. Each entry in a MySQL database can point to a different ftp server that holds the data for that entry.

Although the computers in figure 13 are shown as each having a specific function, each host computer may host a MySQL database, an FTP server, both or neither. For example, MySQL provides replication and synchronisation as a built in feature, so it is possible to have identical duplicate databases on different host computers. This may be useful, for example, for load balancing and to improve response times.

In some exemplary embodiments, data held on the network of ftp servers may be encrypted and may use techniques from ecryptfs to decrypt data on-the-fly. For example, after data leaves the ftp server media, it passes across the network and is decrypted in the user's host before being held in cached space in RAM in their Virtual File System.

As mentioned above, FCluster is read-only. There is no code to provide functions like write / delete / chown / chmod. This is a fundamental requirement of a forensic system and, fortuitously, greatly simplifies the code.

In some exemplary embodiments, FCluster has auditing which it draws from

Loggedfs. In some embodiments, only significant actions like DEB movement, unpacking and the opening of data- files for processing are logged. However, recording access to parts of a file may not be logged, as it may not be necessary. Audit records may be stored, for example, in a table 'audit' recording date/times, users.

In this embodiment, although the data location url information is available to the user eg ftp://myserver.com/ the username and password need to login to the ftp server and gain access the data is not. It is held in another table 'serveraccessinfo' and is retrieved on-the-fly during a read request by FClusterfs. Users can only access evidence via the MySQLfs file-system which provides data via the local ftp servers.

In some exemplary embodiments, the FCluster system 810 may process local data by the host of the ftp server holding each of the files. The location, url, of the ftp server hosting the data is part of the 'inodes' table extending the fields used by MySQLfs and so the 'locality' of the file can trigger the processing task to be initiated within the host.

The behaviour of an FClusterfs file system may be defined as it is mounted by a command line which contains the following entries:

mysqlfs

~mysql_user=me

~mysql_password=mypassword

-mysql_host=25.63.133.244

~mysql_database=fclusterfs

-volume=74a8f0f627cc0dc6

~audituser='Investigator Name'

/home/user/Desktop/ fsmount

Multiple file systems can be mounted on the user's host system and multiple

SQL servers can provide storage for FClusterfs file-system databases.

Job submission to the cluster is via the 'task' table in FClusterfs which can be populated with jobs and the hosts running each of the data servers, because they are the local custodians of the data, pick up the tasks and initiates them for the data is has in its local storage.

The operation of FCluster according to some exemplary embodiments, and the assurance of the integrity of the data passing there through, will now be described in more detail with reference to figure 14.

Figure 14 is a schematic illustration of the flow of data into and through the FClusters system, according to some exemplary embodiments. Figure 14 shows broadly four zones of assurance, namely acquisition 1402, ingestion 1404, distribution and storage 1406, and processing 1408. In some exemplary embodiments, the data flows through these zones of assurance according to the following. In acquisition assurance zone 1402, as also described above, in some exemplary embodiments, the initial imaging process carried out by DTD 102 has three deliverables: (i) a DEB containing directory metadata (ii) a collection of DEBs, one each for each file that falls into a 'high value' criteria set by the image acquirer; (iii) a conventional 'forensic image', for reference and later extraction of further data.

As also described above, the selection of files to be packaged as DEBs takes a prioritised triage approach collecting only file types expected to have a higher likelihood of containing evidence depending on the case type.

In ingestion assurance zone 1404, the first stage of ingestion into FCluster is when the DEB, containing the data defining the file system directory 110, is imported into the MySQL database at the heart of FClusterfs. At this stage a directory skeleton will exist but no data is available within FCluster.

The file data, in the form of a number of DEBs, is imported as it becomes available. This starts a process of 'filling out' the evidence file system with data associated with each directory entry.

In the distribution and storage assurance zone 1406, the ingested data is distributed across the Datanodes according to a load balancing algorithm which bases its allocation on benchmarking previously created by running a known set of approved program against typical data files.

When a digital evidence bag 112 arrives on its storage host is it unpacked and its contents verified in a number of ways. Only if it is proven to be valid is it accepted and made available via the distributed file system FClusterfs. Upon approval at its storage location, a defined list of tasks is invoked and automatic process is conducted, for example generating text indexing or thumb-nailing images.

To provide redundancy and secondary load balancing, a replication agent firstly ensures constant and routine validation of data by applying a SHA 1 checksum to each file and then ensures there are multiple copies, normally three, of the data is held on separate hosts within the cluster.

The digital evidence bags created at image time may have captured only part of the total evidence required to be processes, and hence subsequently a 'Bag it on demand' system may trigger an on-the-fly acquisition of data that was initially deemed of secondary interest from the image once it has been completed and is available to the cluster. This data is validated and placed in the same assured way as the rest of the system.

FCluster may utilise a wide area network, for example utilising a VPN to connect the nodes, but it will be appreciated that the network may be established in other ways.

In some embodiments, whenever data is transferred between nodes it may be in an encrypted form.

In the processing assurance zone 1408, data processing may preferentially take place locally on the datanode holding the data. This may increase the overall processing speed of the data. In similar way to the use of SHAls to identify 'Bad' files can be used without the actual files being accessed. In this case, results are transferred across the network but not normally the data.

The results may be visualised and reported, where an investigator can inspect the results of the processing.

The assurance of the integrity of the data in each zone 1402, 1404, 1406, and

1408 will now be described in more detail.

Firstly, the assurance of the integrity of data in the Acquisition Assurance zone 1402 and the Ingestion Assurance zone 1404 may be achieved according to the following.

The first assurance in the system is one of the "Loops of Authority and

Acknowledgement" type in which authority is granted to an imaging device to take an image and then FCluster only accepts data that was gathered with that authority. Such a loop of authority is illustrated in Fig 15.

The FCluster administrator generates an 'Authority to image' in the form of a file containing a cryptographic key marked as issued to a specific device. This key is recorded in the Volumelnformation table in FClusterfs. The file is passed to the imaging device and the key will be used to encrypt the evidence gathered before it is sent to FCluster. Many keys can be simultaneously issued to a device to form a 'stock' to be used over a set period of time, the keys have an 'expiry date' associated with them as an added control.

As explained above, the imaging process has three outputs. The DEB containing file-system metadata, representing the directory listing, is the first to be imported. The key used to create encrypt the DEBs was stored in the VolumelD table. If it is present, has not expired or has not been previously fulfilled, the import can proceed. Records are created in the inodes table for each file and directory in the evidence file-system. These includes fields that describe the full path and filename, file size, MAC dates and times etc. If the key is not present in the VolumelD table, the import cannot proceed.

At the end of this process a complete 'framework' of the directory structure and filenames will have been created in the FClusterfs database. It is possible to mount this FClusterfs structure and traverse the directory but as the import of file DEBs that contain file data has not been carried out there is no actual data to analyse in the files.

A series of "checklists" may be used to control the import of the details and contents of the data- file DEBs.

The DEB staging directory, where DEBs of high importance data types are placed ready to be imported, is scanned and any DEBs which form part of a Volume that is expected to be imported are found and opened and the details of the VolumelD, path, filename and size are extracted. The inodes table of FClusterfs is searched to see if this DEB is expected, ie there is an entry previously made by a file-system DEB import but various fields like the original file's SF1A1 and staging directory url are empty. If there is a record that satisfies these criteria then the fields in the inodes table are populated with the meta-data extracted from the data-DEB. If there is a record in the inodes table and it shows it has already been imported it will not be considered again.

The assurance of the integrity of data in the Distribution and storage assurance zone 1406 and the Processing assurance zone 1408 may be achieved according to the following.

Having ingested the volume directory metadata the system is now primed to expect the DEBs 112 of data that makeup that file system. The selection of the primary storage of the data is the first task of the loadbalancer. It allocates a storage server to hold the data held within the DEB and records this in the FCluster inodes table. Allocation is based on the available capacity of the host, its processing power and its estimated time to finish its current task list.

The movefile daemon also uses "checklist" type assurance by constantly scanning the inodes table of FClusterfs for any DEB that has been allocated a datanode, not been marked as being 'in place' and the evidence DEB is staged in a local directory. If these conditions are met the DEB is transferred to the storage datanode as allocated by the loadbalancer. If, any only if, the transfer is successful does movedata update the inode table with 'primarystorageinplace' set to true. In some embodiments, Movedata is the only mechanism whereby actual data can be ingested into the system. It can only operate when all the preconditions, from Ingestion Assurance, are met. If does not scan an evidence folder and import whatever DEBs are present; it imports only expected DEBs, as recorded in the FCluster inodes table, from a folder.

Unpacker daemon constantly scans the inodes table to see if there are any DEBs that are on their local server but not unpacked. It takes the entry from the database and looks to see if the files are on its ftp host, as should be the case from the entries in inodes, not the other way round. A file that simply arrives on the server without an entry in inodes would be ignored. When a suitable DEB is identified it is split into header and data sections. The header, containing the metadata is inserted into the 'meta' table and the header file erased. The data section is uudecoded and the data decrypted with a key stored in the VolumeListing table. This was the key first created and issued by the FCluster and used to encrypt the data in the DEB at acquisition time. If the key does not work the file cannot be decrypted and so the transfer and ingestion would fail. Only if the file decrypts and the resulting file has a SHAl checksum that matches both the name of the file itself and the SHAl as recorded in the inodes table is the datafile finally accepted.

The task daemon scans the tasks table to see if any job is required for a file that it holds locally. Because all file access must take place by utilising the enhanced FClusterfs file- system the file must be the correct file and must have the original content that was collected at imaging time. FClusterfs may also provide fine grained access control to the files within a file system. For example, control of which users can process specific data with specific programs may be implemented.

According to the above described rigorous protocol when importing DEBs into a distributed cluster, levels of assurance necessary when conducting forensic examinations can be achived.

It should be noted that FCluster is read-only and so has no record or file locking code, as a result, even when FCluster draws from a remote ftp server data is cached locally in RAM and never needs to refer to the source for updates or changes. This increases the speed of processing. An increase in processing speed is also achieved by implementing the condition that each storage host should, where possible, process its own local data, hence reducing the dependence of the overall processing speed on the network speed.

Figure 16 a schematic diagram of a Data Transfer Device (DTD) 102 according to an exemplary embodiment of the present invention. The DTD 102 comprises a processor CPU 1604 functionally connected to each of memory 1606, input 1602a, output 1602b and output 1602c. For example, input 1602a may be connected to a source SM 104 as described above, output 1602b may be connected to target SM 106 or target server 804 or any other computer acting as an ingestion point for FClusters system 810, and output 1602c may be connected to DEB SM 108, FCluster server 802, or any other computer acting as an ingestion point for FClusters system 810. Processor CPU 1604 may facilitate the transfer of data from source SM 104 to outputs 1602c and 1602b as described above, and the building up of a priority list of files for transfer as described above, which list may be stored in memory 1606. Memory 1606 may comprise array 120 into which, as described above, the directory of the source SM 104 may be loaded. Memory 1606 may also store software causing the DTD 102 to perform the functions of the DTD 102 as variously described above.

Figure 17 a schematic diagram of a server 802 of the FCluster system 810 according to an exemplary embodiment of the present invention. The server 802 comprises a processor CPU 1704 functionally connected to each of memory 1706, and input/output (I/O) 1702. For example, input 1702 may be connected to a computer network, for example a Wide Area Network (WAN), so as to allow communication from and/or to server 802 with any other device connected thereto, for example the DTD 102, other components of the FClusters system 108, and so on. Processor CPU 1704 may facilitate the authentication, distribution, storage and/or processing of data as described above, for example of data ingested into the FClusters system 108 from the DTD 102. Memory 1706 may comprise a database (not shown) at which data, for example DEBs 112, or complete copies of source SM 104 data, as described above, may be stored. Memory 1706 may also store software 806 causing the server 802 to perform the functions of the server 802, or the functions of any of the component nondes/computers/servers of system 108 as variously described above.

The above embodiments are to be understood as illustrative examples of the invention. It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.

Claims

1. A method for copying data from an original resource to a plurality of target resources, the method comprising:

reading, from the original resource, first data representing a directory of the original resource;

prioritising, based on the first data representing the directory, second data of the original resource for copying; and

copying, based on the prioritising, at least some of the second data of the original resource to the plurality of target resources.

2. The method according to claim 1, the method comprising:

writing, to the plurality of target resources, the first data representing the directory of the original resource.

3. The method according to claim 1 or claim 2, the method comprising:

reading, from the original resource, data indicative of the structure of data in the original resource.

4. The method according to any preceding claim, wherein the copying of the at least some of the second data is based on the location of the second data as recorded in the directory of the original resource.

5. The method according to any preceding claim, wherein copying the at least some of the second data of the original resource comprises:

delivering an imaged copy of the original resource to a first one of the plurality of target resources; and

delivering one or more digital evidence containers comprising one or more files comprising a portion of the second data of the original resource to a second one of the plurality of target resources.

6. The method according to claim 5, wherein copying the at least some of the second data of the original resource comprises:

reading a block or cluster from the original resource;

writing the read block or cluster to the first one of the plurality of target resources; and

simultaneously with writing the read block or cluster to the first one of the plurality of target resources, writing the read block or cluster to a digital evidence container of the second one of the plurality of target resources.

7. The method according to claim 5 or claim 6, wherein each block or cluster of the original resource is read only once from the original resource, written only once to the first one of the plurality of target resources, and written only once to the second one of the plurality of target resources.

8. The method according to any preceding claim, wherein the original resource is a storage media device.

9. The method according to claim 8, wherein the storage media device comprises one of a hard disk and a USB stick.

10. The method according to any of claim 5 to claim 9, wherein the first one of the plurality of target resources is a storage device with a storage capacity larger than a storage capacity of the original resource.

11. The method according to any of claim 5 to 10, wherein the second one of the plurality of target resources is a storage device.

12. The method according to claim 11, wherein the storage device of the second one of the plurality of target resources is a network storage facility located on a network.

13. The method according to any one of claim 5 to 12, wherein the second one of the plurality of target resources comprises a file system on which digital evidence container files can be written as ordinary files.

14. The method according to any one of claim 5 to claim 13, wherein each digital evidence container comprises file data and data that defines the origin of the file data on the original resource.

15. The method according to any preceding claim, wherein the first data representing a directory of the original resource comprises metadata of the files stored on the original resource, and the prioritising comprises:

reading the file metadata;

selecting one or more files to be accessed as a priority based on one or more selection criteria relating to file metadata.

16. The method according to claim 15, wherein the file metadata comprises one or more of a file extension, a file location, a file size, a file name, a modification date, a creation date, a keyword, a file category, a checksum and/or a signature associated with the file.

17. The method according to claim 16, wherein the selecting comprises causing a regular expression to act on a file name associated with a file.

18. The method according to any of claim 15 to claim 17, wherein the selecting one or more files to be accessed as a priority comprises generating an index number for each file based on Bayesian scoring of metadata associated with each file, and wherein the order in which files are accessed for copying corresponds to a rank order of the index number for each file.

19. The method according to any of claim 5 to 18, comprising

calculating a checksum of one or more files of the second data of the original resource; and comparing the calculated checksum against a database of known file fingerprints.

20. The method according to claim 19, wherein, if, based on the comparing, the calculated checksum corresponds with a known good file of the database of known file fingerprints, then not to write the data of the file to the second one of the plurality of target resources.

21. The method according to any of claims 5 to 20, comprising:

after the portion of the second data of the original resource is delivered to the second one of the plurality of target resources, disconnecting the second one of the plurality of target resources whilst the image copy of the original resource is still being delivered to the first one of the plurality of target resources.

22. A method for distributed processing of data copied from an original resource according to the method of any of claims 1 to 21, the method of distributed processing comprising:

receiving the copied data at a first network node;

distributing, from the first network node, the received data to one or more second network nodes;

processing the data at the one or more second network nodes.

23. The method according to claim 22, wherein the received data comprises first data representing a directory of the original resource, and second data of the original resource.

24. The method according to claim 22 or claim 23, comprising:

receiving, at the first network node, data indicative of metadata associated with the received data.

25. The method according to any of claim 22 to claim 24, wherein the received data are formatted within one or more digital evidence containers.

26. The method according to claim 25, wherein one or more of the first and/or second network nodes comprises a file system on which digital evidence container files can be written as ordinary files.

27. The method according to any of claim 22 to 26, comprising:

storing data received at the first network node at one or more of the second network nodes.

28. The method according to claim 27, comprising:

verifying the data stored at one of the network nodes based on data stored at one or more other network nodes.

29. The method according to any one of claim 22 to claim 28, wherein the distribution between the network nodes is based on load balancing according to a benchmark created by running a known set of approved programs against typical data files.

30. The method according to any one of claim 22 to claim 29, comprising: issuing an authority to copy data;

determining whether the received data has been gathered with the issued authority; and

if the received data was gathered without the issued authority, then not accept the data for processing.

31. The method according to claim 30, wherein the authority comprises a file containing a cryptographic key, and the determining whether the received data has been gathered with the issued authority comprises determining whether the cryptographic key has been used to encrypt the gathered data before it is received.

32. An apparatus for copying data from an original resource to a plurality of target resources, the apparatus comprising:

means for reading, from the original resource, first data representing a directory of the original resource;

means for prioritising, based on the first data representing the directory, second data of the original resource for copying; and

means for copying, based on the prioritising, at least some of the second data of the original resource to the plurality of target resources.

33. The apparatus according to claim 32, the apparatus comprising:

means for writing, to the plurality of target resources, the first data representing the directory of the original resource.

34. The apparatus according to claim 32 or claim 33, the apparatus comprising: means for reading, from the original resource, data indicative of the structure of data in the original resource.

35. The apparatus according to any of claim 32 to 34, wherein the copying of the at least some of the second data is based on the location of the second data as recorded in the directory of the original resource.

36. The apparatus according to any of claim 32 to 35, wherein the means for copying the at least some of the second data of the original resource comprises:

means for delivering an imaged copy of the original resource to a first one of the plurality of target resources; and

means for delivering one or more digital evidence containers comprising one or more files comprsing a portion of the second data of the original resource to a second one of the plurality of target resources.

37. The apparatus according to claim 36, wherein the means for copying the at least some of the second data of the original resource comprises:

means for reading a block or cluster from the original resource; means for writing the read block or cluster to the first one of the plurality of target resources; and

means for, simultaneously with writing the read block or cluster to the first one of the plurality of target resources, writing the read block or cluster to a digital evidence container of the second one of the plurality of target resources.

38. The apparatus according to claim 36 or claim 37, wherein each block or cluster of the original resource is read only once from the original resource, written only once to the first one of the plurality of target resources, and written only once to the second one of the plurality of target resources.

39. The apparatus according to any of claim 32 to 38, wherein the original resource is a storage media device.

40. The apparatus according to claim 39, wherein the storage media device comprises one of a hard disk and a USB stick.

41. The apparatus according to any of claim 36 to claim 40, wherein the first one of the plurality of target resources is a storage device with a storage capacity larger than a storage capacity of the original resource.

42. The apparatus according to any of claim 36 to 41, wherein the second one of the plurality of target resources is a storage device.

43. The apparatus according to claim 42, wherein the storage device of the second one of the plurality of target resources is a network storage facility located on a network.

44. The apparatus according to any one of claim 36 to 43, wherein the second one of the plurality of target resources comprises a file system on which digital evidence container files can be written as ordinary files.

45. The apparatus according to any one of claim 36 to claim 44, wherein each digital evidence container comprises file data and data that defines exactly the origin of the file data on the original resource.

46. The apparatus according to any of claim 32 to 45, wherein the first data representing a directory of the original resource comprises metadata of the files stored on the original resource, and the means for prioritising second data of the original resource for copying comprises:

means for reading the file metadata;

means for selecting one or more files to be accessed as a priority based on one or more selection criteria relating to file metadata.

47. The apparatus according to claim 46, wherein the file metadata comprises one or more of a file extension, a file location, a file size, a file name, a modification date, a creation date, a keyword, a file category, a checksum and/or a signature associated with the file.

48. The apparatus according to claim 47, wherein the means for selecting one or more files to be accessed as a priority comprises means for causing a regular expression to act on a file name associated with a file.

49. The apparatus according to any of claim 36 to claim 48, wherein the means for selecting one or more files to be accessed as a priority comprises means for generating an index number for each file based on Bayesian scoring of metadata associated with each file, and wherein the order in which files are accessed for copying corresponds to a rank order of the index number for each file.

50. The apparatus according to any of claim 36 to 49, comprising

means for calculating a checksum of one or more files of the second data of the original resource; and

means for comparing the calculated checksum against a database of known file fingerprints.

51. The apparatus according to claim 50, wherein, if, based on the comparing, the calculated checksum corresponds with a known good file of the database of known file fingerprints, then not to write the data of the file to the second one of the plurality of target resources.

52. The apparatus according to any of claims 36 to 51, comprising:

after the portion of the second data of the original resource is delivered to the second one of the plurality of target resources, means for disconnecting the second one of the plurality of target resources whilst the image copy of the original resource is still being delivered to the first one of the plurality of target resources.

53. An apparatus for distributed processing of data copied from an original resource by the apparatus according to any of claims 32 to 52, the apparatus for distributed processing comprising:

means for receiving the copied data;

means for distributing the received data to one or more network nodes for processing.

54. The apparatus according to claim 53, wherein the received data comprises first data representing a directory of the original resource, and second data of the original resource.

55. The apparatus according to claim 53 or claim 54, comprising:

56. The apparatus according to any of claim 53 to claim 55, wherein the received data are formatted within one or more digital evidence containers.

57. The apparatus according to claim 56, comprising a file system on which digital evidence container files can be written as ordinary files.

58. The apparatus according to any of claim 53 to 57, comprising:

means for causing the data received at the apparatus to be stored at the apparatus and/or at one or more of the network nodes.

59. The apparatus according to claim 58, comprising:

means for causing the data stored at the apparatus and/or at one of the network nodes to be verified based on data stored at one or more other network nodes.

60. The apparatus according to any one of claim 53 to claim 59, wherein the means for distributing the received data comprises a means for load balancing the distribution based on load balancing according to a benchmark created by running a known set of approved programs against typical data files.

61. The apparatus according to any one of claim 53 to claim 60, comprising: means for issuing an authority to copy data;

means for determining whether the received data has been gathered with the issued authority; and

if the received data was gathered without the issued authority, then means for not accepting the data for processing.

62. The method according to claim 61, wherein the authority comprises a file containing a cryptographic key, and the means for determining whether the received data has been gathered with the issued authority comprises means for determining whether the cryptographic key has been used to encrypt the gathered data before it is received.

63. A computer readable medium with a computer program stored thereon which, when executed by a computer causes the computer to perform the method according to any of claim 1 to claim 31.

64. A system for distributed processing of data originating from an original resource, wherein the system is arranged to carry out the method according to any of claim 22 to claim 31.